DATABRICKS DEVELOPER AND ADMIN : Azure / OnPrem
Time : 25hrs Approax
Fundamental Databricks
- Introduction to Databricks
- Databricks Terminology and Databricks Community
- Create a free Databricks account
- Introduction to the Databricks environment
- First steps with Databricks
Databricks Platforms
- Importing notebooks, language configuration and markdown
- Databricks File Dystem (DBFS)
- Create, manipulate and visualize tables
- Databricks widgets
Databricks Utilities
- Databricks Utils for managing File System and libraries
- Databricks Utils for notebooks, secrets and Widgets
ETL Approach in Databricks:
- Creating and saving DataFrames in Databricks
- Transformation and visualization of data in Databricks
Databricks installation using Azure:
- Introduction to Setup Databricks Environment using Azure
- Signup for Azure Portal
- Setup Azure Databricks using Azure Portal
- Launching Azure Databricks Environment
- Create Single Node Databricks Cluster
- Editing Databricks Clusters using Databricks UI
- Getting Started with Databricks Notebooks
- Create Databricks SQL Warehouse
- Increase Quota to Create Databricks SQL Warehouse Cluster
- Run Queries using Databricks SQL Warehouse
- Overview of Uploading Data using Databricks SQL Warehouse UI
- Review Data Explorer of Data Science and Engineering Environment
- Analyze Sales Data using Databricks Notebooks
- Terminate Databricks Data Science and Engineering Clusters
- Terminate Databricks SQL Warehouse Clusters
- Delete Azure Databricks Workspace
- Population Data Analytics Lab
Setup Databricks for SQL :
- Installing Databricks CLI using python3
- Configure Databricks CLI using Token and Profile
- Setup Git Repository for Material and Data Sets related to Databricks SQL Course
Databricks SQL
- Introduction DB-SQL platform
- Run first SQL query using DB SQL editor
- Intro about Dashborad of Databricks
- Overview of Databricks SQL Data Explorer to review Metastore Database and Tables
- Use Databricks SQL Editor to develop scripts or queries
- Review Metadata of Tables using Databricks SQL Platform
- Overview of loading data into retail_db tables
- Configure Databricks CLI to push data into Databricks Platform
- Copy JSON Data into DBFS using Databricks CLI
- Analyze JSON Data using Spark APIs
- Analyze Delta Table Schemas using Spark APIs
- Load Data from Spark Data Frames into Delta Tables
- Run Adhoc Queries using Databricks SQL Editor to validate data
- Overview of External Tables using Databricks SQL
- Using COPY Command to Copy Data into Delta Tables
- Manage Databricks SQL Endpoints
Managing Database using DB SQL warehouse :
- Review Databases using Databricks SQL Data Explorer
- Create Database or Schema using Databricks SQL
- Using IF NOT EXISTS while Creating Databases using Databricks SQL
- Listing or Showing Databases and Getting Metadata of Databases using Databricks
- Understand Default Location of Databricks SQL Database or Schema
- Create Database or Schema using Location in Databricks SQL Warehouse
- Drop Databases in Databricks SQL Warehouse
- Alter Database in Databricks SQL Warehouse
- Comments on Databases in Databricks SQL Warehouse
Manage Delta tables using DB SQL warehouse :
- List Databases and Save Databricks SQL Script
- Create Table using Delta Format in Databricks SQL Warehouse
- Understand Location and Using Clause to specify File Format for Databricks
- Create External Table using Delta Format in Databricks SQL Warehouse
- Drop External Table and Delete Folder in Databricks SQL Warehouse
- Overview of DML or CRUD Operations using Databricks SQL
- Insert Records into Databricks SQL Warehouse table
- Insert Multiple Records into Databricks SQL Warehouse table
- Update Existing Records in Databricks SQL Warehouse table
- Update Existing Records in Databricks SQL Warehouse table based on Null Values
- Delete Existing Records in Databricks SQL Warehouse table
- Cleanup Users Tables from Databricks SQL Warehouse Database or Schema
Setup Dataset for DB SQL Views and copy command
- Create Folder in DBFS using Databricks CLI Commands
- Copy Files from Local File System into DBFS using Databricks CLI Commands
- Overwrite Files while Copying into DBFS using Databricks CLI Command
- Understand Course Catalog Data in the files uploaded to DBFS
- Options to Analyze Data using Databricks SQL Queries
- Run Select Queries using DBFS Path in From Cluase
- Run Queries using Temporary Views in Databricks SQL
- Run Queries using External Tables in Databricks SQL
Queries to process value in JSON
- Queries to Process Values in JSON String Columns
- Get Distinct and Count based on Key using Course Catalog Data
- Filter Data using Basic Databricks SQL Queries using Course Catalog Data
- Exploring Functions using Databricks SQL
- Understand Record Column Values in Course Catalog Table
- Processing JSON String Values using Databricks SQL Queries
- Process Instructors JSON Records using Databricks SQL Queries
- Create View for Instructors using Databricks SQL Queries
Copy Data into Delta tables In Databricks Warehouse :
- Create Delta Table for Course Catalog Data Set
- Get File Names along with Data using Databricks SQL Queries
- Overview of Databricks SQL COPY Command
- Copy Data from single file into Delta Tables using Files
- Copy Data from multiple files into Delta Tables using Files
- Copy Data from multiple files into Delta Tables using Pattern
- Create Course Catalog Table in Databricks SQL Warehouse with additional Column
- Copy Data from Files using Queries into Delta Tables
- Validate Course Catalog Table in Bronze Layer
Insert or Merge Query Results or View into delta tables using Databricks SQL
- Introduction to Insert or Merge Query Results or View into Delta Tables using D
- Create Course Catalog and Instructors Tables using Databricks SQL
- Copy Data into Course Catalog Table from JSON Files using Databricks SQL
- Insert Query Results into Delta Table using Databricks SQL
- Exercise to Create Courses Table and Insert Data
- Copy Instructors Data into Course Catalog Table from new file
- Understand the Concept of Merge or Upsert in DML or CRUD Operations
- Develop Query to Get the latest Instructors Records from Course Catalog Table
- Overview of Merge Statement Syntax using Databricks SQL
- Merge Data into Instructors Table from Course Catalog using Databricks SQL
- Exercise to merge Courses Data from Course Catalog into Courses Table
Delta Lake Lab : Exercise
- Data Lakehouse Architecture
- Medallion Lakehouse architecture
- Delta Lake
- 1: Create Delta Table (SQL & Python)
- 2: Read & Write Delta Table
- 3: Update / Delete / Merge
- 4: Schema Validation
- 5: Time Travel
- 6: Convert a Parquet table to a Delta table
- 7: Generated Columns
- 8: Incremental ETL load
- 9: Incremental ETL load (@version property)
- Processing Nested XML file
- Processing Nested JSON file
- Delta Table - Time Travel and Vacuum
Databricks: Admin
- Manage User & Group
- Lab: Add User into Azure Active Directory
- Lab: Create Group
- Lab: Table Access Control
- Lab: Workspace, Cluster, Job Access
- Introduction to Azure Databricks Workspace.
- Databricks Clusters
- Databricks Pools
- Databricks Notebooks and magic commands
- Databricks CLI and DBFS management
- Administrating Cluster via Terraform
Databricks Notebook – CI/CD using Azure Devops
- Integrate databricks notebook with Git providers like Github.
- Configure Continuous Integration - Artefacts to deployed in clusters.
- Configure Continuous delivery using datathirst templates.
- Run notebook on Azure Databricks via Jobs.
- Secure cluster via cluster policy and permission
- DataFactory LinkedServices
- Orchestrate notebook via DataFactory
Databricks Cluster & Utilities details :
- Navigate the Workspace
- Databricks Runtimes
- Clusters Part 1
- Cluster Part 2
- Notebooks
- Libraries
- Repos for Git integration
- Databricks File System (DBFS)
- DBUTILS
- Widgets
- Workflows
- Metastore - Setup external Megastore
- Metastore - Setup external Metastore II
Structure Streaming using Databricks, Spark and Azure
- What is Spark Structure Streaming
- Data Source & Sink
- Lab: Rate & File Source
- Lab: Kafka Source
- Lab: Sink: Console, Memory, File & Custom
- Lab: Build Streaming ETL
- Lab: Setup Event Hub
- Lab: Event Hub Producer
- Lab: Integrate Event Hubs with Data Bricks
- Lab: Transformation
- Streaming ETL: Ingest into Azure Data storage
Deep Dive into DataLake house , Delta lake and Delta table :
- Understanding Data Warehouse, Data Lake and Data Lakehouse
- Databricks Lakehouse Architecture and Delta Lake
- Delta Tables
- Storing data in a Delta table, Databricks SQL and time travel
- Delta Table caching
- Delta Table partitioning
- Delta Table Z-ordering
- Where to go from here?
- Azure Databricks
- Why Spark is difficult? Why Databricks Evolved?
- Why Databricks in Cloud? Introduction to Azure Databricks
- How to save Databricks demo Cost
- Demo overview
- Understand about Databricks tables and filessystem.
- Load CSV data in Azure blob storage
- Demo: Provision Databricks, Clusters and workbook
- Demo: Mount Data Lake to Databricks DBFS
- Creating Azure Free Account
- Azure Portal Overview
- Introduction to Azure Databricks
- Creating Azure Databricks Service
- Azure Databricks Architecture Overview
- Project Solution Databricks Notebooks
- Azure Databricks Cluster Types
- Azure Databricks Cluster Configuration
- Creating Azure Databricks Cluster
- Azure Databricks Cluster Pool
- Azure Databricks Notebooks Introduction
- Magic commands
- Databricks Utilities
- Databricks File System (DBFS)
- Databricks Mount overview
- Creating Azure Data Lake Storage Gen2
- Creating Azure Service Principal
- Mounting Azure Data Lake Storage Gen2
- Secret Scopes Overview
- Creating Secret Scope and Secrets in Key Vault
- Mounting Data Lake Using Secrets
- Project :
- Formula1 Data Overview
- Upload Formula1 Data to Data Lake
- Project Requirement Overview
- Solution Architecture Overview
- Data Ingestion - CSV
- Data Ingestion - JSON
- Data Ingestion - Multiple Files
- Databricks Workflows
- Filter & Join Transformations
- Aggregations
- Spark SQL - Databases/ Tables/ Views
- Spark SQL - Filters/ Joins/ Aggregations
- Incremental Load
- Data Loading Design Patterns
- Formula1 Project Scenario
- Formula1 Project Data Set-up
- Full Refresh Implementation
- Incremental Load - Method 1
- Incremental Load - Method 2
- Incremental Load Improvements - Assignment
- Incremental Load Improvements - Solution
- Incremental Load - Notebook Workflows
- Incremental Load - Race Results
- Incremental Load - Driver Standings
- Incremental Load - Constructor Standings (Assignment)
- Pitfalls of Data Lakes
- Data Lakehouse Architecture
- Read & Write to Delta Lake
- Updates and Deletes on Delta Lake
- Merge/ Upsert to Delta Lake
- History, Time Travel, Vacuum
- Delta Lake Transaction Log
- Convert from Parquet to Delta
- Data Ingestion - Circuits File
- Data Ingestion - Results File
- Data Ingestion - Results File
- File Improvements
- Data Transformation -pysprrk/spark-scala/SQL
- DEMO: EXPLORE, Analyse, Clean, Transform and Load Data in Databricks
- Azure Databricks Clusters
- Azure Databricks other Important Components
- Databricks – Monitoring
- Use Case Discussion and solution using DATABRICKS: Any 2 use case will be taken during training
- Building a solution architecture for a data engineering solution using Azure Databricks, Azure Data Lake Gen2, Azure Data Factory and Power BI
- Mounting Azure Storage in Databricks using secrets stored in Azure Key Vault
- Working with Databricks Tables, Databricks File System (DBFS) etc
- Using Delta Lake to implement a solution using Lakehouse architecture
- Creating dashboards to visualise the outputs
- Connecting to the Azure Databricks tables from PowerBI
- Working with Databricks notebooks as well as using Databricks utilities, magic commands etc.
- Configure Azure Databricks logging via Log4j and spark listener library via log analytics workspace.
- Configure notebook deployment via Databricks Jobs.
- Configure CI CD using Azure DevOps.
- Delta Lake : Spark /Scala using Databricks
Detailed discussion on delta lake -spark/Scala
- Introduction to Data Lake
- Key Features of Delta Lake
- Implementing incremental load pattern using delta lake
- Emergence of Data Lakehouse architecture and the role of delta lake.
- Read, Write, Update, Delete and Merge to delta lake using both PySpark as well as SQL
- Create a table
- Write a table
- Read a table
- Schema validation
- Update table schema
- Table Metadata
- Delete from a table
- Update a Table
- Vacuum
- History
- Concurrency Control
- Optimistic concurrency control
- Migrate Workloads to Delta Lake
- Optimize Performance with File Management
- Auto Optimize
- Optimize Performance with Caching
- Delta and Apache Spark caching
- Cache a subset of the data
- Isolation Levels
- Best Practices
- working on multiple use cases