PY- SPARK 3.x 30hrs
- Understanding Data Lake
- Introduction to Spark
- Spark Architecture
- Spark RDD
- Apache Spark in Cloud - Databricks Community & notebook
- Apache spark in Hadoop Ecosystem - Zeplin notebook
- Apache Spark in Local mode - scala / pyspark shell
- Spark Execution model & Architecture
- how to run spark program
- Spark Single node and multi node cluster setup
- Spark Execution models & cluster managers
- working with notebook in cluster mode
- working with spark submit
- Apache Spark using SQL - Getting Started
- Launching and using Spark SQL CLI
- Understanding Spark Metastore Warehouse Directory
- Managing Spark Metastore Databases
- Managing Spark Metastore Tables
- Retrieve Metadata of Spark Metastore Tables
- Role of Spark Metastore or Hive Metastore
- Example to working with Dataframe
- DataFrame with SparkSQL shell
- Spark DataFrame
- working with dataframe row
- working with Dataframe row and unit test
- working with Dataframe row and unstructure data
- working with dataframe column
- DataFrame partition and Executors
- Creating and using UDF
- Aggregation in DataFrame
- Windowing in dataframe
- -Grouping Aggregation in Dataframe
- DataFrame joins
- Internal Joins & shuffle
- Optimizing joins
- Implementing Bucket joins
- Spark Transformation and Actions
- Spark Jobs Stages & Task
- Understanding Execution plan
- Unit Testing in Spark
- Debuging Spark Driver and Executor
- Spark Application logs in cluster
- Assignment :
- Spark SQL Exercise
- Apache Spark using SQL - Pre-defined Function
- Overview of Pre-defined Functions using Spark SQL
- Validating Functions using Spark SQL
- String Manipulation Functions using Spark SQL
- Date Manipulation Functions using Spark SQL
- Overview of Numeric Functions using Spark SQL
- Data Type Conversion using Spark SQL
- Dealing with Nulls using Spark SQL
- Using CASE and WHEN using Spark SQL
- Apache Spark using SQL - Basic Transformations
- Prepare or Create Tables using Spark SQL
- Projecting or Selecting Data using Spark SQL
- Filtering Data using Spark SQL
- Joining Tables using Spark SQL - Inner
- Joining Tables using Spark SQL - Outer
- Aggregating Data using Spark SQL
- Sorting Data using Spark SQL
- Apache Spark using SQL - Basic DDL and DML
- Introduction to Basic DDL and DML using Spark SQL
- Create Spark Metastore Tables using Spark SQL
- Overview of Data Types for Spark Metastore Table Columns
- Adding Comments to Spark Metastore Tables using Spark SQL
- Loading Data Into Spark Metastore Tables using Spark SQL - Local
- Loading Data Into Spark Metastore Tables using Spark SQL - HDFS
- Loading Data into Spark Metastore Tables using Spark SQL - Append and Overwrite
- . Creating External Tables in Spark Metastore using Spark SQL
- Managed Spark Metastore Tables vs External Spark Metastore Tables
- Overview of Spark Metastore Table File Formats
- Drop Spark Metastore Tables and Databases
- Truncating Spark Metastore Tables
- Exercise - Managed Spark Metastore Tables
- Apache Spark using SQL - DML and Partitioning
- Introduction to DML and Partitioning of Spark Metastore Tables using Spark SQL
- ntroduction to Partitioning of Spark Metastore Tables using Spark SQL
- Creating Spark Metastore Tables using Parquet File Format
- Load vs. Insert into Spark Metastore Tables using Spark SQL
- Inserting Data using Stage Spark Metastore Table using Spark SQL
- Creating Partitioned Spark Metastore Tables using Spark SQL
- Adding Partitions to Spark Metastore Tables using Spark SQL
- Loading Data into Partitioned Spark Metastore Tables using Spark SQL
- Inserting Data into Partitions of Spark Metastore Tables using Spark SQL
- Using Dynamic Partition Mode to insert data into Spark Metastore Tables
- Spark Stream using window functions
- What are Discretized Streams?
- How to Create Discretized Streams
- Transformations on DStreams
- Transformation Operation
- Window Operations
- Window
- countByWindow
- reduceByKeyAndWindow
- countByValueAndWindow
- Output Operations on DStreams
- forEachRDD
- SQL Operations
- Aggregating Dataframes
- Grouping Aggregations
- Windowing Aggregations
- Advance Pyspark
- Join Operations
- Stateful Transformations
- Checkpointing
- Accumulators
- Fault Tolerance
- Dataframe Joins and column name ambiguity
- Outer Joins in Dataframe
- Internals of Spark Join and shuffle
- Optimizing your joins
- Implementing Bucket Joins
- Streaming Aggregates and State Store
- Incremental Aggregates and Update Mode
- Spark Streaming Output Modes
- Statefull Vs Stateless Aggregation
- Implementing Stateless Streaming Aggregation
- Timebound Stateful Tumbling Window Aggregation
- Watermarking and State Store Cleanup
- Sliding Window Aggregates
- Spark Structure Streaming
- Introduction to Structured Streaming
- Operations on Streaming Dataframes and DataSets
- Window Operations
- Handling Late Data and Watermarking
- Performance Tuning
- PySpark Streaming with Apache Kafka
- Integration with Kafka Text Lecture
- PySpark Streaming with Azure Databricks
- Spark Programming , Model and execution
- Execution Methods - How to Run Spark Programs?
- Spark Distributed Processing Model - How your program runs?
- Spark Execution Modes and Cluster Managers
- Summarizing Spark Execution Models - When to use What?
- Working with PySpark Shell - Demo
- Working with Notebooks in Cluster - Demo
- Working with Spark Submit -
- Creating Spark Project Build Configuration
- Configuring Spark Project Application Logs
- Creating Spark Session
- Configuring Spark Session
- Data Frame Introduction
- Data Frame Partitions and Executors
- Spark Transformations and Actions
- Spark Jobs Stages and Task
- Understanding your Execution Plan
- Unit Testing Spark Application
- There will be 4-5 use case