true

Learn Big Data from the Best Tutors

Affordable fees
1-1 or Group class
Flexible Timings
Verified Tutors

Search in

Apache Spark Architecture & Features

10/10/2024 0 0

Let’s discuss about Apache Spark Architecture. Spark is a distributed computing platform designed for fast and flexible large scale parallel data processing. It is Master-Slave Architecture which needs a Cluster.

A Cluster is a pool of computers working together but viewed as a single system, For e.g., I have ten worker nodes, each with 16 CPU cores and 64 GB RAM, So, my total CPU capacity is 160 CPU cores, and the RAM capacity is 640 GB.

Spark architecture consists of four components, including the spark driver, executors, cluster administrators, and worker nodes. It uses the Dataset and data frames as the fundamental data storage mechanism to optimize the Spark process and big data computation

The Driver program that runs on Master. It is the central coordinator that manages the execution of Spark Application and it initiate Spark Context(SC)/session which is the entry point for spark application. SparkContext contains all of the basic functions. The Spark Driver includes several other components, including a DAG Scheduler, Task Scheduler, Backend Scheduler, and Block Manager, all of which are responsible for translating user-written code into jobs that are actually executed on the cluster.

The Cluster Manager manages the execution of various jobs in the cluster. Spark Context communicate with cluster manager and request for the resources(RAM and Core). The cluster Manager does the task of allocating resources for the job, it also launch executor and locate the data in worker node. The executor send heart beat to Driver program and also get the work from driver program , execute the task and send the result back to driver program.

Once the job has been broken down into smaller jobs, which are then distributed to worker nodes, SparkDriver will control the execution.The executor is in charge of carrying out these duties. The lifespan of executors is the same as that of the Spark Application. We can increase the number of workers if we want to improve the performance of the system. In this way, we can divide jobs into more coherent parts.

Features of Apache Spark

Apache Spark has many features which make it a great choice as a big data processing engine. Many of these features establish the advantages of Apache Spark over other Big Data processing engines. Let us look into details of some of the main features

Fault Tolerance: Apache Spark is designed to handle worker node failures. It achieves this fault tolerance by using DAG and RDD (Resilient Distributed Datasets). DAG contains the lineage of all the transformations and actions needed to complete a task. So in the event of a worker node failure, the same results can be achieved by rerunning the steps from the existing DAG.
Lazy Evaluation: Spark does not evaluate any transformation immediately. All the transformations are lazily evaluated. The transformations are added to the DAG and the final computation or results are available only when actions are called. This gives Spark the ability to make optimization decisions, as all the transformations become visible to the Spark engine before performing any action.
Real Time Stream Processing: Spark Streaming brings Apache Spark’s stream processing, letting you write streaming jobs the same way you write batch jobs.
In Memory Computing: Unlike Hadoop MapReduce, Apache Spark is capable of processing tasks in memory and it is not required to write back intermediate results to the disk. This feature gives massive speed to Spark processing. Over and above this, Spark is also capable of caching the intermediate results so that it can be reused in the next iteration. This gives Spark added performance boost for any iterative and repetitive processes.
Supporting Multiple languages: Spark comes inbuilt with multi-language support. It has most of the APIs available in Java, Scala, Python and R. Also, there are advanced features available with R language for data analytics. Also, Spark comes with SparkSQL which has an SQL like feature. SQL developers find it therefore very easy to use, and the learning curve is reduced to a great level.
Integrated with Hadoop: Apache Spark integrates very well with Hadoop file system HDFS. It offers support to multiple file formats like parquet, json, csv, ORC, Avro etc. Hadoop can be easily leveraged with Spark as an input data source or destination.

Spark Jobs execution steps:

Spark creates one job for each action.
This job may contain a series of multiple transformations.
The Spark engine will optimize those transformations and creates a logical plan for the job.
Then spark will break the logical plan at the end of every wide dependency and create two or more stages.
If you have a Narrow dependency, your plan will be a single stage plan.
But if you have N wide-dependencies, your plan should have N+1 stages.
Data from one stage to another stage is shared using the shuffle/sort operation.
Now each stage may be executed as one or more parallel tasks.
The number of tasks in the stage is equal to the number of input partitions.
The task is the most critical concept for a Spark job and is the smallest unit of work in a Spark job. The Spark driver assigns these tasks to the executors and asks them to do the work.

The below diagram shows the steps in Query execution in SparkSQL/DataFrames/Datasets.

When a query is executed it check for the syntax and create an unresolved logical plan. This means there are unresolved attributes and relations in the plan. So then it has to look into the catalog to fill in the missing information for the plan, meaning field and Dataset information. This leads to the generation of a logical plan.

Here a series of optimizations (like filter pushdown)are performed which generates an optimized logical plan. This optimization engine in Spark is called Catalyst optimizer. The optimized plan is then converted to multiple physical plans where a Cost model is used to select an Optimal Physical Plan. This then gets into the final Code Generation step and then the final query is executed to generate the final output as RDDs

In next article we will discuss about the memory allocation and types of memory involved in spark and its uses.

0 Like 0 Dislike

Follow 1

Other Lessons for You

Understanding Big Data

Introduction to Big Data This blog is about Big Data, its meaning, and applications prevalent currently in the industry.It’s an accepted fact that Big Data has taken the world by storm and has become...

MyMirror

0 0

Up, Up And Up of Hadoop's Future

The onset of Digital Architectures in enterprise businesses implies the ability to drive continuous online interactions with global consumers/customers/clients or patients. The goal is not just to provide...

Gopal Raj

0 0

13 Things Every Data Scientist Must Know Today

We have spent close to a decade in data science & analytics now. Over this period, We have learnt new ways of working on data sets and creating interesting stories. However, before we could succeed,...

SV Tech Soft

0 0

WebSphere

WebSphere is a set of Java-based tools from IBM that allows customers to create and manage sophisticated business Web sites. The central WebSphere tool is theWebSphere Application Server (WAS), an application...

ITech Analytic Solutions

0 0

Cloud Computing

Introduction: In online world, we get information with just one click. But where this all information is stored? How we can store so much data from anywhere and can access from everywhere. No time bound,...

Namrata Y.

1 0

Find Big Data Training near you

Online Big Data Training

Looking for Big Data Training?

Learn from Best Tutors on UrbanPro.

Are you a Tutor or Training Institute?

Join UrbanPro Today to find students near you

Big Data Questions

How much time will I take to learn Big Data and after learning how much time will it take to attain a job?

12 Answers

Hello, I have completed B.com , MBA fin & M and 5 yr working experience in SAP PLM 1 - Engineering documentation...

9 Answers

Hello all, I have completed B.com, MBA fin & M and 5 yr working experience in SAP PLM 1 - Engineering...

10 Answers

What are the top three institutes in Kolkata that provide Big Data Training? What are the areas I should...

8 Answers

How much beneficial it would be for me to get a job as certified business analyst if I pursue a course...

25 Answers

Looking for Big Data Classes?

The best tutors for Big Data Classes are on UrbanPro

Select the best Tutor
Book & Attend a Free Demo
Pay and start Learning

Learn Big Data with the Best Tutors

The best Tutors for Big Data Classes are on UrbanPro

I am a Student I am a Tutor
Name*	Please enter your full name. Please enter institute name.
Email*	Please enter your email address.
Phone*	Please enter a valid phone number.
Location*	Please enter a pincode or area name.
City*	Please enter city name.
Category*	Please enter category.
Gender*	Male Female Please select your gender.
Email ID/ Mobile No.*	Please enter either mobile no. or email.
Enter Password*	Please enter OTP Please enter Password Sorry, this phone number is not verified, Please login with your email Id.