UrbanPro
true

Learn Big Data from the Best Tutors

  • Affordable fees
  • 1-1 or Group class
  • Flexible Timings
  • Verified Tutors

Search in

Apache Spark Architecture & Features

Kayalvizhi T.
1 hr ago 0 0

Let’s discuss about Apache Spark Architecture. Spark is a distributed computing platform designed for fast and flexible large scale parallel data processing. It is Master-Slave Architecture which needs a Cluster.

A Cluster is a pool of computers working together but viewed as a single system, For e.g., I have ten worker nodes, each with 16 CPU cores and 64 GB RAM, So, my total CPU capacity is 160 CPU cores, and the RAM capacity is 640 GB.

Spark architecture consists of four components, including the spark driver, executors, cluster administrators, and worker nodes. It uses the Dataset and data frames as the fundamental data storage mechanism to optimize the Spark process and big data computation

Spark Architecture

The Driver program that runs on Master. It is the central coordinator that manages the execution of Spark Application and it initiate Spark Context(SC)/session which is the entry point for spark application. SparkContext contains all of the basic functions. The Spark Driver includes several other components, including a DAG Scheduler, Task Scheduler, Backend Scheduler, and Block Manager, all of which are responsible for translating user-written code into jobs that are actually executed on the cluster.

The Cluster Manager manages the execution of various jobs in the cluster. Spark Context communicate with cluster manager and request for the resources(RAM and Core). The cluster Manager does the task of allocating resources for the job, it also launch executor and locate the data in worker node. The executor send heart beat to Driver program and also get the work from driver program , execute the task and send the result back to driver program.

Once the job has been broken down into smaller jobs, which are then distributed to worker nodes, SparkDriver will control the execution.The executor is in charge of carrying out these duties. The lifespan of executors is the same as that of the Spark Application. We can increase the number of workers if we want to improve the performance of the system. In this way, we can divide jobs into more coherent parts.

Features of Apache Spark

Apache Spark has many features which make it a great choice as a big data processing engine. Many of these features establish the advantages of Apache Spark over other Big Data processing engines. Let us look into details of some of the main features

  1. Fault Tolerance: Apache Spark is designed to handle worker node failures. It achieves this fault tolerance by using DAG and RDD (Resilient Distributed Datasets). DAG contains the lineage of all the transformations and actions needed to complete a task. So in the event of a worker node failure, the same results can be achieved by rerunning the steps from the existing DAG.
  2. Lazy Evaluation: Spark does not evaluate any transformation immediately. All the transformations are lazily evaluated. The transformations are added to the DAG and the final computation or results are available only when actions are called. This gives Spark the ability to make optimization decisions, as all the transformations become visible to the Spark engine before performing any action.
  3. Real Time Stream Processing: Spark Streaming brings Apache Spark’s stream processing, letting you write streaming jobs the same way you write batch jobs.
  4. In Memory Computing: Unlike Hadoop MapReduce, Apache Spark is capable of processing tasks in memory and it is not required to write back intermediate results to the disk. This feature gives massive speed to Spark processing. Over and above this, Spark is also capable of caching the intermediate results so that it can be reused in the next iteration. This gives Spark added performance boost for any iterative and repetitive processes.
  5. Supporting Multiple languages: Spark comes inbuilt with multi-language support. It has most of the APIs available in Java, Scala, Python and R. Also, there are advanced features available with R language for data analytics. Also, Spark comes with SparkSQL which has an SQL like feature. SQL developers find it therefore very easy to use, and the learning curve is reduced to a great level.
  6. Integrated with Hadoop: Apache Spark integrates very well with Hadoop file system HDFS. It offers support to multiple file formats like parquet, json, csv, ORC, Avro etc. Hadoop can be easily leveraged with Spark as an input data source or destination.

Spark Jobs execution steps:

  1. Spark creates one job for each action.
  2. This job may contain a series of multiple transformations.
  3. The Spark engine will optimize those transformations and creates a logical plan for the job.
  4. Then spark will break the logical plan at the end of every wide dependency and create two or more stages.
  5. If you have a Narrow dependency, your plan will be a single stage plan.
  6. But if you have N wide-dependencies, your plan should have N+1 stages.
  7. Data from one stage to another stage is shared using the shuffle/sort operation.
  8. Now each stage may be executed as one or more parallel tasks.
  9. The number of tasks in the stage is equal to the number of input partitions.
  10. The task is the most critical concept for a Spark job and is the smallest unit of work in a Spark job. The Spark driver assigns these tasks to the executors and asks them to do the work.

The below diagram shows the steps in Query execution in SparkSQL/DataFrames/Datasets.

When a query is executed it check for the syntax and create an unresolved logical plan. This means there are unresolved attributes and relations in the plan. So then it has to look into the catalog to fill in the missing information for the plan, meaning field and Dataset information. This leads to the generation of a logical plan.

Here a series of optimizations (like filter pushdown)are performed which generates an optimized logical plan. This optimization engine in Spark is called Catalyst optimizer. The optimized plan is then converted to multiple physical plans where a Cost model is used to select an Optimal Physical Plan. This then gets into the final Code Generation step and then the final query is executed to generate the final output as RDDs

In next article we will discuss about the memory allocation and types of memory involved in spark and its uses.

0 Dislike
Follow 1

Please Enter a comment

Submit

Other Lessons for You

What is M.S.Project ?
MICROSOFT PROJECT contains project work and project groups, schedules and finances.Microsoft Project permits its users to line realistic goals for project groups and customers by making schedules, distributing...

CheckPointing Process - Hadoop
CHECK POINTING Checkpointing process is one of the vital concept/activity under Hadoop. The Name node stores the metadata information in its hard disk. We all know that metadata is the heart core...

What is a Dashboard?
Introduction There are many different ideas of what a dashboard is. This article will clearly define it along with other presentation tools. In article, What is BI? - A Business Intelligence Primer, it...

What is a VBA Module?
VBA code is stored and typed in the VBA Editor in what are called modules As stated on the VBA Editor page, a collection of modules is what is called a VBA project Every major Microsoft Office product...

What is Microsoft Access?
Microsoft Access has been around for some time, yet people often still ask me what is Microsoft Access and what does it do? Microsoft Access is a part of the Microsoft Office Suite. It does not come with...
X

Looking for Big Data Classes?

The best tutors for Big Data Classes are on UrbanPro

  • Select the best Tutor
  • Book & Attend a Free Demo
  • Pay and start Learning

Learn Big Data with the Best Tutors

The best Tutors for Big Data Classes are on UrbanPro

This website uses cookies

We use cookies to improve user experience. Choose what cookies you allow us to use. You can read more about our Cookie Policy in our Privacy Policy

Accept All
Decline All

UrbanPro.com is India's largest network of most trusted tutors and institutes. Over 55 lakh students rely on UrbanPro.com, to fulfill their learning requirements across 1,000+ categories. Using UrbanPro.com, parents, and students can compare multiple Tutors and Institutes and choose the one that best suits their requirements. More than 7.5 lakh verified Tutors and Institutes are helping millions of students every day and growing their tutoring business on UrbanPro.com. Whether you are looking for a tutor to learn mathematics, a German language trainer to brush up your German language skills or an institute to upgrade your IT skills, we have got the best selection of Tutors and Training Institutes for you. Read more