[Spark進階]--再識spark高階架構

阿新 • • 發佈：2018-11-09

Spark EcoSystem幾乎都是以 Spark Core為核心而構建起來的，那麼，先看看 Spark Core的高階架構：

分別介紹下幾個概念

1、Driver Programs
A driver program is an application that uses Spark as a library. It provides the data processing code that Spark executes on the worker nodes. A driver program can launch one or more jobs on a Spark cluster.

2、Executors
An executor is a JVM (Java virtual machine) process that Spark creates on each worker for an application. It executes application code concurrently in multiple threads. It can also cache data in memory or disk.
An executor has the same lifespan as the application for which it is created. When a Spark application terminates, all the executors created for it also terminate.

3、Tasks
A task is the smallest unit of work that Spark sends to an executor. It is executed by a thread in an executor on a worker node. Each task performs some computations to either return a result to a driver program or partition its output for shuffle.
Spark creates a task per data partition. An executor runs one or more tasks concurrently. The amount of parallelism is determined by the number of partitions. More partitions mean more tasks processing data in parallel.

Application Execution

This section briefly describes how data processing code is executed on a Spark cluster.

Terminology
Let’s define a few terms first:
Shuffle. A shuffle redistributes data among a cluster of nodes. It is an expensive operation because it involves moving data across a network. Note that a shuffle does not randomly redistribute data; it groups data elements into buckets based on some criteria. Each bucket forms a new partition.

Job. A job is a set of computations that Spark performs to return results to a driver program. Essentially, it is an execution of a data processing algorithm on a Spark cluster. An application can launch multiple jobs. Exactly how a job is executed is covered later in this chapter.

Stage. A stage is a collection of tasks. Spark splits a job into a DAG of stages. A stage may depend on another stage. For example, a job may be split into two stages, stage 0 and stage 1, where stage 1 cannot begin until stage 0 is completed. Spark groups tasks into stages using shuffle boundaries. Tasks that do not require a shuffle are grouped into the same stage. A task that requires its input data to be shuffled begins a new stage.

How an Application Works

With the definitions out of the way, I can now describe how a Spark application processes data in parallel across a cluster of nodes. When a Spark application is run, Spark connects to a cluster manager and acquires executors on the worker nodes. As mentioned earlier, a Spark application submits a data processing algorithm as a job. Spark splits a job into a directed acyclic graph (DAG) of stages. It then schedules the execution of these stages on the executors using a low-level scheduler provided by a cluster manager. The executors run the tasks submitted by Spark in parallel.

Every Spark application gets its own set of executors on the worker nodes. This design provides a few benefits.
First, tasks from different applications are isolated from each other since they run in different JVM processes. A misbehaving task from one application cannot crash another Spark application.

Second,scheduling of tasks becomes easier. Spark has to schedule the tasks belonging to only one application at a time. It does not have to handle the complexities of scheduling tasks from multiple concurrently running applications.
However, this design also has one disadvantage. Since applications run in separate JVM processes, they cannot easily share data.
Even though they may be running on the same worker nodes, they cannot share data without writing it to disk.
As previously mentioned, writing and reading data from disk are expensive operations. Therefore, applications sharing data through disk will experience performance issues.

總結如下

一個物理節點可以有一個或多個worker
一個worker中可以有一個或多個executor
一個executor擁有多個cpu core和memory
僅shuffle (把一組無規則的資料儘量轉換成一組具有一定規則的資料)操作時才算作一個stage
一個partition對應一個task

[Spark進階]--再識spark高階架構

分別介紹下幾個概念

Application Execution

How an Application Works

總結如下

[Spark進階]--再識spark高階架構

Scala進階之路-Spark獨立模式集群部署

慕課從零到一spark進階之路（一）

[Spark進階]-- 記憶體管理

[Spark 進階]-- 優化Spark作業以獲得最佳效能

[Spark進階]--Spark RDMA技術

Spark修煉之道（進階篇）——Spark入門到精通：第一節 Spark 1.5.0叢集搭建

Spark修煉之道（進階篇）——Spark入門到精通：第十四節 Spark Streaming 快取、Checkpoint機制

Spark修煉之道（進階篇）——Spark入門到精通：第十六節 Spark Streaming與Kafka

Spark修煉之道（進階篇）——Spark入門到精通：第十節 Spark SQL案例實戰（一）

Spark修煉之道（進階篇）——Spark入門到精通：第十三節 Spark Streaming—— Spark SQL、DataFrame與Spark Streaming

Spark修煉之道（進階篇）——Spark入門到精通：第十五節 Kafka 0.8.2.1 叢集搭建

Spark修煉之道（進階篇）——Spark入門到精通：第九節 Spark SQL執行流程解析

Spark修煉之道（進階篇）——Spark入門到精通：第六節 Spark程式設計模型（三)

Spark修煉之道（進階篇）——Spark入門到精通：第十節 Spark Streaming（一)

Ambari——大資料平臺的搭建利器之進階篇[配置spark]

進階！中型網絡架構DHCP詳解、實驗步驟

深入進階03 - 數據庫架構設計

技術進階：Kubernetes高級架構與應用狀態部署

Java進階---從程式設計師到架構師需要掌握的知識架構

[Spark進階]--再識spark高階架構

分別介紹下幾個概念

Application Execution

How an Application Works

總結如下

相關推薦