Spark 2.3.0+Kubernetes應用程式部署

阿新 • • 發佈：2018-12-26

Spark2.3.0+Kubernetes應用程式部署

Spark可以執行在Kubernetes管理的叢集中，利用Native Kubernetes排程的特點已經被加入Spark。目前Kubernetes排程是實驗性的，在未來的版本中，Spark在配置、容器映像、入口可能會有行為上的變化。

(1) 先決條件。

執行在Spark 2.3 或更高版本上。
執行Kubernetes cluster 1.6以上版本，訪問配置為使用kubectl。如果不止一個Kubernetes叢集，可以使用minikube在本地設定一個測試叢集。
建議使用DNS外掛的minikube最新版本。
注意，預設minikube配置不夠執行Spark的應用，推薦使用3 CPUs、4G記憶體的配置，能夠啟動包含一個Executor的Spark應用程式。

必須有適當的許可權在叢集中列表，建立，編輯和刪除Pods，通過kubectl auth can-i<list|create|edit|delete> pods可以列出這些資源。Driver Pods使用的服務帳戶憑據必須允許建立Pods、服務和配置configmaps。
在叢集配置Kubernetes DNS。

(2) 工作原理。Kubernetes工作原理如圖2-1所示。

圖 2- 1 Kubernetes原理圖

Spark-Submit可直接提交Spark應用到Kubernetes叢集，提交機制是：

l Spark建立Spark Driver，執行在一個Kubernetes pod。

l Driver建立Executors，執行在 Kubernetes pods，並執行應用程式程式碼。

l 當應用程式完成時，Executor pods將終止和清理，但Driver pod持久化到日誌，在KubernetesAPI保留“完成”狀態，直到最終垃圾回收或手動清理。注意：在完成狀態，Driverpod沒使用任何計算或記憶體資源。

Driver和Executor pod由Kubernetes排程。使用配置屬性通過節點選擇器在Driver和Executor節點的一個子集中排程是可能的。這將可能在未來的版本中使用更先進的排程提示，例如：node/podaffinities。

(3) 提交應用到Kubernetes。

Docker 映象：Kubernetes要求使用者提供的映象可以部署到容器內的pods。映象

是建立一個Kubernetes支援的容器執行時環境。Docker是一個Kubernetes經常使用的容器執行環境。Spark（從2.3版開始）使用Dockerfile，或定製個性化的應用，可以在目錄kubernetes/dockerfiles/中發現。Spark還附帶了一個bin/docker-image-tool.sh指令碼，可以用來構建和釋出Docker映象，使用Kubernetes後端。

例如：

$ ./bin/docker-image-tool.sh -r <repo> -t my-tagbuild

$ ./bin/docker-image-tool.sh -r <repo> -t my-tagpush

(4) 叢集模式。

在叢集模式提交Spark Pi程式。

$ bin/spark-submit \

--masterk8s://https://<k8s-apiserver-host>:<k8s-apiserver-port> \

--deploy-modecluster \

--name spark-pi\

--classorg.apache.spark.examples.SparkPi \

--confspark.executor.instances=5 \

--confspark.kubernetes.container.image=<spark-image> \

local:///path/to/examples.jar

Spark Master，在Spark-Submit中指定--master命令列引數，或者在應用程式配置檔案裡面設定spark.master，必須是一個URL的格式k8s://<api_server_url>，Master的字首以k8s://開頭，Spark應用程式將在Kubernetes叢集啟動, 使用api_server_url聯絡API伺服器。如果URL中沒指定HTTP協議，預設為HTTPS。例如，設定為Master為k8s://example.com:443相當於設定為 k8s://https://example.com:443, 但在不同的埠連線（無TLS）, Master可設定為k8s://http://example.com:8080。

在Kubernetes模式中，通過spark.app.name或--name引數指定Spark的應用名稱，Spark-Submit使用預設名稱提交，Kubernetes建立的資源例如Drivers和Executors。應用程式名稱必須由小寫字母、"-"、和"."組成，開始和結束字元必須是一個字母數字字元。

如果有一個Kubernetes叢集setup，發現伺服器的API URL的方法之一是通過執行kubectl cluster-info查詢叢集資訊。

$ kubectl cluster-info

Kubernetes master is running at http://127.0.0.1:6443

在上面的例子中，可以用Spark-Submit在Kubernetes叢集提交程式，通過指定--masterk8s://http://127.0.0.1:6443提交。另外，也可以用認證代理kubectl proxy聯絡Kubernetes API。

本地代理啟動：

$ kubectl proxy

如果本地代理是執行在localhost:8001，--master k8s://http://127.0.0.1:8001可以作為Spark-Submit引數提交應用。最後請注意，在上面的例子中，我們指定一個特定的URI方案local://，這個URI例子的Jar包已經在Docker映象中。

(5) 依賴管理。

如果應用程式的依賴關係都託管在HDFS或HTTP伺服器的遠端位置，可通過適當的遠端URI引用。此外，應用程式依賴關係可以預先安裝到定製的Docker映象。依賴可以借鑑類路徑新增到當地URI local://及/，或在Dockerfiles設定環境變數，設定SPARK_EXTRA_CLASSPATH。local://模式要求在依賴定製的Docker映象。注意，目前還不支援使用從提交客戶端的本地檔案系統的依賴。

(6) 使用遠端的依賴。

當應用程式依賴HDFS或HTTP伺服器託管在遠端位置，Driver和Executor pods

需要Kubernetes初始化容器下載依賴，這樣Driver和Executor容器可以在本地使用。

初始化容器處理遠端依賴需指定spark.jars（或Spark-Submit的--jars ）、spark.files（或Spark-Submit的--files引數），處理遠端託管主要應用資源。例如：主程式Jar，下面顯示了Spark-Submit使用遠端依賴的一個例子：

$ bin/spark-submit \

--masterk8s://https://<k8s-apiserver-host>:<k8s-apiserver-port> \

--deploy-modecluster \

--name spark-pi\

--classorg.apache.spark.examples.SparkPi \

--jarshttps://path/to/dependency1.jar,https://path/to/dependency2.jar

--fileshdfs://host:port/path/to/file1,hdfs://host:port/path/to/file2

--confspark.executor.instances=5 \

--confspark.kubernetes.container.image=<spark-image> \

https://path/to/examples.jar

(7) 金鑰管理。

Kubernetes金鑰可以用來提供Spark應用訪問安全服務證書。安裝使用者指定的金鑰到Driver容器，使用者可以使用配置屬性spark.kubernetes.driver.secrets.[SecretName]=<mountpath>。同樣的，安裝使用者指定的金鑰到Executor容器，使用者可以使用配置屬性spark.kubernetes.executor.secrets.[SecretName]=<mount path>。請注意，假設被安裝的金鑰是在Driver和Executor pods的同一名稱空間。例如，在Driver和Executor的容器中安裝一個金鑰spark-secret到路徑/etc/secrets，在Spark-Submit中新增以下提交命令：

--confspark.kubernetes.driver.secrets.spark-secret=/etc/secrets

--confspark.kubernetes.executor.secrets.spark-secret=/etc/secrets

注意，如果一個初始化容器被使用，任何金鑰安裝到Driver的容器也將被安裝到Driver的初始化容器。同樣，任何金鑰安裝到Executor的容器也將被安裝到Executor的初始化容器。

(8) 自省與除錯。

這些都是觀測Spark應用程式執行、應用程式完成、進展監測的不同方法。

l 訪問日誌。可以使用Kubernetes API或kubectl CLI訪問日誌，當一個Spark 應用程式正在執行，應用可能記錄流日誌：

$ kubectl-n=<namespace> logs -f <driver-pod-name>

如果安裝在叢集，同樣也可以通過Kubernetes儀表盤訪問日誌。

l 訪問Driver UI的介面。與應用程式關聯的使用者介面，可以在本地使用kubectl port-forward進行訪問。

$ kubectl port-forward <driver-pod-name> 4040:4040

l 除錯。可能有幾種故障：如果Kubernetes API伺服器拒絕Spark-Submit提交，或一個不同的原因拒絕連線，提交邏輯應表明遇到的錯誤。然而，如果是正在執行的應用程式，最佳途徑可能是通過KubernetesCLI。

得到一些在Driverpod排程決策的基本資訊，可以執行：

$ kubectl describepod <spark-driver-pod>

如果Pod遇到執行時錯誤，狀態可以進一步查證，可以使用：

$ kubectl logs<spark-driver-pod>

失敗的Executor pods的狀態和日誌可以使用類似的方式檢查。最後，刪除Driverpod會清理整個Spark應用，包括所有的Executors，相關的服務等。Driverpod可以被認為是Spark應用在Kubernetes的表示。

(9) Kubernetes的特點。

l 名稱空間。Kubernetes有名稱空間的概念。名稱空間是在多個使用者之間（通過資源配額）劃分叢集資源的一種方式。Spark執行在Kubernetes，可以使用名稱空間啟動Spark的應用。這可以通過使用spark.kubernetes.namespace配置。Kubernetes允許使用ResourceQuota設定資源限制、物件的數量等對個體的名稱空間。名稱空間和ResourceQuota可以組合使用，管理員控制共享Spark應用在Kubernetes叢集執行的資源分配。

l 基於角色的訪問控制。Kubernetes叢集基於角色的訪問控制啟用時，使用者可以配置Kubernetes RBAC角色和服務帳戶，由Spark執行在Kubernetes各種元件訪問KubernetesAPI伺服器。

Spark Driver pod採用Kubernetes服務帳戶訪問Kubernetes API伺服器，建立和檢查 Executor pods。Driver pod使用的服務帳戶必須具有對 Driver操作的適當的許可權。具體地說，至少服務帳戶必須被授予Role或ClusterRole角色，執行Driverpods可以建立pods和服務。預設情況下，如果pod被建立沒有服務指定帳戶時，Driver pod自動分配預設的指定名稱空間的服務帳戶spark.kubernetes.namespace。

根據版本和Kubernetes部署設定，預設服務帳戶可能有角色，在預設Kubernetes和基於角色的訪問控制政策下，允許Driverpods 建立pods和服務。有時，使用者可能需要指定一個自定義服務帳戶有權授予角色。Spark執行在Kubernetes中，通過配置屬性自定義服務帳戶spark.kubernetes.authenticate.driver.serviceAccountName=<serviceaccount name>指定被Driver pod使用的定製服務賬戶。例如：讓 Driver pod使用Spark服務帳戶，使用者只需新增以下選項的Spark-Submit提交命令：

--confspark.kubernetes.authenticate.driver.serviceAccountName=spark

建立一個自定義的服務帳戶，使用者可以使用 kubectl create serviceaccount 命令。例如，下面的命令建立了一個名稱為spark的服務帳戶：

$ kubectl createserviceaccount spark

授予服務帳戶Role或ClusterRole，一個RoleBinding或ClusterRoleBinding是必要的。建立一個RoleBinding 或ClusterRoleBinding，使用者可以使用kubectl create rolebinding（或clusterrolebinding 為ClusterRoleBinding）命令。例如，下面的命令在預設名稱空間建立了一個editClusterRole，授予到spark服務帳戶：

$ kubectl createclusterrolebinding spark-role --clusterrole=edit --serviceaccount=default:spark–

namespace=default

請注意，Role只能用於在一個單一的名稱空間授予對資源的訪問許可權（如pods），而ClusterRole 可用於授予訪問群集範圍內的資源（如節點）以及所有的名稱空間的名稱空間資源（如pods）。Spark 執行在Kubernetes，因為Driver總是在相同的名稱空間建立Executorpods，Role是足夠的，雖然使用者可以使用 ClusterRole。RBAC授權的更多資訊以及如何配置Kubernetes 服務帳戶，請參閱使用RBAC授權內容(https://kubernetes.io/docs/admin/authorization/rbac/)和pods配置服務帳戶(https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/)。

(10) 客戶端模式。目前不支援客戶端模式。

(11) 未來的工作。Spark執行在Kubernetes的功能，正在apache-spark-on-k8s/spark分支孵化（https://github.com/apache-spark-on-k8s/spark），最終將使它進入spark-kubernetes整合版本。

其中一些包括：

l PySpark

l R

l Dynamic Executor Scaling

l 本地檔案的依賴關係管理

l Spark 的應用管理

l 工作佇列和資源管理

(12) 配置。

Spark的配置資訊可以檢視頁面（http://spark.apache.org/docs/latest/configuration.html）。 Spark執行在Kubernetes的配置資訊可以檢視頁面（http://spark.apache.org/docs/latest/running-on-kubernetes.html）。

Spark Properties

Property Name	Default	Meaning
`spark.kubernetes.namespace`	`default`	The namespace that will be used for running the driver and executor pods.
`spark.kubernetes.container.image`	`(none)`	Container image to use for the Spark application. This is usually of the form `example.com/repo/spark:v1.0.0`. This configuration is required and must be provided by the user, unless explicit images are provided for each different container type.
`spark.kubernetes.driver.container.image`	`(value of spark.kubernetes.container.image)`	Custom container image to use for the driver.
`spark.kubernetes.executor.container.image`	`(value of spark.kubernetes.container.image)`	Custom container image to use for executors.
`spark.kubernetes.container.image.pullPolicy`	`IfNotPresent`	Container image pull policy used when pulling images within Kubernetes.
`spark.kubernetes.allocation.batch.size`	`5`	Number of pods to launch at once in each round of executor pod allocation.
`spark.kubernetes.allocation.batch.delay`	`1s`	Time to wait between each round of executor pod allocation. Specifying values less than 1 second may lead to excessive CPU usage on the spark driver.
`spark.kubernetes.authenticate.submission.caCertFile`	(none)	Path to the CA cert file for connecting to the Kubernetes API server over TLS when starting the driver. This file must be located on the submitting machine's disk. Specify this as a path as opposed to a URI (i.e. do not provide a scheme).
`spark.kubernetes.authenticate.submission.clientKeyFile`	(none)	Path to the client key file for authenticating against the Kubernetes API server when starting the driver. This file must be located on the submitting machine's disk. Specify this as a path as opposed to a URI (i.e. do not provide a scheme).
`spark.kubernetes.authenticate.submission.clientCertFile`	(none)	Path to the client cert file for authenticating against the Kubernetes API server when starting the driver. This file must be located on the submitting machine's disk. Specify this as a path as opposed to a URI (i.e. do not provide a scheme).
`spark.kubernetes.authenticate.submission.oauthToken`	(none)	OAuth token to use when authenticating against the Kubernetes API server when starting the driver. Note that unlike the other authentication options, this is expected to be the exact string value of the token to use for the authentication.
`spark.kubernetes.authenticate.submission.oauthTokenFile`	(none)	Path to the OAuth token file containing the token to use when authenticating against the Kubernetes API server when starting the driver. This file must be located on the submitting machine's disk. Specify this as a path as opposed to a URI (i.e. do not provide a scheme).
`spark.kubernetes.authenticate.driver.caCertFile`	(none)	Path to the CA cert file for connecting to the Kubernetes API server over TLS from the driver pod when requesting executors. This file must be located on the submitting machine's disk, and will be uploaded to the driver pod. Specify this as a path as opposed to a URI (i.e. do not provide a scheme).
`spark.kubernetes.authenticate.driver.clientKeyFile`	(none)	Path to the client key file for authenticating against the Kubernetes API server from the driver pod when requesting executors. This file must be located on the submitting machine's disk, and will be uploaded to the driver pod. Specify this as a path as opposed to a URI (i.e. do not provide a scheme). If this is specified, it is highly recommended to set up TLS for the driver submission server, as this value is sensitive information that would be passed to the driver pod in plaintext otherwise.
`spark.kubernetes.authenticate.driver.clientCertFile`	(none)	Path to the client cert file for authenticating against the Kubernetes API server from the driver pod when requesting executors. This file must be located on the submitting machine's disk, and will be uploaded to the driver pod. Specify this as a path as opposed to a URI (i.e. do not provide a scheme).
`spark.kubernetes.authenticate.driver.oauthToken`	(none)	OAuth token to use when authenticating against the Kubernetes API server from the driver pod when requesting executors. Note that unlike the other authentication options, this must be the exact string value of the token to use for the authentication. This token value is uploaded to the driver pod. If this is specified, it is highly recommended to set up TLS for the driver submission server, as this value is sensitive information that would be passed to the driver pod in plaintext otherwise.
`spark.kubernetes.authenticate.driver.oauthTokenFile`	(none)	Path to the OAuth token file containing the token to use when authenticating against the Kubernetes API server from the driver pod when requesting executors. Note that unlike the other authentication options, this file must contain the exact string value of the token to use for the authentication. This token value is uploaded to the driver pod. If this is specified, it is highly recommended to set up TLS for the driver submission server, as this value is sensitive information that would be passed to the driver pod in plaintext otherwise.
`spark.kubernetes.authenticate.driver.mounted.caCertFile`	(none)	Path to the CA cert file for connecting to the Kubernetes API server over TLS from the driver pod when requesting executors. This path must be accessible from the driver pod. Specify this as a path as opposed to a URI (i.e. do not provide a scheme).
`spark.kubernetes.authenticate.driver.mounted.clientKeyFile`	(none)	Path to the client key file for authenticating against the Kubernetes API server from the driver pod when requesting executors. This path must be accessible from the driver pod. Specify this as a path as opposed to a URI (i.e. do not provide a scheme).
`spark.kubernetes.authenticate.driver.mounted.clientCertFile`	(none)	Path to the client cert file for authenticating against the Kubernetes API server from the driver pod when requesting executors. This path must be accessible from the driver pod. Specify this as a path as opposed to a URI (i.e. do not provide a scheme).
`spark.kubernetes.authenticate.driver.mounted.oauthTokenFile`	(none)	Path to the file containing the OAuth token to use when authenticating against the Kubernetes API server from the driver pod when requesting executors. This path must be accessible from the driver pod. Note that unlike the other authentication options, this file must contain the exact string value of the token to use for the authentication.
`spark.kubernetes.authenticate.driver.serviceAccountName`	`default`	Service account that is used when running the driver pod. The driver pod uses this service account when requesting executor pods from the API server. Note that this cannot be specified alongside a CA cert file, client key file, client cert file, and/or OAuth token.
`spark.kubernetes.driver.label.[LabelName]`	(none)	Add the label specified by `LabelName` to the driver pod. For example, `spark.kubernetes.driver.label.something=true`. Note that Spark also adds its own labels to the driver pod for bookkeeping purposes.
`spark.kubernetes.driver.annotation.[AnnotationName]`	(none)	Add the annotation specified by `AnnotationName` to the driver pod. For example, `spark.kubernetes.driver.annotation.something=true`.
`spark.kubernetes.executor.label.[LabelName]`	(none)	Add the label specified by `LabelName` to the executor pods. For example, `spark.kubernetes.executor.label.something=true`. Note that Spark also adds its own labels to the driver pod for bookkeeping purposes.
`spark.kubernetes.executor.annotation.[AnnotationName]`	(none)	Add the annotation specified by `AnnotationName` to the executor pods. For example, `spark.kubernetes.executor.annotation.something=true`.
`spark.kubernetes.driver.pod.name`	(none)	Name of the driver pod. If not set, the driver pod name is set to "spark.app.name" suffixed by the current timestamp to avoid name conflicts.
`spark.kubernetes.executor.lostCheck.maxAttempts`	`10`	Number of times that the driver will try to ascertain the loss reason for a specific executor. The loss reason is used to ascertain whether the executor failure is due to a framework or an application error which in turn decides whether the executor is removed and replaced, or placed into a failed state for debugging.
`spark.kubernetes.submission.waitAppCompletion`	`true`	In cluster mode, whether to wait for the application to finish before exiting the launcher process. When changed to false, the launcher has a "fire-and-forget" behavior when launching the Spark job.
`spark.kubernetes.report.interval`	`1s`	Interval between reports of the current Spark job status in cluster mode.
`spark.kubernetes.driver.limit.cores`	(none)	Specify the hard CPU [limit](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container) for the driver pod.
`spark.kubernetes.executor.limit.cores`	(none)	Specify the hard CPU [limit](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container) for each executor pod launched for the Spark Application.
`spark.kubernetes.node.selector.[labelKey]`	(none)	Adds to the node selector of the driver pod and executor pods, with key `labelKey` and the value as the configuration's value. For example, setting `spark.kubernetes.node.selector.identifier` to `myIdentifier` will result in the driver pod and executors having a node selector with key `identifier` and value`myIdentifier`. Multiple node selector keys can be added by setting multiple configurations with this prefix.
`spark.kubernetes.driverEnv.[EnvironmentVariableName]`	(none)	Add the environment variable specified by `EnvironmentVariableName` to the Driver process. The user can specify multiple of these to set multiple environment variables.
`spark.kubernetes.mountDependencies.jarsDownloadDir`	`/var/spark-data/spark-jars`	Location to download jars to in the driver and executors. This directory must be empty and will be mounted as an empty directory volume on the driver and executor pods.
`spark.kubernetes.mountDependencies.filesDownloadDir`	`/var/spark-data/spark-files`	Location to download jars to in the driver and executors. This directory must be empty and will be mounted as an empty directory volume on the driver and executor pods.
`spark.kubernetes.mountDependencies.timeout`	300s	Timeout in seconds before aborting the attempt to download and unpack dependencies from remote locations into the driver and executor pods.
`spark.kubernetes.mountDependencies.maxSimultaneousDownloads`	5	Maximum number of remote dependencies to download simultaneously in a driver or executor pod.
`spark.kubernetes.initContainer.image`	`(value of spark.kubernetes.container.image)`	Custom container image for the init container of both driver and executors.
`spark.kubernetes.driver.secrets.[SecretName]`	(none)	Add the Kubernetes Secret named `SecretName` to the driver pod on the path specified in the value. For example,`spark.kubernetes.driver.secrets.spark-secret=/etc/secrets`. Note that if an init-container is used, the secret will also be added to the init-container in the driver pod.
`spark.kubernetes.executor.secrets.[SecretName]`	(none)	Add the Kubernetes Secret named `SecretName` to the executor pod on the path specified in the value. For example, `spark.kubernetes.executor.secrets.spark-secret=/etc/secrets`. Note that if an init-container is used, the secret will also be added to the init-container in the executor pod.

2018年新春報喜！熱烈祝賀王家林大咖大資料經典傳奇著作《SPARK大資料商業實戰三部曲》暢銷書籍清華大學出版社發行上市!

本書基於Spark 2.2.0最新版本（2017年7月11日釋出），以Spark商業案例實戰和Spark在生產環境下幾乎所有型別的效能調優為核心，以Spark核心解密為基石，分為上篇、中篇、下篇，對企業生產環境下的Spark商業案例與效能調優抽絲剝繭地進行剖析。上篇基於Spark原始碼，從一個動手實戰案例入手，循序漸進地全面解析了Spark 2.2新特性及Spark核心原始碼；中篇選取Spark開發中最具有代表的經典學習案例，深入淺出地介紹，在案例中綜合應用Spark的大資料技術；下篇效能調優內容基本完全覆蓋了Spark在生產環境下的所有調優技術。

本書適合所有Spark學習者和從業人員使用。對於有分散式計算框架應用經驗的人員，本書也可以作為Spark高手修煉的參考書籍。同時，本書也特別適合作為高等院校的大資料教材使用。

清華大學出版社官方旗艦店（天貓）、京東、噹噹網、亞馬遜等網店已可購買！歡迎大家購買學習！( Spark 核心部分透徹講解Spark 2.2.0的原始碼；

Spark 案例部分詳細講解案例程式碼，新書案例部分每章都專門有1節列出案例全部的程式碼。新書重印時，將在新書封底加上二維碼，也可以在清華大學出版社官網上進行下載。)

清華大學出版社官方旗艦店（天貓）、京東、噹噹網、亞馬遜等網店已可購買！歡迎大家購買學習！

清華大學出版社官方旗艦店（天貓）點選開啟連結

Spark 2.3.0+Kubernetes應用程式部署