1. 程式人生 > >Spark Release 2.0.0

Spark Release 2.0.0

原文連結       譯者:小村長

Spark2.0在2016年7月26日釋出,因為工作中經常用到,所以對它關注比較多,正好今天”提前”下班,所以抽空翻譯一下spark2.0發版概述,簡單的介紹一下spark2.0的新特性和新變化。好吧,現在就讓村長帶領大家一起走進spark2.0的神祕殿堂。同時也希望更多的人蔘入進來,知識因為共享才變的有意義和價值。

Spark 2.0.0是第一個在2.x線上發行的版本. 主要的更新是在API的可用性,SQL2003的支援,效能的提升,結構化流,R UDF的支援還用可操作性的提升. 另外, 這個發行版本包括超過2500個補丁來只300個貢獻者.

可以通過 downloads 來下載spark2.0. 你也可以訪問 

detailed changes來了解細節的改變. 我們向你展示每個模組的細節變化.

API穩定性

Spark 2.0.0是spark 2.x產品線上第一個發行版. Spark保證它所有2.x發行版非實驗性API的穩定性. 雖然APIs和1.x有很多相似之處, 同時Spark 2.0.0也有很多大的變化. 可以通過這個 網站來 檢視API的移除,修改和過時的資訊.

核心和Spark SQL

程式 APIs

在Spark2.0最大的變化是最新更新的APIs:

  • 統一了DataFrame 和Dataset: 在Scala 和Java中, DataFrame 和Dataset做了統一, 也就是說. DataFrame僅僅是 Dataset行的類型別名. 在 Python 和R中, 由於缺乏型別安全, DataFrame僅僅是主要的程式介面.
  • SparkSession: 一個新的入口點代替老的SQLContext 和HiveContext 對於 DataFrame 和Dataset APIs. SQLContext 和HiveContext 繼續保留為向後相容.
  • 一個新的, 最新型的配置API對於SparkSession
  • 更簡單的, 效能更好的累加器(accumulator) API
  • 一個新的, 提升了Datasets聚合API的效能

SQL

Spark 2.0大體上實現了對SQL2003的函式支援. Spark SQL現在能夠執行所有的 99 TPC-DS 查詢. 更多的詳細情況如下:

  • Spark自帶的SQL解析器不僅僅支援 ANSI-SQL標準同時也支援 Hive QL
  • 啟動了本地的DDL 命令
  • 子查詢, 包括
    • 不相關的標量子查詢
    • 相關的標量子查詢
    • 基於NOT IN的子查詢 (在 WHERE/HAVING 語句)
    • 基於IN 語句的子查詢 (在 WHERE/HAVING 語句)
    • 基於(NOT) EXISTS 語句的子查詢 (在 WHERE/HAVING 語句)
  • 標準化View 的支援

另外,當構建沒有Hive支援的時候, Spark SQL也包括幾乎所有的函式功能當構建Hive支援的時候, 當連線Hive異常, Hive UDFs, 和指令碼的轉換.

新特性

  • 本地CSV 資料來源, 構建在 Databricks’ spark-csv module
  • 關閉快取和執行期間的堆記憶體的管理
  • Hive的桶表支援
  •  使用sketches近似統計功能, 包括quantile, Bloom filter, and count-min sketch.

效能和執行時間

  • 實質性的效能提升(2 – 10X) 通過對SQL和DataFrames的操作是通過一個新的技術,我們稱之為整個階段的程式碼生成.
  • 提升了Parquet瀏覽速度通過吞吐量的向量化
  • 提升了ORC 效能
  • 化了在 Catalyst查詢選項的通用的工作負載
  • 通過繼承window本地函式來提升在window上執行的效能
  • 對於本地資料來源的自動檔案合併

MLlib

MLlib API是以DataFrame為基礎的. 以RDD為API進入了過度模式. 通過查詢MLlib 嚮導來了解更多細節

新特徵

  • ML persistence: The DataFrames-based API provides near-complete support for saving and loading ML models and Pipelines in Scala, Java, Python, and R. See this blog post for details. (SPARK-6725, SPARK-11939, SPARK-14311)
  • MLlib in R: SparkR now offers MLlib APIs for generalized linear models, naive Bayes, k-means clustering, and survival regression. See this talk to learn more.
  • Python: PySpark now offers many more MLlib algorithms, including LDA, Gaussian Mixture Model, Generalized Linear Regression, and more.scaling
  • Algorithms added to DataFrames-based API: Bisecting K-Means clustering, Gaussian Mixture Model, MaxAbsScaler feature transformer.

這次羅列了很多新的特徵.

速度/換算

向量和矩陣儲存在DataFrames中使其更高效的序列化, 使其reduce 呼叫MLlib 演算法更加高效. (SPARK-14850)

SparkR

SparkR在spark2.0中最大的提升就是添加了使用者自定義函式的功能. 使用者可以定義以下三種函式: dapply, gapply, 和 lapply. The first two can be used to do partition-based UDFs using dapply and gapply, e.g. partitioned model learning. The latter can be used to do hyper-parameter tuning.

另外,也增加如下新特性:

  • Improved algorithm coverage for machine learning in R, including naive Bayes, k-means clustering, and survival regression.
  • Generalized linear models support more families and link functions.
  • Save and load for all ML models.
  • More DataFrame functionality: Window functions API, reader, writer support for JDBC, CSV, SparkSession

Streaming

Spark 2.0 ships the initial experimental release for Structured Streaming, a high level streaming API built on top of Spark SQL and the Catalyst optimizer. Structured Streaming enables users to program against streaming sources and sinks using the same DataFrame/Dataset API as in static data sources, leveraging the Catalyst optimizer to automatically incrementalize the query plans.

For the DStream API, the most prominent update is the new experimental support for Kafka 0.10.

依賴和包的改進

在最新的Spark中對spark的操作和包裝進行了改進:

  • Spark 2.0 n不在要求把所有的依賴打包到一個jar中.
  • Akka 依賴被移除, 使用者根據自己的需求適配任何版本的Akka.
  • Kryo版本適配到3.0.
  • 預設的採用 Scala 2.11編譯,二而不是Scala 2.10.

移除,特徵改變,過時

移除的

以下的特性在Spark2.0已經刪除:

  • Bagel
  • 不在支援Hadoop2.1和更早版本
  • 配置關閉序列化的選項
  • HTTPBroadcast
  • TTL-based metadata cleaning
  • Semi-private class org.apache.spark.Logging. We suggest you use slf4j directly.
  • SparkContext.metricsSystem
  • Block-oriented integration with Tachyon (subsumed by file system integration)
  • Methods deprecated in Spark 1.x
  • Methods on Python DataFrame that returned RDDs (map, flatMap, mapPartitions, etc). They are still available in dataframe.rdd field, e.g. dataframe.rdd.map.
  • Less frequently used streaming connectors, including Twitter, Akka, MQTT, ZeroMQ
  • Hash-based shuffle manager
  • History serving functionality from standalone Master
  • For Java and Scala, DataFrame no longer exists as a class. As a result, data sources would need to be updated.

Behavior Changes

The following changes might require updating existing applications that depend on the old behavior or API.

  • The default build is now using Scala 2.11 rather than Scala 2.10.
  • In SQL, floating literals are now parsed as decimal data type rather than double data type.
  • Kryo version is bumped to 3.0.
  • Java RDD’s flatMap and mapPartitions functions used to require functions returning Java Iterable. They have been updated to require functions returning Java iterator so the functions do not need to materialize all the data.
  • Java RDD’s countByKey and countAprroxDistinctByKey now returns a map from K to java.lang.Long, rather than to java.lang.Object.
  • When writing Parquet files, the summary files are not written by default. To re-enable it, users must set “parquet.enable.summary-metadata” to true.
  • The DataFrame-based API (spark.ml) now depends upon local linear algebra in spark.ml.linalg, rather than in spark.mllib.linalg. This removes the last dependencies of spark.ml.* on spark.mllib.*. (SPARK-13944) See the MLlib migration guide for a full list of API changes.

For a more complete list, please see SPARK-11806 for deprecations and removals.

過時的

下面的特性在Spark2.0中過時了, 可能在未來的Spark 2.x版本中移除:

  • 對Mesos的Fine-grained模式的支援
  • 對Java7的支援
  • 對Python 2.6的支援

Known Issues

  • Lead and Lag’s behaviors have been changed to ignoring nulls from respecting nulls (1.6’s behaviors). In 2.0.1, the behavioral changes will be fixed in 2.0.1 (SPARK-16721).
  • Lead and Lag functions using constant input values does not return the default value when the offset row does not exist (SPARK-16633).

工作人員

譯者注: 雖不認識他們,不知道他們是誰,但是感謝他們的辛勤付出,為開源社群提供了這麼好的分散式框架,請我們瞄一下他們的名字以示尊重。

Last but not least, this release would not have been possible without the following contributors: Aaron Tokhy, Abhinav Gupta, Abou Haydar Elias, Adam Budde, Adam Roberts, Ahmed Kamal, Ahmed Mahran, Alex Bozarth, Alexander Ulanov, Allen, Anatoliy Plastinin, Andrew, Andrew Ash, Andrew Or, Andrew Ray, Anthony Truchet, Antonio Murgia, Arun Allamsetty, Azeem Jiva, Ben McCann, BenFradet, Bertrand Bossy, Bill Chambers, Bjorn Jonsson, Bo Meng, Brandon Bradley, Brian O’Neill, BrianLondon, Bryan Cutler, Burak Köse, Burak Yavuz, Carson Wang, Cazen, Charles Allen, Cheng Hao, Cheng Lian, Claes Redestad, CodingCat, DB Tsai, DLucky, Daniel Jalova, Daoyuan Wang, Darek Blasiak, David Tolpin, Davies Liu, Devaraj K, Dhruve Ashar, Dilip Biswal, Dmitry Erastov, Dominik Jastrzębski, Dongjoon Hyun, Earthson Lu, Egor Pakhomov, Ehsan M.Kermani, Ergin Seyfe, Eric Liang, Ernest, Felix Cheung, Feynman Liang, Fokko Driesprong, Franklyn D’souza, François Garillot, Gabriele Nizzoli, Gary King, GayathriMurali, Gio Borje, Grace, Grzegorz Chilkiewicz, Guillaume Poulin, Gábor Lipták, Hemant Bhanawat, Herman van Hovell, Herman van Hövell tot Westerflier, Hiroshi Inoue, Holden Karau, Hossein, Huaxin Gao, Imran Rashid, Imran Younus, Ioana Delaney, Iulian Dragos, Jacek Laskowski, Jacek Lewandowski, Jakob Odersky, James Lohse, James Thomas, Jason Lee, Jason Moore, Jason White, Jean-Baptiste Onofré, Jeff L, Jeff Zhang, Jeremy Derr, JeremyNixon, Jo Voordeckers, Joan, Jon Maurer, Joseph K. Bradley, Josh Howes, Josh Rosen, Joshi, Juarez Bochi, Julien Baley, Junyang, Junyang Qian, Jurriaan Pruis, Kai Jiang, KaiXinXiaoLei, Kay Ousterhout, Kazuaki Ishizaki, Kevin Yu, Koert Kuipers, Kousuke Saruta, Koyo Yoshida, Krishna Kalyan, Lewuathe, Liang-Chi Hsieh, Lianhui Wang, Lin Zhao, Lining Sun, Liu Xiang, Liwei Lin, Luc Bourlier, Luciano Resende, Lukasz, Maciej Brynski, Malte, Marcelo Vanzin, Marcin Tustin, Mark Grover, Martin Menestret, Masayoshi TSUZUKI, Matei Zaharia, Matthew Wise, Michael Allman, Michael Armbrust, Michael Gummelt, Michel Lemay, Mike Dusenberry, Mortada Mehyar, Nakul Jindal, Nam Pham, Narine Kokhlikyan, NarineK, Neelesh Srinivas Salian, Nezih Yigitbasi, Nicholas Chammas, Nicholas Tietz, Nick Pentreath, Nilanjan Raychaudhuri, Nirman Narang, Nishkam Ravi, Nong, Nong Li, Oleg Danilov, Oliver Pierson, Oscar D. Lara Yejas, Parth Brahmbhatt, Patrick Wendell, Pete Robbins, Peter Ableda, Prajwal Tuladhar, Prashant Sharma, Pravin Gadakh, QiangCai, Qifan Pu, Raafat Akkad, Rahul Tanwani, Rajesh Balamohan, Rekha Joshi, Reynold Xin, Richard W. Eggert II, Robert Dodier, Robert Kruszewski, Robin East, Ruifeng Zheng, Ryan Blue, Sameer Agarwal, Sandeep Singh, Sanket, Sasaki Toru, Sean Owen, Sean Zhong, Sebastien Rainville, Sebastián Ramírez, Sela, Sergiusz Urbaniak, Shally Sangal, Sheamus K. Parkes, Shivaram Venkataraman, Shixiong Zhu, Shuai Lin, Shubhanshu Mishra, Sital Kedia, Stavros Kontopoulos, Stephan Kessler, Steve Loughran, Subhobrata Dey, Subroto Sanyal, Sumedh Mungee, Sun Rui, Sunitha Kambhampati, Takahashi Hiroshi, Takeshi YAMAMURO, Takuya Kuwahara, Takuya UESHIN, Tathagata Das, Tejas Patil, Terence Yim, Thomas Graves, Timothy Chen, Timothy Hunter, Tom Graves, Tom Magrino, Tommy YU, Travis Crawford, Tristan Reid, Victor Chima, Villu Ruusmann, Wayne Song, WeichenXu, Weiqing Yang, Wenchen Fan, Wesley Tang, Wilson Wu, Wojciech Jurczyk, Xiangrui Meng, Xin Ren, Xin Wu, Xinh Huynh, Xiu Guo, Xusen Yin, Yadong Qi, Yanbo Liang, Yash Datta, Yin Huai, Yonathan Randolph, Yong Gang Cao, Yong Tang, Yu ISHIKAWA, Yucai Yu, Yuhao Yang, Yury Liavitski, Zhang, Liye, Zheng RuiFeng, Zheng Tan, aokolnychyi, bomeng, catapan, cody koeninger, dding3, depend, echo2mei, felixcheung, frreiss, fwang1, gatorsmile, guoxu1231, huangzhaowei, hushan, hyukjinkwon, jayadevanmurali, jeanlyn, jerryshao, jliwork, junhao, kaklakariada, krishnakalyan3, lfzCarlosC, lgieron, mark800, mathieu longtin, mcheah, meiyoula, movelikeriver, mwws, nfraison, oraviv, peng.zhang, petermaxlee, pierre-borckmans, poolis, prabs, proflin, pshearer, rotems, sachin aggarwal, sandy, scwf, seddonm1, sethah, sharkd, shijinkui, sureshthalamati, tedyu, thomastechs, tmnd1991, vijaykiran, wangfei, wangyang, [email protected], wujian, xin Wu, yzhou2001, zero323, zhonghaihua, zhuol, zlpmichelle, Örjan Lundberg, Yang Bo.
Spark新文件