1. 程式人生 > >Introducing Apache Spark Datasets(中英雙語)

Introducing Apache Spark Datasets(中英雙語)

文章標題

Introducing Apache Spark Datasets

作者介紹

文章正文

Developers have always loved Apache Spark for providing APIs that are simple yet powerful, a combination of traits that makes complex analysis possible with minimal programmer effort.  At Databricks, we have continued to push Spark’s usability and performance envelope through the introduction of 

DataFrames and Spark SQL. These are high-level APIs for working with structured data (e.g. database tables, JSON files), which let Spark automatically optimize both storage and computation. Behind these APIs, the Catalyst optimizer and Tungsten execution engine optimize applications in ways that were not possible with Spark’s object-oriented (RDD) API, such as operating on data in a raw binary form.

Today we’re excited to announce Spark Datasets, an extension of the DataFrame API that provides a type-safe, object-oriented programming interfaceSpark 1.6 includes an API preview of Datasets, and they will be a development focus for the next several versions of Spark. Like DataFrames, Datasets take advantage of Spark’s Catalyst optimizer by exposing expressions and data fields to a query planner.  Datasets also leverage Tungsten’s fast in-memory encoding.  Datasets extend these benefits with compile-time type safety – meaning production applications can be checked for errors before they are run. They also allow direct operations over user-defined classes.

In the long run, we expect Datasets to become a powerful way to write more efficient Spark applications. We have designed them to work alongside the existing RDD API, but improve efficiency when data can be represented in a structured form.  Spark 1.6 offers the first glimpse at Datasets, and we expect to improve them in future releases.

1、Working with Datasets

A Dataset is a strongly-typed, immutable collection of objects that are mapped to a relational schema.  At the core of the Dataset API is a new concept called an encoder, which is responsible for converting between JVM objects and tabular representation. The tabular representation is stored using Spark’s internal Tungsten binary format, allowing for operations on serialized data and improved memory utilization.  Spark 1.6 comes with support for automatically generating encoders for a wide variety of types, including primitive types (e.g. String, Integer, Long), Scala case classes, and Java Beans.

Users of RDDs will find the Dataset API quite familiar, as it provides many of the same functional transformations (e.g. map, flatMap, filter).  Consider the following code, which reads lines of a text file and splits them into words:

// RDDs

val lines = sc.textFile("/wikipedia")
val words = lines
  .flatMap(_.split(" "))
  .filter(_ != "")
// Datasets

val lines = sqlContext.read.text("/wikipedia").as[String]
val words = lines
  .flatMap(_.split(" "))
  .filter(_ != "")

Both APIs make it easy to express the transformation using lambda functions. The compiler and your IDE understand the types being used, and can provide helpful tips and error messages while you construct your data pipeline.

While this high-level code may look similar syntactically, with Datasets you also have access to all the power of a full relational execution engine. For example, if you now want to perform an aggregation (such as counting the number of occurrences of each word), that operation can be expressed simply and efficiently as follows:

// RDDs

val counts = words
    .groupBy(_.toLowerCase)
    .map(w => (w._1, w._2.size))
// Datasets

val counts = words
    .groupBy(_.toLowerCase)
    .count()

Since the Dataset version of word count can take advantage of the built-in aggregate count, this computation can not only be expressed with less code, but it will also execute significantly faster.  As you can see in the graph below, the Dataset implementation runs much faster than the naive RDD implementation.  In contrast, getting the same performance using RDDs would require users to manually consider how to express the computation in a way that parallelizes optimally.

Another benefit of this new Dataset API is the reduction in memory usage. Since Spark understands the structure of data in Datasets, it can create a more optimal layout in memory when caching Datasets. In the following example, we compare caching several million strings in memory using Datasets as opposed to RDDs. In both cases, caching data can lead to significant performance improvements for subsequent queries.  However, since Dataset encoders provide more information to Spark about the data being stored, the cached representation can be optimized to use 4.5x less space.

To help you get started, we’ve put together some example notebooks: Working with ClassesWord Count.

2、Lightning-fast Serialization with Encoders

Encoders are highly optimized and use runtime code generation to build custom bytecode for serialization and deserialization.  As a result, they can operate significantly faster than Java or Kryo serialization.

In addition to speed, the resulting serialized size of encoded data can also be significantly smaller (up to 2x), reducing the cost of network transfers.  Furthermore, the serialized data is already in the Tungsten binary format, which means that many operations can be done in-place, without needing to materialize an object at all.  Spark has built-in support for automatically generating encoders for primitive types (e.g. String, Integer, Long), Scala case classes, and Java Beans.  We plan to open up this functionality and allow efficient serialization of custom types in a future release.

3、Seamless Support for Semi-Structured Data

The power of encoders goes beyond performance.  They also serve as a powerful bridge between semi-structured formats (e.g. JSON) and type-safe languages like Java and Scala.

For example, consider the following dataset about universities:

{"name": "UC Berkeley", "yearFounded": 1868, numStudents: 37581}

{"name": "MIT", "yearFounded": 1860, numStudents: 11318}

…

Instead of manually extracting fields and casting them to the desired type, you can simply define a class with the expected structure and map the input data to it.  Columns are automatically lined up by name, and the types are preserved.

case class University(name: String, numStudents: Long, yearFounded: Long)

val schools = sqlContext.read.json("/schools.json").as[University]

schools.map(s => s"${s.name} is ${2015 – s.yearFounded} years old")

Encoders eagerly check that your data matches the expected schema, providing helpful error messages before you attempt to incorrectly process TBs of data. For example, if we try to use a datatype that is too small, such that conversion to an object would result in truncation (i.e. numStudents is larger than a byte, which holds a maximum value of 255) the Analyzer will emit an AnalysisException.

case class University(numStudents: Byte)

val schools = sqlContext.read.json("/schools.json").as[University]

org.apache.spark.sql.AnalysisException: Cannot upcast `yearFounded` from bigint to smallint as it may truncate

When performing the mapping, encoders will automatically handle complex types, including nested classes, arrays, and maps.

4、A Single API for Java and Scala

Another goal to the Dataset API is to provide a single interface that is usable in both Scala and Java. This unification is great news for Java users as it ensure that their APIs won’t lag behind the Scala interfaces, code examples can easily be used from either language, and libraries no longer have to deal with two slightly different types of input.  The only difference for Java users is they need to specify the encoder to use since the compiler does not provide type information.  For example, if wanted to process json data using Java you could do it as follows:

public class University implements Serializable {
    private String name;
    private long numStudents;
    private long yearFounded;

    public void setName(String name) {...}
    public String getName() {...}
    public void setNumStudents(long numStudents) {...}
    public long getNumStudents() {...}
    public void setYearFounded(long yearFounded) {...}
    public long getYearFounded() {...}
}

class BuildString implements MapFunction {
    public String call(University u) throws Exception {
        return u.getName() + " is " + (2015 - u.getYearFounded()) + " years old.";
    }
}

Dataset schools = context.read().json("/schools.json").as(Encoders.bean(University.class));
Dataset strings = schools.map(new BuildString(), Encoders.STRING());

5、Looking Forward

While Datasets are a new API, we have made them interoperate easily with RDDs and existing Spark programs. Simply calling the rdd() method on a Dataset will give an RDD. In the long run, we hope that Datasets can become a common way to work with structured data, and we may converge the APIs even further.

As we look forward to Spark 2.0, we plan some exciting improvements to Datasets, specifically:

  • Performance optimizations – In many cases, the current implementation of the Dataset API does not yet leverage the additional information it has and can be slower than RDDs. Over the next several releases, we will be working on improving the performance of this new API.
  • Custom encoders – while we currently autogenerate encoders for a wide variety of types, we’d like to open up an API for custom objects.
  • Python Support.
  • Unification of DataFrames with Datasets – due to compatibility guarantees, DataFrames and Datasets currently cannot share a common parent class. With Spark 2.0, we will be able to unify these abstractions with minor changes to the API, making it easy to build libraries that work with both.

If you’d like to try out Datasets yourself, they are already available in Databricks.  We’ve put together a few example notebooks for you to try out: Working with ClassesWord Count.

Spark 1.6 is available on Databricks today, sign up for a free 14-day trial.

參考文獻

  • https://databricks.com/blog/2016/01/04/introducing-apache-spark-datasets.html
  • https://blog.csdn.net/sunbow0/article/details/50723233

相關推薦

Introducing Apache Spark Datasets雙語

文章標題 Introducing Apache Spark Datasets 作者介紹 文章正文 Developers have always loved Apache Spark for providing APIs that are simple yet powerful, a combi

A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets雙語

文章標題 A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets 且談Apache Spark的API三劍客:RDD、DataFrame和Dataset When to use them and why 什麼時候用他們,為什麼

Introducing DataFrames in Apache Spark for Large Scale Data Science雙語

文章標題 Introducing DataFrames in Apache Spark for Large Scale Data Science 一個用於大規模資料科學的API——DataFrame 作者介紹 文章正文 Today, we are excited to announce a ne

What’s new for Spark SQL in Apache Spark 1.3雙語

block htm park -h apache HA log -a -- 文章標題 What’s new for Spark SQL in Apache Spark 1.3 作者介紹 Michael Armbrust 文章正文 參考文獻

Apache Spark as a Compiler: Joining a Billion Rows per Second on a Laptop雙語

文章標題 Apache Spark as a Compiler: Joining a Billion Rows per Second on a Laptop Deep dive into the new Tungsten execution engine 作者介紹 文章正文 參考文獻

Deep Dive into Spark SQL’s Catalyst Optimizer雙語

文章標題 Deep Dive into Spark SQL’s Catalyst Optimizer 作者介紹 文章正文 參考文獻 https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-op

Spark 論文篇-Spark:工作組上的叢集計算的框架雙語

論文內容: 待整理 參考文獻: Spark: Cluster Computing with Working Sets. Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, Ion Stoica. H

資料結構專有名詞&常見術語雙語

資料結構的一些常見術語的中英文雙語對照,很多場合都可以用到, 比如程式設計命名,我覺得挺有用的,就收集在這裡了 --_-- 目錄 一. 常見術語 資料   Data              &

JAVA官方指導書全集 第一篇雙語

                                                                              第一篇:The Java? Tutorials  The Java Tutorials are practical g

Android樣式的抽取使用與國際化雙語

今天我們來完成一箇中英雙語UI介面的設計,並把常用的一些樣式抽取出來使用,話不多說,上圖: 下面具體講解一下實現過程。 1、我們在新建的專案中的res目錄下新建2個資料夾,values-zh-rCN、values-en-rUS

常見的大資料術語名詞解釋對照

       大資料的出現帶來了許多新的術語,但這些術語往往比較難以理解。因此,我們通過本文給出一個常用的大資料術語表,拋磚引玉,供大家深入瞭解。其中部分定義參考了相應的部落格文章。當然,這份術語表並沒有100%包含所有的術語。 A 聚合(Aggregation) –

國家及校級獎項、稱號對照

國家獎學金 National Scholarship   國家勵志獎學金 National Encouragement scholarship   三好學生標兵 Pacemaker to Merit Student   三好學生 Merit Student   學習優秀生

Apache Spark 2.2基於成本的優化器CBO轉載

ons roc art 3.4 post tinc ner sort 重排序 Apache Spark 2.2最近引入了高級的基於成本的優化器框架用於收集並均衡不同的列數據的統計工作 (例如., 基(cardinality)、唯一值的數量、空值、最大最小值、平均/最大長度,

Excel添加讀音均可

excel .com img TP sele info sel 按鈕 sub 點擊開發工具 VB 雙擊表格 Sub 讀() Selection.Speak End Sub 點擊插入>按鈕 Excel添加讀音(中英均可)

IntelliJ IDEA 快捷鍵說明大全對照、帶圖示詳解 (轉載)

lac 關閉 計算表達 ror 官網 條件 mark ctrl + c 為什麽 其中的英文說明來自於 idea 的官網資料,中文說明主要來自於自己的領會和理解,英文說明只是作為參考。重要的快捷鍵會附帶圖示,進行詳細的說明。 每一部分會先列出所有的快捷鍵說明表,如果有不清楚的

IntelliJ IDEA 快捷鍵說明大全對照、帶圖示詳解

show catch 源碼 error 熱鍵 說明 type 機制 edi IntelliJ IDEA 快捷鍵說明大全(中英對照、帶圖示詳解) 因為覺得網絡上的 idea 快捷鍵不夠詳盡,所以特別編寫了此篇文章,方便大家使用 idea O(∩_∩)

React Conf 2018 專題 —— React Today and Tomorrow PART I 視訊雙語字幕

最近在 掘金翻譯計劃 校對了一篇 Dan Abramov 的關於 React Hooks 的文章,在 Sophie Alpert 和 Dan 在 React Conf 2018 上對 Hooks 的提案之後, Hooks 非常火。想到由於原視訊在 Youtube 上的原因導致大部分小夥伴沒法觀看,而且官

React Conf 2018 專題 —— React Today and Tomorrow PART II 視訊雙語字幕

距離React Conf 2018 已經將近一個月了,距離上個 React Conf 2018 的中英文雙語視訊釋出也有兩週的時間了,這兩週,一直在進行Dan Abramov 的關於 React Hooks 提案部分演講的字幕校對和翻譯工作,多謝開源社群,這次加入了新的小夥伴 程式媛_小發 一起完成了校

struts框架實現國際化實現登入頁面互換

新建一個Web工程,並且搭好框架 以上已經介紹了Struts2框架的搭建,在這裡就不在介紹了。下面是例項化的一個例子,有需要的可以參考一下。 首先,先寫jsp頁面,在這個jsp上測試中英文切換效果,程式碼如下: 登入頁面:index.jsp <%@ pag

照片裏的20世紀全球史 - 雙語 - 套裝共10冊 - azw3電子書 - 完整高清版下載

pdf web ogr size 地球 分享圖片 電視 img 報刊 照片裏的20世紀全球史 - 中英雙語 - 套裝共10冊 - azw3電子書 - 完整高清版下載 下載地址:網盤下載 備用地址:網盤下載 照片裏的20世紀全球史 - 中英雙語 - 套裝共10冊 - azw3