1. 程式人生 > >Winning a Kaggle competition with Apache Spark and SparkML Machine Learning Pipelines

Winning a Kaggle competition with Apache Spark and SparkML Machine Learning Pipelines

IBM Chief Data Scientist Romeo Kienzler demonstrates how to use the new DataFrames-based SparkML pipelines (with data from a recent Kaggle competition on production line performance) to code a machine learning workflow from scratch. Romeo starts by showing you how to ingest the Kaggle data then performs the ETL (extract, transform, load) process using the Apache Parquet format and OpenStack Swift to store the data to ObjectStore.

He uses common, pre-processing techniques such as one hot encoding and String Indexing demonstrate how to create the Spark ML pipeline. Finally, Romeo feeds the data into an algorithm called RandomForrest and illustrates how to evaluate the results.

After the session, you will come away with a template you can use for your data science projects. The event will be performed using the

IBM Data Science Experience so you can join up and immediately replicate the example.

The Apache Spark open-source cluster-computing framework sports Spark ML, a package introduced in Spark 1.2 which provides a uniform set of high-level APIs that help developers create and tune practical machine learning pipelines. Spark ML represents a common machine learning workflow as a pipeline, a sequence of stages in which each stage is either a transformer or an estimator.

A simple text document processing workflow follows stages similar to this, usually in order:

  1. Split the document text into words
  2. Convert the words into a numerical feature vector
  3. Develop a prediction model using the feature vectors and labels

A transformer is an abstraction which includes feature transformers and learned models; it implements a transform() method which converts one schema of a resilient distributed dataset (RDD) into another. An estimator abstracts the concept of a learning algorithm or any algorithm that fits or trains on data by implementing a fit() method that accepts a schema RDD and produces a transformer.

The Spark DataFrames API, an extension to the RDD API and inspired by data frames in R and Python, was designed to support modern big data and data science applications. It is simply a distributed collection of data organized into named columns that can be constructed from a wide array of sources such as structured data files, tables in Hive, external databases, or existing RDDs.

Romeo Kienzler is a Senior Data Scientist and DeepLearning and AI Engineer for IBM Watson IoT and an IBM Certified Senior Architect who spends much of his waking life helping global clients solve their data analysis challenges. Romeo holds an MSc (ETH) in Computer Science with specialization in information systems, bioinformatics, and applied statistics from the Swiss Federal Institute of Technology. He is an Associate Professor of artificial intelligence and his current research focus is on cloud-scale machine learning and deep learning using open source technologies including R, Apache Spark, Apache SystemML, Apache Flink, DeepLearning4J, and TensorFlow.

Resources for you

相關推薦

Winning a Kaggle competition with Apache Spark and SparkML Machine Learning Pipelines

IBM Chief Data Scientist Romeo Kienzler demonstrates how to use the new DataFrames-based SparkML pipelines (with data from a recent Kaggle competition on

Beginning Data Exploration and Analysis with Apache Spark 使用Apache Spark開始資料探索和分析 中文字幕

使用Apache Spark開始資料探索和分析 中文字幕 Beginning Data Exploration and Analysis with Apache Spark 無論您是想要探索資料還是開發複雜的機器學習模型,資料準備都是任何資料專業人士的主要任務 Spark是一種引擎,它

Building a Big Data Pipeline With Airflow, Spark and Zeppelin

Building a Big Data Pipeline With Airflow, Spark and Zeppelin“black tunnel interior with white lights” by Jared Arango on UnsplashIn this data-driven era,

A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets(中英雙語)

文章標題 A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets 且談Apache Spark的API三劍客:RDD、DataFrame和Dataset When to use them and why 什麼時候用他們,為什麼

Offset Management For Apache Kafka With Apache Spark Streaming

ould cond eth ref properly fine load them sca An ingest pattern that we commonly see being adopted at Cloudera customers is Apache Spark

Author name disambiguation using a graph model with node splitting and merging based on bibliographic information

分隔 需要 sin 相似性度量 進行 ati 判斷 特征向量 edi Author name disambiguation using a graph model with node splitting and merging based on bibliographic

Stream processing with Apache Flink and Minio

轉自:https://blog.minio.io/stream-processing-with-apache-flink-and-minio-10da85590787 Modern technology trends like Machine Learning, Deep Learning, Art

My secret sauce to be in top 2% of a kaggle competition

from:https://towardsdatascience.com/my-secret-sauce-to-be-in-top-2-of-a-kaggle-competition-57cff0677d3c Competing in kaggle competitions is fun and

A Microservice Architecture with Spring Boot and Spring Cloud(五)

測試 REST API 最後,我們將測試我們的REST API。 首先,一個簡單的設定: private final String ROOT_URI = "http://localhost:8080"; private FormAuthConfig formConfig =

A Microservice Architecture with Spring Boot and Spring Cloud(四)

REST APIs 我們需要兩個相同設定的API:Config Client,Eureka,JPA,Web,和Security: <dependency> <groupId>org.springframework.cloud</groupId&g

A Microservice Architecture with Spring Boot and Spring Cloud(三)

服務發現 對於服務發現,我們需要Eureka,Cloud Config Client和Security: <dependency> <groupId>org.springframework.cloud</groupId> <a

A Microservice Architecture with Spring Boot and Spring Cloud(二)

安全配置 下一步是保護這兩個API。 雖然後面我們可能需要用OAuth2 + JWT來實現,但現在從基本認證開始。 這正是我們要開始的地方。 首先,我們Book application 的安全配置如下: @EnableWebSecurity @Configuration publ

A Microservice Architecture with Spring Boot and Spring Cloud(一)

前段日子,就有個想法,打算翻譯一些關於SpringBoot的文件資料。後來在學習SpringBoot中,無意發現一本SpringBoot與SpringCloud微服務架構入門級的書籍,感覺不錯,決定拿它作為我職業生涯翻譯的第一本技術書。 《A Microservice Arch

Livy : A REST Interface for Apache Spark

官網:http://livy.incubator.apache.org/ 概述:     當前spark上的管控平臺有spark job server,zeppelin,由於spark job server和zeppelin都存在一些缺陷,比如spark job se

Build a task app with Hapi, MongoDB and Vue.js

The idea for this tutorial is we’re going to build a task app with Node.js, MongoDB and Vue.js. Users will be able to read, create and delete tasks from th

Analyze Your Data on Amazon DynamoDB with Apache Spark

Manjeet Chayel is a Solutions Architect with AWS Every day, tons of customer data is generated, such as website logs, gaming data, adverti

JVM-based deep learning on IoT data with Apache Spark

Romeo Kienzler works as a Chief Data Scientist in the IBM Watson IoT worldwide team helping clients to apply advanced machine learning at scale on their Io

解決value toDF is not a member of org.apache.spark.rdd.RDD[People]

編譯如下程式碼時 val rdd : RDD[People]= sparkSession.sparkContext.textFile(hdfsFile,2).map(line => line.split(",")).map(arr => Peo

scala學習-Description Resource Path Location Type value toDF is not a member of org.apache.spark.rdd.R

編譯如下程式碼時,出現value toDF is not a member of org.apache.Spark.rdd.RDD[People] 錯誤 val rdd : RDD[People]= sparkSession.sparkContext.tex

Learn How to Code and Deploy Machine Learning Models on Spark Structured Streaming

This post is a token of appreciation for the amazing open source community of Data Science, to which I owe a lot of what I have learned. For last few month