Winning a Kaggle competition with Apache Spark and SparkML Machine Learning Pipelines

阿新 • • 發佈：2019-01-16

IBM Chief Data Scientist Romeo Kienzler demonstrates how to use the new DataFrames-based SparkML pipelines (with data from a recent Kaggle competition on production line performance) to code a machine learning workflow from scratch. Romeo starts by showing you how to ingest the Kaggle data then performs the ETL (extract, transform, load) process using the Apache Parquet format and OpenStack Swift to store the data to ObjectStore.

He uses common, pre-processing techniques such as one hot encoding and String Indexing demonstrate how to create the Spark ML pipeline. Finally, Romeo feeds the data into an algorithm called RandomForrest and illustrates how to evaluate the results.

After the session, you will come away with a template you can use for your data science projects. The event will be performed using the

IBM Data Science Experience so you can join up and immediately replicate the example.

The Apache Spark open-source cluster-computing framework sports Spark ML, a package introduced in Spark 1.2 which provides a uniform set of high-level APIs that help developers create and tune practical machine learning pipelines. Spark ML represents a common machine learning workflow as a pipeline, a sequence of stages in which each stage is either a transformer or an estimator.

A simple text document processing workflow follows stages similar to this, usually in order:

Split the document text into words
Convert the words into a numerical feature vector
Develop a prediction model using the feature vectors and labels

A transformer is an abstraction which includes feature transformers and learned models; it implements a transform() method which converts one schema of a resilient distributed dataset (RDD) into another. An estimator abstracts the concept of a learning algorithm or any algorithm that fits or trains on data by implementing a fit() method that accepts a schema RDD and produces a transformer.

The Spark DataFrames API, an extension to the RDD API and inspired by data frames in R and Python, was designed to support modern big data and data science applications. It is simply a distributed collection of data organized into named columns that can be constructed from a wide array of sources such as structured data files, tables in Hive, external databases, or existing RDDs.

Romeo Kienzler is a Senior Data Scientist and DeepLearning and AI Engineer for IBM Watson IoT and an IBM Certified Senior Architect who spends much of his waking life helping global clients solve their data analysis challenges. Romeo holds an MSc (ETH) in Computer Science with specialization in information systems, bioinformatics, and applied statistics from the Swiss Federal Institute of Technology. He is an Associate Professor of artificial intelligence and his current research focus is on cloud-scale machine learning and deep learning using open source technologies including R, Apache Spark, Apache SystemML, Apache Flink, DeepLearning4J, and TensorFlow.

Winning a Kaggle competition with Apache Spark and SparkML Machine Learning Pipelines

Resources for you

Winning a Kaggle competition with Apache Spark and SparkML Machine Learning Pipelines

Beginning Data Exploration and Analysis with Apache Spark 使用Apache Spark開始資料探索和分析中文字幕

Building a Big Data Pipeline With Airflow, Spark and Zeppelin

A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets（中英雙語）

Offset Management For Apache Kafka With Apache Spark Streaming

Author name disambiguation using a graph model with node splitting and merging based on bibliographic information

Stream processing with Apache Flink and Minio

My secret sauce to be in top 2% of a kaggle competition

A Microservice Architecture with Spring Boot and Spring Cloud（五）

A Microservice Architecture with Spring Boot and Spring Cloud（四）

A Microservice Architecture with Spring Boot and Spring Cloud（三）

A Microservice Architecture with Spring Boot and Spring Cloud（二）

A Microservice Architecture with Spring Boot and Spring Cloud（一）

Livy : A REST Interface for Apache Spark

Build a task app with Hapi, MongoDB and Vue.js

Analyze Your Data on Amazon DynamoDB with Apache Spark

JVM-based deep learning on IoT data with Apache Spark

解決value toDF is not a member of org.apache.spark.rdd.RDD[People]

scala學習-Description Resource Path Location Type value toDF is not a member of org.apache.spark.rdd.R

Learn How to Code and Deploy Machine Learning Models on Spark Structured Streaming

Winning a Kaggle competition with Apache Spark and SparkML Machine Learning Pipelines

Resources for you

相關推薦