RDD轉換為DataFrame案例

阿新 • • 發佈：2019-02-15

SparkSQL支援兩種方式來將RDD轉換為DataFrame。

第一種方式，是使用反射來推斷包含了特定資料型別的RDD的元資料。這種基於反射的方式，程式碼比較簡潔，當你已經知道你的RDD的元資料時，是一種非常不錯的方式。

第二種方式，是通過程式設計介面來建立DataFrame，你可以在程式執行時動態構建一份元資料，然後將其應用到已經存在的RDD上。這種方式的程式碼比較冗長，但是如果在編寫程式時，還不知道RDD的元資料，只有在程式執行時，才能動態得知其元資料，那麼只能通過這種動態構建元資料的方式。

檔案students.txt中內容如下：

1,leo,17
2,marry,17
3,jack,18
4,tom,19

1. 使用反射方式將RDD轉換為DataFrame

Java程式碼如下：

public class RDD2DataFrameReclection {
	public static void main(String[] args) {
		SparkConf conf = new SparkConf()
					.setMaster("local")
					.setAppName("RDD2DataFrameReflection");
		JavaSparkContext sc = new JavaSparkContext(conf);
		SQLContext sqlContext = new SQLContext(sc);
		
		JavaRDD<String> lines = sc.textFile("./data/students.txt");
		JavaRDD<Student> students = lines.map(new Function<String, Student>() {

			@Override
			public Student call(String line) throws Exception {
				String[] lineSplited = line.split(",");
				Student stu = new Student();
				stu.setId(Integer.valueOf(lineSplited[0].trim()));
				stu.setName(lineSplited[1]);
				stu.setAge(Integer.valueOf(lineSplited[2]));
				
				return stu;
			}
		});
		
		//使用反射方式將RDD轉換為DataFrame
		//將Student.class傳入進入，其實就是用反射的方式來建立DataFrame
		//因為Student.class本身就是反射的一個應用
		//然後底層還得通過對Student Class進行反射，來獲取其中的field
		//這裡要求，JavaBean必須實現Serializable介面，是可序列化的
		DataFrame studentDF = sqlContext.createDataFrame(students, Student.class);
	
		//拿到了一個DataFrame之後，就可以將去註冊為一個臨時表，然後針對其中的資料執行SQL語句
		studentDF.registerTempTable("students");
		//針對students臨時表執行SQL語句，查詢年齡小於等於18歲的學生，就是teenager
		DataFrame teenagerDF = sqlContext.sql("select * from students where age<=18");
	
		//將查詢出來的DataFrame再次轉換為RDD
		JavaRDD<Row> teenagerRDD = teenagerDF.javaRDD();
		
		//將RDD中的資料進行對映，對映為student
		JavaRDD<Student> teenagerStudentRDD = teenagerRDD.map(new Function<Row, Student>() {

			@Override
			public Student call(Row row) throws Exception {
				//row中的資料順序可以與期望的不同
				Student stu = new Student();
				stu.setAge(row.getInt(0));
				stu.setId(row.getInt(1));
				stu.setName(row.getString(2));
				
				return stu;
			}
		});
		
		//將資料collect回來，打印出來
		List<Student> studentList = teenagerStudentRDD.collect();
		for(Student stu : studentList)
			System.out.println(stu);
		
	}
}

Scala程式碼如下：

object RDD2DataFrameReflection extends App {
  
  val conf = new SparkConf()
      .setAppName("RDD2DataFrameReflection")
      .setMaster("local")
  val sc = new SparkContext(conf)
  val sqlContext = new SQLContext(sc)
  
  //在scala中使用反射方式，進行RDD到DataFrame的轉換，需要手動匯入一個隱式轉換
  import sqlContext.implicits._
  
  case class Student(id:Int,name:String,age:Int)
  
  //這裡其實就是一個普通的，元素為case class的RDD
  //直接對它使用toDF()方法，即可轉換為DataFrame
  val studentDF = sc.textFile("./data/students.txt", 1)
      .map { line => line.split(",") }
      .map { arr => Student(arr(0).trim().toInt, arr(1), arr(2).trim().toInt) }
      .toDF()
     
   studentDF.registerTempTable("students")
   val teenagerDF = sqlContext.sql("select * from students where age<=18")
   
   val teenagerRDD = teenagerDF.rdd
   
   teenagerRDD.map { row => Student(row(0).toString().toInt,row(1).toString(),row(2).toString().toInt) }
   .collect()
   .foreach { stu => println(stu.id + ":" + stu.name + ":" + stu.age) }
   
   // 在scala中，對row的使用，比java中的row的使用，更加豐富
   // 在scala中，可以用row的getAs()方法，獲取指定列名的列
   teenagerRDD.map { row => Student(row.getAs[Int]("id"),row.getAs[String]("name"),row.getAs[Int]("age")) }
   .collect()
   .foreach { stu => println(stu.id + ":" + stu.name + ":" + stu.age) }
   
    // 還可以通過row的getValuesMap()方法，獲取指定幾列的值，返回的是個map
   val studentRDD = teenagerRDD.map { row => {
     val map = row.getValuesMap[Any](Array("id","name","age"));
     Student(map("id").toString().toInt,map("name").toString(),map("age").toString().toInt)
   } 
   }
   studentRDD.collect().foreach { stu => println(stu.id + ":" + stu.name + ":" + stu.age) }

}

2. 以程式設計方式動態指定元資料，將RDD轉換為DataFrame

Java程式碼如下：

public class RDD2DataFrameProgramatically {
	
	public static void main(String[] args) {
		//建立SparkConf、JavaSparkContext、SQLContext
		SparkConf conf = new SparkConf()
					.setMaster("local")
					.setAppName("RDD2DataFrameProgramatically");
		
		JavaSparkContext sc = new JavaSparkContext(conf);
		SQLContext sqlContext = new SQLContext(sc);
		
		//第一步，建立一個普通的RDD，但是，必須將其轉換為RDD<Row>的這種格式
		JavaRDD<String> lines = sc.textFile("./data/students.txt");
		
		JavaRDD<Row> studentRows = lines.map(new Function<String, Row>() {

			@Override
			public Row call(String line) throws Exception {
				String[] lineSplited = line.split(",");
				return RowFactory.create(Integer.valueOf(lineSplited[0])
						,lineSplited[1],
						Integer.valueOf(lineSplited[2]));
			}
		});
		
		//第二步，動態構造元資料
		//比如說，id、name等，field的名稱和型別，可能都是在程式執行過程中，動態從mysql db裡
		//或者配置檔案中，加載出來的，是不固定的
		//所以特別適合用這種程式設計的方式，來構造元資料
		List<StructField> structFields = new ArrayList<StructField>();
		structFields.add(DataTypes.createStructField("id", DataTypes.IntegerType, true));
		structFields.add(DataTypes.createStructField("name", DataTypes.StringType, true));
		structFields.add(DataTypes.createStructField("age", DataTypes.IntegerType, true));
		
		StructType structType = DataTypes.createStructType(structFields);
		
		//第三步，使用動態構造的元資料將RDD轉換為DataFrame
		DataFrame studentDF = sqlContext.createDataFrame(studentRows, structType);
		
		//後面，就可以使用DataFrame了
		studentDF.registerTempTable("students");
		
		DataFrame teenagerDF = sqlContext.sql("select * from students where age < 18");
		
		List<Row> rows = teenagerDF.javaRDD().collect();
		for(Row row : rows) {
			System.out.println(row);
		}
	}
}

Scala程式碼如下：

object RDD2DataFrameProgrammatically extends App {
  
  val conf = new SparkConf()
        .setMaster("local")
        .setAppName("RDD2DataFrameProgrammatically")
        
  val sc = new SparkContext(conf)
  val sqlContext = new SQLContext(sc)
  
  //第一步，構造出元素為Row的普通RDD
  val studentRDD = sc.textFile("./data/students.txt", 1)
        .map { line => Row(line.split(",")(0).toInt, line.split(",")(1), line.split(",")(2).toInt) }
  
  //第二步，程式設計方式動態構造元資料
  val structType = StructType(Array(
      StructField("id",IntegerType,true),
      StructField("name",StringType,true),
      StructField("age",IntegerType,true)))
  
  //第三步，進行RDD到DataFrame的轉換
  val studentDF = sqlContext.createDataFrame(studentRDD, structType)
  
  //接續正常使用
  studentDF.registerTempTable("students")
  
  val teenagerDF = sqlContext.sql("select * from students where age<=18")
  
  val teenagerRDD = teenagerDF.rdd.collect().foreach { row => println(row) }
}

RDD轉換為DataFrame案例

SparkSQL支援兩種方式來將RDD轉換為DataFrame。第一種方式，是使用反射來推斷包含了特定資料型別的RDD的元資料。這種基於反射的方式，程式碼比較簡潔，當你已經知道你的RDD的元資料時，是一種非常不錯的方式。第二種方式，是通過程式設計介面來建立D

Spark RDD轉換為DataFrame

person true line ted struct ger fields text san #構造case class，利用反射機制隱式轉換 scala> import spark.implicits._ scala> val rdd= sc.text

RDD轉換為DataFrame【反射/編程】

pac ESS cas == its 選擇 stop csdn auth 寫在前面主要是加載文件為RDD，再把RDD轉換為DataFrame,進而使用DataFrame的API或Sql進行數據的方便操作簡單理解：DataFrame=RDD+Schema 貼代碼 pack

RDD使用程式設計介面方式轉換為DataFrame的工具類（針對欄位特別多的）

在使用Spark-Sql 時，需要把RDD型別轉換為DataFrame，再使用一些SQL操作，在轉換為DataFrame時有兩種方式一種是通過反射方式，一種是通過程式設計介面方式程式設計介面的方式比較常用，但是這種方式程式碼量可能比較大，特別是在你的欄位特別多的時候，你需要先把RDD中的型

Spark SQL中 RDD 轉換到 DataFrame

pre ase replace 推斷 expr context 利用反射轉換 port 1.people.txtsoyo8, 35小周, 30小華, 19soyo,882./** * Created by soyo on 17-10-10. * 利用反射機制推斷RDD

怎樣利用Pandas將List列表轉換為Dataframe？

1.一種情況是有兩個列表，合併到一個DataFrame中：假設一個列表為a,另一個列表為b,則可以採用以下兩種方法進行合併：方法1,1 首先將兩個列表合併成一個字典，然後再將該字典傳入到DataFrame中建立，程式碼示例如下： >>> a

Spark中RDD轉換成DataFrame的兩種方式（分別用Java和scala實現）

一：準備資料來源在專案下新建一個student.txt檔案，裡面的內容為： print? <code class="language-java">1,zhangsan,20 2,lisi,21 3,wanger,1

Pandas Series轉換為DataFrame

說明雖然Series有一個to_frame()方法，但是當Series的index也需要轉變為DataFrame的一列時，這個方法轉換會有一點問題。所以，下面我採用將Series物件轉換為list物件，然後將list物件轉換為DataFrame物件。例項這

Spark讀取文字檔案並轉換為DataFrame

本文首發於我的個人部落格QIMING.INFO，轉載請帶上鍊接及署名。 Spark ML裡的核心API已經由基於RDD換成了基於DataFrame，為了使讀取到的值成為DataFrame型別，我們可以直接使用讀取CSV的方式來讀取文字檔案，可問題來了，當文字檔案中每一行的各

Spark SQL將rdd轉換為資料集-以程式設計方式指定模式（Programmatically Specifying the Schema）

一：解釋官網：https://spark.apache.org/docs/latest/sql-getting-started.html 這種場景是生活中的常態 When case classes cannot be defined ahead of time (for example

配置sparksql讀hive，dataframe和RDD，將RDD轉換成Dataframe，檢視，withcolumn

文章目錄退出spark-shell 使用spark自帶檔案建立dataframe 退出安全模式配置spark讀hive 1.pom檔案增加 2.resource下加檔案 3.修改h

RDD轉換成DataFrame的2種方式

DataFrame 與 RDD 的互動 Spark SQL它支援兩種不同的方式轉換已經存在的RDD到DataFrame 方法一第一種方式是使用反射的方式，用反射去推倒出來RDD裡面的schema。這個方式簡單，但是不建議使用，因為在工作當中，使用

關於html轉換為pdf案例的一些測試與思考

由於工作所需，最近花時間研究了html轉換為pdf的功能。html轉換為pdf的關鍵技術是如何處理網頁中複雜的css樣式，通過在網上收集資料，發現目前html 轉換為pdf的解決方案主要分為三類：客戶端模式：前後臺呼叫客戶端程式，利用客戶端程式的功能完成pdf檔案轉換。本

python 將dataframe的某一列離散資料轉換為數值資料

from sklearn import preprocessing def bianma(a, name): type = a.ix[:, name] a[name].fillna('0', inplace=True) le = preprocessing.LabelE

pandas中dataframe的索引使用和轉換為array

#!/usr/bin/env python3 # -*- coding: utf-8 -*- """ Created on Tue Nov 6 23:44:54 2018 @author: lilong """ import pandas as pd import numpy as

spark1.6使用：讀取本地外部資料，把RDD轉化成DataFrame，儲存為parquet格式,讀取csv格式

一、先開啟Hadoop和spark 略二、啟動spark-shell spark-shell --master local[2] --jars /usr/local/src/spark-1.6.1-bin-hadoop2.6/libext/com.mysql.jdbc.Driver.j

Pandas Dataframe資料轉換為二維陣列array

一個Dataframe如下： age astigmatic lenses_target prescript tearRate 0 2 0 1 1 1 1 2

Pandas DataFrame將多列資料一次性從object轉換為datetime

從CSV檔案中讀取資料後，很多日期型別資料為object。為了批量將這幾列轉換為datetime。怎麼做呢？一、找出df的列名中有“date”日期的列 datel=[] for x in df.columns.tolist(): if 'date' in x: d

OpenCVforUnity中Texture2D格式轉換為OpenCV中的Mat格式——Texture2DToMatExample案例

OpenCV for Unity（2.3.2）外掛中的Texture2DToMatExample案例（Unity2018.2.6f1）位置：OpenCVForUnity\Examples\Basic\Texture2DToMatExample 目錄一、功能概括

SparkStreaming（15）：DStream轉換為RDD的Transform運算元

1.實現功能 DStream中還是缺少某些API的，比如sortByKey之類的。所以使用Transform直接操作DStream中的當前job/批次對應的RDD，來替換DStream的操作（可以直接使用RDD的api），比較方便。 2.程式碼 package

RDD轉換為DataFrame案例

相關推薦