1. 程式人生 > >Spark實戰(1) 配置AWS EMR 和Zeppelin Notebook

Spark實戰(1) 配置AWS EMR 和Zeppelin Notebook

SparkContext和SparkSession的區別,如何取用?

  • SparkContext:
    • 在Spark 2.0.0之前使用
    • 通過資源管理器例如YARN來連線叢集
    • 需要傳入SparkConf來建立SparkContext物件
    • 如果要使用SQL,HIVE或者Streaming的API, 需要建立單獨的Context
    •   val conf = new SparkConf()
        .setAppName(“RetailDataAnalysis”)
        .setMaster(“spark://master:7077)
        .
      set(“spark.executor.memory”, “2g”) val sc = new SparkContext(conf)
  • SparkSession:
    • 出現在Spark 2.0.0之後, 推薦使用
    • 除了能夠呼叫Spark的全部功能之外,允許DataFrameDataset APIs
    • 對於SQL, HIVE和Streaming,不需要建立單獨的Context
    • 可以在初始化session之後配置config
       # Creating Spark session:
       val spark = SparkSession
       			.
      builder .appName("WorldBankIndex") .getOrCreate() # Configuring properties: spark.conf.set("spark.sql.shuffle.partitions", 6) spark.conf.set("spark.executor.memory", "2g")

配置AWS EMR

# 1. Open aws console
# 2. Access the EMR
# 3. Create cluser
# 4. Go to andvanced options
# 5. Release: emr-5.11.1 # 6. Hadoop: 2.7.3 # 7. Zeppelin: 0.7.3 # 8. Spark: 2.2.1 # 9. Choose spot price to save budget # 10. Create you key pair, download and chmod 400 it # 11. Add inbound Security Group: 22 for ssh, 8890 for Zeppelin

建立Zeppelin Notebook

# 1. access master node public dns:8890
# 2. Create new note
# 3. Default Interpreter: spark
%pyspark # 4. import the pyspark package
# after importing package, you could run python code in zeppelin
for i in [1,2,3]:
	print(i)
	
# the spark context is already set
sc

# the spark session is already set
spark

# read file fro aws s3
df = spark.read.csv("s3n://MyaccessKey:[email protected]/file.csv")