spark學習記錄（十二、Spark UDF&UDAF&開窗函式）

阿新 • • 發佈：2019-01-13

一、UDF&UDAF

public class JavaExample {
    public static void main(String[] args) {
        SparkConf conf = new SparkConf();
        conf.setMaster("local");
        conf.setAppName("udf");
        JavaSparkContext sc = new JavaSparkContext(conf);
        SQLContext sqlContext = new SQLContext(sc);
        JavaRDD<String> parallelize = sc.parallelize(Arrays.asList("Sam", "Tom", "Jetty", "Tom", "Jetty"));
        JavaRDD<Row> rowRDD = parallelize.map(new Function<String, Row>() {
            public Row call(String s) throws Exception {
                return RowFactory.create(s);
            }
        });

        /**
         * 動態建立Schema方式載入DF
         */
        List<StructField> fields = new ArrayList<StructField>();
        fields.add(DataTypes.createStructField("name", DataTypes.StringType, true));
        StructType schema = DataTypes.createStructType(fields);

        Dataset<Row> dataFrame = sqlContext.createDataFrame(rowRDD, schema);
        dataFrame.registerTempTable("user");

        /**
         * 根據UDF函式引數的個數決定是實現哪個UDF   UDF1,UDF2...
         * UDF1表示傳一個引數,StrLen(name)
         */
        sqlContext.udf().register("StrLen", new UDF1<String, Integer>() {
            public Integer call(String s) throws Exception {
                return s.length();
            }
        }, DataTypes.IntegerType);

        sqlContext.sql("select name,StrLen(name) as length from user").show();

        /**
         * 註冊一個UDAF函式，實現統計相同的值得個數
         */
        sqlContext.udf().register("StringCount", new UserDefinedAggregateFunction() {
            /**
             * 指定輸入欄位的欄位及型別
             */
            @Override
            public StructType inputSchema() {
                return DataTypes.createStructType(
                        Arrays.asList(DataTypes.createStructField("name",
                                DataTypes.StringType, true)));
            }
            /**
             * 指定UDAF函式計算後返回的結果型別
             */
            @Override
            public DataType dataType() {
                return DataTypes.IntegerType;
            }
            /**
             * 確保一致性一般用true，用以標記針對給定的一組輸入，UDAF總是生成相同的結果
             */
            @Override
            public boolean deterministic() {
                return true;
            }
            /**
             * 可以認為一個一個的將組內的欄位值傳遞出來顯現拼接的邏輯
             * buffer.update(0)獲取的是上一次聚合的值
             * 相當於map的combiner，combiner就是對每一個map task的處理結果進行一次小聚合
             * 大聚合發生在reduce端
             * 這裡既是：在進行聚合的時候，每當有新的值進來，對分組後的聚合
             */
            @Override
            public void update(MutableAggregationBuffer buffer, Row input) {
                buffer.update(0,buffer.getInt(0)+1);
            }
            /**
             * 在進行聚合操作的時候所要處理的資料的結果的型別
             */
            @Override
            public StructType bufferSchema() {
                return DataTypes.createStructType(
                                Arrays.asList(DataTypes.createStructField("bf", DataTypes.IntegerType, true)));
            }
            /**
             * 合併 update操作，可能是針對一個分組內的部分資料，在某個節點上發生的 但是可能一個分組內的資料，
             * 會分佈在多個節點上處理
             * 此時就要用merge操作，將各個節點上分散式拼接好的串，合併起來
             * buffer1.getInt(0) : 大聚和的時候 上一次聚合後的值
             * buffer2.getInt(0) : 這次計算傳入進來的update的結果
             * 這裡即是：最後在分散式節點完成後需要進行全域性級別的Merge操作
             */
            @Override
            public void merge(MutableAggregationBuffer buffer1, Row buffer2) {
                buffer1.update(0, buffer1.getInt(0) + buffer2.getInt(0));
            }
            /**
             * 初始化一個內部的自己定義的值，在Aggregate之前每組資料的初始化結果
             */
            @Override
            public void initialize(MutableAggregationBuffer buffer) {
                buffer.update(0, 0);
            }
            /**
             * 最後返回一個和DataType的型別要一致的型別，返回UDAF最後的計算結果
             */
            @Override
            public Object evaluate(Row buffer) {
                return buffer.getInt(0);
            }
        });

        sqlContext.sql("select name ,StringCount(name) as number from user group by name").show();

        sc.stop();
    }
}

二、開窗函式

例：駕駛第一列為日期，第二類為類別，第三類為價格。統計每個類別賺的最多的三次。

資料：https://download.csdn.net/download/qq_33283652/10904792

/**
 * row_number() 開窗函式是按照某個欄位分組，然後取另一欄位的前幾個的值，相當於 分組取topN
 * 如果SQL語句裡面使用到了開窗函式，那麼這個SQL語句必須使用HiveContext來執行，HiveContext預設情況下在本地無法建立。
 * 開窗函式格式：
 * row_number() over (partitin by XXX order by XXX desc)
 */
public class JavaExample {
    public static void main(String[] args) {
        SparkConf conf = new SparkConf();
        conf.setAppName("windowfun");
        conf.set("spark.sql.shuffle.partitions","1");
        JavaSparkContext sc = new JavaSparkContext(conf);
        HiveContext hiveContext = new HiveContext(sc);
        hiveContext.sql("use spark");
        hiveContext.sql("drop table if exists sales");
        hiveContext.sql("create table if not exists sales (riqi string,leibie string,jine Int) "
                + "row format delimited fields terminated by '\t'");
        hiveContext.sql("load data local inpath '/root/test/sales' into table sales");
        /**
         * 開窗函式格式：
         * 【 rou_number() over (partitin by XXX order by XXX) 】
         */
        Dataset<Row> result = hiveContext.sql("select riqi,leibie,jine "
                + "from ("
                + "select riqi,leibie,jine,"
                + "row_number() over (partition by leibie order by jine desc) rank "
                + "from sales) t "
                + "where t.rank<=3");
        result.write().mode(SaveMode.Overwrite).saveAsTable("sales_result");
        sc.stop();
    }
}

spark學習記錄（十二、Spark UDF&UDAF&開窗函式）

一、UDF&UDAF public class JavaExample { public static void main(String[] args) { SparkConf conf = new SparkConf(); conf.

spark學習記錄（十一、Spark on Hive配置）

新增依賴 <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-hive_2.12</artifa

spark學習記錄（十四、kafka）

一、簡介 kafka是一個高吞吐的分散式訊息佇列系統。特點是生產者消費者模式，先進先出（FIFO）保證順序，自己不丟資料，預設每隔7天清理資料。訊息列隊常見場景：系統之間解耦合、峰值壓力緩衝、非同步通訊。 Kafka架構是由producer（訊息生產者）、consumer（訊息消費者）

spark學習記錄（十、SparkSQL）

一、介紹 SparkSQL支援查詢原生的RDD。 RDD是Spark平臺的核心概念，是Spark能夠高效的處理大資料的各種場景的基礎。能夠在Scala中寫SQL語句。支援簡單的SQL語法檢查，能夠在Scala中寫Hive語句訪問Hive資料，並將結果取回作為RDD使用。 D

Android學習記錄（十二) http之base/digest鑑權。

說下背景，我們實現的http的檔案下載是基於webdav協議的。這個肯定是需要鑑權的～ android 5.1不再推薦使用apache的client,今天努力想嘗試一下用httpurlconnect

spark學習記錄（二、RDD）

一、概念 RDD（Resilient Distributed Dataset）叫做彈性分散式資料集，是Spark中最基本的資料抽象，它代表一個不可變、可分割槽、裡面的元素可平行計算的集合。RDD具有資料流模型的特點：自動容錯、位置感知性排程和可伸縮性。RDD允許使用者在執行多個查詢時顯式地將工作

spark學習記錄（七、二次排序和分組取TopN問題）

1.二次排序例題：將兩列數字按第一列升序，如果第一列相同，則第二列升序排列資料檔案：https://download.csdn.net/download/qq_33283652/10894807 將資料封裝成物件，對物件進行排序，然後取出value public class Se

Spark筆記整理（十二）：日誌記錄與監控

提交說明默認 conf 分布 core view aps 版本 1 Standalone模式下按照香飄葉子的文檔部署好完全分布式集群後，提交任務到Spark集群中，查看hadoop01:8080，想點擊查看某個已完成應用的歷史情況，出現下面的提示： Event log

Spark學習記錄（二）Spark叢集搭建

Hadoop Spark叢集搭建，以及IDEA遠端除錯環境：Hadoop-2.7.2 jdk-1.8 scala-2-11-12 spark-2.1.0 spark2.0.0開始，只支援Java8版本了，

spark學習記錄（十三、SparkStreaming）

一、SparkStreaming簡介 SparkStreaming是流式處理框架，是Spark API的擴充套件，支援可擴充套件、高吞吐量、容錯的實時資料流處理，實時資料的來源可以是：Kafka, Flume, Twitter, ZeroMQ或者TCP sockets，並且可以使用高階功能的複雜

spark學習記錄（八、廣播變數和累加器）

一、廣播變數 public class JavaExample { public static void main(String[] args) { SparkConf conf = new SparkConf(); conf.setMaster("

spark學習記錄（九、MasterHA和Spark shuffle）

一、zookeeper配置MasterHA 1.1修改conf下的spark-env.sh ： export SPARK_DAEMON_JAVA_OPTS="-Dspark-deploy-recoveryMode=ZOOKEEPER -Dspark.deploy.zookee

spark學習記錄（六、基礎知識）

1.術語解釋 2.SparkCore和SparkSQL知識點思維導圖整理 https://download.csdn.net/download/qq_33283652/10890863 3.RDD的寬窄依賴相同的key去同一個分割槽，但一個分割槽可以用不同的key

spark學習記錄（五、Spark基於資源排程管理器的提交模式）

一、Standalone（Spark自帶） 1.1 Standalone-client模式提交命令： ./spark-submit --master spark://hadoop1:7077 --class org.apache.spark.examples.Spar

spark學習記錄（四、運算元（函式））

1.Transformations轉換運算元 Transformations類運算元是一類運算元（函式）叫做轉換運算元，如map,flatMap,reduceByKey等。Transformations運算元是延遲執行，也叫懶載入執行。 filter：過濾符合條件的記錄數，true保留

spark學習記錄（三、spark叢集搭建）

一、安裝spark 1.上傳壓縮包並解壓 2.在conf目錄下配置slaves cp slaves.template slaves //在master機上配置worker節點 hadoop2 hadoop3 3.配置spark-env.sh cp spark-env.sh.t

spark學習記錄（一、scala與java編寫wordCount比較）

新增依賴： <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.12<

Linux學習筆記（十二）usermod、passwd、mkpasswd

屬於 pass bsp exp -- 改密碼 use uid gid 一、usermod修改用戶信息usermod -u 111 username #修改用戶 usermod -g grp2 username #修改用戶組 usermod -d

Spark學習記錄（三）核心API模組介紹

spark ------------- 基於hadoop的mr，擴充套件MR模型高效使用MR模型，記憶體型叢集計算，提高app處理速度。 spark特點 ------------- 速度:在記憶體中儲存中間結果。支援多種語言。Scala、Java、Python 內建了80+的運算元. 高階分析

Spark學習記錄（一）Spark 環境搭建以及worldCount示例

安裝Spark ------------------- 首先，安裝spark之前需要先安裝scala，並且安裝scala的版本一定要是將要安裝的spark要求的版本。比如spark2.1.0 要求scala 2.11系列的版本，不能多也不能少 1.下載spark-2.1.0-bin-hadoop

spark學習記錄（十二、Spark UDF&UDAF&開窗函式）

一、UDF&UDAF

二、開窗函式

相關推薦