1. 程式人生 > >微博使用者資料分析

微博使用者資料分析

一、資料描述

1)資料引數

使用者的歷史微博資料

截止到20131215

壓縮後244MB,解壓後878MB

2)資料型別

整個資料是json格式

json中欄位描述:

beCommentWeiboId  是否評論

beForwardWeiboId 是否是轉發微博

catchTime 抓取時間

commentCount 評論次數

content  內容

createTime 建立時間

info1 資訊欄位1

info2資訊欄位2

info3資訊欄位3

mlevel   no sure

musicurl 音樂連結

pic_list   照片列表(可以有多個)

praiseCount  點贊人數

reportCount  轉發人數

source    資料來源

userId    使用者id

videourl 視訊連結

weiboId 微博id

weiboUrl      微博網址

二、實操題目

1. 組織資料(Hive

    建立Hive表weibo(json STRING),表只有一個欄位,匯入所有資料,並驗證查詢前3條資料

   1>建表(建庫)

       ①建立資料庫:create database weibo;

       ②切換資料庫:use weibo;

       ③建立外部表:create external tableweibo(json string) row format delimited lines terminated by "\n"stored as textfile location "/exam/weibo";

2>匯入資料

       ①上傳資料:

       ②解壓檔案:unzip weibo.zip

       ③上傳資料:hdfs dfs -put ~/data/619893/*/exam/weibo/

   3>驗證查詢前三條資料

       select json from weibo limit 3;

2. 統計需求(Hive

(1)統計微博總量和獨立使用者數

              ①確認是否有髒資料:通過結果很容易看出沒有

       select get_json_object(js.json,'$.userId') from (select json from weibo)js where substr(json,1,1)="{";

       ②正常查詢:

select "微博總量:",sum(user.cnt),"獨立使用者總數",count(user.userId)

       from(

           select jj.uid as userId ,count(*) as cnt

           from (

                selectget_json_object(substring(js.json,2),'$.userId')

               as uid

                from (

                    select json from weibo

                    )

                    as js

                ) as jj

                group by jj.uid

       ) as user;

(2)統計使用者所有微博被轉發的總次數,並輸出TOP-3使用者

       ①建立一個檢視:

       create view userRecord

       as select 

       get_json_object(substring(js.json,2),'$.beCommentWeiboId') asbeCommentWeiboId ,

       get_json_object(substring(js.json,2),'$.beForwardWeiboId') asbeForwardWeiboId ,

       get_json_object(substring(js.json,2),'$.catchTime') as catchTime ,

       get_json_object(substring(js.json,2),'$.commentCount') as commentCount ,

       get_json_object(substring(js.json,2),'$.content') as content,

       get_json_object(substring(js.json,2),'$.createTime') as createTime ,

       get_json_object(substring(js.json,2),'$.info1') as info1 ,

       get_json_object(substring(js.json,2),'$.info2') as info2,

       get_json_object(substring(js.json,2),'$.info3') as info3,

       get_json_object(substring(js.json,2),'$.mlevel') as mlevel,

       get_json_object(substring(js.json,2),'$.musicurl') as musicurl,

       get_json_object(substring(js.json,2),'$.pic_list') as pic_list ,

       get_json_object(substring(js.json,2),'$.praiseCount') as praiseCount,

       get_json_object(substring(js.json,2),'$.reportCount') as reportCount,

       get_json_object(substring(js.json,2),'$.source') as source ,

       get_json_object(substring(js.json,2),'$.userId') as userId ,

       get_json_object(substring(js.json,2),'$.videourl') as videourl ,

       get_json_object(substring(js.json,2),'$.weiboId') as weiboId,

       get_json_object(substring(js.json,2),'$.weiboUrl') as weiboUrl

       from (select json from weibo) js;

       ②執行查詢:

       select userId,sum(reportCount) as cnt from userRecord group by userIdorder by cnt DESC limit 3;

(3)統計微博被轉發最多的前3位使用者的id

       執行查詢:

       select uu.userId

       from (

           select userId,count(*)

           as cnt

           from userRecord

           where reportCount>0

           group by userId

           order by cnt

           )

           as uu limit 3;;

(4)統計每個使用者的傳送微博總數,並存儲到臨時表

    建立臨時表:

   create table tempory_uid_sum(

       uid string,

       total int

);

    查詢並插入資料:

   insert overwrite table tempory_uid_sum select userId,sum(reportCount)from userRecord group by userId;

(5)統計帶圖片的微博數

       執行查詢:

       select count(*) from userRecord where length(pic_list) >2;

(6)統計使用iphone發微博的獨立使用者數

       執行查詢:

       select count(distinct(userId)) from userRecord where source="iPhone客戶端";

(7)統計微博中使用者ID與資料來源資訊,將其放入檢視中,然後統計檢視中資料來源是“iPad客戶端”的使用者數目

       建立檢視:

       create view view_uid_source

       as

           select userId,source

            from

           userRecord;

       執行查詢:

       select count(distinct(userId)) from view_uid_source wheresource="iPad客戶端";

3 特殊需求

①往hive中新增jar:add jar/home/hadoop/data/UDF_11.jar

②建立臨時函式:create temporary functionAddTwo as "org.zkpk.func.Add";

③建立臨時函式:create temporary function WordCount as"org.zkpk.func.QueryWord";

(1)實現Hive UDF完成下面的需求:

    將微博的點贊人數與轉發人數相加求和,並將相加之和降序排列,取前10條記錄

    ①程式碼

import org.apache.hadoop.hive.ql.exec.UDF;

    public class Add extends UDF {

        public Integer evaluate(Integerval1,Integer val2)  throws Exception{

            return val1+val2;

        }

    }

           ②查詢:

    select b.*

    from (

            select createTime,userId AddTwo(cast(praiseCountas int),cast(reportCount as int))  astotal

            from userRecord

        ) as a ,

        (

            select * from userRecord

        ) as b

        where a.userId=b.userId

        and a.catchTime = b.catchTime

        order by a.total;

(2) 實現Hive UDF完成下面的需求:

    1>微博內容content中的包含某個詞的個數,方法返回值是int型別的數值

    ①程式碼:

import org.apache.hadoop.hive.ql.exec.UDF;

public classQueryWord extends UDF {

         static int counter = 0;

         public int stringNumbers(String str,String se) {

                   if (str.indexOf(se) == -1) {

                            return 0;

                   } else if (str.indexOf(se) !=-1) {

                            counter++;

                            stringNumbers(str.substring(str.indexOf(se)+ se.length()), se);

                            return counter;

                   }

                   return 0;

         }

         public Integer evaluate(String val1,String val2) throws Exception {

                   int sum = stringNumbers(val1,val2);

                   Integer in = newInteger(sum);

                   return in;

         }

}

    2>使用該方法統計微博內容中出現“iphone”次數最多的使用者,最終結果輸出使用者ID和次數

    ①查詢:

selectuserId,sum(WordCount(content,"iphone")) as cnt from userRecord groupby userId order by cnt;

②查詢(這個查詢完全是為了方便檢視結果):

    selectuserId,sum(WordCount(content,"iphone")) as cnt from userRecord groupby userId order by cnt DESC limit 5;