微博使用者資料分析
一、資料描述
1)資料引數
使用者的歷史微博資料
截止到20131215
壓縮後244MB,解壓後878MB
2)資料型別
整個資料是json格式
json中欄位描述:
beCommentWeiboId 是否評論
beForwardWeiboId 是否是轉發微博
catchTime 抓取時間
commentCount 評論次數
content 內容
createTime 建立時間
info1 資訊欄位1
info2資訊欄位2
info3資訊欄位3
mlevel no sure
musicurl 音樂連結
pic_list 照片列表(可以有多個)
praiseCount 點贊人數
reportCount 轉發人數
source 資料來源
userId 使用者id
videourl 視訊連結
weiboId 微博id
weiboUrl 微博網址
二、實操題目
1. 組織資料(Hive)
建立Hive表weibo(json STRING),表只有一個欄位,匯入所有資料,並驗證查詢前3條資料
1>建表(建庫)
①建立資料庫:create database weibo;
②切換資料庫:use weibo;
③建立外部表:create external tableweibo(json string) row format delimited lines terminated by "\n"stored as textfile location "/exam/weibo";
2>匯入資料
①上傳資料:
②解壓檔案:unzip weibo.zip
③上傳資料:hdfs dfs -put ~/data/619893/*/exam/weibo/
3>驗證查詢前三條資料
select json from weibo limit 3;
2. 統計需求(Hive)
(1)統計微博總量和獨立使用者數
①確認是否有髒資料:通過結果很容易看出沒有
select get_json_object(js.json,'$.userId') from (select json from weibo)js where substr(json,1,1)="{";
②正常查詢:
select "微博總量:",sum(user.cnt),"獨立使用者總數",count(user.userId)
from(
select jj.uid as userId ,count(*) as cnt
from (
selectget_json_object(substring(js.json,2),'$.userId')
as uid
from (
select json from weibo
)
as js
) as jj
group by jj.uid
) as user;
(2)統計使用者所有微博被轉發的總次數,並輸出TOP-3使用者
①建立一個檢視:
create view userRecord
as select
get_json_object(substring(js.json,2),'$.beCommentWeiboId') asbeCommentWeiboId ,
get_json_object(substring(js.json,2),'$.beForwardWeiboId') asbeForwardWeiboId ,
get_json_object(substring(js.json,2),'$.catchTime') as catchTime ,
get_json_object(substring(js.json,2),'$.commentCount') as commentCount ,
get_json_object(substring(js.json,2),'$.content') as content,
get_json_object(substring(js.json,2),'$.createTime') as createTime ,
get_json_object(substring(js.json,2),'$.info1') as info1 ,
get_json_object(substring(js.json,2),'$.info2') as info2,
get_json_object(substring(js.json,2),'$.info3') as info3,
get_json_object(substring(js.json,2),'$.mlevel') as mlevel,
get_json_object(substring(js.json,2),'$.musicurl') as musicurl,
get_json_object(substring(js.json,2),'$.pic_list') as pic_list ,
get_json_object(substring(js.json,2),'$.praiseCount') as praiseCount,
get_json_object(substring(js.json,2),'$.reportCount') as reportCount,
get_json_object(substring(js.json,2),'$.source') as source ,
get_json_object(substring(js.json,2),'$.userId') as userId ,
get_json_object(substring(js.json,2),'$.videourl') as videourl ,
get_json_object(substring(js.json,2),'$.weiboId') as weiboId,
get_json_object(substring(js.json,2),'$.weiboUrl') as weiboUrl
from (select json from weibo) js;
②執行查詢:
select userId,sum(reportCount) as cnt from userRecord group by userIdorder by cnt DESC limit 3;
(3)統計微博被轉發最多的前3位使用者的id
執行查詢:
select uu.userId
from (
select userId,count(*)
as cnt
from userRecord
where reportCount>0
group by userId
order by cnt
)
as uu limit 3;;
(4)統計每個使用者的傳送微博總數,並存儲到臨時表
建立臨時表:
create table tempory_uid_sum(
uid string,
total int
);
查詢並插入資料:
insert overwrite table tempory_uid_sum select userId,sum(reportCount)from userRecord group by userId;
(5)統計帶圖片的微博數
執行查詢:
select count(*) from userRecord where length(pic_list) >2;
(6)統計使用iphone發微博的獨立使用者數
執行查詢:
select count(distinct(userId)) from userRecord where source="iPhone客戶端";
(7)統計微博中使用者ID與資料來源資訊,將其放入檢視中,然後統計檢視中資料來源是“iPad客戶端”的使用者數目
建立檢視:
create view view_uid_source
as
select userId,source
from
userRecord;
執行查詢:
select count(distinct(userId)) from view_uid_source wheresource="iPad客戶端";
3 特殊需求
①往hive中新增jar:add jar/home/hadoop/data/UDF_11.jar
②建立臨時函式:create temporary functionAddTwo as "org.zkpk.func.Add";
③建立臨時函式:create temporary function WordCount as"org.zkpk.func.QueryWord";
(1)實現Hive UDF完成下面的需求:
將微博的點贊人數與轉發人數相加求和,並將相加之和降序排列,取前10條記錄
①程式碼
import org.apache.hadoop.hive.ql.exec.UDF;
public class Add extends UDF {
public Integer evaluate(Integerval1,Integer val2) throws Exception{
return val1+val2;
}
}
②查詢:
select b.*
from (
select createTime,userId AddTwo(cast(praiseCountas int),cast(reportCount as int)) astotal
from userRecord
) as a ,
(
select * from userRecord
) as b
where a.userId=b.userId
and a.catchTime = b.catchTime
order by a.total;
(2) 實現Hive UDF完成下面的需求:
1>微博內容content中的包含某個詞的個數,方法返回值是int型別的數值
①程式碼:
import org.apache.hadoop.hive.ql.exec.UDF;
public classQueryWord extends UDF {
static int counter = 0;
public int stringNumbers(String str,String se) {
if (str.indexOf(se) == -1) {
return 0;
} else if (str.indexOf(se) !=-1) {
counter++;
stringNumbers(str.substring(str.indexOf(se)+ se.length()), se);
return counter;
}
return 0;
}
public Integer evaluate(String val1,String val2) throws Exception {
int sum = stringNumbers(val1,val2);
Integer in = newInteger(sum);
return in;
}
}
2>使用該方法統計微博內容中出現“iphone”次數最多的使用者,最終結果輸出使用者ID和次數
①查詢:
selectuserId,sum(WordCount(content,"iphone")) as cnt from userRecord groupby userId order by cnt;
②查詢(這個查詢完全是為了方便檢視結果):
selectuserId,sum(WordCount(content,"iphone")) as cnt from userRecord groupby userId order by cnt DESC limit 5;