1. 程式人生 > >一道hive面試題

一道hive面試題

該面試題的原文地址:http://blog.csdn.net/zolalad/article/details/10819749#

解決思路:根據使用者ID算出訪問次數,然後根據訪問次數算出fromurl和tourl

難點主要為計算使用者訪問次數,原文的計算方法看著有點複雜,於是就簡單寫了一個

import java.util.HashMap;

import org.apache.hadoop.hive.ql.exec.UDF;

public class UdfTest extends UDF {

	HashMap<Integer,Integer> hm = new HashMap<Integer, Integer>();
	public int evaluate(int id){
		Integer count = hm.get(id);
		if (count==null){
			count=0;
		}
		count++;
		hm.put(id, count);
		return count;
		
	}
}

把使用者的ID及訪問次數count寫入map集合,最後返回count

打包上傳,在hive中執行add jar /usr/local/udf.jar ,CREATE TEMPORARY FUNCTION num AS "udf.UdfTest";

SELECT t1.platform,t1.user_id,t1.n,t2.click_url FROM_URL,t1.click_url TO_URL FROM  
(select *,num(USER_ID) n from trlog)t1  
LEFT OUTER JOIN  
(select *,num(USER_ID) n from trlog)t2   
on t1.user_id = t2.user_id and t1.n = t2.n+1; 

註釋:當訪問次數為1時,fromurl為null,此時t1.n為1,t2中應不存在次數為1的,所以t2中應該n+1

進行連表查詢,剛開始報錯java.io.FileNotFoundException(File does not exist/usr/local/......),於是手動把jar包傳到hdfs,成功執行

注:

最近發現僅用hive的分析函式就可實現:ROW_NUMBER+LAG

select platform,
user_id,
CLICK_TIME,
ROW_NUMBER() OVER(PARTITION BY platform,user_id order by CLICK_TIME) AS rn,
lag(click_url,1) over(partition by platform,user_id order by CLICK_TIME) as from_url,
click_url from trlog;