1. 程式人生 > >阿里天池大資料之移動推薦演算法大賽總結及程式碼全公佈

阿里天池大資料之移動推薦演算法大賽總結及程式碼全公佈

移動推薦演算法比賽已經結束了一個多星期了,現在寫一篇文章來回顧一下自己的參賽歷程。

首先,對不瞭解這個比賽的同學們介紹一下這個比賽(引用自官網):

賽題簡介

2014年是阿里巴巴集團移動電商業務快速發展的一年,例如2014雙11大促中移動端成交佔比達到42.6%,超過240億元。相比PC時代,移動端網路的訪問是隨時隨地的,具有更豐富的場景資料,比如使用者的位置資訊、使用者訪問的時間規律等。

本次大賽以阿里巴巴移動電商平臺的真實使用者-商品行為資料為基礎,同時提供移動時代特有的位置資訊,而參賽隊伍則需要通過大資料和演算法構面向建移動電子商務的商品推薦模型。希望參賽隊伍能夠挖掘資料背後豐富的內涵,為移動使用者在合適的時間、合適的地點精準推薦合適的內容。

賽制安排

第一賽季,3月20日—4月25日

1. 可下載的少量淘寶資料,並在本地除錯演算法,提交結果;若隊伍一天內多次提交結果,新結果版本將覆蓋原版本; 

2.4月1日起開放提交結果入口,4月2日出首次排行榜,每天更新排行榜, 按照F1分從高到低排序;排行榜將選擇選手在本階段的歷史最優成績進行排名展示; 

3.4月20日將進行一次資料切換,排行榜取20日之後成績排名; 

4. 第一賽季截止時,最好成績排名前500名的隊伍進入第二賽季; 

第二賽季,4月30日-7月1日

1. 第二賽季分為2個階段:

1)Part 1,4月30日-6月23日, Part 1截止時,最好成績排名前200名的隊伍進入Part 2;

2)Part 2,6月24日-7月1日, Part 2截止時,最好成績排名前5名的隊伍將受邀參加決賽答辯;

注:

Part 1答案資料包含在觀測日期當天有購買行為的使用者全集50%的使用者購買資料;

Part 2答案資料包含在觀測日期當天有購買行為的使用者全集100%的使用者購買資料,即Part 2階段的使用者量是Part 1階段的2倍; 

2.入圍選手需登入天池平臺,訪問和使用海量淘寶資料,並利用Map&Reduce、SQL及各種平臺整合的機器學習演算法包除錯模型,提交結果; 

3.第二賽季提供每天1次的評測機會,提交截止時間為0點,每天更新排行榜, 按照F1分從高到低排序;

資料說明

競賽資料包含兩個部分。第一部分是使用者在商品全集上的移動端行為資料(D),表名為tianchi_mobile_recommend_train_user,包含如下欄位:

 欄位

欄位說明

提取說明

user_id

 使用者標識

 抽樣&欄位脫敏

item_id

 商品標識

 欄位脫敏

behavior_type

 使用者對商品的行為型別

 包括瀏覽、收藏、加購物車、購買,對應取值分別是1、2、3、4。

user_geohash

 使用者位置的空間標識,可以為空

 由經緯度通過保密的演算法生成

item_category

商品分類標識

欄位脫敏

time

行為時間

精確到小時級別

第二個部分是商品子集(P),表名為tianchi_mobile_recommend_train_item,包含如下欄位: 

 欄位

欄位說明

提取說明

item_id

 商品標識

 抽樣&欄位脫敏

item_ geohash

 商品位置的空間標識,可以為空

 由經緯度通過保密的演算法生成

item_category

 商品分類標識

 欄位脫敏

訓練資料包含了抽樣出來的一定量使用者在一個月時間(11.18~12.18)之內的移動端行為資料(D),評分資料是這些使用者在這個一個月之後的一天(12.19)對商品子集(P)的購買資料。參賽者要使用訓練資料建立推薦模型,並輸出使用者在接下來一天對商品子集購買行為的預測結果。 

評分資料格式

具體計算公式如下:參賽者完成使用者對商品子集的購買預測之後,需要將結果放入指定格式的資料表(非分割槽表)中,要求結果表名為:tianchi_mobile_recommendation_predict,包含user_id和item_id兩列(均為string型別),要求去除重複。例如:


初賽資料

初賽階段提供10000使用者的完整行為資料以及百萬級的商品資訊;訓練和預測的資料將會在4月20日進行一次切換(即切換為另一批10000使用者的資料)。

決賽資料

決賽階段提供500萬用戶的完整行為資料以及千萬級的商品資訊。

評估指標

比賽採用經典的精確度(precision)、召回率(recall)和F1值作為評估指標。具體計算公式如下:


其中PredictionSet為演算法預測的購買資料集合,ReferenceSet為真實的答案購買資料集合。我們以F1值作為最終的唯一評測標準。

知道了這個比賽的背景及資料之後我們開始介紹第一賽季:

第一賽季使用者數為1萬,一個月的使用者商品互動總共千萬級別,商品子集十萬級別,我們第一次根據使用者的心裡行為,覺得前一天的購物車商品很有可能第二天就被購買,所以我們直接提交了1218號一天的購物車(跟商品子集交),得到的成績為


這個成績在當時確實不錯,特別是好多人還沒想到這個僥倖的方法的時候,我們覺得還是挺機智的。這種直接提交購物車的方式不涉及任何演算法,我們以後就稱為規則了。之後的幾天我們開始很興奮的去尋找各種規則,包括剔除掉在30天裡從來不買東西的人


包括把前一天加入購物車然後當天購買的剔除掉


還有一條雲泛天音小哥給我們的指導,就是假設使用者在1218號上午9點加入購物車三件東西,10點買了其中一件,那麼剩下兩件就可以刪掉了,因為如果要買,為何不一起買呢,因為使用者做出了選擇,F1繼續提升


這時候F1已經破9了,但是由於我們一直是在一天的資料集上進行操作,所以recall是有瓶頸的。但此時對於我們來說還是想繼續找規則,相信那個階段很多隊伍都跟我們一樣,原因只有一個,那就是規則提升太快了。

後來實在找不出啥強有力的規則來了,就準備用一下演算法,因為LR快速簡單,python裡面有包,我們就直接拿來用了,對稀疏問題採用了L1正則化.LR自己寫的話也很快,就一個梯度下降定參。同一隊的小哥之前用過那個包,我們就直接拿來用了。拆分資料集我們是這樣做的

19號:線上測試

18號:線下測試用

17號:線下訓練用(用當天購買或沒購買的作用對打標籤,然後用16號及之前的提特徵)

16號之前的:提特徵用

其中的具體資料是多少由於時間久遠記不太清楚了,但其中最大的問題就是訓練的時候正負樣本不均衡的問題,我們是在寫程式碼的時候把他當做一個引數設定的,然後選擇最優的,第一賽季負樣本我們做了1/20的抽樣,正負樣本比例我記得是在1:5左右。

這裡分析一下正負樣本比例的問題:

直接量化1)正:負=1:3

2)正:負=1:5

3)正:負=1:7

從(1)到(3)負樣本逐漸增多,負樣本分的更細,演算法預測出的正樣本會變少,,準確率會上升,召回率會下降,因為評測指標是兩者的均衡F1,所以總能調出一個F1最高的值來。

LR的時候特徵我們主要用了以下這些(直接上程式碼,都是些簡單的特徵,包括購物車購買轉化率啊,商品熱度啊等等):

select user_id,count(behavior_type) from user_item where behavior_type=1 andtime<'2014-12-18 00:00:00' group by user_id into outfile'D://model//signal_feature//buser_click.txt';
select user_id,count(behavior_type) from user_item where behavior_type=2 andtime<'2014-12-18 00:00:00' group by user_id into outfile'D://model//signal_feature//buser_collect.txt';
select user_id,count(behavior_type) from user_item where behavior_type=3 andtime<'2014-12-18 00:00:00' group by user_id into outfile'D://model//signal_feature//buser_cart.txt';
select user_id,count(behavior_type) from user_item where behavior_type=4 andtime<'2014-12-18 00:00:00' group by user_id into outfile'D://model//signal_feature//buser_buy.txt';
select item_id,count(behavior_type) from user_item where behavior_type=1 andtime<'2014-12-18 00:00:00' group by item_id into outfile'D://model//signal_feature//bitem_click.txt';
select item_id,count(behavior_type) from user_item where behavior_type=2 andtime<'2014-12-18 00:00:00' group by item_id into outfile'D://model//signal_feature//bitem_collect.txt';
select item_id,count(behavior_type) from user_item where behavior_type=3 andtime<'2014-12-18 00:00:00' group by item_id into outfile'D://model//signal_feature//bitem_cart.txt';
select item_id,count(behavior_type) from user_item where behavior_type=4 andtime<'2014-12-18 00:00:00' group by item_id into outfile'D://model//signal_feature//bitem_buy.txt';
select user_id,item_id,count(behavior_type) from user_item where behavior_type=1 andtime<'2014-12-18 00:00:00' group by user_id,item_id into outfile'D://model//signal_feature//bu_it_click.txt';
select user_id,item_id,count(behavior_type) from user_item where behavior_type=2 andtime<'2014-12-18 00:00:00' group by user_id,item_id into outfile'D://model//signal_feature//bu_it_collect.txt';
select user_id,item_id,count(behavior_type) from user_item where behavior_type=3 andtime<'2014-12-18 00:00:00' group by user_id,item_id into outfile'D://model//signal_feature//bu_it_cart.txt';
select user_id,item_id,count(behavior_type) from user_item where behavior_type=4 andtime<'2014-12-18 00:00:00' group by user_id,item_id into outfile'D://model//signal_feature//bu_it_buy.txt';
select user_id,count(behavior_type) from user_item where behavior_type=1 group byuser_id into outfile 'D://model//signal_feature//nuser_click.txt';
select user_id,count(behavior_type) from user_item where behavior_type=2 group byuser_id into outfile 'D://model//signal_feature//nuser_collect.txt';
select user_id,count(behavior_type) from user_item where behavior_type=3 group byuser_id into outfile 'D://model//signal_feature//nuser_cart.txt';
select user_id,count(behavior_type) from user_item where behavior_type=4 group byuser_id into outfile 'D://model//signal_feature//nuser_buy.txt';
select item_id,count(behavior_type) from user_item where behavior_type=1 group byitem_id into outfile 'D://model//signal_feature//nitem_click.txt';
select item_id,count(behavior_type) from user_item where behavior_type=2 group byitem_id into outfile 'D://model//signal_feature//nitem_collect.txt';
select item_id,count(behavior_type) from user_item where behavior_type=3 group byitem_id into outfile 'D://model//signal_feature//nitem_cart.txt';
select item_id,count(behavior_type) from user_item where behavior_type=4 group byitem_id into outfile 'D://model//signal_feature//nitem_buy.txt';
select user_id,item_id,count(behavior_type) from user_item where behavior_type=1 groupby user_id,item_id into outfile 'D://model//signal_feature//nu_it_click.txt';
select user_id,item_id,count(behavior_type) from user_item where behavior_type=2 groupby user_id,item_id into outfile 'D://model//signal_feature//nu_it_collect.txt';
select user_id,item_id,count(behavior_type) from user_item where behavior_type=3 groupby user_id,item_id into outfile 'D://model//signal_feature//nu_it_cart.txt';
select user_id,item_id,count(behavior_type) from user_item where behavior_type=4 groupby user_id,item_id into outfile 'D://model//signal_feature//nu_it_buy.txt';
 
 
 
select user_id,count(behavior_type) from user_item where behavior_type=1 andtime<'2014-12-18 00:00:00' and time>'2014-12-15 00:00:00' group byuser_id into outfile 'D://model//signal_feature//buser_recent3day_click.txt';
select user_id,count(behavior_type) from user_item where behavior_type=2 andtime<'2014-12-18 00:00:00' and time>'2014-12-15 00:00:00' group byuser_id into outfile 'D://model//signal_feature//buser_recent3day_collect.txt';
select user_id,count(behavior_type) from user_item where behavior_type=3 andtime<'2014-12-18 00:00:00' and time>'2014-12-15 00:00:00' group byuser_id into outfile 'D://model//signal_feature//buser_recent3day_cart.txt';
select user_id,count(behavior_type) from user_item where behavior_type=4 andtime<'2014-12-18 00:00:00' and time>'2014-12-15 00:00:00' group byuser_id into outfile 'D://model//signal_feature//buser_recent3day_buy.txt';
 
 
 
select item_id,count(behavior_type) from user_item where behavior_type=1 andtime<'2014-12-18 00:00:00' and time>'2014-12-15 00:00:00' group byitem_id into outfile 'D://model//signal_feature//bitem_recent3day_click.txt';
select item_id,count(behavior_type) from user_item where behavior_type=2 andtime<'2014-12-18 00:00:00' and time>'2014-12-15 00:00:00' group byitem_id into outfile 'D://model//signal_feature//bitem_recent3day_collect.txt';
select item_id,count(behavior_type) from user_item where behavior_type=3 andtime<'2014-12-18 00:00:00' and time>'2014-12-15 00:00:00' group byitem_id into outfile 'D://model//signal_feature//bitem_recent3day_cart.txt';
select item_id,count(behavior_type) from user_item where behavior_type=4 andtime<'2014-12-18 00:00:00' and time>'2014-12-15 00:00:00' group byitem_id into outfile 'D://model//signal_feature//bitem_recent3day_buy.txt';
 
 
 
select user_id,item_id,count(behavior_type) from user_item where behavior_type=1 andtime<'2014-12-18 00:00:00' and time>'2014-12-15 00:00:00' group byuser_id,item_id into outfile'D://model//signal_feature//bu_it_recent3day_click.txt';
select user_id,item_id,count(behavior_type) from user_item where behavior_type=2 andtime<'2014-12-18 00:00:00' and time>'2014-12-15 00:00:00' group byuser_id,item_id into outfile'D://model//signal_feature//bu_it_recent3day_collect.txt';
select user_id,item_id,count(behavior_type) from user_item where behavior_type=3 andtime<'2014-12-18 00:00:00' and time>'2014-12-15 00:00:00' group byuser_id,item_id into outfile'D://model//signal_feature//bu_it_recent3day_cart.txt';
select user_id,item_id,count(behavior_type) from user_item where behavior_type=4 andtime<'2014-12-18 00:00:00' and time>'2014-12-15 00:00:00' group byuser_id,item_id into outfile'D://model//signal_feature//bu_it_recent3day_buy.txt';
 
 
 
select user_id,count(behavior_type) from user_item where behavior_type=1 andtime<'2014-12-19 00:00:00' and time>'2014-12-16 00:00:00' group byuser_id into outfile 'D://model//signal_feature//nuser_recent3day_click.txt';
select user_id,count(behavior_type) from user_item where behavior_type=2 andtime<'2014-12-19 00:00:00' and time>'2014-12-16 00:00:00' group byuser_id into outfile 'D://model//signal_feature//nuser_recent3day_collect.txt';
select user_id,count(behavior_type) from user_item where behavior_type=3 andtime<'2014-12-19 00:00:00' and time>'2014-12-16 00:00:00' group byuser_id into outfile 'D://model//signal_feature//nuser_recent3day_cart.txt';
select user_id,count(behavior_type) from user_item where behavior_type=4 andtime<'2014-12-19 00:00:00' and time>'2014-12-16 00:00:00' group byuser_id into outfile 'D://model//signal_feature//nuser_recent3day_buy.txt';
 
 
 
select item_id,count(behavior_type) from user_item where behavior_type=1 andtime<'2014-12-19 00:00:00' and time>'2014-12-16 00:00:00' group byitem_id into outfile 'D://model//signal_feature//nitem_recent3day_click.txt';
select item_id,count(behavior_type) from user_item where behavior_type=2 andtime<'2014-12-19 00:00:00' and time>'2014-12-16 00:00:00' group byitem_id into outfile 'D://model//signal_feature//nitem_recent3day_collect.txt';
select item_id,count(behavior_type) from user_item where behavior_type=3 andtime<'2014-12-19 00:00:00' and time>'2014-12-16 00:00:00' group byitem_id into outfile 'D://model//signal_feature//nitem_recent3day_cart.txt';
select item_id,count(behavior_type) from user_item where behavior_type=4 andtime<'2014-12-19 00:00:00' and time>'2014-12-16 00:00:00' group byitem_id into outfile 'D://model//signal_feature//nitem_recent3day_buy.txt';
 
 
 
select user_id,item_id,count(behavior_type) from user_item where behavior_type=1 andtime<'2014-12-19 00:00:00' and time>'2014-12-16 00:00:00' group byuser_id,item_id into outfile'D://model//signal_feature//nu_it_recent3day_click.txt';
select user_id,item_id,count(behavior_type) from user_item where behavior_type=2 andtime<'2014-12-19 00:00:00' and time>'2014-12-16 00:00:00' group byuser_id,item_id into outfile'D://model//signal_feature//nu_it_recent3day_collect.txt';
select user_id,item_id,count(behavior_type) from user_item where behavior_type=3 andtime<'2014-12-19 00:00:00' and time>'2014-12-16 00:00:00' group byuser_id,item_id into outfile'D://model//signal_feature//nu_it_recent3day_cart.txt';
select user_id,item_id,count(behavior_type) from user_item where behavior_type=4 andtime<'2014-12-19 00:00:00' and time>'2014-12-16 00:00:00' group byuser_id,item_id into outfile'D://model//signal_feature//nu_it_recent3day_buy.txt';


這時候我們是在規則的基礎上用了兩天的資料


最後以10.02的分數結束了第一賽季。

第二賽季是相當蛋疼的,剛進入ODPS準備好好探索一番的時候實驗室有了一些任務,不得不將大部分注意力轉到科研上來,兩個隊友也有別的事沒有再參與,中間甚至有一星期都沒看過ODPS,全部做的時間加起來感覺也就10天左右,這個想起來還是非常遺憾的,因為難得的一次機會被浪費了。所以如果想好好做這個比賽的話一定要有充足的時間。因為它確實非常耗時間,只要堅持,就會有好結果的。

好的,進入正題,我來說下第二賽季我是怎麼做的:

第二賽季用的是阿里的ODPS平臺,提供了SQLMR來提特徵,我主要用的就是SQL。第二賽季第一階段是500個隊,第二階段是200個隊。這一階段從我用ODPS開始就感到平臺很慢,一個select查詢語句兩個小時都跑不出來,一直在等待資源。不少隊伍是資源大戶,並且機制是資源是一定的,誰的job佔了誰就跑,並不是每一個隊伍分固定的資源。想想其實阿里畢竟要先滿足自己的業務需求,500個隊,一隊兩臺機子就要1000臺了。。。。這實在代價很大。

上手ODPS平臺可以看看官方文件,SQL語句比較偏向hive,用起來都是些簡單的,也不難。

介紹一下第二賽季的資料,第二賽季使用者數為500,商品為千萬級別的,30天的互動的條數大約是58億條,資料量龐大,最後第二賽季第一階段評測只用結果的一半,第二階段用全部的結果.區別就是第二階段precision會翻倍,recall基本不變.

好的,同學們可能對程式碼比較感興趣,我就先貼程式碼(中間寫了一些註釋),然後解釋:

--------------------------------------------------------------------------------
--                              引數說明                                     --
--------------------------------------------------------------------------------
--整合的資料處理流程,包含了特徵提取,資料融合,加分類標籤,歸一化和欠取樣等
--輸入引數:
--label_day:分類標籤的日期,如:'2014-12-18 00'
--table_label:儲存的表名label,例如16
--輸出表:
--特徵表系列:${table_label}_item_features_1,...,${table_label}_item_features_n,${table_label}_item_features,
--           ${table_label}_ui_features_1,....,${table_label}_ui_features_n,${table_label}_ui_features,
--           ${table_label}_user_features_1,....,${table_label}_user_features_n,${table_label}_user_features,
<div style="text-align: center;"><span style="font-family: Arial, Helvetica, sans-serif; font-size: 12px;">    sum(case when(behavior_type=1 andtime<'${label_day}') then 1 else 0 end) as i1,</span></div>--資料融合處理:
--           ${table_label}_feature_table,${table_label}_normal,${table_label}_under_sample
 
 
--------------------------------------------------------------------------------
--                              特徵提取                                     --
--------------------------------------------------------------------------------
 
--------------------------------------------------------------------------------
--使用label_day之前的資料生成item特徵
 
--已成功
drop table if exists${table_label}_item_features_1;
create table${table_label}_item_features_1 as
select
    item_id,
    --1)對不同item點選、收藏、購物車、購買的總計
    sum(case when(behavior_type=2 andtime<'${label_day}') then 1 else 0 end) as i2,
    sum(case when(behavior_type=3 andtime<'${label_day}') then 1 else 0 end) as i3,
    sum(case when(behavior_type=4 andtime<'${label_day}') then 1 else 0 end) as i4,
    --2)對不同item點選、收藏、購物車、購買平均每個user的計數
    sum(case when(behavior_type=1 andtime<'${label_day}') then 1 else 0 end)/count(distinct user_id) as i5,
    --2015-5-6新增(品牌是否有變熱門的徵兆)
    --最近第1天的行為數與日平均行為數的比值
    case when sum(case when behavior_type=1then 1 else 0 end)=0 then 0 else sum(case when behavior_type=1 anddatediff(to_date('${label_day}','yyyy-mm-dd hh'),to_date(time,'yyyy-mm-ddhh'),'dd')=1 then 1 else 0 end)*datediff(to_date('${label_day}','yyyy-mm-ddhh'),to_date('2014-11-18 00','yyyy-mm-dd hh'),'dd')/sum(case whenbehavior_type=1 then 1 else 0 end) end as i6,
    case when sum(case when behavior_type=2then 1 else 0 end)=0 then 0 else sum(case when behavior_type=2 anddatediff(to_date('${label_day}','yyyy-mm-dd hh'),to_date(time,'yyyy-mm-ddhh'),'dd')=1 then 1 else 0 end)*datediff(to_date('${label_day}','yyyy-mm-ddhh'),to_date('2014-11-18 00','yyyy-mm-dd hh'),'dd')/sum(case whenbehavior_type=2 then 1 else 0 end) end as i7,
    case when sum(case when behavior_type=3then 1 else 0 end)=0 then 0 else sum(case when behavior_type=3 anddatediff(to_date('${label_day}','yyyy-mm-dd hh'),to_date(time,'yyyy-mm-ddhh'),'dd')=1 then 1 else 0 end)*datediff(to_date('${label_day}','yyyy-mm-ddhh'),to_date('2014-11-18 00','yyyy-mm-dd hh'),'dd')/sum(case whenbehavior_type=3 then 1 else 0 end) end as i8,
    --最近第2天的行為數與日平均行為數的比值
    case when sum(case when behavior_type=1then 1 else 0 end)=0 then 0 else sum(case when behavior_type=1 anddatediff(to_date('${label_day}','yyyy-mm-dd hh'),to_date(time,'yyyy-mm-ddhh'),'dd')=2 then 1 else 0 end)*datediff(to_date('${label_day}','yyyy-mm-ddhh'),to_date('2014-11-18 00','yyyy-mm-dd hh'),'dd')/sum(case whenbehavior_type=1 then 1 else 0 end) end as i9,
    case when sum(case when behavior_type=2then 1 else 0 end)=0 then 0 else sum(case when behavior_type=2 anddatediff(to_date('${label_day}','yyyy-mm-dd hh'),to_date(time,'yyyy-mm-ddhh'),'dd')=2 then 1 else 0 end)*datediff(to_date('${label_day}','yyyy-mm-ddhh'),to_date('2014-11-18 00','yyyy-mm-dd hh'),'dd')/sum(case whenbehavior_type=2 then 1 else 0 end) end as i10,
    case when sum(case when behavior_type=3then 1 else 0 end)=0 then 0 else sum(case when behavior_type=3 anddatediff(to_date('${label_day}','yyyy-mm-dd hh'),to_date(time,'yyyy-mm-ddhh'),'dd')=2 then 1 else 0 end)*datediff(to_date('${label_day}','yyyy-mm-ddhh'),to_date('2014-11-18 00','yyyy-mm-dd hh'),'dd')/sum(case whenbehavior_type=3 then 1 else 0 end) end as i11,
    case when sum(case when behavior_type=4then 1 else 0 end)=0 then 0 else sum(case when behavior_type=4 anddatediff(to_date('${label_day}','yyyy-mm-dd hh'),to_date(time,'yyyy-mm-ddhh'),'dd')=2 then 1 else 0 end)*datediff(to_date('${label_day}','yyyy-mm-ddhh'),to_date('2014-11-18 00','yyyy-mm-dd hh'),'dd')/sum(case whenbehavior_type=4 then 1 else 0 end) end as i12,
    --最近第3天的行為數與日平均行為數的比值
    case when sum(case when behavior_type=1then 1 else 0 end)=0 then 0 else sum(case when behavior_type=1 anddatediff(to_date('${label_day}','yyyy-mm-dd hh'),to_date(time,'yyyy-mm-ddhh'),'dd')=3 then 1 else 0 end)*datediff(to_date('${label_day}','yyyy-mm-ddhh'),to_date('2014-11-18 00','yyyy-mm-dd hh'),'dd')/sum(case whenbehavior_type=1 then 1 else 0 end) end as i13,
    case when sum(case when behavior_type=2then 1 else 0 end)=0 then 0 else sum(case when behavior_type=2 anddatediff(to_date('${label_day}','yyyy-mm-dd hh'),to_date(time,'yyyy-mm-ddhh'),'dd')=3 then 1 else 0 end)*datediff(to_date('${label_day}','yyyy-mm-ddhh'),to_date('2014-11-18 00','yyyy-mm-dd hh'),'dd')/sum(case whenbehavior_type=2 then 1 else 0 end) end as i14,
    case when sum(case when behavior_type=3then 1 else 0 end)=0 then 0 else sum(case when behavior_type=3 anddatediff(to_date('${label_day}','yyyy-mm-dd hh'),to_date(time,'yyyy-mm-ddhh'),'dd')=3 then 1 else 0 end)*datediff(to_date('${label_day}','yyyy-mm-ddhh'),to_date('2014-11-18 00','yyyy-mm-dd hh'),'dd')/sum(case whenbehavior_type=3 then 1 else 0 end) end as i15,
    case when sum(case when behavior_type=4then 1 else 0 end)=0 then 0 else sum(case when behavior_type=4 anddatediff(to_date('${label_day}','yyyy-mm-dd hh'),to_date(time,'yyyy-mm-ddhh'),'dd')=3 then 1 else 0 end)*datediff(to_date('${label_day}','yyyy-mm-ddhh'),to_date('2014-11-18 00','yyyy-mm-dd hh'),'dd')/sum(case whenbehavior_type=4 then 1 else 0 end) end as i16,
    --最近3天的行為數與日平均行為數的比值
    case when sum(case when behavior_type=1then 1 else 0 end)=0 then 0 else sum(case when behavior_type=1 anddatediff(to_date('${label_day}','yyyy-mm-dd hh'),to_date(time,'yyyy-mm-ddhh'),'dd')<4 then 1 else 0 end)*datediff(to_date('${label_day}','yyyy-mm-ddhh'),to_date('2014-11-18 00','yyyy-mm-dd hh'),'dd')/sum(case whenbehavior_type=1 then 1 else 0 end) end as i17,
    case when sum(case when behavior_type=2then 1 else 0 end)=0 then 0 else sum(case when behavior_type=2 anddatediff(to_date('${label_day}','yyyy-mm-dd hh'),to_date(time,'yyyy-mm-ddhh'),'dd')<4 then 1 else 0 end)*datediff(to_date('${label_day}','yyyy-mm-ddhh'),to_date('2014-11-18 00','yyyy-mm-dd hh'),'dd')/sum(case whenbehavior_type=2 then 1 else 0 end) end as i18,
    case when sum(case when behavior_type=3then 1 else 0 end)=0 then 0 else sum(case when behavior_type=3 anddatediff(to_date('${label_day}','yyyy-mm-dd hh'),to_date(time,'yyyy-mm-ddhh'),'dd')<4 then 1 else 0 end)*datediff(to_date('${label_day}','yyyy-mm-ddhh'),to_date('2014-11-18 00','yyyy-mm-dd hh'),'dd')/sum(case whenbehavior_type=3 then 1 else 0 end) end as i19,
    case when sum(case when behavior_type=4then 1 else 0 end)=0 then 0 else sum(case when behavior_type=4 anddatediff(to_date('${label_day}','yyyy-mm-dd hh'),to_date(time,'yyyy-mm-ddhh'),'dd')<4 then 1 else 0 end)*datediff(to_date('${label_day}','yyyy-mm-ddhh'),to_date('2014-11-18 00','yyyy-mm-dd hh'),'dd')/sum(case whenbehavior_type=4 then 1 else 0 end) end as i20
fromtianchi_lbs.tianchi_mobile_recommend_train_user
wheretime<'${label_day}'
group by item_id;
 
--已成功
--統計商品在類別中的排序(2015-5-15新增)
drop table if exists${table_label}_item_features_2;
create table${table_label}_item_features_2 as
select item_id,
    dense_rank() over(partition byitem_category order by ii1 desc) as i21,
    dense_rank() over(partition byitem_category order by ii2 desc) as i22,
    dense_rank() over(partition byitem_category order by ii3 desc) as i23,
    dense_rank() over(partition byitem_category order by ii4 desc) as i24
from
(
    select item_category,item_id,
    sum(case when behavior_type=1 then 1 else 0end) as ii1,
    sum(case when behavior_type=2 then 1 else 0end) as ii2,
    sum(case when behavior_type=3 then 1 else 0end) as ii3,
    sum(case when behavior_type=4 then 1 else 0end) as ii4
    fromtianchi_lbs.tianchi_mobile_recommend_train_user
    where time<'${label_day}'
    group by item_id,item_category
)t;
 
#####################05-24
--商品互動的總人數(全部,最近1天,3天)
drop table if exists${table_label}_item_features_3;
create table${table_label}_item_features_3 as
select item_id,
--全部
sum(casewhen behavior_type=1 then 1 else 0 end) as i25,
sum(casewhen behavior_type=2 then 1 else 0 end) as i26,
sum(casewhen behavior_type=3 then 1 else 0 end) as i27,
sum(casewhen behavior_type=4 then 1 else 0 end) as i28,
--最近1天
sum(casewhen behavior_type=1 and datediff(to_date('${label_day}','yyyy-mm-ddhh'),to_date(time,'yyyy-mm-dd hh'),'dd')=1 then 1 else 0 end) as i29,
sum(casewhen behavior_type=2 and datediff(to_date('${label_day}','yyyy-mm-ddhh'),to_date(time,'yyyy-mm-dd hh'),'dd')=1 then 1 else 0 end) as i30,
sum(casewhen behavior_type=3 and datediff(to_date('${label_day}','yyyy-mm-ddhh'),to_date(time,'yyyy-mm-dd hh'),'dd')=1 then 1 else 0 end) as i31,
sum(casewhen behavior_type=4 and datediff(to_date('${label_day}','yyyy-mm-ddhh'),to_date(time,'yyyy-mm-dd hh'),'dd')=1 then 1 else 0 end) as i32,
--最近3天
sum(casewhen behavior_type=1 and datediff(to_date('${label_day}','yyyy-mm-ddhh'),to_date(time,'yyyy-mm-dd hh'),'dd')<4 then 1 else 0 end) as i33,
sum(casewhen behavior_type=2 and datediff(to_date('${label_day}','yyyy-mm-ddhh'),to_date(time,'yyyy-mm-dd hh'),'dd')<4 then 1 else 0 end) as i34,
sum(casewhen behavior_type=3 and datediff(to_date('${label_day}','yyyy-mm-ddhh'),to_date(time,'yyyy-mm-dd hh'),'dd')<4 then 1 else 0 end) as i35,
sum(casewhen behavior_type=4 and datediff(to_date('${label_day}','yyyy-mm-ddhh'),to_date(time,'yyyy-mm-dd hh'),'dd')<4 then 1 else 0 end) as i36
from
(
selectdistinct item_id,user_id,time,behavior_type
fromtianchi_lbs.tianchi_mobile_recommend_train_user
wheretime<'${label_day}'
)t
group by item_id;
 
--商品行為數(最近1天,3天)
drop table if exists${table_label}_item_features_4;
create table${table_label}_item_features_4 as
select item_id,
sum(casewhen behavior_type=1 and datediff(to_date('${label_day}','yyyy-mm-ddhh'),to_date(time,'yyyy-mm-dd hh'),'dd')=1 then 1 else 0 end) as i37,
sum(casewhen behavior_type=2 and datediff(to_date('${label_day}','yyyy-mm-ddhh'),to_date(time,'yyyy-mm-dd hh'),'dd')=1 then 1 else 0 end) as i38,
sum(casewhen behavior_type=3 and datediff(to_date('${label_day}','yyyy-mm-ddhh'),to_date(time,'yyyy-mm-dd hh'),'dd')=1 then 1 else 0 end) as i39,
sum(casewhen behavior_type=4 and datediff(to_date('${label_day}','yyyy-mm-ddhh'),to_date(time,'yyyy-mm-dd hh'),'dd')=1 then 1 else 0 end) as i40,
sum(casewhen behavior_type=1 and datediff(to_date('${label_day}','yyyy-mm-ddhh'),to_date(time,'yyyy-mm-dd hh'),'dd')<4 then 1 else 0 end) as i41,
sum(casewhen behavior_type=2 and datediff(to_date('${label_day}','yyyy-mm-ddhh'),to_date(time,'yyyy-mm-dd hh'),'dd')<4 then 1 else 0 end) as i42,
sum(casewhen behavior_type=3 and datediff(to_date('${label_day}','yyyy-mm-ddhh'),to_date(time,'yyyy-mm-dd hh'),'dd')<4 then 1 else 0 end) as i43,
sum(casewhen behavior_type=4 and datediff(to_date('${label_day}','yyyy-mm-ddhh'),to_date(time,'yyyy-mm-dd hh'),'dd')<4 then 1 else 0 end) as i44
fromtianchi_lbs.tianchi_mobile_recommend_train_user
wheretime<'${label_day}'
group by item_id;
 
--商品的購買轉化率及轉化率與類別平均轉化率的比值
drop table if exists${table_label}_item_features_5;
create table${table_label}_item_features_5 as
select item_id,r1 asi45,r2 as i46,r3 as i47,
casewhen cr1>0 then r1/cr1 else 0 end as i48,
casewhen cr2>0 then r2/cr2 else 0 end as i49,
casewhen cr3>0 then r3/cr3 else 0 end as i50
from
(
selectitem_id,item_category,
casewhen sum(case when behavior_type=4 then 1 else 0 end)>0 then sum(case whenbehavior_type=1 then 1 else 0 end)/sum(case when behavior_type=4 then 1 else 0end) else 0 end as r1,
casewhen sum(case when behavior_type=4 then 1 else 0 end)>0 then sum(case whenbehavior_type=2 then 1 else 0 end)/sum(case when behavior_type=4 then 1 else 0end) else 0 end as r2,
casewhen sum(case when behavior_type=4 then 1 else 0 end)>0 then sum(case whenbehavior_type=3 then 1 else 0 end)/sum(case when behavior_type=4 then 1 else 0end) else 0 end as r3
fromtianchi_lbs.tianchi_mobile_recommend_train_user
wheretime<'${label_day}'
group byitem_id,item_category
) t1
join
(
selectitem_category,
casewhen sum(case when behavior_type=4 then 1 else 0 end)>0 then sum(case whenbehavior_type=1 then 1 else 0 end)/sum(case when behavior_type=4 then 1 else 0end) else 0 end as cr1,
casewhen sum(case when behavior_type=4 then 1 else 0 end)>0 then sum(case whenbehavior_type=1 then 1 else 0 end)/sum(case when behavior_type=4 then 1 else 0end) else 0 end as cr2,
casewhen sum(case when behavior_type=4 then 1 else 0 end)>0 then sum(case whenbehavior_type=1 then 1 else 0 end)/sum(case when behavior_type=4 then 1 else 0end) else 0 end as cr3
fromtianchi_lbs.tianchi_mobile_recommend_train_user
wheretime<'${label_day}'
groupby item_category
) t2
ont1.item_category=t2.item_category;
 
--商品行為/同類同行為均值(總表,類別行為統計)
drop table if exists${table_label}_item_features_6;
create table${table_label}_item_features_6 as
select item_id,
casewhen t2.click>0 then t1.click/t2.click else 0 end as i51,
casewhen t2.favorite>0 then t1.favorite/t2.favorite else 0 end i52,
casewhen t2.cart>0 then t1.cart/t2.cart else 0 end i53,
casewhen t2.buy>0 then t1.buy/t2.buy else 0 end i54
from
(
--使用者的行為數
selectitem_id,item_category,
sum(casewhen behavior_type=1 then 1 else 0 end) as click,
sum(casewhen behavior_type=2 then 1 else 0 end) as favorite,
sum(casewhen behavior_type=3 then 1 else 0 end) as cart,
sum(casewhen behavior_type=4 then 1 else 0 end) as buy
fromtianchi_lbs.tianchi_mobile_recommend_train_user
wheretime<'${label_day}'
groupby item_id,item_category
)t1
join
(
--類別的平均行為數
selectitem_category,
avg(casewhen behavior_type=1 then 1 else 0 end) as click,
avg(casewhen behavior_type=1 then 1 else 0 end) as favorite,
avg(casewhen behavior_type=1 then 1 else 0 end) as cart,
avg(casewhen behavior_type=1 then 1 else 0 end) as buy
fromtianchi_lbs.tianchi_mobile_recommend_train_user
wheretime<'${label_day}'
groupby item_category
)t2
ont1.item_category=t2.item_category;
 
--合併item特徵。此處需要常常做變動,新增新的特徵之後都需要做改變
drop table if exists${table_label}_item_features;
create table${table_label}_item_features as
select
   ${table_label}_item_features_1.item_id,i1,i2,i3,i4,i5,i6,i7,i8,i9,i10,i11,i12,i13,i14,i15,i16,i17,i18,i19,i20,i21,i22,i23,i24
,i25,i26,i27,i28,i29,i30,i31,i32,i33,i34,i35,i36,i37,i38,i39,i40,i41,i42,i43,i44,i45,i46,i47,i48,i49,i50,i51,i52,i53,i54
from
${table_label}_item_features_1
join
${table_label}_item_features_2
on${table_label}_item_features_1.item_id=${table_label}_item_features_2.item_id
join
${table_label}_item_features_3
on${table_label}_item_features_1.item_id=${table_label}_item_features_3.item_id
join
${table_label}_item_features_4
on${table_label}_item_features_1.item_id=${table_label}_item_features_4.item_id
join
${table_label}_item_features_5
on${table_label}_item_features_1.item_id=${table_label}_item_features_5.item_id
join
${table_label}_item_features_6
on${table_label}_item_features_1.item_id=${table_label}_item_features_6.item_id;
 
 
------------------------------------------------------------------------------
使用label_day之前的資料生成使用者-商品特徵
 
已成功
drop table if exists${table_label}_ui_features_1;
create table${table_label}_ui_features_1 as
SELECT
    user_id,item_id,
    --平均每天對商品的行為數
    sum(case when (behavior_type=1 andtime<'${label_day}') then 1 else 0 end)/30 as ui1,
    sum(case when (behavior_type=2 andtime<'${label_day}') then 1 else 0 end)/30 as ui2,
    sum(case when (behavior_type=3 andtime<'${label_day}') then 1 else 0 end)/30 as ui3,
    sum(case when (behavior_type=4 andtime<'${label_day}') then 1 else 0 end)/30 as ui4,
    --最近第一天的操作
    sum(case when (behavior_type=1 anddatediff(to_date('${label_day}','yyyy-mm-dd hh'),to_date(time,'yyyy-mm-ddhh'),'dd')=1) then 1 else 0 end) as ui5,
    sum(case when (behavior_type=2 anddatediff(to_date('${label_day}','yyyy-mm-dd hh'),to_date(time,'yyyy-mm-ddhh'),'dd')=1) then 1 else 0 end) as ui6,
    sum(case when (behavior_type=3 anddatediff(to_date('${label_day}','yyyy-mm-dd hh'),to_date(time,'yyyy-mm-ddhh'),'dd')=1) then 1 else 0 end) as ui7,
    --最近第二天的操作
    sum(case when (behavior_type=1 anddatediff(to_date('${label_day}','yyyy-mm-dd hh'),to_date(time,'yyyy-mm-ddhh'),'dd')=2) then 1 else 0 end) as ui8,
    sum(case when (behavior_type=2 anddatediff(to_date('${label_day}','yyyy-mm-dd hh'),to_date(time,'yyyy-mm-ddhh'),'dd')=2) then 1 else 0 end) as ui9,
    sum(case when (behavior_type=3 anddatediff(to_date('${label_day}','yyyy-mm-dd hh'),to_date(time,'yyyy-mm-ddhh'),'dd')=2) then 1 else 0 end) as ui10,
    sum(case when (behavior_type=4 anddatediff(to_date('${label_day}','yyyy-mm-dd hh'),to_date(time,'yyyy-mm-ddhh'),'dd')=2) then 1 else 0 end) as ui11,
    --最近第三天的操作
    sum(case when (behavior_type=1 anddatediff(to_date('${label_day}','yyyy-mm-dd hh'),to_date(time,'yyyy-mm-ddhh'),'dd')=3) then 1 else 0 end) as ui12,
    sum(case when (behavior_type=2 anddatediff(to_date('${label_day}','yyyy-mm-dd hh'),to_date(time,'yyyy-mm-ddhh'),'dd')=3) then 1 else 0 end) as ui13,
    sum(case when (behavior_type=3 anddatediff(to_date('${label_day}','yyyy-mm-dd hh'),to_date(time,'yyyy-mm-ddhh'),'dd')=3) then 1 else 0 end) as ui14,
    sum(case when (behavior_type=4 anddatediff(to_date('${label_day}','yyyy-mm-dd hh'),to_date(time,'yyyy-mm-ddhh'),'dd')=3) then 1 else 0 end) as ui15,
    --最近1周的操作
    sum(case when (behavior_type=1 anddatediff(to_date('${label_day}','yyyy-mm-dd hh'),to_date(time,'yyyy-mm-ddhh'),'dd')<8) then 1 else 0 end) as ui16,
    sum(case when (behavior_type=2 anddatediff(to_date('${label_day}','yyyy-mm-dd hh'),to_date(time,'yyyy-mm-ddhh'),'dd')<8) then 1 else 0 end) as ui17,
    sum(case when (behavior_type=3 anddatediff(to_date('${label_day}','yyyy-mm-dd hh'),to_date(time,'yyyy-mm-ddhh'),'dd')<8) then 1 else 0 end) as ui18,
    sum(case when (behavior_type=4 anddatediff(to_date('${label_day}','yyyy-mm-dd hh'),to_date(time,'yyyy-mm-ddhh'),'dd')<8) then 1 else 0 end) as ui19,
    --最近一天的最後的操作時間
    max(case when (behavior_type=1 anddatediff(to_date('${label_day}','yyyy-mm-dd hh'),to_date(time,'yyyy-mm-ddhh'),'dd')=1) then cast(substr(time,-2,2) as bigint) else 0 end) as ui20,
    max(case when (behavior_type=2 anddatediff(to_date('${label_day}','yyyy-mm-dd hh'),to_date(time,'yyyy-mm-ddhh'),'dd')=1) then cast(substr(time,-2,2) as bigint) else 0 end) as ui21,
    max(case when (behavior_type=3 anddatediff(to_date('${label_day}','yyyy-mm-dd hh'),to_date(time,'yyyy-mm-ddhh'),'dd')=1) then cast(substr(time,-2,2) as bigint) else 0 end) as ui22,
    --最近一天的最早的操作時間
    min(case when (behavior_type=1 anddatediff(to_date('${label_day}','yyyy-mm-dd hh'),to_date(time,'yyyy-mm-d