1. 程式人生 > >第十四篇 資料分析案例

第十四篇 資料分析案例

經過前面的學習,下面來看⼀些真實世界的資料集。對於每個資料集,我們會⽤之前介紹的⽅法,從原始資料中提取有意義的內容。展示的⽅法適⽤於其它資料集,也包括你的。本篇包含了⼀些各種各樣的案例資料集,可以⽤來練習。

案例資料集可以在Github倉庫找到。

一、來⾃Bitly的USA.gov資料
2011年,URL縮短服務Bitly跟美國政府⽹站USA.gov合作,提供了⼀份從⽣成.gov或.mil短連結的⽤戶那⾥收集來的匿名資料。在2011年,除實時資料之外,還可以下載⽂本⽂件形式的每⼩時快照。這項服務現在已經關閉,但我們儲存⼀份資料⽤於本篇的案例。

以每⼩時快照為例,⽂件中各⾏的格式為JSON(即JavaScriptObject Notation,這是⼀種常⽤的Web資料格式)。
例如,如果我們只讀取某個⽂件中的第⼀⾏,那麼所看到的結果應該是下⾯這樣:
path = 'datasets/bitly_usagov/example.txt'
open(path).readline()               # 輸出如下:
'{ "a": "Mozilla\\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\\/535.11
(KHTML, like Gecko) Chrome\\/17.0.963.78 Safari\\/535.11", "c": "US", "nk": 1,
"tz": "America\\/New_York", "gr": "MA", "g": "A6qOVH", "h": "wfLQtf", "l":
"orofrog", "al": "en-US,en;q=0.8", "hh": "1.usa.gov", "r":
"http:\\/\\/www.facebook.com\\/l\\/7AQEFzjSi\\/1.usa.gov\\/wfLQtf", "u":
"http:\\/\\/www.ncbi.nlm.nih.gov\\/pubmed\\/22415991", "t": 1331923247, "hc":
1331822918, "cy": "Danvers", "ll": [ 42.576698, -70.954903 ] }\n'

Python有內建或第三⽅模組可以將JSON字串轉換成Python字典物件。這⾥,我將使⽤json模組及其loads函式逐⾏載入已經下載好的資料⽂件:
import json
path = 'datasets/bitly_usagov/example.txt'
records = [json.loads(line) for line in open(path)]

現在,records物件就成為⼀組Python字典了:
records[0]              # 輸出如下:
{'a': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.78 Safari/535.11',
  'c': 'US',
  'nk': 1,
  'tz': 'America/New_York',
  'gr': 'MA',
  'g': 'A6qOVH',
  'h': 'wfLQtf',
  'l': 'orofrog',
  'al': 'en-US,en;q=0.8',
  'hh': '1.usa.gov',
  'r': '

http://www.facebook.com/l/7AQEFzjSi/1.usa.gov/wfLQtf',
  'u': '
http://www.ncbi.nlm.nih.gov/pubmed/22415991',
  't': 1331923247,
  'hc': 1331822918,
  'cy': 'Danvers',
  'll': [42.576698, -70.954903]}

1、⽤純Python程式碼對時區進⾏計數
假設我們想要知道該資料集中最常出現的是哪個時區(即tz欄位),得到答案的辦法有很多。⾸先,我們⽤列表推導式取出⼀組時區:
time_zones = [rec['tz'] for rec in records]     # 執行報錯,輸出如下:
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-22-f3fbbc37f129> in <module>()
----> 1 time_zones = [rec['tz'] for rec in records]
<ipython-input-22-f3fbbc37f129> in <listcomp>(.0)
----> 1 time_zones = [rec['tz'] for rec in records]
KeyError: 'tz'
因為並不是所有記錄都有時區欄位。此時,只需在列表推導式末尾加上⼀個if 'tz'in rec判斷

即可:
time_zones = [rec['tz'] for rec in records if 'tz' in rec]
time_zones[:10]                     # 輸出如下:
['America/New_York',
  'America/Denver',
  'America/New_York',
  'America/Sao_Paulo',
  'America/New_York',
  'America/New_York',
  'Europe/Warsaw',
  '',
  '',
  '']

只看前10個時區,我們發現有些是未知的(即空的)。雖然可以將它們過濾掉,但現在暫時先留著。接下來,為了對時區進⾏計數,這⾥介紹兩個辦法:⼀個較難(只使⽤標準Python庫),另⼀個較簡單(使⽤pandas)。計數的辦法之⼀是在遍歷時區的過程中將計數值儲存在字典中
def get_counts(sequence):
       counts = {}
       for x in sequence:
             if x in counts:
                  counts[x] += 1
           else:
                  counts[x] = 1
       return counts
如果使⽤Python標準庫的更⾼級⼯具,那麼你可能會將程式碼寫得更簡潔⼀些:
from collections import defaultdict
def get_counts2(sequence):
       counts = defaultdict(int)
       for x in sequence:
             counts[x] += 1
       return counts
      
我將邏輯寫到函式中是為了獲得更⾼的復⽤性。要⽤它對時區進⾏處理,只需將time_zones傳⼊即可:
counts = get_counts(time_zones)
counts['America/New_York']          # 輸出:1251
len(time_zones)                     # 輸出:3440

如果想要得到前10位的時區及其計數值,我們需要⽤到⼀些有關字典的處理技巧:
def top_counts(count_dict, n=10):
       value_key_pairs = [(count, tz) for tz, count in count_dict.items()]
       value_key_pairs.sort()
       return value_key_pairs[-n:]
然後有:
top_counts(counts)      # 輸出如下:
[(33, 'America/Sao_Paulo'),
  (35, 'Europe/Madrid'),
  (36, 'Pacific/Honolulu'),
  (37, 'Asia/Tokyo'),
  (74, 'Europe/London'),
  (191, 'America/Denver'),
  (382, 'America/Los_Angeles'),
  (400, 'America/Chicago'),
  (521, ''),
  (1251, 'America/New_York')]
 
如果你搜索Python的標準庫,你能找到collections.Counter類,它可以使這項⼯作更簡單:
from collections import Counter
counts = Counter(time_zones)
counts.most_common(10)              # 輸出如下:
[('America/New_York', 1251),
  ('', 521),
  ('America/Chicago', 400),
  ('America/Los_Angeles', 382),
  ('America/Denver', 191),
  ('Europe/London', 74),
  ('Asia/Tokyo', 37),
  ('Pacific/Honolulu', 36),
  ('Europe/Madrid', 35),
  ('America/Sao_Paulo', 33)]

2、⽤pandas對時區進⾏計數
從原始記錄的集合建立DateFrame,與將記錄列表傳遞到pandas.DataFrame⼀樣簡單:
import pandas as pd
frame = pd.DataFrame(records)
frame.info()            # 輸出如下:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3560 entries, 0 to 3559
Data columns (total 18 columns):
_heartbeat_       120 non-null float64
a                        3440 non-null object
al                       3094 non-null object
c                        2919 non-null object
cy                      2919 non-null object
g                       3440 non-null object
gr                      2919 non-null object
h                       3440 non-null object
hc                     3440 non-null float64
hh                     3440 non-null object
kw                     93 non-null object
l                        3440 non-null object
ll                       2919 non-null object
nk                     3440 non-null float64
r                       3440 non-null object
t                       3440 non-null float64
tz                     3440 non-null object
u                      3440 non-null object
dtypes: float64(4), object(14)
memory usage: 500.7+ KB
frame['tz'][:10]        # 輸出如下:獲取tz列的前10個元素
0     America/New_York
1         America/Denver
2     America/New_York
3    America/Sao_Paulo
4     America/New_York
5     America/New_York
6           Europe/Warsaw
7
8
9
Name: tz, dtype: object

這⾥frame的輸出形式是摘要檢視(summary view),主要⽤於較⼤的DataFrame物件。我們然後可以對Series使⽤value_counts⽅法
tz_counts = frame['tz'].value_counts()
tz_counts[:10]          # 輸出如下:
America/New_York             1251
                                               521
America/Chicago                  400
America/Los_Angeles           382
America/Denver                    191
Europe/London                       74
Asia/Tokyo                              37
Pacific/Honolulu                     36
Europe/Madrid                       35
America/Sao_Paulo                33
Name: tz, dtype: int64

我們可以⽤matplotlib視覺化這個資料。為此,我們先給記錄中未知或缺失的時區填上⼀個替代值。fillna函式可以替換缺失值(NA),⽽未知值(空字串)則可以通過布林型陣列索引加以替換:
clean_tz = frame['tz'].fillna('Missing')
clean_tz[clean_tz == ''] = 'Unknown'
tz_counts = clean_tz.value_counts()
tz_counts[:10]          # 輸出如下:
America/New_York              1251
Unknown                                521
America/Chicago                   400
America/Los_Angeles            382
America/Denver                     191
Missing                                   120
Europe/London                        74
Asia/Tokyo                               37
Pacific/Honolulu                      36
Europe/Madrid                         35
Name: tz, dtype: int64

此時,我們可以⽤seaborn包建立⽔平柱狀圖(結果⻅圖14-1):
import seaborn as sns
subset = tz_counts[:10]
sns.barplot(y=subset.index, x=subset.values)    # 輸出圖形14-1

圖14-1 usa.gov示例資料中最常出現的時區
                                圖14-1  usa.gov示例資料中最常出現的時區

a欄位含有執⾏URL短縮操作的瀏覽器、裝置、應⽤程式的相關資訊:
frame['a'][1]           # 輸出:'GoogleMaps/RochesterNY'
frame['a'][50]          # 輸出如下:
'Mozilla/5.0 (Windows NT 5.1; rv:10.0.2) Gecko/20100101 Firefox/10.0.2'
frame['a'][51][:50]     # a列第51行前50個字元,輸出如下:
'Mozilla/5.0 (Linux; U; Android 2.2.2; en-us; LG-P9'

將這些"agent"字串中的所有資訊都解析出來是⼀件挺鬱悶的⼯作。⼀種策略是將這種字串的第⼀節(與瀏覽器⼤致對應)分離出來並得到另外⼀份⽤戶⾏為摘要
results = pd.Series([x.split()[0] for x in frame.a.dropna()])           # 提取資料
results[:5]             # 檢視結果,輸出如下:
0                             Mozilla/5.0
1    GoogleMaps/RochesterNY
2                             Mozilla/4.0
3                             Mozilla/5.0
4                             Mozilla/5.0
dtype: object
results.value_counts()[:8]          # 統計數量,輸出如下:
Mozilla/5.0                              2594
Mozilla/4.0                                601
GoogleMaps/RochesterNY       121
Opera/9.80                                  34
TEST_INTERNET_AGENT             24
GoogleProducer                          21
Mozilla/6.0                                    5
BlackBerry8520/5.0.0.681              4
dtype: int64

現在,假設你想按Windows和⾮Windows⽤戶對時區統計資訊進⾏分解。為了簡單起⻅,我們假定只要agent字串中含有"Windows"就認為該⽤戶為Windows⽤戶。由於有的agent缺失,所以⾸先將它們從資料中移除:
cframe = frame[frame.a.notnull()]               # 過濾a列不為空的行
然後計算出各⾏是否含有Windows的值:str.contains()方法是pandas的字串方法,不是Python的
cframe['os'] = np.where(cframe['a'].str.contains('Windows'), 'Windows', 'Not Windows')
cframe['os'][:5]        # 輸出如下:
0           Windows
1    Not Windows
2           Windows
3    Not Windows
4           Windows
Name: os, dtype: object
接下來就可以根據時區和新得到的作業系統列表對資料進⾏分組了:
by_tz_os = cframe.groupby(['tz', 'os'])
分組計數,類似於value_counts函式,可以⽤size來計算。並利⽤unstack對計數結果進⾏重塑:
agg_counts = by_tz_os.size().unstack().fillna(0)
agg_counts[:10]                     # 輸出如下:
os                                                          Not Windows  Windows
tz
                                                                            245.0         276.0
Africa/Cairo                                                            0.0             3.0
Africa/Casablanca                                                   0.0             1.0
Africa/Ceuta                                                            0.0             2.0
Africa/Johannesburg                                               0.0             1.0
Africa/Lusaka                                                           0.0             1.0
America/Anchorage                                                4.0             1.0
America/Argentina/Buenos_Aires                           1.0             0.0
America/Argentina/Cordoba                                   0.0             1.0
America/Argentina/Mendoza                                  0.0             1.0

最後,我們來選取最常出現的時區。為了達到這個⽬的,我根據agg_counts中的⾏數構造了⼀個間接索引陣列
# Use to sort in ascending order
indexer = agg_counts.sum(1).argsort()    # 按行求和
indexer[:10]            # 輸出如下:
tz
                                                           24
Africa/Cairo                                       20
Africa/Casablanca                              21
Africa/Ceuta                                       92
Africa/Johannesburg                          87
Africa/Lusaka                                      53
America/Anchorage                           54
America/Argentina/Buenos_Aires      57
America/Argentina/Cordoba              26
America/Argentina/Mendoza             55
dtype: int64

然後我通過take按照這個順序截取了最後10⾏最⼤值:
count_subset = agg_counts.take(indexer[-10:])
count_subset            # 輸出如下:
os                                         Not Windows  Windows
tz
America/Sao_Paulo                             13.0          20.0
Europe/Madrid                                    16.0          19.0
Pacific/Honolulu                                    0.0          36.0
Asia/Tokyo                                             2.0          35.0
Europe/London                                    43.0          31.0
America/Denver                                 132.0          59.0
America/Los_Angeles                        130.0        252.0
America/Chicago                               115.0        285.0
                                                            245.0        276.0
America/New_York                            339.0        912.0

pandas有⼀個簡便⽅法nlargest,可以做同樣的⼯作:
agg_counts.sum(1).nlargest(10)      # 輸出如下:按行求和,並取值最大的10個
tz
America/New_York            1251.0
                                              521.0
America/Chicago                 400.0
America/Los_Angeles          382.0
America/Denver                   191.0
Europe/London                      74.0
Asia/Tokyo                             37.0
Pacific/Honolulu                    36.0
Europe/Madrid                       35.0
America/Sao_Paulo                33.0
dtype: float64
然後,如這段程式碼所示,可以⽤柱狀圖表示。我傳遞⼀個額外引數到seaborn的barpolt函式,來畫⼀個堆積條形圖(⻅圖14-2):
# Rearrange the data for plotting
count_subset = count_subset.stack()
count_subset.name = 'total'
count_subset = count_subset.reset_index()
count_subset[:10]                   # 輸出如下:
                                  tz                      os        total
0   America/Sao_Paulo   Not Windows         13.0
1   America/Sao_Paulo          Windows         20.0
2         Europe/Madrid    Not Windows         16.0
3         Europe/Madrid           Windows         19.0
4       Pacific/Honolulu    Not Windows           0.0
5       Pacific/Honolulu           Windows         36.0
6                Asia/Tokyo    Not Windows           2.0
7                Asia/Tokyo           Windows          35.0
8         Europe/London    Not Windows          43.0
9         Europe/London           Windows          31.0
sns.barplot(x='total', y='tz', hue='os', data=count_subset)             # 輸出圖形14-2

圖14-2 最常出現時區的Windows和⾮Windows⽤戶
                                       圖14-2  最常出現時區的Windows和⾮Windows⽤戶

這張圖不容易看出Windows⽤戶在⼩分組中的相對⽐例,因此標準化分組百分⽐之和為1
def norm_total(group):
       group['normed_total'] = group.total / group.total.sum()
       return group
results = count_subset.groupby('tz').apply(norm_total)
再次畫圖,⻅圖14-3:
sns.barplot(x='normed_total', y='tz', hue='os', data=results)           # 輸出圖14-3,根據os列畫堆積圖形

圖14-3 最常出現時區的Windows和⾮Windows⽤戶的百分⽐
                                 圖14-3  最常出現時區的Windows和⾮Windows⽤戶的百分⽐

我們還可以⽤groupby的transform⽅法,更⾼效的計算標準化的和:
g = count_subset.groupby('tz')
results2 = count_subset.total / g.total.transform('sum')

二、MovieLens 1M資料集
GroupLens Research(
http://www.grouplens.org/node/73)採集了⼀組從20世紀90年末到21世紀初由MovieLens⽤戶提供的電影評分資料。這些資料中包括電影評分、電影元資料(⻛格型別和年代)以及關於⽤戶的⼈⼝統計學資料(年齡、郵編、性別和職業等)。基於機器學習演算法的推薦系統⼀般都會對此類資料感興趣。雖然這裡不會詳細介紹機器學習技術,但我會告訴你如何對這種資料進⾏切⽚切塊以滿⾜實際需求。

MovieLens 1M資料集含有來⾃6000名⽤戶對4000部電影的100萬條評分資料。它分為三個表:評分、⽤戶資訊和電影資訊。將該資料從zip⽂件中解壓出來之後,可以通過pandas.read_table將各個表分別讀到⼀個pandas DataFrame物件中:
import pandas as pd

# Make display smaller
pd.options.display.max_rows = 10

unames = ['user_id', 'gender', 'age', 'occupation', 'zip']
users = pd.read_table('datasets/movielens/users.dat', sep='::',
                                      header=None, names=unames)

rnames = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_table('datasets/movielens/ratings.dat', sep='::',
                                         header=None, names=rnames)
                                        
mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table('datasets/movielens/movies.dat', sep='::',
                                          header=None, names=mnames)

利⽤Python的切⽚語法,通過檢視每個DataFrame的前⼏⾏即可驗證資料載入⼯作是否⼀切順利:
users[:5]               # same as users.head(),輸出如下:
    user_id    gender  age  occupation           zip
0          1              F      1                 10      48067
1          2             M   56                  16     70072
2          3             M   25                  15     55117
3          4             M   45                    7     02460
4          5             M   25                  20     55455
ratings[:5]             # 輸出如下:
      user_id    movie_id    rating       timestamp
0             1           1193           5      978300760
1             1             661           3      978302109
2             1             914           3      978301968
3             1           3408           4      978300275
4             1           2355           5      978824291
movies.head()           # 輸出如下:
       movie_id                                                        title                                             genres
0                 1                                    Toy Story (1995)       Animation|Children's|Comedy
1                 2                                       Jumanji (1995)         Adventure|Children's|Fantasy
2                 3                     Grumpier Old Men (1995)                         Comedy|Romance
3                 4                        Waiting to Exhale (1995)                             Comedy|Drama
4                 5            Father of the Bride Part II (1995)                                         Comedy
ratings                 # 輸出如下:
                user_id    movie_id    rating         timestamp
0                       1           1193           5        978300760
1                       1             661           3        978302109
2                       1             914           3        978301968
3                       1           3408           4        978300275
4                       1           2355           5        978824291
...                     ...                 ...           ...        ...
1000204     6040           1091           1        956716541
1000205     6040           1094           5        956704887
1000206     6040             562           5        956704746
1000207     6040           1096           4        956715648
1000208     6040           1097           4        956715569
[1000209 rows x 4 columns]

注意,其中的年齡和職業是以編碼形式給出的,它們的具體含義請參考該資料集的README⽂件。分析散佈在三個表中的資料可不是⼀件輕鬆的事情。假設我們想要根據性別和年齡計算某部電影的平均得分,如果將所有資料都合併到⼀個表中的話問題就簡單多了。我們先⽤pandas的merge函式將ratings跟users合併到⼀起,然後再將movies也合併進去。pandas會根據列名的重疊情況推斷出哪些列是合併(或連線)鍵:
data = pd.merge(pd.merge(ratings, users), movies)
data        # 輸出如下:
                     user_id    movie_id    rating     timestamp   gender  age   occupation          zip       \
0                            1          1193            5    978300760             F      1                 10      48067
1                            2          1193            5    978298413            M   56                  16     70072
2                          12          1193            4    978220179            M   25                  12     32793
3                          15          1193            4    978199279            M   25                    7     22903
4                          17          1193            5    978158471            M   50                    1     95350
...                          ...               ...             ...            ...                     ...    ...                   ...           ...
1000204          5949          2198            5    958846401            M   18                  17      47901
1000205          5675          2703            3    976029116            M   35                  14      30030
1000206          5780          2845            1    958153068            M   18                  17      92886
1000207          5851          3607            5    957756608             F   18                   20      55410
1000208          5938          2909            4    957273353            M   25                    1      35401
                                                                                       title                                 genres
0                         One Flew Over the Cuckoo's Nest (1975)                                 Drama
1                         One Flew Over the Cuckoo's Nest (1975)                                 Drama
2                         One Flew Over the Cuckoo's Nest (1975)                                 Drama
3                         One Flew Over the Cuckoo's Nest (1975)                                 Drama
4                         One Flew Over the Cuckoo's Nest (1975)                                 Drama
...                                                                                 ...                                  ...
1000204                                               Modulations (1998)                      Documentary
1000205                                           Broken Vessels (1998)                                 Drama
1000206                                                 White Boys (1999)                                 Drama
1000207                                         One Little Indian (1973)    Comedy|Drama|Western
1000208       Five Wives, Three Secretaries and Me (1998)                      Documentary
[1000209 rows x 10 columns]
data.iloc[0]            # 輸出如下:
user_id                                                                             1
movie_id                                                                    1193
rating                                   &