1. 程式人生 > >資料分析---《Python for Data Analysis》學習筆記【02】

資料分析---《Python for Data Analysis》學習筆記【02】

《Python for Data Analysis》一書由Wes Mckinney所著,中文譯名是《利用Python進行資料分析》。這裡記錄一下學習過程,其中有些方法和書中不同,是按自己比較熟悉的方式實現的。

 

第二個例項:MovieLens 1M Data Set

 

簡介: GroupLens Research提供了從MovieLens使用者那裡收集來的一系列對90年代電影評分的資料。

 

資料地址:http://files.grouplens.org/datasets/movielens/ml-1m.zip

 

準備工作:匯入pandas和matplotlib

import pandas as pd
import matplotlib.pyplot as plt
fig,ax=plt.subplots()

 

壓縮包裡有三個.dat檔案,分別是movies, users, ratings。這幾個檔案可以用pandas的read_table()方法讀入並變為DataFrame格式,用names引數設定各個表的列名。

movies=pd.read_table(r"...\movies.dat", sep='::', engine='python', names=["movieId", "title", "genre"])
users
=pd.read_table(r"...\users.dat", sep='::', engine='python', names=["userId", "gender", "age", "occupation", "zip"]) ratings=pd.read_table(r"...\ratings.dat", sep='::', engine='python', names=["userId", "movieId", "rating", "timestamp"])

 

接下來把這三張表合併在一起,以便於分析。其中movies和ratings先通過movieId列進行連線,然後合併的表再與users通過userId列進行連線。

data=pd.merge(pd.merge(movies, ratings, on="movieId", how="inner"), users, on="userId", how="inner")

 

合併的表前5行顯示如下:

   movieId                                      title  \
0        1                           Toy Story (1995)   
1       48                          Pocahontas (1995)   
2      150                           Apollo 13 (1995)   
3      260  Star Wars: Episode IV - A New Hope (1977)   
4      527                    Schindler's List (1993)   

                                  genre  userId  rating  timestamp gender  \
0           Animation|Children's|Comedy       1       5  978824268      F   
1  Animation|Children's|Musical|Romance       1       5  978824351      F   
2                                 Drama       1       5  978301777      F   
3       Action|Adventure|Fantasy|Sci-Fi       1       4  978300760      F   
4                             Drama|War       1       5  978824195      F   

   age  occupation    zip  
0    1          10  48067  
1    1          10  48067  
2    1          10  48067  
3    1          10  48067  
4    1          10  48067 

 

上面可以看到,age這一列有明顯的異常(1歲?),因此這裡把data中age小於18歲和大於100歲的人去除。

data=data[(data["age"]>=18) & (data["age"]<=100)]

 

我們來看一下,按性別分組,對各部電影的平均評分是多少:

by_gender_movie_rating=pd.pivot_table(data, values="rating", index="title", columns="gender", aggfunc="mean")

 

這裡用透視表展示了男女分別對各部電影的平均評分:

gender                                    F         M
title                                                
$1,000,000 Duck (1971)             3.375000  2.761905
'Night Mother (1986)               3.400000  3.424242
'Til There Was You (1997)          2.694444  2.571429
'burbs, The (1989)                 2.793478  2.947368
...And Justice for All (1979)      3.828571  3.693252
1-900 (1994)                       2.000000  3.000000
10 Things I Hate About You (1999)  3.593137  3.303855
101 Dalmatians (1961)              3.789474  3.512535
101 Dalmatians (1996)              3.210526  2.928934
12 Angry Men (1957)                4.229008  4.318376

 

然而,我們考慮到如果一部電影打分的人太少,那麼此評分就不會太準確,該電影就不能作為取樣。因此,我們要對每部電影打分的人數進行統計,並把評分人數超過250的電影篩選出來。

movie_counts=data.groupby('title')['title'].count()

movies_select=movie_counts.index[movie_counts.values>=250]

 

然後,我們把上面的透視表按照選出的電影movies_select進行篩選,選出所有符合條件的行:

by_gender_movie_rating=by_gender_movie_rating.loc[movies_select]

 

我們再來看看現在透視表變成了什麼樣:

gender                                      F         M
title                                                  
'burbs, The (1989)                   2.793478  2.947368
10 Things I Hate About You (1999)    3.593137  3.303855
101 Dalmatians (1961)                3.789474  3.512535
101 Dalmatians (1996)                3.210526  2.928934
12 Angry Men (1957)                  4.229008  4.318376
13th Warrior, The (1999)             3.084746  3.172185
2 Days in the Valley (1996)          3.477273  3.246862
20,000 Leagues Under the Sea (1954)  3.648936  3.723404
2001: A Space Odyssey (1968)         3.829341  4.125931
2010 (1984)                          3.456522  3.418269

 

現在,如果我們想知道女性評分最高的10部電影分別是什麼,那麼我們可以對F列的值進行排序:

top_female_rating=by_gender_movie_rating.sort_values(by='F', ascending=False)

 

以下是結果:

gender                                                     F         M
title                                                                 
Close Shave, A (1995)                               4.672619  4.479121
Wrong Trousers, The (1993)                          4.611607  4.485390
Wallace & Gromit: The Best of Aardman Animation...  4.587629  4.413043
Grand Day Out, A (1992)                             4.581967  4.288820
Sunset Blvd. (a.k.a. Sunset Boulevard) (1950)       4.575221  4.476744
Schindler's List (1993)                             4.563333  4.493325
To Kill a Mockingbird (1962)                        4.539792  4.395387
Shawshank Redemption, The (1994)                    4.539088  4.560944
Creature Comforts (1990)                            4.514286  4.287958
Usual Suspects, The (1995)                          4.512255  4.520864

 

現在,我們想看一下男女評分差異最大的10部電影分別是什麼。首先,給透視表增加差別列-diff,然後再對diff列的值進行排序。

import numpy as np
by_gender_movie_rating["diff"]=np.abs(by_gender_movie_rating["F"]-by_gender_movie_rating["M"])

top_diff_rating=by_gender_movie_rating.sort_values(by='diff', ascending=False)

 

來看一下top_diff_rating的前10行:

gender                                         F         M      diff
title                                                               
Dirty Dancing (1987)                    3.762590  2.961929  0.800661
Good, The Bad and The Ugly, The (1966)  3.484536  4.223776  0.739240
Jumpin' Jack Flash (1986)               3.269231  2.582707  0.686524
Kentucky Fried Movie, The (1977)        2.875000  3.555970  0.680970
Dumb & Dumber (1994)                    2.700000  3.318275  0.618275
Hidden, The (1987)                      3.137931  3.744094  0.606163
Cable Guy, The (1996)                   2.280488  2.878472  0.597984
Grease (1978)                           3.958955  3.376673  0.582282
Rocky III (1982)                        2.361702  2.939828  0.578126
Evil Dead II (Dead By Dawn) (1987)      3.328767  3.900000  0.571233

 

如果想知道不論男女,所有觀眾評分差異最大的10部電影,那麼我們先計算出總評分的標準差,再提取評分人數超過250的電影,最後按標準差進行排序。

movie_rating_std=data.groupby('title')['rating'].std()

movie_rating_std=movie_rating_std.loc[movies_select]

top_rating_std=movie_rating_std.sort_values(ascending=False)

 

結果如下:

title
Dumb & Dumber (1994)                           1.324767
Blair Witch Project, The (1999)                1.319496
Natural Born Killers (1994)                    1.305525
Tank Girl (1995)                               1.278513
Rocky Horror Picture Show, The (1975)          1.259985
Eyes Wide Shut (1999)                          1.254972
Fear and Loathing in Las Vegas (1998)          1.247835
Evita (1996)                                   1.247072
Hellraiser (1987)                              1.243238
South Park: Bigger, Longer and Uncut (1999)    1.237987
Name: rating, dtype: float64

 

至此,書中的分析已全部結束。以下是我自己增加的一些分析內容:

 

如果我們想知道總評分最高的10部電影,男女評分之間有沒有很大的差異,那麼我們先在上面的透視表by_gender_movie_rating裡增加一個總評分欄,然後按照總評分進行排序。

movie_rating=data.groupby('title')['rating'].mean()  #先把總平均評分算出來
movie_rating=movie_rating.loc[movies_select]  #摘選評分人數超過250的電影

by_gender_movie_rating['total']=movie_rating.values  #在透視表中增加總平均評分一列

top_rating=by_gender_movie_rating.sort_values(by='total', ascending=False)  #按總平均評分排序

 

top_rating的前10行如下:

gender                                                     F         M  \
title                                                                    
Seven Samurai (The Magnificent Seven) (Shichini...  4.471154  4.580392   
Shawshank Redemption, The (1994)                    4.539088  4.560944   
Close Shave, A (1995)                               4.672619  4.479121   
Godfather, The (1972)                               4.319829  4.583186   
Wrong Trousers, The (1993)                          4.611607  4.485390   
Usual Suspects, The (1995)                          4.512255  4.520864   
Schindler's List (1993)                             4.563333  4.493325   
Sunset Blvd. (a.k.a. Sunset Boulevard) (1950)       4.575221  4.476744   
Raiders of the Lost Ark (1981)                      4.341727  4.529474   
Rear Window (1954)                                  4.475524  4.482480   

gender                                                  diff     total  
title                                                                   
Seven Samurai (The Magnificent Seven) (Shichini...  0.109238  4.561889  
Shawshank Redemption, The (1994)                    0.021857  4.554791  
Close Shave, A (1995)                               0.193498  4.531300  
Godfather, The (1972)                               0.263357  4.526267  
Wrong Trousers, The (1993)                          0.126218  4.519048  
Usual Suspects, The (1995)                          0.008609  4.518857  
Schindler's List (1993)                             0.070008  4.512011  
Sunset Blvd. (a.k.a. Sunset Boulevard) (1950)       0.098477  4.501094  
Raiders of the Lost Ark (1981)                      0.187748  4.486675  
Rear Window (1954)                                  0.006955  4.480545

 

用柱形圖畫出來進行比較:

ax.bar(range(10), top_rating['F'][:10], width=-0.3, label='Female', align='edge')
ax.bar([i+0.3 for i in range(10)], top_rating['M'][:10], width=-0.3, label='Male', align='edge')
ax.set_xticks(range(10))
ax.set_ylim(4,5)
ax.set_xticklabels(top_rating[:10].index.values, rotation=90)
ax.legend()

plt.show()

可以看到,男性對教父的評分比女性要高很多。

 

現在,讓我們再來看看各個年齡段最喜歡的10部電影是什麼。

 

首先,對年齡進行分組,並在data資料裡新增年齡分組這一列:

age_range=pd.cut(data['age'], 3, labels=['Young', 'Middle', 'Old'])
data['age_range']=age_range

 

data的前10行現在如下:

    movieId                                              title  \
53        1                                   Toy Story (1995)   
54       17                       Sense and Sensibility (1995)   
55       34                                        Babe (1995)   
56       48                                  Pocahontas (1995)   
57      199  Umbrellas of Cherbourg, The (Parapluies de Che...   
58      266                         Legends of the Fall (1994)   
59      296                                Pulp Fiction (1994)   
60      364                              Lion King, The (1994)   
61      368                                    Maverick (1994)   
62      377                                       Speed (1994)   

                                   genre  userId  rating  timestamp gender  \
53           Animation|Children's|Comedy       6       4  978237008      F   
54                         Drama|Romance       6       4  978236383      F   
55               Children's|Comedy|Drama       6       4  978237444      F   
56  Animation|Children's|Musical|Romance       6       5  978237570      F   
57                         Drama|Musical       6       5  978237570      F   
58             Drama|Romance|War|Western       6       4  978237909      F   
59                           Crime|Drama       6       2  978237379      F   
60          Animation|Children's|Musical       6       4  978237570      F   
61                 Action|Comedy|Western       6       4  978237909      F   
62               Action|Romance|Thriller       6       3  978236383      F   

    age  occupation    zip age_range  
53   50           9  55117       Old  
54   50           9  55117       Old  
55   50           9  55117       Old  
56   50           9  55117       Old  
57   50           9  55117       Old  
58   50           9  55117       Old  
59   50           9  55117       Old  
60   50           9  55117       Old  
61   50           9  55117       Old  
62   50           9  55117       Old 

 

然後,按年齡段作為列,製作透視表:

by_age_movie_rating=pd.pivot_table(data, values="rating", index="title", columns="age_range", aggfunc="mean")

 

這裡透視表by_age_movie_rating展示了各個年齡段的觀眾對各部電影的平均評分:

age_range                            Middle       Old     Young
title                                                          
$1,000,000 Duck (1971)             3.133333  2.600000  3.058824
'Night Mother (1986)               2.904762  3.777778  3.551724
'Til There Was You (1997)          2.900000  2.500000  2.625000
'burbs, The (1989)                 2.818182  2.951220  2.912195
...And Justice for All (1979)      3.657143  3.809524  3.692308
1-900 (1994)                            NaN  3.000000  2.000000
10 Things I Hate About You (1999)  3.102941  3.476190  3.424125
101 Dalmatians (1961)              3.826087  3.692308  3.488746
101 Dalmatians (1996)              3.279570  3.460317  2.764368
12 Angry Men (1957)                4.358333  4.268156  4.293333

 

可以看到上面有無效欄位,應該是該年齡段沒有人對此電影進行評分。因此,我們把無效值變為0。同時,把評分人數超過250的電影篩選出來:

by_age_movie_rating=by_age_movie_rating.fillna(0)
by_age_movie_rating=by_age_movie_rating.loc[movies_select]

 

然後我們按Young這一列的值來排序,看看年輕人評分最高的10部電影是什麼:

top_young_movie_rating=by_age_movie_rating.sort_values(by='Young', ascending=False)

 

結果如下:

age_range                                             Middle       Old  \
title                                                                    
Shawshank Redemption, The (1994)                    4.487500  4.423690   
Usual Suspects, The (1995)                          4.390879  4.319231   
Seven Samurai (The Magnificent Seven) (Shichini...  4.532895  4.581006   
Godfather, The (1972)                               4.541935  4.452290   
Close Shave, A (1995)                               4.450704  4.577465   
Star Wars: Episode IV - A New Hope (1977)           4.354633  4.386760   
Raiders of the Lost Ark (1981)                      4.475538  4.414737   
Wrong Trousers, The (1993)                          4.443850  4.663866   
Rear Window (1954)                                  4.479245  4.461818   
Sunset Blvd. (a.k.a. Sunset Boulevard) (1950)       4.611570  4.432624   

age_range                                              Young  
title                                                         
Shawshank Redemption, The (1994)                    4.617735  
Usual Suspects, The (1995)                          4.595943  
Seven Samurai (The Magnificent Seven) (Shichini...  4.565371  
Godfather, The (1972)                               4.552921  
Close Shave, A (1995)                               4.551220  
Star Wars: Episode IV - A New Hope (1977)           4.524260  
Raiders of the Lost Ark (1981)                      4.514109  
Wrong Trousers, The (1993)                          4.513109  
Rear Window (1954)                                  4.491803  
Sunset Blvd. (a.k.a. Sunset Boulevard) (1950)       4.482051

第一名是肖申克的救贖。

 

用同樣的方法,我們可以看到老年組評分最高的電影是Wrong Trousers, The (1993),中年組評分最高的電影是Sunset Blvd. (a.k.a. Sunset Boulevard) (1950)。