1. 程式人生 > >Cris 的 Python 資料分析筆記 06:Pandas 常見的資料預處理

Cris 的 Python 資料分析筆記 06:Pandas 常見的資料預處理

文章目錄

1. Pandas 對指定列排序

import pandas as pd

'''
    sort_values 表示按照指定列進行排序;inplace 引數如果為 True,表示對原 DataFrame 進行排序處理,否則就是返回一個
    新的排序後的 DataFrame,NaN 表示缺失值;預設升序排序,可以使用 ascending 引數改變排序規則
'''
data = pd.read_csv('food_info.csv')
print(data['Sodium_(mg)'])
data.sort_values('Sodium_(mg)'
,inplace=True) print(data['Sodium_(mg)']) data.sort_values('Sodium_(mg)',inplace=True,ascending=False) print(data['Sodium_(mg)'])
0        643.0
1        659.0
2          2.0
3       1146.0
4        560.0
5        629.0
6        842.0
7        690.0
8        644.0
9        700.0
10       604.0
11       364.0
12       344.0
13       372.0
14       308.0
15       406.0
16       365.0
17       812.0
18       917.0
19       800.0
20       600.0
21       819.0
22       714.0
23       800.0
24       600.0
25       627.0
26       710.0
27       619.0
28       682.0
29       628.0
         ...  
8588       2.0
8589       2.0
8590       7.0
8591     564.0
8592     464.0
8593     490.0
8594       1.0
8595     199.0
8596     297.0
8597      16.0
8598     486.0
8599       0.0
8600       2.0
8601    1297.0
8602    1435.0
8603    2838.0
8604      10.0
8605       2.0
8606      12.0
8607       0.0
8608    3326.0
8609    1765.0
8610    3750.0
8611      29.0
8612      58.0
8613    4450.0
8614     667.0
8615      58.0
8616      70.0
8617      68.0
Name: Sodium_(mg), Length: 8618, dtype: float64
760     0.0
758     0.0
405     0.0
761     0.0
2269    0.0
763     0.0
764     0.0
770     0.0
774     0.0
396     0.0
395     0.0
6827    0.0
394     0.0
393     0.0
391     0.0
390     0.0
787     0.0
788     0.0
2270    0.0
2231    0.0
407     0.0
748     0.0
409     0.0
747     0.0
702     0.0
703     0.0
704     0.0
705     0.0
706     0.0
707     0.0
       ... 
8153    NaN
8155    NaN
8156    NaN
8157    NaN
8158    NaN
8159    NaN
8160    NaN
8161    NaN
8163    NaN
8164    NaN
8165    NaN
8167    NaN
8169    NaN
8170    NaN
8172    NaN
8173    NaN
8174    NaN
8175    NaN
8176    NaN
8177    NaN
8178    NaN
8179    NaN
8180    NaN
8181    NaN
8183    NaN
8184    NaN
8185    NaN
8195    NaN
8251    NaN
8267    NaN
Name: Sodium_(mg), Length: 8618, dtype: float64
276     38758.0
5814    27360.0
6192    26050.0
1242    26000.0
1245    24000.0
1243    24000.0
1244    23875.0
292     17000.0
1254    11588.0
5811    10600.0
8575     9690.0
291      8068.0
1249     8031.0
5812     7893.0
1292     7851.0
293      7203.0
4472     7027.0
4836     6820.0
1261     6580.0
3747     6008.0
1266     5730.0
4835     5586.0
4834     5493.0
1263     5356.0
1553     5203.0
1552     5053.0
1251     4957.0
1257     4843.0
294      4616.0
8613     4450.0
         ...   
8153        NaN
8155        NaN
8156        NaN
8157        NaN
8158        NaN
8159        NaN
8160        NaN
8161        NaN
8163        NaN
8164        NaN
8165        NaN
8167        NaN
8169        NaN
8170        NaN
8172        NaN
8173        NaN
8174        NaN
8175        NaN
8176        NaN
8177        NaN
8178        NaN
8179        NaN
8180        NaN
8181        NaN
8183        NaN
8184        NaN
8185        NaN
8195        NaN
8251        NaN
8267        NaN
Name: Sodium_(mg), Length: 8618, dtype: float64

2. 泰坦尼克經典入門案例

import numpy as np

'''
    isnull 函式可以判斷一列資料的缺失值,NaN 則返回 True,正常值則返回 False
'''
titanic_survival = pd.read_csv('titanic_train.csv')
titanic_survival.head()

age = titanic_survival['Age']
age_top_10 = (age[0:10])
age_is_null = pd.isnull(age_top_10)
print(age_is_null)

# 通過索引過濾得到缺失值的資料集
age_null = age_top_10[age_is_null]
print(age_null)
age_null_count = len(age_null)
print(age_null_count)
0    False
1    False
2    False
3    False
4    False
5     True
6    False
7    False
8    False
9    False
Name: Age, dtype: bool
5   NaN
Name: Age, dtype: float64
1

3. Pandas 常用資料預處理函式

3.1 缺失值處理

'''
    如果不對 NaN 值處理,得到的計算結果就是 nan 的~~~    
'''
average_age = sum(titanic_survival['Age'])/len(titanic_survival['Age'])
print(average_age)

'''
    非常厲害的缺失值處理:通過切片判斷表示式得到所有不是 NaN 值的正常資料
'''
# 先通過 isnull 函式得到指定列的所有值,正常值正常顯示,非正常值以 NaN 顯示
all_age_null = pd.isnull(titanic_survival['Age'])
print(all_age_null)
# 然後通過切片表示式作為索引得到所有的正常值
good_ages = titanic_survival['Age'][all_age_null == False]
print(good_ages)
age_average = sum(good_ages)/len(good_ages)
# 29.69911764705882
print(age_average)
nan
0      False
1      False
2      False
3      False
4      False
5       True
6      False
7      False
8      False
9      False
10     False
11     False
12     False
13     False
14     False
15     False
16     False
17      True
18     False
19      True
20     False
21     False
22     False
23     False
24     False
25     False
26      True
27     False
28      True
29      True
       ...  
861    False
862    False
863     True
864    False
865    False
866    False
867    False
868     True
869    False
870    False
871    False
872    False
873    False
874    False
875    False
876    False
877    False
878     True
879    False
880    False
881    False
882    False
883    False
884    False
885    False
886    False
887    False
888     True
889    False
890    False
Name: Age, Length: 891, dtype: bool
0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
6      54.0
7       2.0
8      27.0
9      14.0
10      4.0
11     58.0
12     20.0
13     39.0
14     14.0
15     55.0
16      2.0
18     31.0
20     35.0
21     34.0
22     15.0
23     28.0
24      8.0
25     38.0
27     19.0
30     40.0
33     66.0
34     28.0
35     42.0
37     21.0
38     18.0
       ... 
856    45.0
857    51.0
858    24.0
860    41.0
861    21.0
862    48.0
864    24.0
865    42.0
866    27.0
867    31.0
869     4.0
870    26.0
871    47.0
872    33.0
873    47.0
874    28.0
875    15.0
876    20.0
877    19.0
879    56.0
880    25.0
881    33.0
882    22.0
883    28.0
884    25.0
885    39.0
886    27.0
887    19.0
889    26.0
890    32.0
Name: Age, Length: 714, dtype: float64
29.69911764705882

3.2 Pandas 預處理函式自動過濾缺失值

# missing data is so common that many pandas methods automatically filter for it
# 雖然 Pandas 為我們提供了過濾缺失值的函式,但是仍然不是很推薦使用,因為資料最好不要輕易過濾,通常的做法都是
#  為其新增一份計算後的預設值
mean_age = titanic_survival['Age'].mean()
print(mean_age)
29.69911764705882

3.3 手動來計算每種船艙的平均價格

Pclass = [1,2,3]
Pclass_avg_price = {}
for this_pclass in Pclass:
    
    # 首先我們需要根據列來篩選出符合條件的行資料(樣本資料),然後篩選出來的樣本的指定列(特徵值)的值求和併除以對應行數求均值
    # 得到的資料就是指定特徵值的均值
    prices = titanic_survival[titanic_survival['Pclass'] == this_pclass]
#     Pclass_avg_price[this_pclass] = sum(prices['Fare'])/len(prices)
    # 求均值可以使用 3.2節所示的 Pandas 內建函式!
    Pclass_avg_price[this_pclass] = prices['Fare'].mean()
    
print(Pclass_avg_price)
{1: 84.15468749999992, 2: 20.66218315217391, 3: 13.675550101832997}

3.4 Pandas 的內建函式簡化 3.3 節的計算

'''
    index tells the method which column to group by
    values is th column that we want to apply the calculation to 
    aggfunc specifies the calculation we want to perform 
'''
passenger_survival = titanic_survival.pivot_table(index='Pclass', values='Survived', aggfunc=np.mean)
print(passenger_survival)

# 注意:aggfunc 屬性如果不寫,預設就是求均值
avg_age = titanic_survival.pivot_table(index='Pclass', values='Age')
print(avg_age)
age = titanic_survival.pivot_table(index='Pclass', values='Age', aggfunc=np.mean)
print(age)
        Survived
Pclass          
1       0.629630
2       0.472826
3       0.242363
              Age
Pclass           
1       38.233441
2       29.877630
3       25.140620
              Age
Pclass           
1       38.233441
2       29.877630
3       25.140620

3.5 分組計算制定列之間的關係

# 這裡根據登船地點進行分組,然後分別統計船票價格之和以及獲救人數之和(按照分組顯示)
Fare_survived = titanic_survival.pivot_table(index='Embarked', values=['Fare', 'Survived'], aggfunc=np.sum)
print(Fare_survived)
                Fare  Survived
Embarked                      
C         10072.2962        93
Q          1022.2543        30
S         17439.3988       217
# specifying axis = 1 or axis = 'columns' will drop any columns that have null values
drop_col = titanic_survival.dropna(axis=1)
print(drop_col.head())

# 如果 Age 和 Sex 列缺失值,那麼丟掉這一行樣本
new_data = titanic_survival.dropna(axis=0, subset=['Age','Sex'])
print(new_data.head())

# 對應的 fillna 函式則是對 null 值進行填充
   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex  SibSp  Parch  \
0                            Braund, Mr. Owen Harris    male      1      0   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female      1      0   
2                             Heikkinen, Miss. Laina  female      0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female      1      0   
4                           Allen, Mr. William Henry    male      0      0   

             Ticket     Fare  
0         A/5 21171   7.2500  
1          PC 17599  71.2833  
2  STON/O2. 3101282   7.9250  
3            113803  53.1000  
4            373450   8.0500  
   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  

3.6 資料定位

# Pandas 根據行號和列名來定位具體的某個值
print(titanic_survival.loc[12,'Age'])
print(titanic_survival.loc[342,'Pclass'])
20.0
2

3.7 重排序索引

new_data = titanic_survival.sort_values('Age', ascending=False)
# 拋棄以前的索引,對排序後的資料的索引進行重新計算,inplace 為 True 表示對原資料直接更改
new_data.reset_index(drop=True,inplace=True)
print(new_data.head())
   PassengerId  Survived  Pclass                                  Name   Sex  \
0          631         1       1  Barkworth, Mr. Algernon Henry Wilson  male   
1          852         0       3                   Svensson, Mr. Johan  male   
2          494         0       1               Artagaveytia, Mr. Ramon  male   
3           97         0       1             Goldschmidt, Mr. George B  male   
4          117         0       3                  Connors, Mr. Patrick  male   

    Age  SibSp  Parch    Ticket     Fare Cabin Embarked  
0  80.0      0      0     27042  30.0000   A23        S  
1  74.0      0      0    347060   7.7750   NaN        S  
2  71.0      0      0  PC 17609  49.5042   NaN        C  
3  71.0      0      0  PC 17754  34.6542    A5        C  
4  70.5      0      0    370369   7.7500   NaN        Q  

3.8 自定義函式

# 定義新函式返回第一百行的資料
def handredth_data (column):
    data = column.loc[99]
    return data
data = titanic_survival.apply(handredth_data)
print(data)

# 獲取每列的缺失值的樣本數
def null_count (column):
    col_null = pd.isnull(column)
    null = column[col_null]
    return len(null)

count = titanic_survival.apply(null_count)
print('----------')
print(count)
print(help(pd.isnull))
PassengerId                  100
Survived                       0
Pclass                         2
Name           Kantor, Mr. Sinai
Sex                         male
Age                           34
SibSp                          1
Parch                          0
Ticket                    244367
Fare                          26
Cabin                        NaN
Embarked                       S
dtype: object
----------
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64
Help on function isna in module pandas.core.dtypes.missing:

isna(obj)
    Detect missing values for an array-like object.
    
    This function takes a scalar or array-like object and indictates
    whether values are missing (``NaN`` in numeric arrays, ``None`` or ``NaN``
    in object arrays, ``NaT`` in datetimelike).
    
    Parameters
    ----------
    obj : scalar or array-like
        Object to check for null or missing values.
    
    Returns
    -------
    bool or array-like of bool
        For scalar input, returns a scalar boolean.
        For array input, returns an array of boolean indicating whether each
        corresponding element is missing.
    
    See Also
    --------
    notna : boolean inverse of pandas.isna.
    Series.isna : Detetct missing values in a Series.
    DataFrame.isna : Detect missing values in a DataFrame.
    Index.isna : Detect missing values in an Index.
    
    Examples
    --------
    Scalar arguments (including strings) result in a scalar boolean.
    
    >>> pd.isna('dog')
    False
    
    >>> pd.isna(np.nan)
    True
    
    ndarrays result in an ndarray of booleans.
    
    >>> array = np.array([[1, np.nan, 3], [4, 5, np.nan]])
    >>> array
    array([[ 1., nan,  3.],
           [ 4.,  5., nan]])
    >>> pd.isna(array)
    array([[False,  True, False],
           [False, False,  True]])
    
    For indexes, an ndarray of booleans is returned.
    
    >>> index = pd.DatetimeIndex(["2017-07-05", "2017-07-06", None,
    ...                           "2017-07-08"])
    >>> index
    DatetimeIndex(['2017-07-05', '2017-07-06', 'NaT', '2017-07-08'],
                  dtype='datetime64[ns]', freq=None)
    >>> pd.isna(index)
    array([False, False,  True, False])
    
    For Series and DataFrame, the same type is returned, containing booleans.
    
    >>> df = pd.DataFrame([['ant', 'bee', 'cat'], ['dog', None, 'fly']])
    >>> df
         0     1    2
    0  ant   bee  cat
    1  dog  None  fly
    >>> pd.isna(df)
           0      1      2
    0  False  False  False
    1  False   True  False
    
    >>> pd.isna(df[1])
    0    False
    1     True
    Name: 1, dtype: bool

None

3.9 每行迭代及資料轉換

ages = titanic_survival['Age']
print(ages.head())

def which_class (row):
    pclass = row['Pclass']
    if pd.isnull(pclass):
        return 'Unknown'
    elif pclass == 1:
        return 'First Class'
    elif pclass == 2:
        return 'Second Class'
    else:
        return 'Third Class'
    
# apply 函式中,axis 屬性為1,表示對每行進行函式判斷,即資料迭代
result = titanic_survival.apply(which_class, axis=1)
print(result.head())

def age_class (row):
    age = row['Age']
    if pd.isna(age):
        return 'Unknown'
    elif age < 18:
        return '年輕人'
    elif age < 40:
        return '中年人'
    else:
        return '老年人'
age_lable = titanic_survival.apply(age_class, axis=1)
print(age_lable.tail())
0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
Name: Age, dtype: float64
0    Third Class
1    First Class
2    Third Class
3    First Class
4    Third Class
dtype: object
886        中年人
887        中年人
888    Unknown
889        中年人
890        中年人
dtype: object

3.10 巧妙分組計算資料之間的關係

# 為 DataFrame 新增一列
titanic_survival['age_label'] = age_lable
result = titanic_survival.pivot_table(index='age_label', values='Survived')
print(result)
           Survived
age_label          
Unknown    0.293785
中年人        0.383562
年輕人        0.539823
老年人        0.374233