python資料清洗（五）

阿新 • • 發佈：2018-12-17

案例分析
應用學到的所有資料清理技術，以整理從Gapminder Foundation獲得的真實世界，混亂的資料集。完成後，可以擁有乾淨整潔的資料集，還可以使用Python的強大功能開始處理自己的資料科學專案！

1.1 探索性分析
每當獲得新資料集時，第一個任務應該是進行一些探索性分析，以便更好地理解資料並對任何潛在問題進行診斷。

19世紀的Gapminder資料已載入到名為g1800s的DataFrame中。使用pandas方法（如.head（），.info（）和.describe（））和DataFrame屬性（如.columns和.shape）來探索它。

可以看出資料集包含260行，101列，100個數值列， “Life expectancy”是DataFrame中唯一不屬於float64型別的列。而且有些國家的值一直沒有變過。

1.2 視覺化資料
自1800年以來，全球的預期壽命一直在穩步上升。希望通過Gapminder資料確認這一點。

DataFrame g1800s已預先載入。你在這個練習中的工作是建立一個散點圖，在x軸上的預期壽命為'1800'，在y軸上為'1899'的預期壽命。

在這裡，目標是直觀地檢查資料以獲得洞察力和錯誤。在檢視繪圖時，請注意散點圖是否採用對角線的形式，以及哪些點落在對角線的下方或上方。這將告知1899年不同國家的預期壽命與1800年相比，將如何改變（或沒有改變）。如果點落在對角線上，則意味著預期壽命保持不變！

# Import matplotlib.pyplot
import matplotlib.pyplot as plt

# Create the scatter plot
g1800s.plot(kind='scatter', x='1800', y='1899')

# Specify axis labels
plt.xlabel('Life Expectancy by Country in 1800')
plt.ylabel('Life Expectancy by Country in 1899')

# Specify axis limits
plt.xlim(20, 55)
plt.ylim(20, 55)

# Display the plot
plt.show()

正如看到的，有相當多的國家落在對角線上。事實上，檢查DataFrame可以發現，在190世紀，260個國家中有140個國家的預期壽命根本沒有變化！這可能是因為當時無法訪問資料。通過這種方式，視覺化資料可以幫助發現洞察並診斷錯誤。

1.3 思考手頭的問題
由於資料集是按國家和年份給出了預期壽命水平資料，因此可以詢問有關每年平均預期壽命變化的問題。

但是，在繼續之前，確保對資料的以下假設是正確的非常重要：

“預期壽命”是DataFrame的第一列（索引0）。
其他列包含空值或數值。
數值均大於或等於0。
每個國家只有一個例項。
可以編寫一個可以應用於整個DataFrame的函式來驗證其中的一些假設。請注意，花時間編寫此類指令碼也可以幫助處理其他資料集。

def check_null_or_valid(row_data):
    """Function that takes a row of data,
    drops all missing values,
    and checks if all remaining values are greater than or equal to 0
    """
    no_na = row_data.dropna()
    numeric = pd.to_numeric(no_na)
    ge0 = numeric >= 0
    return ge0

# Check whether the first column is 'Life expectancy'
assert g1800s.columns[0] == 'Life expectancy'

# Check whether the values in the row are valid
assert g1800s.iloc[:, 1:].apply(check_null_or_valid, axis=1).all().all()

# Check that there is only one instance of each country
assert g1800s['Life expectancy'].value_counts()[0] == 1

養成像這樣測試程式碼的習慣是一項重要技能。

1.4 組裝資料
這裡預裝了三個DataFrame：g1800s，g1900s和g2000s。它們分別包含19世紀，20世紀和21世紀的Gapminder預期壽命資料。

本練習中任務是將它們連線到一個名為gapminder的DataFrame中。這是一個逐行連線，類似於第3部分中連線每月優步資料集的方式。

# Concatenate the DataFrames row-wise
gapminder = pd.concat([g1800s,g1900s,g2000s])

# Print the shape of gapminder
print(gapminder.shape)

# Print the head of gapminder
print(gapminder.head())

從1800年到2016年的所有Gapminder資料現在都包含在一個DataFrame中。

二、資料的初始印象

2.2 檢查資料型別
既然資料處於正確的形狀，需要確保列具有正確的資料型別。也就是說，需要確保country是object型別，year是int64型別，而life_expectancy是float64型別。整潔的DataFrame已預先載入為gapminder。

使用.info（）方法在IPython Shell中探索它。請注意，列'year'是object型別。這是不正確的，因此需要使用pd.to_numeric（）函式將其轉換為數字資料型別。

# Convert the year column to numeric
gapminder.year = pd.to_numeric(gapminder['year'])

# Test if country is of type object
assert gapminder.country.dtypes == np.object

# Test if year is of type int64
assert gapminder.year.dtypes == np.int64

# Test if life_expectancy is of type float64
assert gapminder.life_expectancy.dtypes == np.float64

由於斷言語句沒有丟擲任何錯誤，因此可以確保列具有正確的資料型別！

2.3 看著country的拼寫
在整理了DataFrame並檢查了資料型別之後，資料清理過程中的下一個任務是檢視“country”列，看看是否有任何特殊或無效的字元可能需要處理。

可以合理地假設國名將包含：

一組大寫字母和大寫字母。
單詞之間的空格。
任何縮寫的期間。
要確認是這種情況，您可以再次利用正則表示式的強大功能。對於像這樣的常見操作，Pandas有一個內建的字串方法 - str.contains（） - 它採用正則表示式模式，並將其應用於Series，如果匹配則返回True，否則返回False。

因為在這裡你想要找到不匹配的值，你必須反轉布林值，這可以使用〜來完成。然後，可以使用此布林系列來獲取具有無效名稱的系列國家/地區。

python資料清洗（五）

python資料清洗（五）

python資料清洗（總結版）-思維導圖

python資料清洗（缺失值與異常值處理）

小白學 Python 資料分析（6）：Pandas （五）基礎操作（2）資料選擇

Python學習筆記（五）OOP

python基礎教程（五）

python學習筆記（五）數值類型和類型轉換

python學習筆記（五）

Python入門篇（五）之文件操作和字符編碼

python全棧（五）基本數據類型

python學習記錄（五）

python | 爬蟲筆記（五）- 數據存儲

Python基礎知識（五）

大話資料結構（五）——棧的兩種java實現方式

Python資料型別（2）列表和元組

Python資料型別（1）數字資料型別

python神經網路（五）輸入手寫數字進行識別

Python學習筆記（五）變數的命名

資料庫_資料模型（五）

Python 學習筆記（五）[面向物件]

python資料清洗（五）

相關推薦