1. 程式人生 > >用Python處理"大"XLS檔案

用Python處理"大"XLS檔案

權當學習Python練手用的.

  • 檔案都是些什麼內容?

    • ’Accident_Index’,
    • ‘Location_Easting_OSGR’,
    • ‘Location_Northing_OSGR’,
    • ‘Longitude’,
    • ‘Latitude’,
    • ‘Police_Force’,
    • ‘Accident_Severity’,
    • ‘Number_of_Vehicles’,
    • ‘Number_of_Casualties’,
    • ‘Date’,
    • ‘Day_of_Week’,
    • ‘Time’,
    • ‘Local_Authority_(District)’,
    • ‘Local_Authority_(Highway)’,
    • ‘1st_Road_Class’, ‘1st_Road_Number’,
    • ‘Road_Type’,
    • ‘Speed_limit’,
    • ‘Junction_Detail’,
    • ‘Junction_Control’,
    • ‘2nd_Road_Class’,
    • ‘2nd_Road_Number’,
    • ‘Pedestrian_Crossing-Human_Control’,
    • ‘Pedestrian_Crossing_Physical_Facilities’,
      • ’Light_Conditions’,
      • ‘Weather_Conditions’,
      • ‘Road_Surface_Conditions’,
      • ‘Special_Conditions_at_Site’,
      • ‘Carriageway_Hazards’,
      • ‘Urban_or_Rural_Area’,
      • ‘Did_Police_Officer_Attend_Scene_of_Accident’,
      • ‘LSOA_of_Accident_Location’

    這裡寫圖片描述

LowMemory 方式讀取檔案

#read the file
filedir='/home/derek/Desktop/python-data-analyis/large-excel-files/Accidents_2013.csv'
data = pd.read_csv(filedir,low_memory=False
) print data.ix[:10]['Day_of_Week']
  • SQL likes 提取資料資訊
print 'Accidents'
print '----------'
#選擇星期日發生的事故
accidents_sunday = data[data.Day_of_Week==1]
print 'Accidents which happended on a Sunday: ',len(accidents_sunday)
#選擇星期日發生的且涉事人數在十人以上的事故
accidents_sunday_twenty_cars = data[(data.Day_of_Week==1) & (data.Number_of_Vehicles>10)]
print'Accidents which happened on a Sunday involving > 10 cars: ' , len(accidents_sunday_twenty_cars)
#選擇星期日發生的且涉事人數在十人以上且天氣情況是下雨的事故(2對應的是無風下雨)
accidents_sunday_twenty_cars_rain = data[(data.Day_of_Week==1) & (data.Number_of_Vehicles>10) & (data.Weather_Conditions==2)]
print'Accidents which happened on a Sunday involving > 10 cars with rainning: ' , len(accidents_sunday_twenty_cars_rain)
#選擇在倫敦的星期日發生的事故
london_data = data[(data['Police_Force'] == 1) & (data.Day_of_Week==1)]
print 'Accidents in London on a Sunday',len(london_data)
#選擇在2000年的倫敦的星期日發生的事故
london_data_2000 = london_data[((pd.to_datetime('2000-1-1', errors='coerce')) > (pd.to_datetime(london_data['Date'],errors='coerce'))) & (pd.to_datetime(london_data['Date'],errors='coerce') < (pd.to_datetime('2000-12-31', errors='coerce')))]
print 'Accidents in London on a Sunday in 2000:',len(london_data_2000)

給人的感覺是特別像SQL語句,DataFrame的這種切片,方式特別好用,對不對?

pd.to_datetime(london_data['Date'],errors='coerce')

這裡是日期轉換函式.

輸出:

Accidents
----------
Accidents which happended on a Sunday:  14854
Accidents which happened on a Sunday involving > 10 cars:  1
Accidents which happened on a Sunday involving > 10 cars with rainning:  1
Accidents in London on a Sunday 2374
Accidents in London on a Sunday in 2000: 0

  • 將部分DataFrame資料以XLSX檔案儲存下來
    確保你安裝了XlsxWriter

sudo pip install XlsxWriter

writer = pd.ExcelWriter('london_data.xlsx', engine='xlsxwriter')
london_data.to_excel(writer, 'sheet1')
writer.save()
writer.close()
  • 塊讀取,分析一個星期中那一天最有出事故的概率最大
    程式碼.2013,2014,2015三年的事故記錄,在’Accidents_2013.csv’,’Accidents_2014.csv’, ‘Accidents_2015.csv’這三個檔案中
import pandas as pd
from pandas import Series
import matplotlib.pyplot as plt
#read the file
dir='/home/derek/Desktop/python-data-analyis/large-excel-files/'
filedir=['Accidents_2013.csv','Accidents_2014.csv', 'Accidents_2015.csv']
tot = Series([])
for i in range(3):
    #塊讀取檔案, 每次讀1000條記錄
    data = pd.read_csv(dir + filedir[i],chunksize=1000)
    for piece in data:
        tot = tot.add(piece['Day_of_Week'].value_counts(), fill_value=0)

day_index = ['Sun', 'Mon', 'Tues', 'Wed', 'Thur', 'Fri', 'Sat']
print 'data like:'
#tot = tot.sort_values(ascending=False)
print tot
#重新構造一個Series,是為了給索引命名
new_Series = Series(tot.values, index=day_index)
new_Series.plot()
plt.show()
plt.close()

控制檯輸出:

data like:
1    46052
2    60956
3    65006
4    64039
5    64445
6    69378
7    55162
dtype: float64

圖:
這裡寫圖片描述
三年記錄在案的有425038條記錄.

結論: 看來,英國人在工作日出行要比在休息日造成更多的事故.星期五的出行造成的事故最多,或許,星期五急著回家,哈哈.相比起來,星期五不適合外出.

檔案沒有提供,是因為:讀者可以自己去下載,可能找到更想更好用Python分析的資料.