利用python進行資料分析----- 第一天,準備工作。DataFrame,Series,Matplotlib
阿新 • • 發佈:2018-12-13
目錄
工具
進行資料處理分析有很多公具,精通一種即可,本實驗只要使用pycharm.
建立變數
開啟pycharm,新建專案,點選python console進入互動式視窗
重疊的箭頭可以輸入指令,special variables是已經建立的變數。
每當輸入一行資料,按一下回車鍵,就會執行該語句,也相當於程式在一句一句的執行寫好的程式碼。右邊的special variables可以看到user建立的變數,包含了變數的型別
刪除變數
獲取資料
下載地址:
下載後為壓縮檔案,加壓後將字尾改為txt,或json,便於處理。順便名字也改一下。
引入檔案:
path:代表檔案路徑的字串,open(路徑)檔案載入函式,readline(),列印open函式讀取到的第一行資料。
>>> path = 'data/data1.txt' >>> open(path).readline() '{ "a": "Mozilla\\/5.0 (Linux; U; Android 4.1.2; en-us; HTC_PN071 Build\\/JZO54K) AppleWebKit\\/534.30 (KHTML, like Gecko) Version\\/4.0 Mobile Safari\\/534.30", "c": "US", "nk": 0, "tz": "America\\/Los_Angeles", "gr": "CA", "g": "15r91", "h": "10OBm3W", "l": "pontifier", "al": "en-US", "hh": "j.mp", "r": "direct", "u": "http:\\/\\/www.nsa.gov\\/", "t": 1368832205, "hc": 1365701422, "cy": "Anaheim", "ll": [ 33.816101, -117.979401 ] }\n'
轉換為json:
import:匯入包指令。
records=[?],建立名稱為recoeds的陣列
json.loads(?) 將引數轉化為json資料
for line in open(path) 開啟制定路徑檔案,for語句迴圈賦值給line
最後將line逐個寫入records.
records[0] 輸出索引為0的陣列元素
>>> import json >>> records = [json.loads(line) for line in open(path)] >>> records[0] {u'a': u'Mozilla/5.0 (Linux; U; Android 4.1.2; en-us; HTC_PN071 Build/JZO54K) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30', u'c': u'US', u'nk': 0, u'tz': u'America/Los_Angeles', u'gr': u'CA', u'g': u'15r91', u'h': u'10OBm3W', u'cy': u'Anaheim', u'l': u'pontifier', u'al': u'en-US', u'hh': u'j.mp', u'r': u'direct', u'u': u'http://www.nsa.gov/', u't': 1368832205, u'hc': 1365701422, u'll': [33.816101, -117.979401]}
解析資料
單個物件輸出
>>> records[0]['tz']
u'America/Los_Angeles'
獲取所有時區
if ‘tz’ in rec : 判斷rec陣列是否存在tz屬性
>>> time_zone = [rec['tz'] for rec in records if 'tz' in rec]
引入自定義函式
count是一個字典,x和count[x]組成一個鍵值對
def get_counts(sequence):
counts = {}
for x in sequence:
if x in counts:
counts[x] +=1
else:
counts[x] = 1
return counts
在系統中引入路徑即可使用自定義函式
>>> sys.path.append('D:\\python\\DataAnalysis\\function')
>>> from getCounts import get_counts
使用函式:
>>> counts = get_counts(time_zone)
>>> print counts
{u'': 636, u'Europe/Lisbon': 8, u'America/Bogota': 16, u'America/Edmonton': 9, u'Australia/Tasmania': 1, u'Europe/Tallinn': 1, u'Asia/Calcutta': 6, u'Australia/South': 4, u'Europe/Skopje': 1, u'Europe/Copenhagen': 4, u'America/St_Lucia': 1, u'Europe/Amsterdam': 15, u'Europe/Zaporozhye': 1, u'America/Phoenix': 40, u'Europe/Moscow': 35, u'America/El_Salvador': 2, u'Europe/Madrid': 21, u'America/Argentina/Buenos_Aires': 11, u'America/Mazatlan': 2, u'America/Rainy_River': 33, u'Europe/Paris': 27, u'Europe/Stockholm': 4, u'America/Monterrey': 4, u'Europe/Athens': 1, u'America/Indianapolis': 50, u'America/Regina': 3, u'America/Mexico_City': 22, u'America/Puerto_Rico': 184, u'Asia/Manila': 4, u'Europe/Sarajevo': 1, u'Europe/Berlin': 24, u'Europe/Zurich': 5, u'Africa/Casablanca': 1, u'Asia/Karachi': 1, u'Europe/Rome': 19, u'Asia/Harbin': 4, u'Australia/West': 9, u'Asia/Kuching': 1, u'Europe/Warsaw': 2, u'Europe/Jersey': 1, u'Australia/Canberra': 7, u'Pacific/Honolulu': 12, u'America/St_Johns': 1, u'Europe/Oslo': 3, u'Asia/Hong_Kong': 5, u'America/Guadeloupe': 1, u'America/Nassau': 1, u'Europe/Prague': 1, u'Australia/NSW': 32, u'America/Halifax': 7, u'America/Jamaica': 1, u'Asia/Singapore': 4, u'America/Manaus': 2, u'America/Los_Angeles': 421, u'Asia/Amman': 1, u'Europe/Bratislava': 3, u'America/Vancouver': 23, u'Atlantic/Reykjavik': 1, u'Asia/Novokuznetsk': 1, u'America/Sao_Paulo': 29, u'America/Port_of_Spain': 1, u'Asia/Tokyo': 102, u'Asia/Jakarta': 4, u'Africa/Johannesburg': 2, u'Europe/Riga': 1, u'Chile/Continental': 16, u'Asia/Taipei': 1, u'Asia/Istanbul': 5, u'Australia/Victoria': 23, u'Europe/Bucharest': 3, u'Asia/Bangkok': 3, u'Africa/Ceuta': 6, u'America/Costa_Rica': 6, u'America/Winnipeg': 4, u'America/Chicago': 686, u'America/La_Paz': 4, u'Africa/Cairo': 3, u'Europe/Brussels': 14, u'Asia/Dubai': 1, u'Asia/Jerusalem': 1, u'Pacific/Auckland': 9, u'America/Argentina/Cordoba': 2, u'America/Caracas': 13, u'America/Panama': 2, u'America/Guayaquil': 4, u'Asia/Kuala_Lumpur': 3, u'America/Denver': 89, u'Asia/Riyadh': 5, u'Europe/Ljubljana': 1, u'Asia/Vladivostok': 1, u'Asia/Phnom_Penh': 1, u'Africa/Gaborone': 1, u'Europe/London': 85, u'America/Montevideo': 3, u'America/Managua': 3, u'Asia/Qatar': 1, u'Asia/Pontianak': 1, u'America/Tijuana': 1, u'America/Argentina/Catamarca': 1, u'Australia/Queensland': 10, u'America/Santo_Domingo': 4, u'Europe/Samara': 2, u'Asia/Yekaterinburg': 2, u'America/Asuncion': 1, u'Europe/Vienna': 6, u'America/New_York': 903, u'Europe/Dublin': 9, u'Europe/Sofia': 1, u'America/Montreal': 8, u'America/Anchorage': 8, u'Asia/Seoul': 3}
獲取數量前十的時區,倒序:
# coding=utf-8
def top_counts(count_dict,n):
value_key_pairs = [(count,tz) for tz,count in count_dict.items()]
value_key_pairs.sort()
return value_key_pairs[-n:]
>>> from topCounts import top_counts
>>> top_counts(counts,10)
[(40, u'America/Phoenix'), (50, u'America/Indianapolis'), (85, u'Europe/London'), (89, u'America/Denver'), (102, u'Asia/Tokyo'), (184, u'America/Puerto_Rico'), (421, u'America/Los_Angeles'), (636, u''), (686, u'America/Chicago'), (903, u'America/New_York')]
使用pandas對時區進行計數
DataFrame函式將資料表示為一個表格。
>>> from pandas import DataFrame,Series
>>> import pandas as pd;import numpy as np
>>> frame = DataFrame(records)
>>> frame
_heartbeat_ ... u
0 NaN ... http://www.nsa.gov/
1 NaN ... http://answers.usa.gov/system/selfservice.cont...
2 NaN ... http://www.saj.usace.army.mil/Media/NewsReleas...
3 NaN ... https://nationalregistry.fmcsa.dot.gov/
4 NaN ... http://www.peacecorps.gov/learn/howvol/ab530gr...
5 NaN ... https://petitions.whitehouse.gov/petition/repe...
6 NaN ... http://pld.dpi.wi.gov/files/pld/images/LinkWI.png
7 NaN ... http://www.nasa.gov/multimedia/imagegallery/im...
8 NaN ... http://www.nsa.gov/
9 NaN ... http://www.nasa.gov/mission_pages/sunearth/new...
10 NaN ... http://www.dodlive.mil/index.php/2013/05/the-2...
11 NaN ... http://doggett.house.gov/index.php/news/571-do...
12 NaN ... http://www.peacecorps.gov/learn/howvol/ab530gr...
13 NaN ... http://www.fws.gov/cno/press/release.cfm?rid=493
14 NaN ... http://www.cancer.gov/PublishedContent/Images/...
15 NaN ... http://www.army.mil/article/103380/
16 NaN ... http://pld.dpi.wi.gov/files/pld/images/LinkWI.png
17 NaN ... http://www.nws.noaa.gov/com/weatherreadynation...
18 NaN ... http://fastlane.dot.gov/2013/05/new-locomotive...
19 NaN ... http://apod.nasa.gov/apod/ap130517.html
20 NaN ... http://www.ice.gov/news/releases/1305/130516sa...
21 NaN ... http://www.dodlive.mil/index.php/2013/05/the-2...
22 NaN ... http://pld.dpi.wi.gov/files/pld/images/LinkWI.png
23 NaN ... http://doggett.house.gov/index.php/news/571-do...
24 NaN ... http://pld.dpi.wi.gov/files/pld/images/LinkWI.png
25 NaN ... http://answers.usa.gov/system/selfservice.cont...
26 NaN ... http://answers.usa.gov/system/selfservice.cont...
27 NaN ... http://answers.usa.gov/system/selfservice.cont...
28 NaN ... http://answers.usa.gov/system/selfservice.cont...
29 NaN ... http://answers.usa.gov/system/selfservice.cont...
... ... ...
3929 NaN ... http://www.nasa.gov/mission_pages/station/expe...
3930 NaN ... http://gsaauctions.gov/gsaauctions/aucdsclnk?s...
3931 NaN ... http://gsaauctions.gov/gsaauctions/aucdsclnk?s...
3932 NaN ... http://www.nsa.gov/
3933 NaN ... http://science.nasa.gov/science-news/science-a...
3934 NaN ... http://www.nsa.gov/
3935 NaN ... http://cms3.tucsonaz.gov/files/police/media-re...
3936 NaN ... http://www.irs.gov/uac/Newsroom/Tax-Relief-for...
3937 NaN ... http://www.jpl.nasa.gov/news/news.php?release=...
3938 NaN ... http://www.jpl.nasa.gov/news/news.php?release=...
3939 NaN ... http://www.doe.gov/articles/energy-department-...
3940 NaN ... http://www.nsa.gov/
3941 NaN ... http://pld.dpi.wi.gov/files/pld/images/LinkWI.png
3942 NaN ... http://fwp.mt.gov/hunting/hunterAccess/openFie...
3943 NaN ... http://science.nasa.gov/media/medialibrary/201...
3944 NaN ... http://gsaauctions.gov/gsaauctions/aucdsclnk?s...
3945 NaN ... http://inws.wrh.noaa.gov/weather/alertinfo/103...
3946 NaN ... http://pld.dpi.wi.gov/files/pld/images/LinkWI.png
3947 NaN ... http://doggett.house.gov/index.php/news/571-do...
3948 NaN ... http://www.nasa.gov/mission_pages/mer/news/mer...
3949 NaN ... http://fastlane.dot.gov/2013/05/new-locomotive...
3950 NaN ... http://studentaid.ed.gov/repay-loans/understan...
3951 NaN ... http://doggett.house.gov/index.php/news/571-do...
3952 NaN ... http://doggett.house.gov/index.php/news/571-do...
3953 1.368836e+09 ... NaN
3954 NaN ... http://inws.wrh.noaa.gov/weather/alertinfo/103...
3955 NaN ... http://pld.dpi.wi.gov/files/pld/images/LinkWI.png
3956 NaN ... http://www.doe.gov/articles/energy-department-...
3957 NaN ... http://www.jpl.nasa.gov/news/news.php?release=...
3958 NaN ... http://science.nasa.gov/media/medialibrary/201...
[3959 rows x 18 columns]
>>> frame['tz'][:10]
0 America/Los_Angeles
1
2 America/Phoenix
3 America/Chicago
4
5 America/Indianapolis
6 America/Chicago
7
8 Australia/NSW
9
Name: tz, dtype: object
獲取數量前十的時區:
value_counts() 統計元素個數
>>> tz_counts = frame['tz'].value_counts()
>>> tz_counts[:10]
America/New_York 903
America/Chicago 686
636
America/Los_Angeles 421
America/Puerto_Rico 184
Asia/Tokyo 102
America/Denver 89
Europe/London 85
America/Indianapolis 50
America/Phoenix 40
Name: tz, dtype: int64
替代填補缺失值:
如果frame['tz‘]不存在,則填充為missing,frame['tz']是個空白字串,表示沒有獲取到使用者資訊
>>> clean_tz = frame['tz'].fillna('missing')
>>> clean_tz[clean_tz == ''] = 'Unknow'
>>> tz_count = clean_tz.value_counts()
>>> tz_count[:10]
America/New_York 903
America/Chicago 686
Unknow 636
America/Los_Angeles 421
America/Puerto_Rico 184
missing 120
Asia/Tokyo 102
America/Denver 89
Europe/London 85
America/Indianapolis 50
Name: tz, dtype: int64
繪製水平條形圖
>>> tz_count[:10].plot(kind='barh',rot=0)
解析Agent字串
for迴圈取出frame中的a列資料,通過空格符分隔並獲取分隔後的第一個字串
>>> result = Series(x.split()[0] for x in frame.a.dropna())
>>> result[:5]
0 Mozilla/5.0
1 Mozilla/4.0
2 Mozilla/5.0
3 Mozilla/5.0
4 Opera/9.80
dtype: object
如果包含 windows 字元,就分為windiws組,反之Not Windows
>>> cframe = frame[frame.a.notnull()]
>>> operating_system = np.where(cframe['a'].str.contains('Windows'),'Windows','Not Windiws')
>>> operating_system[:10]
array(['Not Windiws', 'Windows', 'Windows', 'Not Windiws', 'Not Windiws',
'Windows', 'Windows', 'Not Windiws', 'Not Windiws', 'Windows'],
dtype='|S11')
unstack()用於對計算結果進行重塑
>>> by_tz_os = cframe.groupby(['tz',operating_system])
>>> agg_counts = by_tz_os.size().unstack().fillna(0)
>>> agg_counts[:10]
Not Windiws Windows
tz
484.0 152.0
Africa/Cairo 0.0 3.0
Africa/Casablanca 0.0 1.0
Africa/Ceuta 4.0 2.0
Africa/Gaborone 0.0 1.0
Africa/Johannesburg 2.0 0.0
America/Anchorage 5.0 3.0
America/Argentina/Buenos_Aires 4.0 7.0
America/Argentina/Catamarca 1.0 0.0
America/Argentina/Cordoba 0.0 2.0
構建間接索引進行統計
>>> indexer = agg_counts.sum(1).argsort()
>>> indexer[:10]
tz
55
Africa/Cairo 101
Africa/Casablanca 100
Africa/Ceuta 36
Africa/Gaborone 97
Africa/Johannesburg 42
America/Anchorage 43
America/Argentina/Buenos_Aires 44
America/Argentina/Catamarca 47
America/Argentina/Cordoba 50
dtype: int64
>>> count_subset = agg_counts.take(indexer)[-10:]
>>> count_subset
Not Windiws Windows
tz
America/Phoenix 22.0 18.0
America/Indianapolis 29.0 21.0
Europe/London 62.0 23.0
America/Denver 41.0 48.0
Asia/Tokyo 88.0 14.0
America/Puerto_Rico 93.0 91.0
America/Los_Angeles 207.0 214.0
484.0 152.0
America/Chicago 343.0 343.0
America/New_York 550.0 353.0
生成條形堆積圖
>>> count_subset.plot(kind = 'barh',stacked = True)
比例分佈
>>> normed_subset = count_subset.div(count_subset.sum(1),axis=0)
>>> normed_subset.plot(kind='barh',stacked = True)