從零開始學Python學習筆記---之--pandas序列部分

阿新 • • 發佈：2019-01-05

序列

序列（Series）可以理解成是Python中的列表、元組的高階版本。為什麼說是高階版本呢？因為序列一維陣列類似，具有更好的廣播效應，既可以與一個標量進行運算，又可以進行元素級函式的計算。如下例子所示：

#列表無法與一個標量進行運算（雖然*不報錯，但是它表示的是重複）
ls1 =[1,4,5]
ls1+10

Traceback (most recent call last):
File "F:/pycharmPro/pandas/series.py", line 11, in <module>
ls1+10
TypeError: can only concatenate list (not "int") to list

列表與常數10相加，報錯，顯示無法將列表與整形值連線，“+”運算在列表中是連線操作

import pandas as pd
#將列表轉換為序列
series1 = pd.Series(ls1)
print(series1+10)

0 11
1 14
2 15
dtype: int64

將上面的列表轉換成一個序列後，就可以正常的完成運算，這就是序列的廣播能力。同樣，列表也不能用於元素級的數學函式，對比如下：

#列表無法應用於元素級的數學函式
ls2 = [1,3,8]
print(pow(ls2,2))

Traceback (most recent call last):
File "F:/pycharmPro/pandas/series.py", line 18, in <module>

print(pow(ls2,2))
TypeError: unsupported operand type(s) for ** or pow(): 'list' and 'int'

#將列表轉換成序列
series2 = pd.Series(ls2)
print(pow(series2,2))

0 1
1 9
2 64
dtype: int64

除了上面介紹序列功能，再來說說其他序列常用的場景，如序列的索引、成員關係、排重、排序、計數、抽樣、統計運算等

序列的索引：

由於序列是列表的擴張版，故序列也有一套類似於列表的索引方法，具體如下：

#位置索引
import  
numpy as np
np.random.seed(1)
s1 = pd.Series(np.random.randint(size=5,low=1,high=10))
print(s1,'\n')
print(s1[0],'\n')#取第一個元素
print(s1[1:3],'\n')#取第2~3個元素
print(s1[::2],'\n')#依次取數，步長為2

0 6
1 9
2 6
3 1
4 1
dtype: int32

6

1 9
2 6
dtype: int32

0 6
2 6
4 1
dtype: int32

用倒數的方式取元素，序列就顯得不是很方便了，我們推薦使用非常棒的iat方法，該方法不管應用於序列還是資料框都非常優秀，主要體現在簡介而高速

print(s1.iat[-3],'\n')#取倒數第三個元素
print(s1[-3:],'\n')#取出倒數第三個及之後的所有元素

6

2 6
3 1
4 1
dtype: int32

然而，實際工作中很少通過位置索引（下標）的方法獲取到序列中的某些元素，例如1000個元素構造的序列，查出屬於某個範圍值總不能一個個去數吧？序列提供了另一種索引的方法--布林索引。具體用法如下：

np.random.seed(23)
s1 = pd.Series(np.random.randint(size=5,low=1,high=100))
print(s1)
print(s1[s1>=70])#取出大於等於70的值
print(s1[s1>=40][s1<=50])#取出40~50之間的值

0 84
1 41
2 74
3 55
4 32
dtype: int32
0 84
2 74
dtype: int32
1 41
dtype: int32

一個向量的元素是否包含於另一個向量，Python中對於一個一維陣列，in1d函式實現該功能；對於一個序列，isin方法可實現該功能。

arr1 = np.array([1,2,3,4])
arr2 = np.array([10,20,3,40])
print(np.in1d(arr1,arr2),'\n')

s3 =pd.Series(['A','B','C','D'])
s4 = pd.Series(['X','A','Y','D'])
print(s3.isin(s4))
print(np.in1d(s3,s4),'\n')

[False False True False]

0 True
1 False
2 False
3 True
dtype: bool
[ True False False True]

numpy模組中的in1d函式也可以用於序列的成員關係的比較

如果手中有一離散變數的序列，想檢視該序列都有哪些水平，以及各個水平的頻次，該如何操作？

#序列去重和水平統計
np.random.seed(10)
s= np.random.randint(size=1000,low=1,high=4)
#排重
print(pd.unique(s),'\n')
#水平統計
print(pd.value_counts(s))

[2 1 3]

3 342
2 334
1 324
dtype: int64

藉助於unique函式實現序列的排重，獲得不同的水平值；通過使用value_counts函式對各個水平進行計數，並按頻次降序呈現

有的時候需要對某個序列進行升序或降序排序，雖然這樣的場景並不多，但排序在資料框中的應用還是非常常用的，先來看看如何對序列進行排序：

#序列的排序（排序函式預設升序）
np.random.seed(1)
s= pd.Series(np.random.normal(size=4))
#按序列的索引排序
print(s.sort_index(ascending=False),'\n')#按索引降序排列
#按序列的值排序
print(s.sort_values())#按序列的實際值升序排列

3 -1.072969
2 -0.528172
1 -0.611756
0 1.624345
dtype: float64

3 -1.072969
1 -0.611756
2 -0.528172
0 1.624345

dtype: float64

s =pd.Series([1, 20,20, 4,100,86,66])
result =s.value_counts()#對序列進行統計
print(result)#統計後的結果
value =result.index[0]
print('頻率最大的值：',value)
count =result.iloc[0]
print('最大的計數值',count)

20 2
86 1
100 1
4 1
66 1
1 1
dtype: int64
頻率最大的值： 20
最大的計數值 2

抽樣也是資料分析中常用的方法，通過從總體中抽取出一定量的樣本來推斷總體水平；或者通過抽樣將資料拆分成兩部分，一部分建模，一部分測試。pandas模組提供了sample函式幫我們完成抽樣的任務。

s.sample(n=None, frac=None, replace=False, weights=None, random_state=None, axis=None)
n：指定抽取的樣本量；
frac：指定抽取的樣本比例；
replace：是否有放回抽樣，預設無放回；
weights：指定樣本抽中的概率，預設等概論抽樣；
random_state：指定抽樣的隨機種子；

#從1...100中隨機抽取3個幸運兒
s = pd.Series(range(1,101))
print(s.sample(n=3,random_state=2),'\n')

#從1...5中隨機抽取3個幸運兒
s= pd.Series(range(1,6))
print(s.sample(n=3,replace=True,random_state=2),'\n')

83 84
30 31
56 57
dtype: int32

0 1
0 1
3 4
dtype: int32

s = pd.Series(['男','女'])
data =s.sample(n=10,replace=True,weights=[0.2,0.8],random_state=3)
print(data)

1 女
1 女
1 女
1 女
1 女
1 女
0 男
1 女
0 男
1 女
dtype: object

由於總體就是男、女性別兩個值，故需要抽出10個樣本，必須有放回的抽，而且男女被抽中的概率還不一致，女被抽中的概率是0.8。

統計運算
pandas模組提供了比numpy模組更豐富的統計運算函式，而且還提供了類似於R語言中的summary彙總函式，即describe函式。

#序列彙總
np.random.seed(1234)
s =pd.Series(np.random.randint(size=100,low=10,high=30))
detail =s.describe()
print(detail)

count 100.000000
mean 20.360000
std 5.670266
min 10.000000
25% 15.750000
50% 21.000000
75% 25.000000
max 29.000000
dtype: float64

其中count是序列中非缺失元素的個數。如何判斷一個序列元素是否為缺失呢？可以使用isnull函式。

#缺失值的判定
s= pd.Series([1,2,np.nan,4,np.nan,6])
print(s,'\n')
print(s.isnull())

0 1.0
1 2.0
2 NaN
3 4.0
4 NaN
5 6.0
dtype: float64

0 False
1 False
2 True
3 False
4 True
5 False
dtype: bool

除此，我們再來羅列一些常用的統計函式：

s.min() # 最小值

s.quantile(q=[0,0.25,0.5,0.75,1]) # 分位數函式

s.median() # 中位數

s.mode()# 眾數

s.mean() # 平均值

s.mad() # 平均絕對誤差

s.max # 最大值

s.sum() # 和

s.std() # 標準差

s.var() # 方差

s.skew() # 偏度

s.kurtosis() # 峰度

s.cumsum() # 和的累計，返回序列

s.cumprod() # 乘積的累積，返回序列

s.product() # 序列元素乘積

s.diff() # 序列差異（微分），返回序列

s.abs() # 絕對值，返回序列

s.pct_change() # 百分比變化，返回序列

s.corr(s2)# 相關係數

s.ptp() # 極差 R中的range函式

學習地址：http://mp.weixin.qq.com/s/VwdF5u-FouTPRWg6sHAwqA

從零開始學Python學習筆記---之--pandas序列部分

從零開始學Python學習筆記---之--pandas序列部分

從零開始學Python學習筆記---之--pandas資料框(1)

從零開始學Python學習筆記---之--pandas資料框(3)

從零開始-Machine Learning學習筆記(8)-指數平滑及python實現

（補充）趕鴨子上架學D3.jsdataenter的（二）---data，datum，update，enter，exit基礎概念（b站從零開始畫圖表學習筆記，感謝up主睿小狼）

學習《從零開始學Python網絡爬蟲》PDF+源代碼+《精通Scrapy網絡爬蟲》PDF

從零開始的Python學習Episode 13——常用模組

從零開始的Python學習Episode 15——正則表示式

Python新書推薦《從零開始學Python--資料分析與挖掘》

從零開始學Python【2】--數值計算及正則表示式

從零開始學Python【1】--資料型別及結構

從零開始學多執行緒之取消和關閉(六)

從零開始學多執行緒之執行緒池(五)

從零開始的Python學習知識補充sorted

從零開始-Machine Learning學習筆記(25)-整合學習

從零開始-Machine Learning學習筆記(24)-貝葉斯分類器

從零開始-Machine Learning學習筆記(26)-聚類

從零開始-Machine Learning學習筆記(31)-規則學習

從零開始-Machine Learning學習筆記(30)-概率圖模型

從零開始-Machine Learning學習筆記(29)-半監督學習

從零開始學Python學習筆記---之--pandas序列部分

相關推薦