1. 程式人生 > >python 數據分析3

python 數據分析3

之前 算法 進行 lin abc 會計 用戶 del 另一個

本節概要

  pandas簡介

安裝

pip install pandas

pandas的2個主要數據結構:DataFrame 和 Series

Series

series是一種類似於一維數組的對象,它由一組數據以及一組與之相關的數據標簽(索引)組成。僅由一組數組即可產生最簡單的Series:

obj = Series([4, 7, 9, -1])
print(obj)

0    4
1    7
2    9
3   -1
dtype: int64

Series的字符串表現形式為索引在左邊,值在右邊。沒有設定索引,會自動穿件一個0~N-1的整數型索引。

obj = Series([4, 7, 9, -1])
print(obj.values)
print(obj.index)

[ 4  7  9 -1]
RangeIndex(start=0, stop=4, step=1)

創建一個含有自定義索引的series

obj = Series([4, 7, 9, -1], index=[‘a‘, ‘b‘, ‘c‘, ‘d‘])
print(obj)
print(obj.index)

a    4
b    7
c    9
d   -1
dtype: int64
Index([‘a‘, ‘b‘, ‘c‘, ‘d‘], dtype=‘object‘)

索引取值

obj[‘a‘]    ==> 4
obj[‘c‘]    ==> 9
obj[‘a‘, ‘d‘]    ==> 4, -1

NumPy數組運算都會保留索引跟值之間的鏈接:

obj[obj>2]

a    4
b    7
c    9
dtype: int64

obj*2

a     8
b    14
c    18
d    -2
dtype: int64

series可以看成是一個有序字典,因為存在index到value的一個映射關系。可以應用在許多原本需要字典參數的函數中:

‘b‘ i obj        ==> True

如果數據存放在Python字典中,也可以直接用字典穿件series:

dict_obj = {"a":100,"b":20,"c":50,"d":69}
obj = Series(dict_obj)
dict_obj

a    100
b     20
c     50
d     69
dtype: int64

如果傳入一個字典,還有index列表:

dict_obj = {"a":100,"b":20,"c":50,"d":69}
states = [‘LA‘,‘b‘,‘a‘,‘NY‘]
obj = Series(dict_obj, index=states)

LA      NaN
b      20.0
a     100.0
NY      NaN
dtype: float64

我們發現匹配項會被找出來放在相應的位置,而沒有匹配的則用NAN(not a number)表示缺失。pandas的isnull 和notnull函數可以用於檢測數據缺失:

pd.isnull(obj)

LA     True
b     False
a     False
NY     True
dtype: bool

Series也有類似的用法:

obj.isnull()

LA     True
b     False
a     False
NY     True
dtype: bool

Series 最重要的一個功能是:它在算術運算中會自動對齊不同索引的數據

dict_obj = {"a":100,"b":20,"c":50,"d":69}
dict_obj1 = {"e":100,"b":20,"c":50,"f":69}

obj = Series(dict_obj)
obj1 = Series(dict_obj1)

obj+obj1

a      NaN
b     40.0
c    100.0
d      NaN
e      NaN
f      NaN
dtype: float64

Series對象的name屬性

obj.name=‘qty‘
obj.index.name = ‘types‘


types
a    100
b     20
c     50
d     69
Name: qty, dtype: int64

Series索引可以通過賦值的方式就地修改:

obj.index = [‘dandy‘,‘renee‘,‘Jeff‘,‘Steve‘]
obj

dandy    100
renee     20
Jeff      50
Steve     69
Name: qty, dtype: int64

DataFrame

dataframe是一個表格型的數據結構,它含有一組有序列,每列可以是不通的值的類型。DataFrame既有行索引,又有列索引,它可以看成是series組成的字典(共用同一個索引)。

構建DataFrame

data = {‘states‘:[‘NY‘, ‘LA‘, ‘CA‘, ‘BS‘, ‘CA‘],
        ‘year‘:[2000, 2001, 2002, 2001, 2000],
        ‘pop‘:[1.5, 1.7, 3.6, 2.4, 2.9]}
frame = DataFrame(data)
frame


   pop states  year
0  1.5     NY  2000
1  1.7     LA  2001
2  3.6     CA  2002
3  2.4     BS  2001
4  2.9     CA  2000

指定列序列

frame2 = DataFrame(data, columns=[‘year‘, ‘pop‘, ‘states‘, ‘test‘])
frame2.columns year pop states test 0 2000 1.5 NY NaN 1 2001 1.7 LA NaN 2 2002 3.6 CA NaN 3 2001 2.4 BS NaN 4 2000 2.9 CA NaN
Index([‘year‘, ‘pop‘, ‘states‘, ‘test‘], dtype=‘object‘) # 不存在的列就會產生NaN值

取值:

# 取一列數據的2種方式
frame2[‘states‘]
frame2.year


0    NY
1    LA
2    CA
3    BS
4    CA
Name: states, dtype: object

0    2000
1    2001
2    2002
3    2001
4    2000
Name: year, dtype: int64
# 返回一個series

# 修改行索引
DataFrame(data, columns=[‘year‘, ‘pop‘, ‘states‘, ‘test‘], index=[‘one‘, ‘two‘, ‘three‘, ‘four‘, ‘five‘])

       year  pop states test
one    2000  1.5     NY  NaN
two    2001  1.7     LA  NaN
three  2002  3.6     CA  NaN
four   2001  2.4     BS  NaN
five   2000  2.9     CA  NaN

獲取列
frame2.ix[‘three‘]

year      2002
pop        3.6
states      CA
test       NaN
Name: three, dtype: object

列可以通過賦值的方式修改
frame2.test = ‘11‘

       year  pop states test
one    2000  1.5     NY   11
two    2001  1.7     LA   11
three  2002  3.6     CA   11
four   2001  2.4     BS   11
five   2000  2.9     CA   11

列操作:

將列表或數組賦值給某個列時,其長度必須跟DataFrame的長度相匹配。如果是Series則會精確匹配DataFrame索引,所有空位被填上缺失值

val = Series([-1, -2, 3], index=[‘two‘, ‘one‘, ‘three‘])
frame2[‘test‘] = val

frame2

       year  pop states  test
one    2000  1.5     NY  -2.0
two    2001  1.7     LA  -1.0
three  2002  3.6     CA   3.0
four   2001  2.4     BS   NaN
five   2000  2.9     CA   NaN

為不存在的列賦值,會創建出一列新列。del用於刪除,跟python字典用法很像

frame2[‘test1‘] = frame2.test.notnull()
frame2

       year  pop states  test  test1
one    2000  1.5     NY  -2.0   True
two    2001  1.7     LA  -1.0   True
three  2002  3.6     CA   3.0   True
four   2001  2.4     BS   NaN  False
five   2000  2.9     CA   NaN  False

del frame2[‘test1‘]
frame2

       year  pop states  test
one    2000  1.5     NY  -2.0
two    2001  1.7     LA  -1.0
three  2002  3.6     CA   3.0
four   2001  2.4     BS   NaN
five   2000  2.9     CA   NaN

嵌套字典創建dataframe

pop = {
    "dandy":{"age":18, "gender":"male"},
    "elina": {"age": 16, "gender": "female"},
    "renee": {"age": 16, "gender": "female"},
    "taylor": {"age": 18, "gender": "female"},
}
frame3 = DataFrame(pop)

frame3

       dandy   elina   renee  taylor
age       18      16      16      18
gender  male  female  female  female

frame3.T  # 轉置

       age  gender
dandy   18    male
elina   16  female
renee   16  female
taylor  18  female

series組成的字典創建:

pdata = {‘dandy‘: frame3[‘dandy‘][:-1],
         ‘elina‘: frame3[‘elina‘]}
frame4 = DataFrame(pdata)
frame4

       dandy   elina
age       18      16
gender   NaN  female

設置屬性名

frame3.index.name = ‘detail‘
frame3.columns.name = ‘name‘

frame3

name   dandy   elina   renee  taylor
detail                              
age       18      16      16      18
gender  male  female  female  female

values屬性

frame3.values  # 以二維ndarray的形式返回dataframe中的數據

[[18 16 16 18]
 [‘male‘ ‘female‘ ‘female‘ ‘female‘]]

索引對象

pandas的索引對象負責管理軸標簽和其他元素(軸名稱等)。構建series或者dataframe時,所用到的任何數組和其他序列的標簽都會被轉成一個Index。Index對象是不可修改的(immutable)。

obj = Series(range(3), index=[‘a‘, ‘b‘, ‘c‘])
Index = obj.index
Index[0]

a

如果輸入Index[0] = ‘x‘:

技術分享圖片

正是因為index的不可修改性,才能使Index對象在多個數據結構之間安全共享:

Index = pd.Index(np.arange(3))
obj2 = Series([1.5, -2, 0], index=Index)

obj2.index is Index

True

除了長得像數組,Index的功能也類似一個固定大小的集合:

pop = {
    "dandy":{"age":18, "gender":"male"},
    "elina": {"age": 16, "gender": "female"},
    "renee": {"age": 16, "gender": "female"},
    "taylor": {"age": 18, "gender": "female"},
}
frame3 = DataFrame(pop)
‘dandy‘ in frame3.columns

True

技術分享圖片

基本功能

obj = Series([4, 6, 9.9, 7], index=[‘a‘, ‘v‘, ‘b‘, ‘d‘])
obj

a    4.0
v    6.0
b    9.9
d    7.0
dtype: float64

reindex

obj2 = obj.reindex([‘a‘, ‘b‘, ‘c‘, ‘d‘, ‘v‘])  # 不存在的NaN

obj2

a    4.0
b    9.9
c    NaN
d    7.0
v    6.0
dtype: float64


# 引入fill_value=0
obj2 = obj.reindex([‘a‘, ‘b‘, ‘c‘, ‘d‘, ‘v‘], fill_value=0)
obj2

a    4.0
b    9.9
c    0.0
d    7.0
v    6.0
dtype: float64

#method
obj = Series([‘aa‘, ‘bb‘, ‘cc‘, ‘dd‘], index=[0,2,4,6])
obj2 = obj.reindex(range(7), method=‘ffill‘)

0    aa
1    aa
2    bb
3    bb
4    cc
5    cc
6    dd
dtype: object

技術分享圖片

frame = DataFrame(np.arange(9).reshape((3,3)), index=[‘a‘, ‘b‘, ‘c‘], columns=[‘Ohio‘, ‘Texas‘, ‘California‘])
frame2 =frame.reindex([‘a‘, ‘b‘, ‘c‘, ‘d‘])
frame

   Ohio  Texas  California
a     0      1           2
b     3      4           5
c     6      7           8

frame2

   Ohio  Texas  California
a   0.0    1.0         2.0
b   3.0    4.0         5.0
c   6.0    7.0         8.0
d   NaN    NaN         NaN

text = [‘Texas‘, ‘LA‘, ‘California‘]
frame3 = frame.reindex(columns=text)
frame3

   Texas  LA  California
a      1 NaN           2
b      4 NaN           5
c      7 NaN           8

同時對行列重新索引,插值只能按行應用(軸0)

text = [‘Texas‘, ‘LA‘, ‘California‘]
frame.reindex(index=[‘a‘,‘b‘,‘c‘,‘d‘,‘e‘,‘f‘], method=‘ffill‘).reindex(columns=text)


   Texas  LA  California
a      1 NaN           2
b      1 NaN           2
c      4 NaN           5
d      4 NaN           5
e      7 NaN           8
f      7 NaN           8

利用ix的標簽索引功能,重新索引會更簡潔

frame.ix[[‘a‘,‘b‘,‘c‘,‘d‘],text]

   Texas  LA  California
a    1.0 NaN         2.0
b    NaN NaN         NaN
c    4.0 NaN         5.0
d    NaN NaN         NaN

技術分享圖片

丟棄指定軸上的項

drop方法,返回的是一個在指定軸上刪除了指定值的新對象:

obj = Series(np.arange(5), index=[‘a‘,‘b‘,‘c‘,‘d‘,‘e‘])
new_obj = obj.drop(‘c‘)
new_obj

a    0
b    1
d    3
e    4
dtype: int32

對於DataFrame,可以刪除任意軸上的索引值:

data = DataFrame(np.arange(16).reshape(4,4),
                 index=[‘LA‘,‘UH‘,‘NY‘,‘BS‘],
                 columns=[‘one‘,‘two‘,‘three‘,‘four‘])
data.drop([‘LA‘,‘BS‘])

    one  two  three  four
UH    4    5      6     7
NY    8    9     10    11

#對於列,axis=1
data.drop([‘one‘,‘three‘], axis=1)

    two  four
LA    1     3
UH    5     7
NY    9    11
BS   13    15

索引、選取和過濾

Series索引的工作方式類似於NumPy數組的索引,只不過Series的索引值不只是整數。

obj = Series(np.arange(4), index=[‘a‘,‘b‘,‘c‘,‘d‘])
obj

a    0
b    1
c    2
d    3
dtype: int32

obj[‘b‘]  # 或者obj[1]
1


obj[2:4]  # 或者obj[[‘c‘,‘d‘]]
c    2
d    3
dtype: int32

obj[[1,3]]
b    1
d    3
dtype: int32


obj[obj>2]
d    3
dtype: int32