1. 程式人生 > >如何用pandas讀取CVS格式資料

如何用pandas讀取CVS格式資料

本文主要介紹的是如何利用pandas來讀取CVS格式的資料
CVS格式指的是:每個元素之間均已逗號隔開,不管檔案字尾名是什麼,例如.txt,.data等等

#x.txt

1,2,3
4,5,6

----------------------------------------------------------
column_name=['A','B','C']
t=pd.read_csv('./x.txt',names=column_name)
print t

>>
   A  B  C
0  1  2  3
1  4  5  6

1.匯入pandas包

import
pandas as pd

2.利用read_csv函式讀取

train=pd.read_csv('./Datasets/Breast-Cancer/breast-cancer-train.csv')
test=pd.read_csv('./Datasets/Breast-Cancer/breast-cancer-test.csv')
print np.shape(train)
print type(train)

>> (175,4)
>> <class 'pandas.core.frame.DataFrame'>

讀取後的資料儲存在train中,但其資料型別不是我們常用的array或者array;此時可以用np.array(train)強制轉換成array型別,之後的操作就同矩陣操作一樣了。

3.擬合數據

3.1 轉換成array型別處理

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

train=pd.read_csv('./Datasets/Breast-Cancer/breast-cancer-train.csv')
test=pd.read_csv('./Datasets/Breast-Cancer/breast-cancer-test.csv')
train_data = np.array(train)
test_data = np.array(test)


X_train = train_data[:,1
:3] # 取第1,2列作為訓練集 y_train = train_data[:,3] # 取第3列為標籤 X_test = test_data[:,1:3] y_test = test_data[:,3] p_index = np.where(train_data[:,3]==1)[0] # 取出所以正樣本的索引 n_index = np.where(train_data[:,3]==0)[0] # 取出所以負樣本的索引 positive = X_train[p_index,:] # 取出所以正樣本 nagative = X_train[n_index,:] # 取出所以負樣本 plt.scatter(nagative[:,0],nagative[:,1],marker='o',s=200,c='red') #繪製樣本點 plt.scatter(positive[:,0],positive[:,1],marker='x',s=150,c='black') plt.show() lr=LogisticRegression() lr.fit(X_train,y_train) print lr.score(X_test,y_test)

3.2 利用DataFrame處理

import pandas as pd
import matplotlib.pyplot as plt

train=pd.read_csv('./Datasets/Breast-Cancer/breast-cancer-train.csv')
test=pd.read_csv('./Datasets/Breast-Cancer/breast-cancer-test.csv')

negative=train.loc[train['Type']==0][['Clump Thickness','Cell Size']]
positive=train.loc[train['Type']==1][['Clump Thickness','Cell Size']]
plt.scatter(negative['Clump Thickness'],negative['Cell Size'],\
            marker='o',s=200,c='red')
plt.scatter(positive['Clump Thickness'],positive['Cell Size'],\
            marker='x',s=150,c ='black')
plt.show()


X_train=train[['Clump Thickness','Cell Size']]
y_train=train['Type']
X_test=test[['Clump Thickness','Cell Size']]
y_test=test['Type']

lr=LogisticRegression()
lr.fit(X_train,y_train)
print lr.score(X_test,y_test)

下載

參考:

python機器學習及實踐