1. 程式人生 > >機器學習之數據預處理,Pandas讀取excel數據

機器學習之數據預處理,Pandas讀取excel數據

修改 ould text 形式參數 indicate 索引 ive XP url

Python讀寫excel的工具庫很多,比如最耳熟能詳的xlrd、xlwt,xlutils,openpyxl等。其中xlrd和xlwt庫通常配合使用,一個用於讀,一個用於寫excel。xlutils結合xlrd可以達到修改excel文件目的。openpyxl可以對excel文件同時進行讀寫操作。

而說到數據預處理,pandas就體現除了它的強大之處,並且它還支持可讀寫多種文檔格式,其中就包括對excel的讀寫。本文重點就是介紹pandas對excel數據集的預處理。

機器學習常用的模型對數據輸入都是有要求的,多數機器學習算法最基本的要求是訓練數據要轉換成數值格式。當然,也有像決策樹算法這種不需要轉換為數值的算法,這裏不做特例討論。

pandas讀取excel文件的函數是pandas.read_excel(),主要參數包括:

io : 讀取的excel文檔地址,

string, path object (pathlib.Path or py._path.local.LocalPath),

file-like object, pandas ExcelFile, or xlrd workbook. The string could be a URL. Valid URL schemes include http, ftp, s3, and file. For file URLs, a host is expected. For instance, a local file could be file://localhost/path/to/workbook.xlsx

sheet_name : 讀取的excel指定的sheet頁

string, int, mixed list of strings/ints, or None, default 0

Strings are used for sheet names, Integers are used in zero-indexed sheet positions.

Lists of strings/integers are used to request multiple sheets.

Specify None to get all sheets.

str|int -> DataFrame is returned. list|None -> Dict of DataFrames is returned, with keys representing sheets.

Available Cases

  • Defaults to 0 -> 1st sheet as a DataFrame
  • 1 -> 2nd sheet as a DataFrame
  • “Sheet1” -> 1st sheet as a DataFrame
  • [0,1,”Sheet5”] -> 1st, 2nd & 5th sheet as a dictionary of DataFrames
  • None -> All sheets as a dictionary of DataFrames

header : 設置讀取的excel第一行是否作為列名稱

int, list of ints, default 0

Row (0-indexed) to use for the column labels of the parsed DataFrame. If a list of integers is passed those row positions will be combined into a MultiIndex. Use None if there is no header.

names :設置每列的名稱,數組形式參數

   array-like, default None

List of column names to use. If file contains no header row, then you should explicitly pass header=None

index_col :設置讀取的excel第一列是否作為行名稱

   int, list of ints, default None

Column (0-indexed) to use as the row labels of the DataFrame. Pass None if there is no such column. If a list is passed, those columns will be combined into a MultiIndex. If a subset of data is selected with usecols, index_col is based on the subset.

usecols :執行需要讀取的數據列,通常載入的excel包含不需要的列

    int or list, default None

  • If None then parse all columns,
  • If int then indicates last column to be parsed
  • If list of ints then indicates list of column numbers to be parsed
  • If string then indicates comma separated list of Excel column letters and column ranges (e.g. “A:E” or “A,C,E:F”). Ranges are inclusive of both sides.

下滿是一些pandas讀取excel數據的示例:

將數據集寫入excel文件:

>>> df_out = pd.DataFrame([(‘string1‘, 1),
...                        (‘string2‘, 2),
...                        (‘string3‘, 3)],
...                       columns=[‘Name‘, ‘Value‘])
>>> df_out
      Name  Value
0  string1      1
1  string2      2
2  string3      3
>>> df_out.to_excel(‘tmp.xlsx‘)

讀取excel文件:

>>> pd.read_excel(‘tmp.xlsx‘)
      Name  Value
0  string1      1
1  string2      2
2  string3      3

參數index_col and header 都設置為None表示不讀取excel的第一行和第一列作為標題和默認索引:

>>> pd.read_excel(‘tmp.xlsx‘, index_col=None, header=None)
     0        1      2
0  NaN     Name  Value
1  0.0  string1      1
2  1.0  string2      2
3  2.0  string3      3

甚至可以專門制定列的格式:

>>> pd.read_excel(‘tmp.xlsx‘, dtype={‘Name‘:str, ‘Value‘:float})
      Name  Value
0  string1    1.0
1  string2    2.0
2  string3    3.0

下面是綜合示例:讀取text.xlsx文件的sheet1頁,僅載入D:F列的數據。這裏F列是類別標簽,需要類別1和類別2轉換為數字,應用於機器學習的輸入建模。

import pandas as pd

def reader(path,sheet):
    return pd.read_excel(path, sheet_name=sheet, usecols=D:F)
    
trainrd = reader(text.xlsx,sheet1)
trainrd.head(5)  #查看前5行數據
trainrd[x]=0  #新建一列x
trainrd.loc[trainrd[類別]==‘類別1,x]=0 #將類別列的文字轉換為數字
trainrd.loc[trainrd[類別]==‘類別2,x]=1

機器學習之數據預處理,Pandas讀取excel數據