1. 程式人生 > >菜鷄日記——《Python資料分析與挖掘實戰》實驗6-1 拉格朗日插值法

菜鷄日記——《Python資料分析與挖掘實戰》實驗6-1 拉格朗日插值法

實驗6-1 用拉格朗日插值法

題目描述:用拉格朗日插值法對missing_data.xls中表格的空值進行填補。

# p1, lab6
# Fill all of the null values with Lagrange's interpolation
# Data file name is "missing_data.xls"


import pandas as pd
from scipy.interpolate import lagrange


dir = 'F:/Data Mining/codes/ch6/lab6_1'     # dir is a built-in name, will be shadowed if is distinctly defined
data = pd.read_excel(dir + '/data/missing_data.xls', header=None)   # header=None indicates that the table does not have header


def lagrange_interpolate(s, n, k=5):
    y = s[list(range(n-k, n)) + list(range(n+1, n+1+k))]    # may create indexes out of bound, which are defined as null values
    y = y[y.notnull()]      # y.notnull() returns a Series object in boolean type
    return lagrange(y.index, list(y))(n)
    # method lagrange(x, w) in module scipy.interpolate
    # param x is an array like object, represents the x-coordinates of a set of points
    # param w is an array like object, represents the y-coordinates of a set of points
    # return a numpy.lib.polynomial.poly1d object (polynomial type) represents the Lagrange interpolating polynomial
    # WARNING: this implementation is unstable, do not expect to be able to use more than 20 points
    # (poly1d)(n) gets the result of the polynomial when x=n


for col in data.columns:
    for i in range(len(data)):
        if data[col].isnull()[i]:   # Series.isnull() returns a Series object in boolean type
            data[col][i] = lagrange_interpolate(data[col], i)   # DataFrame[column][index] can locate elements in the DataFrame object
# error ever made: in the conditional statement, miss [col] so that returns a DataFrame object rather than a Series object


data.to_excel(dir + '/data/result.xls', header=None, index=False)   # the last two params construct a table without header and index

 missing_data.xls                          result.xls

我學到了什麼?

df.read_excel(header=None)  說明讀入的表格沒有表頭,否則missing_data.xls的首行會被當作表頭

df.to_excel(header=None, index=False)  指定匯出的表格不含表頭和索引,否則result.xls會有表頭並在最左邊顯示索引

  • isnull()和notnull()的返回物件

二者都是DataFrame或Series的方法,用於空值的判斷,返回DataFrame或Series物件。isnull()方法在空值的位置記為True,否則記為False;notnull()方法在空值的位置記為False,否則記為True

  • DataFrame物件的定位

data[column][index]可以定位到列名為column、索引名為index的位置

  • 提取資料時越界

在上面的lagrange_interpolate()方法中,首行用於提取樣本點,顯然(n-k)和(n+k)都可能越界。但是通過除錯觀察發現,當發生越界時,越界的下標對應的位置值位空值,然後在配合下一條去除空值的語句將越界的取值剔除了