時間序列--上取樣、下采樣
在上取樣的情況下,可能需要注意如何使用插值來計算細粒度的觀測值
在向下取樣的情況下,在選擇用於計算新聚合值的彙總統計資訊時可能需要小心。
也許有兩個主要原因讓你對重新取樣你的時間序列資料感興趣:
1.問題框架:如果您的資料與您希望進行預測的頻率相同,則可能需要重新取樣。
2.特徵工程:重取樣還可以用於為監督學習模型提供額外的結構或洞察學習問題。
這兩種情況有很多重合之處。例如,您可能有每日資料,並希望預測每月的問題。您可以直接使用每日資料,也可以將其下采樣為每月資料,並開發您的模型。
https://machinelearningmastery.com/resample-interpolate-time-series-data-python/
上取樣:
from pandas import read_csv from pandas import datetime from matplotlib import pyplot def parser(x): return datetime.strptime('190'+x, '%Y-%m') series = read_csv('shampoo-sales.csv', header=0, parse_dates=[0], index_col=0, squeeze=True, date_parser=parser) print(series.head()) series.plot() pyplot.show()
資料如下:
Month
1901-01-01 266.0
1901-02-01 145.9
1901-03-01 183.1
1901-04-01 119.3
1901-05-01 180.3
Name: Sales of shampoo over a three year period, dtype: float64
也就是我們現在有月度的資料,想變成日度的資料
首先進行格式轉換
from pandas import read_csv from pandas import datetime def parser(x): return datetime.strptime('190'+x, '%Y-%m') series = read_csv('shampoo-sales.csv', header=0, parse_dates=[0], index_col=0, squeeze=True, date_parser=parser) upsampled = series.resample('D') print(upsampled.head(32))
這裡D代表day,搞完之後變這樣
Month
1901-01-01 266.0
1901-01-02 NaN
1901-01-03 NaN
1901-01-04 NaN
1901-01-05 NaN
1901-01-06 NaN
1901-01-07 NaN
1901-01-08 NaN
1901-01-09 NaN
1901-01-10 NaN
1901-01-11 NaN
1901-01-12 NaN
1901-01-13 NaN
1901-01-14 NaN
1901-01-15 NaN
1901-01-16 NaN
1901-01-17 NaN
1901-01-18 NaN
1901-01-19 NaN
1901-01-20 NaN
1901-01-21 NaN
1901-01-22 NaN
1901-01-23 NaN
1901-01-24 NaN
1901-01-25 NaN
1901-01-26 NaN
1901-01-27 NaN
1901-01-28 NaN
1901-01-29 NaN
1901-01-30 NaN
1901-01-31 NaN
1901-02-01 145.9
現在佔了位置之後就可以進行插值了,方法有很多,比如線性,多項式,spline等等
from pandas import read_csv
from pandas import datetime
def parser(x):
return datetime.strptime('190'+x, '%Y-%m')
series = read_csv('shampoo-sales.csv', header=0, parse_dates=[0], index_col=0, squeeze=True, date_parser=parser)
upsampled = series.resample('D')
interpolated = upsampled.interpolate(method='linear')
print(interpolated.head(32))
效果圖如下:
from pandas import read_csv
from pandas import datetime
from matplotlib import pyplot
def parser(x):
return datetime.strptime('190'+x, '%Y-%m')
series = read_csv('shampoo-sales.csv', header=0, parse_dates=[0], index_col=0, squeeze=True, date_parser=parser)
upsampled = series.resample('D')
interpolated = upsampled.interpolate(method='spline', order=2)
print(interpolated.head(32))
interpolated.plot()
pyplot.show()
效果圖如下:
下采樣:我們有月度資料,現在想要季度資料
from pandas import read_csv
from pandas import datetime
from matplotlib import pyplot
def parser(x):
return datetime.strptime('190'+x, '%Y-%m')
series = read_csv('shampoo-sales.csv', header=0, parse_dates=[0], index_col=0, squeeze=True, date_parser=parser)
resample = series.resample('Q')
quarterly_mean_sales = resample.mean()
print(quarterly_mean_sales.head())
quarterly_mean_sales.plot()
pyplot.show()
Q代表季度,mean()代表幾個月份的均值去代替
Month
1901-03-31 198.333333
1901-06-30 156.033333
1901-09-30 216.366667
1901-12-31 215.100000
1902-03-31 184.633333
Freq: Q-DEC, Name: Sales, dtype: float64
當然你也可以用年份的,這裡用sum
from pandas import read_csv
from pandas import datetime
from matplotlib import pyplot
def parser(x):
return datetime.strptime('190'+x, '%Y-%m')
series = read_csv('shampoo-sales.csv', header=0, parse_dates=[0], index_col=0, squeeze=True, date_parser=parser)
resample = series.resample('A')
quarterly_mean_sales = resample.sum()
print(quarterly_mean_sales.head())
quarterly_mean_sales.plot()
pyplot.show()
更多細節:
1.http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.resample.html
2.http://pandas.pydata.org/pandas-docs/stable/timeseries.html#resampling
3.http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.interpolate.html