1. 程式人生 > >時間序列--上取樣、下采樣

時間序列--上取樣、下采樣

在上取樣的情況下,可能需要注意如何使用插值來計算細粒度的觀測值

在向下取樣的情況下,在選擇用於計算新聚合值的彙總統計資訊時可能需要小心。

也許有兩個主要原因讓你對重新取樣你的時間序列資料感興趣:

1.問題框架:如果您的資料與您希望進行預測的頻率相同,則可能需要重新取樣。

2.特徵工程:重取樣還可以用於為監督學習模型提供額外的結構或洞察學習問題。

這兩種情況有很多重合之處。例如,您可能有每日資料,並希望預測每月的問題。您可以直接使用每日資料,也可以將其下采樣為每月資料,並開發您的模型。

https://machinelearningmastery.com/resample-interpolate-time-series-data-python/

上取樣:

from pandas import read_csv
from pandas import datetime
from matplotlib import pyplot

def parser(x):
	return datetime.strptime('190'+x, '%Y-%m')

series = read_csv('shampoo-sales.csv', header=0, parse_dates=[0], index_col=0, squeeze=True, date_parser=parser)
print(series.head())
series.plot()
pyplot.show()

 

資料如下:

Month
1901-01-01 266.0
1901-02-01 145.9
1901-03-01 183.1
1901-04-01 119.3
1901-05-01 180.3
Name: Sales of shampoo over a three year period, dtype: float64

也就是我們現在有月度的資料,想變成日度的資料

首先進行格式轉換

from pandas import read_csv
from pandas import datetime

def parser(x):
	return datetime.strptime('190'+x, '%Y-%m')

series = read_csv('shampoo-sales.csv', header=0, parse_dates=[0], index_col=0, squeeze=True, date_parser=parser)
upsampled = series.resample('D')
print(upsampled.head(32))

這裡D代表day,搞完之後變這樣

Month
1901-01-01 266.0
1901-01-02 NaN
1901-01-03 NaN
1901-01-04 NaN
1901-01-05 NaN
1901-01-06 NaN
1901-01-07 NaN
1901-01-08 NaN
1901-01-09 NaN
1901-01-10 NaN
1901-01-11 NaN
1901-01-12 NaN
1901-01-13 NaN
1901-01-14 NaN
1901-01-15 NaN
1901-01-16 NaN
1901-01-17 NaN
1901-01-18 NaN
1901-01-19 NaN
1901-01-20 NaN
1901-01-21 NaN
1901-01-22 NaN
1901-01-23 NaN
1901-01-24 NaN
1901-01-25 NaN
1901-01-26 NaN
1901-01-27 NaN
1901-01-28 NaN
1901-01-29 NaN
1901-01-30 NaN
1901-01-31 NaN
1901-02-01 145.9

現在佔了位置之後就可以進行插值了,方法有很多,比如線性,多項式,spline等等

from pandas import read_csv
from pandas import datetime

def parser(x):
	return datetime.strptime('190'+x, '%Y-%m')

series = read_csv('shampoo-sales.csv', header=0, parse_dates=[0], index_col=0, squeeze=True, date_parser=parser)
upsampled = series.resample('D')
interpolated = upsampled.interpolate(method='linear')
print(interpolated.head(32))

效果圖如下:

Shamoo Sales Interpolated Linear

from pandas import read_csv
from pandas import datetime
from matplotlib import pyplot

def parser(x):
	return datetime.strptime('190'+x, '%Y-%m')

series = read_csv('shampoo-sales.csv', header=0, parse_dates=[0], index_col=0, squeeze=True, date_parser=parser)
upsampled = series.resample('D')
interpolated = upsampled.interpolate(method='spline', order=2)
print(interpolated.head(32))
interpolated.plot()
pyplot.show()

 效果圖如下:

Shamoo Sales Interpolated Spline

下采樣:我們有月度資料,現在想要季度資料

from pandas import read_csv
from pandas import datetime
from matplotlib import pyplot

def parser(x):
	return datetime.strptime('190'+x, '%Y-%m')

series = read_csv('shampoo-sales.csv', header=0, parse_dates=[0], index_col=0, squeeze=True, date_parser=parser)
resample = series.resample('Q')
quarterly_mean_sales = resample.mean()
print(quarterly_mean_sales.head())
quarterly_mean_sales.plot()
pyplot.show()

Q代表季度,mean()代表幾個月份的均值去代替

Month
1901-03-31 198.333333
1901-06-30 156.033333
1901-09-30 216.366667
1901-12-31 215.100000
1902-03-31 184.633333
Freq: Q-DEC, Name: Sales, dtype: float64

當然你也可以用年份的,這裡用sum

from pandas import read_csv
from pandas import datetime
from matplotlib import pyplot

def parser(x):
	return datetime.strptime('190'+x, '%Y-%m')

series = read_csv('shampoo-sales.csv', header=0, parse_dates=[0], index_col=0, squeeze=True, date_parser=parser)
resample = series.resample('A')
quarterly_mean_sales = resample.sum()
print(quarterly_mean_sales.head())
quarterly_mean_sales.plot()
pyplot.show()

 更多細節:

1.http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.resample.html

2.http://pandas.pydata.org/pandas-docs/stable/timeseries.html#resampling

3.http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.interpolate.html