1. 程式人生 > >step Time Series Forecasting with Machine Learning for Household Electricity Consumption

step Time Series Forecasting with Machine Learning for Household Electricity Consumption

Given the rise of smart electricity meters and the wide adoption of electricity generation technology like solar panels, there is a wealth of electricity usage data available.

This data represents a multivariate time series of power-related variables that in turn could be used to model and even forecast future electricity consumption.

Machine learning algorithms predict a single value and cannot be used directly for multi-step forecasting. Two strategies that can be used to make multi-step forecasts with machine learning algorithms are the recursive and the direct methods.

In this tutorial, you will discover how to develop recursive and direct multi-step forecasting models with machine learning algorithms.

After completing this tutorial, you will know:

  • How to develop a framework for evaluating linear, nonlinear, and ensemble machine learning algorithms for multi-step time series forecasting.
  • How to evaluate machine learning algorithms using a recursive multi-step time series forecasting strategy.
  • How to evaluate machine learning algorithms using a direct per-day and per-lead time multi-step time series forecasting strategy.

Let’s get started.

Multi-step Time Series Forecasting with Machine Learning Models for Household Electricity Consumption

Multi-step Time Series Forecasting with Machine Learning Models for Household Electricity ConsumptionPhoto by Sean McMenemy, some rights reserved.

Tutorial Overview

This tutorial is divided into five parts; they are:

  1. Problem Description
  2. Load and Prepare Dataset
  3. Model Evaluation
  4. Recursive Multi-Step Forecasting
  5. Direct Multi-Step Forecasting

Problem Description

The ‘Household Power Consumption‘ dataset is a multivariate time series dataset that describes the electricity consumption for a single household over four years.

The data was collected between December 2006 and November 2010 and observations of power consumption within the household were collected every minute.

It is a multivariate series comprised of seven variables (besides the date and time); they are:

  • global_active_power: The total active power consumed by the household (kilowatts).
  • global_reactive_power: The total reactive power consumed by the household (kilowatts).
  • voltage: Average voltage (volts).
  • global_intensity: Average current intensity (amps).
  • sub_metering_1: Active energy for kitchen (watt-hours of active energy).
  • sub_metering_2: Active energy for laundry (watt-hours of active energy).
  • sub_metering_3: Active energy for climate control systems (watt-hours of active energy).

Active and reactive energy refer to the technical details of alternative current.

A fourth sub-metering variable can be created by subtracting the sum of three defined sub-metering variables from the total active energy as follows:

1 sub_metering_remainder = (global_active_power * 1000 / 60) - (sub_metering_1 + sub_metering_2 + sub_metering_3)

Load and Prepare Dataset

The dataset can be downloaded from the UCI Machine Learning repository as a single 20 megabyte .zip file:

Download the dataset and unzip it into your current working directory. You will now have the file “household_power_consumption.txt” that is about 127 megabytes in size and contains all of the observations.

We can use the read_csv() function to load the data and combine the first two columns into a single date-time column that we can use as an index.

12 # load all datadataset=read_csv('household_power_consumption.txt',sep=';',header=0,low_memory=False,infer_datetime_format=True,parse_dates={'datetime':[0,1]},index_col=['datetime'])

Next, we can mark all missing values indicated with a ‘?‘ character with a NaN value, which is a float.

This will allow us to work with the data as one array of floating point values rather than mixed types (less efficient.)

1234 # mark all missing valuesdataset.replace('?',nan,inplace=True)# make dataset numericdataset=dataset.astype('float32')

We also need to fill in the missing values now that they have been marked.

A very simple approach would be to copy the observation from the same time the day before. We can implement this in a function named fill_missing() that will take the NumPy array of the data and copy values from exactly 24 hours ago.

1234567 # fill missing values with a value at the same time one day agodef fill_missing(values):one_day=60*24forrow inrange(values.shape[0]):forcol inrange(values.shape[1]):ifisnan(values[row,col]):values[row,col]=values[row-one_day,col]

We can apply this function directly to the data within the DataFrame.

12 # fill missingfill_missing(dataset.values)

Now we can create a new column that contains the remainder of the sub-metering, using the calculation from the previous section.

123 # add a column for for the remainder of sub meteringvalues=dataset.valuesdataset['sub_metering_4']=(values[:,0]*1000/60)-(values[:,4]+values[:,5]+values[:,6])

We can now save the cleaned-up version of the dataset to a new file; in this case we will just change the file extension to .csv and save the dataset as ‘household_power_consumption.csv‘.

12 # save updated datasetdataset.to_csv('household_power_consumption.csv')

Tying all of this together, the complete example of loading, cleaning-up, and saving the dataset is listed below.

123456789101112131415161718192021222324252627 # load and clean-up datafrom numpy import nanfrom numpy import isnanfrom pandas import read_csvfrom pandas import to_numeric# fill missing values with a value at the same time one day agodef fill_missing(values):one_day=60*24forrow inrange(values.shape[0]):forcol inrange(values.shape[1]):ifisnan(values[row,col]):values[row,col]=values[row-one_day,col]# load all datadataset=read_csv('household_power_consumption.txt',sep=';',header=0,low_memory=False,infer_datetime_format=True,parse_dates={'datetime':[0,1]},index_col=['datetime'])# mark all missing valuesdataset.replace('?',nan,inplace=True)# make dataset numericdataset=dataset.astype('float32')# fill missingfill_missing(dataset.values)# add a column for for the remainder of sub meteringvalues=dataset.valuesdataset['sub_metering_4']=(values[:,0]*1000/60)-(values[:,4]+values[:,5]+values[:,6])# save updated datasetdataset.to_csv('household_power_consumption.csv')

Running the example creates the new file ‘household_power_consumption.csv‘ that we can use as the starting point for our modeling project.

Need help with Deep Learning for Time Series?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Model Evaluation

In this section, we will consider how we can develop and evaluate predictive models for the household power dataset.

This section is divided into four parts; they are:

  1. Problem Framing
  2. Evaluation Metric
  3. Train and Test Sets
  4. Walk-Forward Validation

Problem Framing

There are many ways to harness and explore the household power consumption dataset.

In this tutorial, we will use the data to explore a very specific question; that is:

Given recent power consumption, what is the expected power consumption for the week ahead?

This requires that a predictive model forecast the total active power for each day over the next seven days.

Technically, this framing of the problem is referred to as a multi-step time series forecasting problem, given the multiple forecast steps. A model that makes use of multiple input variables may be referred to as a multivariate multi-step time series forecasting model.

A model of this type could be helpful within the household in planning expenditures. It could also be helpful on the supply side for planning electricity demand for a specific household.

This framing of the dataset also suggests that it would be useful to downsample the per-minute observations of power consumption to daily totals. This is not required, but makes sense, given that we are interested in total power per day.

We can achieve this easily using the resample() function on the pandas DataFrame. Calling this function with the argument ‘D‘ allows the loaded data indexed by date-time to be grouped by day (see all offset aliases). We can then calculate the sum of all observations for each day and create a new dataset of daily power consumption data for each of the eight variables.

The complete example is listed below.

123456789101112 # resample minute data to total for each dayfrom pandas import read_csv# load the new filedataset=read_csv('household_power_consumption.csv',header=0,infer_datetime_format=True,parse_dates=['datetime'],index_col=['datetime'])# resample data to dailydaily_groups=dataset.resample('D')daily_data=daily_groups.sum()# summarizeprint(daily_data.shape)print(daily_data.head())# savedaily_data.to_csv('household_power_consumption_days.csv')

Running the example creates a new daily total power consumption dataset and saves the result into a separate file named ‘household_power_consumption_days.csv‘.

We can use this as the dataset for fitting and evaluating predictive models for the chosen framing of the problem.

Evaluation Metric

A forecast will be comprised of seven values, one for each day of the week ahead.

It is common with multi-step forecasting problems to evaluate each forecasted time step separately. This is helpful for a few reasons:

  • To comment on the skill at a specific lead time (e.g. +1 day vs +3 days).
  • To contrast models based on their skills at different lead times (e.g. models good at +1 day vs models good at days +5).

The units of the total power are kilowatts and it would be useful to have an error metric that was also in the same units. Both Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE) fit this bill, although RMSE is more commonly used and will be adopted in this tutorial. Unlike MAE, RMSE is more punishing of forecast errors.

The performance metric for this problem will be the RMSE for each lead time from day 1 to day 7.

As a short-cut, it may be useful to summarize the performance of a model using a single score in order to aide in model selection.

One possible score that could be used would be the RMSE across all forecast days.

The function evaluate_forecasts() below will implement this behavior and return the performance of a model based on multiple seven-day forecasts.

123456789101112131415161718 # evaluate one or more weekly forecasts against expected valuesdef evaluate_forecasts(actual,predicted):scores=list()# calculate an RMSE score for each dayforiinrange(actual.shape[1]):# calculate msemse=mean_squared_error(actual[:,i],predicted[:,i])# calculate rmsermse=sqrt(mse)# storescores.append(rmse)# calculate overall RMSEs=0forrow inrange(actual.shape[0]):forcol inrange(actual.shape[1]):s+=(actual[row,col]-predicted[row,col])**2score=sqrt(s/(actual.shape[0]*actual.shape[1]))returnscore,scores

Running the function will first return the overall RMSE regardless of day, then an array of RMSE scores for each day.

Train and Test Sets

We will use the first three years of data for training predictive models and the final year for evaluating models.

The data in a given dataset will be divided into standard weeks. These are weeks that begin on a Sunday and end on a Saturday.

This is a realistic and useful way for using the chosen framing of the model, where the power consumption for the week ahead can be predicted. It is also helpful with modeling, where models can be used to predict a specific day (e.g. Wednesday) or the entire sequence.

We will split the data into standard weeks, working backwards from the test dataset.

The final year of the data is in 2010 and the first Sunday for 2010 was January 3rd. The data ends in mid November 2010 and the closest final Saturday in the data is November 20th. This gives 46 weeks of test data.

The first and last rows of daily data for the test dataset are provided below for confirmation.

123 2010-01-03,2083.4539999999984,191.61000000000055,350992.12000000034,8703.600000000033,3842.0,4920.0,10074.0,15888.233355799992...2010-11-20,2197.006000000004,153.76800000000028,346475.9999999998,9320.20000000002,4367.0,2947.0,11433.0,17869.76663959999

The daily data starts in late 2006.

The first Sunday in the dataset is December 17th, which is the second row of data.

Organizing the data into standard weeks gives 159 full standard weeks for training a predictive model.

123 2006-12-17,3390.46,226.0059999999994,345725.32000000024,14398.59999999998,2033.0,4187.0,13341.0,36946.66673200004...2010-01-02,1309.2679999999998,199.54600000000016,352332.8399999997,5489.7999999999865,801.0,298.0,6425.0,14297.133406600002

The function split_dataset() below splits the daily data into train and test sets and organizes each into standard weeks.

Specific row offsets are used to split the data using knowledge of the dataset. The split datasets are then organized into weekly data using the NumPy split() function.

12345678

相關推薦

step Time Series Forecasting with Machine Learning for Household Electricity Consumption

Given the rise of smart electricity meters and the wide adoption of electricity generation technology like solar panels, there is a wealth of electricity

Feature Selection for Time Series Forecasting with Python

Tweet Share Share Google Plus The use of machine learning methods on time series data requires f

Multivariate Time Series Forecasting with LSTMs in Keras 中文版翻譯

像長期短期記憶(LSTM)神經網路的神經網路能夠模擬多個輸入變數的問題。這在時間序列預測中是一個很大的益處,其中古典線性方法難以適應多變數或多輸入預測問題。 在本教程中,您將發現如何在Keras深度學習庫中開發多變數時間序列預測的LSTM模型。 完成本教程後,您將知道: 如何

愉快的學習就從翻譯開始吧_Multivariate Time Series Forecasting with LSTMs in Keras_3_Multivariate LSTM Forecast

3. Multivariate LSTM Forecast Model/多變數LSTM預測模型In this section, we will fit an LSTM to the problem.本章,我們將一個LSTM擬合到這個問題LSTM Data Preparatio

Introduction to Time Series Forecasting With Python

I believe my books offer thousands of dollars of education for tens of dollars each. They are months if not years of experience distilled into a few hundre

5 Top Books on Time Series Forecasting With R

Tweet Share Share Google Plus Time series forecasting is a difficult problem. Unlike classificat

Multivariate Time Series Forecasting with LSTMs in Keras

Tweet Share Share Google Plus Neural networks like Long Short-Term Memory (LSTM) recurrent neura

Time Series Prediction With Deep Learning in Keras

Tweet Share Share Google Plus Time Series prediction is a difficult problem both to frame and to

How to Get Good Results Fast with Deep Learning for Time Series Forecasting

Tweet Share Share Google Plus 3 Strategies to Design Experiments and Manage Complexity on Your P

10 Challenging Machine Learning Time Series Forecasting Problems

Tweet Share Share Google Plus Machine learning methods have a lot to offer for time series forec

How to Create an ARIMA Model for Time Series Forecasting in Python

Tweet Share Share Google Plus A popular and widely used statistical method for time series forec

LSTM Model Architecture for Rare Event Time Series Forecasting

Tweet Share Share Google Plus Time series forecasting with LSTMs directly has shown little succe

【論文筆記】An Intelligent Fault Diagnosis Method Using: Multivariate Anomaly Detection for Time Series Data with Generative Adversarial Networks

ivar 單位矩陣 作用 一次 一個 http example tps 計算 論文來源:IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS 2016年的文章,SCI1區,提出了兩階段的算法。第一個階段使用Sparse filtering

[Javascript] Classify text into categories with machine learning in Natural

bus easy ann etc hms scrip steps spam not In this lesson, we will learn how to train a Naive Bayes classifier or a Logistic Regression cl

[Javascript] Classify JSON text data with machine learning in Natural

comm about cnblogs ++ get ssi learn clas save In this lesson, we will learn how to train a Naive Bayes classifier and a Logistic Regressi

AUTOML --- Machine Learning for Automated Algorithm Design.

org font sign sig ngs post ont learn ted 自動算法的機器學習: Machine Learning for Automated Algorithm Design. http://www.ml4aad.org/ AutoM

Machine learning for improved image-based wavefront sensing

均方誤差 多個 nms ear 誤差 隨機選擇 公司 選擇 標準 ---恢復內容開始---   基於圖像的波前傳感是一種利用參數化物理模型和非線性優化計算點擴散函數(Psf)來測量波前誤差的方法。當執行基於圖像的波前傳感時,探測器上捕獲一個psf,物理模型創建一個波前,生成

Machine Learning for iOS Developers iOS開發者的機器學習教程 Lynda課程中文字幕

Machine Learning for iOS Developers 中文字幕 iOS開發者的機器學習教程 中文字幕Machine Learning for iOS Developers 您是否想構建自定義機器學習模型來增強您的iOS應用程式? 即使您不是機器學習專家,Core

A Comprehensive survey of machine learning for Internet (2018) via Boutaba,Mohammed et al【sec 5】

5 Traffic routing   網路流量路由是網路中的基礎,並且需要選擇用於分組傳輸的路徑。 選擇標準是多種多樣的,主要取決於操作策略和目標,例如成本最小化,鏈路利用率最大化和QoS配置。 流量路由需要具有強能力的ML模型能力,例如能夠應對和擴充套件複雜和動態網路拓撲,學習所選路

A Comprehensive survey of machine learning for Internet (2018) via Boutaba,Mohammed et al【Sec 2】

這是AI for Net的一篇survey。 文章目錄 Section 2 A primer of AI for net 2.1 learning paradigm 2.2 Data c