1. 程式人生 > >Time Series Forecast Study with Python: Monthly Sales of French Champagne

Time Series Forecast Study with Python: Monthly Sales of French Champagne

Time series forecasting is a process, and the only way to get good forecasts is to practice this process.

In this tutorial, you will discover how to forecast the monthly sales of French champagne with Python.

Working through this tutorial will provide you with a framework for the steps and the tools for working through your own time series forecasting problems.

After completing this tutorial, you will know:

  • How to confirm your Python environment and carefully define a time series forecasting problem.
  • How to create a test harness for evaluating models, develop a baseline forecast, and better understand your problem with the tools of time series analysis.
  • How to develop an autoregressive integrated moving average model, save it to file, and later load it to make predictions for new time steps.

Let’s get started.

  • Update Mar/2017: Fixed a typo in the code example in the “Review Residual Errors” section where the wrong model was run.
Time Series Forecast Study with Python - Monthly Sales of French Champagne

Time Series Forecast Study with Python – Monthly Sales of French Champagne
Photo by Basheer Tome, some rights reserved.

Overview

In this tutorial, we will work through a time series forecasting project from end-to-end, from downloading the dataset and defining the problem to training a final model and making predictions.

This project is not exhaustive, but shows how you can get good results quickly by working through a time series forecasting problem systematically.

The steps of this project that we will through are as follows.

  1. Environment.
  2. Problem Description.
  3. Test Harness.
  4. Persistence.
  5. Data Analysis.
  6. ARIMA Models.
  7. Model Validation.

This will provide a template for working through a time series prediction problem that you can use on your own dataset.

Stop learning Time Series Forecasting the slow way!

Take my free 7-day email course and discover data prep, modeling and more (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

1. Environment

This tutorial assumes an installed and working SciPy environment and dependencies, including:

  • SciPy
  • NumPy
  • Matplotlib
  • Pandas
  • scikit-learn
  • statsmodels

If you need help installing Python and the SciPy environment on your workstation, consider the Anaconda distribution that manages much of it for you.

This script will help you check your installed versions of these libraries.

123456789101112131415161718 # scipyimport scipyprint('scipy: %s'%scipy.__version__)# numpyimport numpyprint('numpy: %s'%numpy.__version__)# matplotlibimport matplotlibprint('matplotlib: %s'%matplotlib.__version__)# pandasimport pandasprint('pandas: %s'%pandas.__version__)# scikit-learnimport sklearnprint('sklearn: %s'%sklearn.__version__)# statsmodelsimport statsmodelsprint('statsmodels: %s'%statsmodels.__version__)

The results on my workstation used to write this tutorial are as follows:

123456 scipy: 0.18.1numpy: 1.11.2matplotlib: 1.5.3pandas: 0.19.1sklearn: 0.18.1statsmodels: 0.6.1

2. Problem Description

The problem is to predict the number of monthly sales of champagne for the Perrin Freres label (named for a region in France).

The dataset provides the number of monthly sales of champagne from January 1964 to September 1972, or just under 10 years of data.

The values are a count of millions of sales and there are 105 observations.

The dataset is credited to Makridakis and Wheelwright, 1989.

Download the dataset as a CSV file and place it in your current working directory with the filename “champagne.csv“.

3. Test Harness

We must develop a test harness to investigate the data and evaluate candidate models.

This involves two steps:

  1. Defining a Validation Dataset.
  2. Developing a Method for Model Evaluation.

3.1 Validation Dataset

The dataset is not current. This means that we cannot easily collect updated data to validate the model.

Therefore we will pretend that it is September 1971 and withhold the last one year of data from analysis and model selection.

This final year of data will be used to validate the final model.

The code below will load the dataset as a Pandas Series and split into two, one for model development (dataset.csv) and the other for validation (validation.csv).

1234567 from pandas import Seriesseries=Series.from_csv('champagne.csv',header=0)split_point=len(series)-12dataset,validation=series[0:split_point],series[split_point:]print('Dataset %d, Validation %d'%(len(dataset),len(validation)))dataset.to_csv('dataset.csv')validation.to_csv('validation.csv')

Running the example creates two files and prints the number of observations in each.

1 Dataset 93, Validation 12

The specific contents of these files are:

  • dataset.csv: Observations from January 1964 to September 1971 (93 observations)
  • validation.csv: Observations from October 1971 to September 1972 (12 observations)

The validation dataset is about 11% of the original dataset.

Note that the saved datasets do not have a header line, therefore we do not need to cater for this when working with these files later.

3.2. Model Evaluation

Model evaluation will only be performed on the data in dataset.csv prepared in the previous section.

Model evaluation involves two elements:

  1. Performance Measure.
  2. Test Strategy.

3.2.1 Performance Measure

The observations are a count of champagne sales in millions of units.

We will evaluate the performance of predictions using the root mean squared error (RMSE). This will give more weight to predictions that are grossly wrong and will have the same units as the original data.

Any transforms to the data must be reversed before the RMSE is calculated and reported to make the performance between different methods directly comparable.

We can calculate the RMSE using the helper function from the scikit-learn library mean_squared_error() that calculates the mean squared error between a list of expected values (the test set) and the list of predictions. We can then take the square root of this value to give us an RMSE score.

For example:

12345678 from sklearn.metrics import mean_squared_errorfrom math import sqrt...test=...predictions=...mse=mean_squared_error(test,predictions)rmse=sqrt(mse)print('RMSE: %.3f'%rmse)

3.2.2 Test Strategy

Candidate models will be evaluated using walk-forward validation.

This is because a rolling-forecast type model is required from the problem definition. This is where one-step forecasts are needed given all available data.

The walk-forward validation will work as follows:

  • The first 50% of the dataset will be held back to train the model.
  • The remaining 50% of the dataset will be iterated and test the model.
  • For each step in the test dataset:
    • A model will be trained.
    • A one-step prediction made and the prediction stored for later evaluation.
    • The actual observation from the test dataset will be added to the training dataset for the next iteration.
  • The predictions made during the iteration of the test dataset will be evaluated and an RMSE score reported.

Given the small size of the data, we will allow a model to be re-trained given all available data prior to each prediction.

We can write the code for the test harness using simple NumPy and Python code.

Firstly, we can split the dataset into train and test sets directly. We’re careful to always convert a loaded dataset to float32 in case the loaded data still has some String or Integer data types.

12345 # prepare dataX=series.valuesX=X.astype('float32')train_size=int(len(X)*0.50)train,test=X[0:train_size],X[train_size:]

Next, we can iterate over the time steps in the test dataset. The train dataset is stored in a Python list as we need to easily append a new observation each iteration and NumPy array concatenation feels like overkill.

The prediction made by the model is called yhat for convention, as the outcome or observation is referred to as y and yhat (a ‘y‘ with a mark above) is the mathematical notation for the prediction of the y variable.

The prediction and observation are printed each observation for a sanity check prediction in case there are issues with the model.

1234567891011 # walk-forward validationhistory=[xforxintrain]predictions=list()foriinrange(len(test)):# predictyhat=...predictions.append(yhat)# observationobs=test[i]history.append(obs)print('>Predicted=%.3f, Expected=%3.f'%(yhat,obs))

4. Persistence

The first step before getting bogged down in data analysis and modeling is to establish a baseline of performance.

This will provide both a template for evaluating models using the proposed test harness and a performance measure by which all more elaborate predictive models can be compared.

The baseline prediction for time series forecasting is called the naive forecast, or persistence.

This is where the observation from the previous time step is used as the prediction for the observation at the next time step.

We can plug this directly into the test harness defined in the previous section.

The complete code listing is provided below.

12345678910111213141516171819202122232425 from pandas import Seriesfrom sklearn.metrics import mean_squared_errorfrom math import sqrt# load dataseries=Series.from_csv('dataset.csv')# prepare dataX=series.valuesX=X.astype('float32')train_size=int(len(X)*0.50)train,test=X[0:train_size],X[train_size:]# walk-forward validationhistory=[xforxintrain]predictions=list()foriinrange(len(test)):# predictyhat=history[-1]predictions.append(yhat)# observationobs=test[i]history.append(obs)print('>Predicted=%.3f, Expected=%3.f'%(yhat,obs))# report performancemse=mean_squared_error(test,predictions)rmse=sqrt(mse)print('RMSE: %.3f'%rmse)

Running the test harness prints the prediction and observation for each iteration of the test dataset.

The example ends by printing the RMSE for the model.

In this case, we can see that the persistence model achieved an RMSE of 3186.501. This means that on average, the model was wrong by about 3,186 million sales for each prediction made.

1234567 ...>Predicted=4676.000, Expected=5010>Predicted=5010.000, Expected=4874>Predicted=4874.000, Expected=4633>Predicted=4633.000, Expected=1659>Predicted=1659.000, Expected=5951RMSE: 3186.501

We now have a baseline prediction method and performance; now we can start digging into our data.

5. Data Analysis

We can use summary statistics and plots of the data to quickly learn more about the structure of the prediction problem.

In this section, we will look at the data from five perspectives:

  1. Summary Statistics.
  2. Line Plot.
  3. Seasonal Line Plots
  4. Density Plots.
  5. Box and Whisker Plot.

5.1 Summary Statistics

Summary statistics provide a quick look at the limits of observed values. It can help to get a quick idea of what we are working with.

The example below calculates and prints summary statistics for the time series.

123 from pandas import Seriesseries=Series.from_csv('dataset.csv')print(series.describe())

Running the example provides a number of summary statistics to review.

Some observations from these statistics include:

  • The number of observations (count) matches our expectation, meaning we are handling the data correctly.
  • The mean is about 4,641, which we might consider our level in this series.
  • The standard deviation (average spread from the mean) is relatively large at 2,486 sales.
  • The percentiles along with the standard deviation do suggest a large spread to the data.
12345678 count       93.000000mean      4641.118280std       2486.403841min       1573.00000025%       3036.00000050%       4016.00000075%       5048.000000max      13916.000000

5.2 Line Plot

A line plot of a time series can provide a lot of insight into the problem.

The example below creates and shows a line plot of the dataset.

12345 from pandas import Seriesfrom matplotlib import pyplotseries=Series.from_csv('dataset.csv')series.plot()pyplot.show()

Run the example and review the plot. Note any obvious temporal structures in the series.

Some observations from the plot include:

  • There may be an increasing trend of sales over time.
  • There appears to be systematic seasonality to the sales for each year.
  • The seasonal signal appears to be growing over time, suggesting a multiplicative relationship (increasing change).
  • There do not appear to be any obvious outliers.
  • The seasonality suggests that the series is almost certainly non-stationary.
Champagne Sales Line Plot

Champagne Sales Line Plot

There may be benefit in explicitly modeling the seasonal component and removing it. You may also explore using differencing with one or two levels in order to make the series stationary.

The increasing trend or growth in the seasonal component may suggest the use of a log or other power transform.

5.3 Seasonal Line Plots

We can confirm the assumption that the seasonality is a yearly cycle by eyeballing line plots of the dataset by year.

The example below takes the 7 full years of data as separate groups and creates one line plot for each. The line plots are aligned vertically to help spot any year-to-year pattern.