Beginner’s Guide to Predicting Time Series Data with Python

Learn how to predict time series data with Linear Regression, SARIMA, Exponential Smoothing or TBATS and what to consider when using these machine learning models.

Aug 01, 2024

Predicting future events is part of human nature. Even in ancient times, people tried to interpret future events from the flight of birds or the constellations of the stars. These early methods were based on observations and interpretations of natural phenomena. Instead of making predictions based on our experience, there are now machine learning models that use historical data to make predictions.

This article explains traditional machine learning models for beginners to predict time series data. Time series data refers to observations recorded at regular intervals over time. This type of data is prevalent across various domains, including finance, where it’s used for tracking stock prices and market trends, and meteorology, for forecasting weather patterns. It’s also crucial in managing inventories and monitoring the performance of supply chains. Moreover, in the realm of IoT and smart devices, time series data plays a pivotal role, capturing everything from household energy usage to environmental conditions. The essence of time series analysis lies in its focus on temporal sequences, making the timing of each data point critical. This distinguacy sets it apart from other analytical approaches. The aim is to discover patterns, trends and seasonal fluctuations over time and then take these components into account in the prediction. This article delves into various modeling techniques such as linear regression, (S)ARIMA, TBATS, and exponential smoothing. It not only explores the strengths and drawbacks of each approach but also provides actionable guidance for beginners.

Predicting time series data with Linear Regression

Linear regression models the relationship between a dependent variable and one or more independent variables. The linear equation for this has the form:

y = xw + b

Y is the predicted variable, x the independent variable, w the weight and b the bias. Linear regression is particularly useful for identifying trends in time series data and making predictions by finding the best fit line through the data points.

The advantages of linear regression are the simplicity of the model and its interpretability. The models are easy to implement and show how the independent variables influence the dependent variable. The method is less useful for capturing non-linear relationships and complex patterns within the data. When comparing different (advanced) machine learning methods, linear regression is often used as a baseline. The model provides a reference point to measure the improvements achieved by more complex algorithms.

One example is the prediction of energy consumption data: An energy consumption company wants to predict the electricity consumption of its customers based on temperature data. The hypothesis is that as temperatures fall (e.g. in the winter months in Europe), energy consumption from heating and lighting will increase, while in the summer months energy consumption will fall.

Tips for practical implementation

The scikit-learn library offers an easy introduction to practical implementation in Python. After importing the necessary packages (numpy for mathematical operations, sklearn.linear_model for the linear regression model, matplotlib for creating the visualizations) and loading the data, you can fit the model to the data using .fit(). X contains the data of the independent variables and y those of the dependent variable. The model result contains the coefficient of determination (R2) and the regression coefficient. You need both to assess the quality and significance of a linear regression analysis.

Python code example for a linear regression

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt

# Load the datasets
temperature_data = pd.read_csv('temperaturedata.csv')
energy_data = pd.read_csv('energydata.csv')

# Assuming the data is in the first column, adjust if necessary
X = temperature_data.iloc[:, 0].values.reshape(-1, 1)  # Reshape for scikit-learn
Y = energy_data.iloc[:, 0].values

# Split the data into training and testing sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=0)

# Create a linear regression model
model = LinearRegression()

# Train the model
model.fit(X_train, Y_train)

# Make predictions
Y_pred = model.predict(X_test)

# Visualize the results
plt.scatter(X_test, Y_test, color='black')
plt.plot(X_test, Y_pred, color='blue', linewidth=3)
plt.xlabel('Temperature (Celsius)')
plt.ylabel('Energy Consumption (KW)')
plt.show()

Where is the best place to continue learning?

Predicting time series data with (S)ARIMA

(S)ARIMA stands for Seasonal Autoregressive Integrated Moving Average. The method originates from the well-known ARIMA family — but also takes into account the seasonality in the data. The model is well suited to time series that exhibit both seasonal and non-seasonal patterns.

When selecting the parameters, the ‘order’ and ‘seasonal_order’ parameters are decisive:

model = SARIMAX(cleaned_df['variabletopredict'], 
                order=(p, d, q), 
                seasonal_order=(P, D, Q, s))

p is deduced from the significant spikes in the PACF plot, d is the degree of differentiation to achieve stationarity and q is deduced from the significant spikes in the ACF plot. The parameters P, D, Q, are analogous — but for the seasonal components of the time series. s means the periodicity of the seasonality. For example, if the energy consumption data is available every 15 minutes, s could be 96 (15*24=96).

The method is particularly strong in modeling data with clear trends and seasonal patterns. For data that is highly volatile or contains non-linear patterns, modeling and parameter setting can be complex. In the example of energy consumption data, the energy supplier would like to predict future electricity consumption based on historical consumption data. Especially since smart meters have been installed in households across the board in various countries, this historical data has become more easily available. This data shows peaks in winter and summer (annual seasonality) and in the morning and evening (daily seasonality), for example.

Tips for practical implementation

To perform a in-depth analysis and modeling of time series data, such as energy consumption data, with (S)ARIMA, you ideally start with an exploratory data analysis (EDA). This involves examining the data distribution, for example using histograms, as well as a detailed time series analysis that shows missing data and temporal patterns. An essential step is the creation of an autocorrelation plot to visualize the data dependency on different time lags. In addition, the decomposition of the time series into its main components — trend, seasonality and residuals — is informative. The ACF (autocorrelation function) and PACF (partial autocorrelation function) plots are helpful for model parameterization. Before fitting the model, you should carry out a stationarity test to check the necessity of differentiations. Start with a simple model and increase the complexity step by step.

Python code example for SARIMA

# Import necessary libraries
import pandas as pd
import numpy as np
from statsmodels.tsa.statespace.sarimax import SARIMAX
import matplotlib.pyplot as plt

# Load your dataset
energy_data = pd.read_csv('energydata.csv')
temperature_data = pd.read_csv('temperaturedata.csv')

# Assuming your datetime data is in the first column and is properly formatted
energy_data['Date'] = pd.to_datetime(energy_data['Date'])
temperature_data['Date'] = pd.to_datetime(temperature_data['Date'])

# Merge datasets on Date
data = pd.merge(energy_data, temperature_data, on='Date')

# Set Date column as index
data.set_index('Date', inplace=True)

# Define and fit the model
model = SARIMAX(data['Energy'], order=(1, 1, 1), seasonal_order=(1, 1, 1, 12))
results = model.fit()

# Forecast
forecast = results.forecast(steps=12)

# Plot the results
plt.figure(figsize=(10,6))
plt.plot(data['Energy'], label='Actual Energy Consumption')
plt.plot(forecast, label='Forecasted Energy Consumption', color='red')
plt.xlabel('Date')
plt.ylabel('Energy Consumption')
plt.legend()
plt.show()

Where is the best place to continue learning?

Time Series Forecasting with Python (Linear Regression, SARIMA, Exponential Smoothing, TBATS) — Image from Shutterstock

Predicting time series data with Exponential Smoothing

With exponential smoothing, the more recent observations are given greater weight than the older observations. The aim of this exponential smoothing is to ensure that the forecast is less influenced by random fluctuations in the data.

The model is well suited for short-term forecasts and is computationally and memory efficient. It is less suitable for modeling complex time series with strong non-linear trends or for forecasting over longer periods of time. It is also sometimes difficult to determine the optimum values for the smoothing parameters. For example, the energy consumption data shows daily and annual consumption patterns, which suggests the use of the Holt-Winters method (Triple Exponential Smoothing).

Tips for practical implementation

It is best to start again with an exploratory data analysis (EDA), as with SARIMA, in order to recognize patterns, trends and seasonalities in the data. Again, check for stationarity, as exponential smoothing works better with stationary data. If the data is not stationary, you should make it stationary by adding the trend and seasonal components. For model validation, it is best to look at the Mean Absolute Error (MAE), the Mean Squared Error (MSE), the Root Mean Squared Error (RMSE) and the Mean Absolute Percentage Error (MAPE). In addition, the residuals in the residual analysis should be randomly distributed around zero.

There are different types of this model, with the following three being the most common:

1. Simple Exponential Smoothing
The simplest model is the most basic form of exponential smoothing. This assumes that the time series has no trend and no seasonality. The forecast for the next period is based on the weighted average of the previous observation and the forecast for the current period. In this model, with trend=None and seasonal=None, neither the trend nor the seasonal component is taken into account in the model.

# Import necessary libraries
import pandas as pd
import numpy as np
from statsmodels.tsa.holtwinters import ExponentialSmoothing
import matplotlib.pyplot as plt

# Load your dataset
energy_data = pd.read_csv('energydata.csv')

# Assuming your datetime data is in the first column and is properly formatted
energy_data['Date'] = pd.to_datetime(energy_data['Date'])

# Set Date column as index
energy_data.set_index('Date', inplace=True)

# Define and fit the model for simple exponential smoothing
model = ExponentialSmoothing(energy_data['Energy'], trend=None, seasonal=None)
results = model.fit()

# Forecast the next 14 periods
forecast = results.forecast(steps=14)

# Plot the results
plt.figure(figsize=(10,6))
plt.plot(energy_data['Energy'], label='Actual Energy Consumption')
plt.plot(forecast, label='Forecasted Energy Consumption', color='red')
plt.xlabel('Date')
plt.ylabel('Energy Consumption')
plt.legend()
plt.show()

2. Double Exponential Smoothing / Holt’s Linear Exponential Smoothing

This model is used to forecast time series data with a linear trend. However, the data does not show a seasonal pattern. The part trend=’add’ specifies that the model should take an additive trend component into account. With seasonal=None, it is specified that no seasonal component should be taken into account. The trend and seasonal components can be additive or multiplicative — depending on the time series.

from statsmodels.tsa.holtwinters import ExponentialSmoothing

# Define and fit the model for double exponential smoothing
model = ExponentialSmoothing(energy_data['Energy'], trend='add', seasonal=None)
results = model.fit()

3. Triple Exponential Smoothing / Holt-Winter’s Linear Exponential Smoothing

The third model is used if the time series data has both a trend and a seasonal component. With trend=’add’ and seasonal=’add’ it is specified that an additive trend component and an additive seasonal component are to be taken into account. seasonal_periods=12 means that the length of the seasonal period is 12 months. The trend and seasonal components can be additive or multiplicative — depending on the time series.

from statsmodels.tsa.holtwinters import ExponentialSmoothing

# Define and fit the model for triple exponential smoothing (Holt-Winters)
model = ExponentialSmoothing(energy_data['Energy'], trend='add', seasonal='add', seasonal_periods=12)
results = model.fit()

Where is the best place to continue learning?

Predicting time series data with TBATS

TBATS stands for Trigonometric seasonality, Box-Cox transformation, ARMA errors, Trend and Seasonal components. The model was specially developed to deal with complex seasonal patterns. The model contains various components:

Trigonometric seasonality: This enables the modeling of seasonal patterns with variable length. For example, daily and weekly patterns can be captured in energy consumption data.
Box-Cox transformation: This transforms the time series to ensure variance stability. This is particularly important when the variance is not constant. For example, non-stationarity in the variance of energy consumption data could be stabilized.
ARMA error: An ARMA model is used for the error terms, which takes autocorrelation into account.
Trend: The model can take into account both fixed and randomly changing trends in the data. For example, a stochastic trend in the energy consumption data could be assumed to account for possible changes in consumption patterns over time.
Seasonal component: This allows the model to handle multiple seasonal cycles with different lengths. In the energy consumption data, these could again be daily, weekly or annual cycles.

The model is suitable for data with complex seasonal patterns and when several seasonal patterns are to be modeled simultaneously. TBATS is often more computationally intensive than the simpler models — especially for long time series with several seasonal cycles.

Tips for practical implementation

Start again with an exploratory data analysis (EDA) to understand the structure and seasonal patterns in the time series. After importing the libraries (pandas for data manipulation and analysis, TBATS for modeling and forecasting) and loading the data, convert the column with the date into a date format (pd.to_datetime()). You then define the date column as the index of the data frame, which is always important for time series analyses. Before the prediction, a TBATS estimator is initialized with specific seasonal periods. Here, for example, 24 is defined for the number of hours in a day and 168 for the number of hours in a week. The model learns the patterns in the data and finally makes a prediction of energy consumption for the next 48 hours

Python code example for TBATS

# Import necessary libraries
import pandas as pd
from tbats import TBATS

# Load your dataset
energy_data = pd.read_csv('energydata.csv')

# Assuming your datetime data is in the first column and is properly formatted
energy_data['Date'] = pd.to_datetime(energy_data['Date'])

# Set Date column as index
energy_data.set_index('Date', inplace=True)

# Initialize and fit the TBATS model
estimator = TBATS(seasonal_periods=(24, 168))  # Tägliche (24 Stunden) und wöchentliche (168 Stunden) Saisonperioden
model = estimator.fit(energy_data['Energy'])

# Forecast
forecast = model.forecast(steps=48)  # Vorhersage für die nächsten 48 Stunden

# Ergebnisse

Where is the best place to continue learning?

Conclusion

In contrast to observations that we humans make and try to interpret, these models use historical data and make predictions based on them. We encounter time series data, which is recorded sequentially over a period of time, in many different areas. What the models have in common is that they place particular emphasis on recognizing patterns, trends and seasonal fluctuations when analyzing and predicting. While linear regression and exponential smoothing are ideal for simpler time series without strong seasonal or non-linear patterns, (S)ARIMA and TBATS offer more sophisticated methods for handling more complex data with pronounced seasonal components or trends.

Data Science Espresso by Sarah Lea

Discussion about this post