Time Series Forecasting: A Comprehensive EDA Framework
Written on
Introduction
Time series analysis is a prevalent area in data science and machine learning, crucial for predicting various phenomena like financial events, energy usage, product sales, and market trends. This domain's relevance has surged with the exponential growth of available data and the continuous advancement of machine learning algorithms. Traditional statistical forecasting methods, such as regression models and ARIMA, have been complemented by machine learning techniques (e.g., tree-based models) and deep learning approaches (e.g., LSTMs, CNNs, Transformers).
Regardless of the forecasting method employed, one essential preliminary step is Exploratory Data Analysis (EDA). EDA involves the examination and visualization of data to extract key insights and summarize its primary characteristics. This step is vital in data science, providing the groundwork for another critical phase: feature engineering, which entails the creation and transformation of dataset features to optimize model performance.
This article aims to outline a clear template for conducting exploratory data analysis focused on time series, highlighting the dataset's most significant characteristics. We will utilize popular Python libraries such as Pandas, Seaborn, and Statsmodels.
Data
For our analysis, we will use Kaggle’s Hourly Energy Consumption dataset, which pertains to the PJM Hourly Energy Consumption data. PJM is a regional transmission organization in the U.S. that supplies electricity to various states, including Delaware, Illinois, and Virginia. The dataset provides hourly power consumption figures measured in megawatts (MW).
Exploratory Data Analysis
We will identify key analyses to perform on time series data. One of the most critical tasks is plotting the data, as visualizations can reveal patterns, anomalies, temporal changes, and variable relationships. Insights gained from these visualizations are crucial for informing forecasting models. Additionally, descriptive statistics and time series decomposition will play significant roles.
The proposed EDA framework consists of six steps: Descriptive Statistics, Time Plot, Seasonal Plots, Box Plots, Time Series Decomposition, and Lag Analysis.
1. Descriptive Statistics
Descriptive statistics provide a quantitative summary of structured data. Common metrics include measures of central tendency (mean, median), measures of dispersion (range, standard deviation), and positional measures (percentiles, quartiles). The five-number summary—comprising minimum, first quartile (Q1), median (Q2), third quartile (Q3), and maximum—effectively encapsulates these metrics.
In Python, you can easily obtain this information using the following Pandas method:
import pandas as pd
# Loading and preprocessing steps df = pd.read_csv('../input/hourly-energy-consumption/PJME_hourly.csv') df = df.set_index('Datetime') df.index = pd.to_datetime(df.index)
df.describe()
2. Time Plot
The initial visualization should be a time plot, which displays observations against the time they were recorded, connecting consecutive observations with lines. In Python, you can create this plot using Pandas and Matplotlib:
import matplotlib.pyplot as plt
# Set pyplot style plt.style.use("seaborn")
# Plot df['PJME_MW'].plot(title='PJME - Time Plot', figsize=(10, 6)) plt.ylabel('Consumption [MW]') plt.xlabel('Date')
This plot reveals several insights: 1. The data exhibits yearly seasonality. 2. When examining individual years, patterns emerge, indicating peaks in winter and summer due to higher electricity demand. 3. The series does not display a clear trend of increase or decrease over the years; average consumption remains relatively constant. 4. An anomaly appears around 2023, which may need addressing in the model.
3. Seasonal Plots
A seasonal plot is essentially a time plot where data points are grouped by the "seasons" they belong to. With hourly energy data, we can observe various seasonal patterns: yearly, weekly, and daily. First, we will create relevant variables in our Pandas DataFrame:
# Defining required fields df['year'] = [x for x in df.index.year] df['month'] = [x for x in df.index.month] df = df.reset_index() df['week'] = df['Datetime'].apply(lambda x: x.week) df = df.set_index('Datetime') df['hour'] = [x for x in df.index.hour] df['day'] = [x for x in df.index.day_of_week] df['day_str'] = [x.strftime('%a') for x in df.index] df['year_month'] = [str(x.year) + '_' + str(x.month) for x in df.index]
3.1 Seasonal Plot — Yearly Consumption
A significant visualization is the energy consumption grouped by month across years, which highlights yearly seasonality and potential trends over time. Here’s how to create this plot:
import numpy as np
# Defining colors palette np.random.seed(42) df_plot = df[['month', 'year', 'PJME_MW']].dropna().groupby(['month', 'year']).mean()[['PJME_MW']].reset_index() years = df_plot['year'].unique() colors = np.random.choice(list(mpl.colors.XKCD_COLORS.keys()), len(years), replace=False)
# Plot plt.figure(figsize=(16, 12)) for i, y in enumerate(years):
if i > 0:
plt.plot('month', 'PJME_MW', data=df_plot[df_plot['year'] == y], color=colors[i], label=y)
if y == 2018:
plt.text(df_plot.loc[df_plot.year == y, :].shape[0] + 0.3, df_plot.loc[df_plot.year == y, 'PJME_MW'].values[-1], y, fontsize=12, color=colors[i])else:
plt.text(df_plot.loc[df_plot.year == y, :].shape[0] + 0.1, df_plot.loc[df_plot.year == y, 'PJME_MW'].values[-1], y, fontsize=12, color=colors[i])
# Setting labels plt.gca().set(ylabel='PJME_MW', xlabel='Month') plt.yticks(fontsize=12, alpha=0.7) plt.title("Seasonal Plot - Monthly Consumption", fontsize=20) plt.ylabel('Consumption [MW]') plt.xlabel('Month') plt.show()
This plot illustrates a clear pattern: consumption peaks during winter and summer, with lower usage in spring and autumn. Additionally, there is no evident trend of increasing or decreasing consumption over the years.
3.2 Seasonal Plot — Weekly Consumption
The weekly plot visualizes consumption trends throughout the week across months, helping to identify any changes in weekly consumption patterns. Here’s how to create this plot:
# Defining colors palette np.random.seed(42) df_plot = df[['month', 'day_str', 'PJME_MW', 'day']].dropna().groupby(['day_str', 'month', 'day']).mean()[['PJME_MW']].reset_index() df_plot = df_plot.sort_values(by='day', ascending=True)
months = df_plot['month'].unique() colors = np.random.choice(list(mpl.colors.XKCD_COLORS.keys()), len(months), replace=False)
# Plot plt.figure(figsize=(16, 12)) for i, y in enumerate(months):
if i > 0:
plt.plot('day_str', 'PJME_MW', data=df_plot[df_plot['month'] == y], color=colors[i], label=y)
plt.text(df_plot.loc[df_plot.month == y, :].shape[0] - 0.9, df_plot.loc[df_plot.month == y, 'PJME_MW'].values[-1], y, fontsize=12, color=colors[i])
# Setting Labels plt.gca().set(ylabel='PJME_MW', xlabel='Month') plt.yticks(fontsize=12, alpha=0.7) plt.title("Seasonal Plot - Weekly Consumption", fontsize=20) plt.ylabel('Consumption [MW]') plt.xlabel('Month') plt.show()
3.3 Seasonal Plot — Daily Consumption
The daily consumption plot illustrates how energy use varies throughout the day. The data is first grouped by the day of the week and then averaged. Here’s the code to create this visualization:
import seaborn as sns
# Defining the dataframe df_plot = df[['hour', 'day_str', 'PJME_MW']].dropna().groupby(['hour', 'day_str']).mean()[['PJME_MW']].reset_index()
# Plot using Seaborn plt.figure(figsize=(10, 8)) sns.lineplot(data=df_plot, x='hour', y='PJME_MW', hue='day_str', legend=True) plt.locator_params(axis='x', nbins=24) plt.title("Seasonal Plot - Daily Consumption", fontsize=20) plt.ylabel('Consumption [MW]') plt.xlabel('Hour') plt.legend()
This plot typically reveals a distinctive "M" pattern, indicating higher consumption around midday (10 am to 2 pm) and again in the early evening (6 pm to 8 pm), with lower usage in the afternoon (2 pm to 6 pm). It also highlights differences in consumption on weekdays versus weekends.
3.4 Seasonal Plot — Feature Engineering
We can leverage insights from seasonal plots for feature engineering. For instance, consider the following points: 1. Yearly consumption remains stable, suggesting the potential for using seasonal features derived from lag or external variables. 2. Weekly consumption patterns are consistent across months, indicating the utility of weekly features from lag or external variables. 3. Daily consumption diverges between weekdays and weekends, suggesting the need for categorical features to distinguish between normal days and weekends.
4. Box Plots
Box plots offer a clear way to visualize data distribution. They display percentiles, including the 1st (Q1), 2nd (Q2/median), and 3rd (Q3) quartiles, as well as whiskers that represent the data range. Values outside the whiskers are identified as outliers.
4.1 Box Plots — Total Consumption
To compute the box plot for total consumption, we can use Seaborn:
plt.figure(figsize=(8, 5)) sns.boxplot(data=df, x='PJME_MW') plt.xlabel('Consumption [MW]') plt.title('Boxplot - Consumption Distribution');
This box plot indicates a Gaussian-like distribution, albeit with a right-skewed tail.
4.2 Box Plots — Day/Month Distribution
An insightful box plot examines consumption by day and month, focusing on data from 2017. Here’s how to create it:
df['year'] = [x for x in df.index.year] df['month'] = [x for x in df.index.month] df['year_month'] = [str(x.year) + '_' + str(x.month) for x in df.index]
df_plot = df[df['year'] >= 2017].reset_index().sort_values(by='Datetime').set_index('Datetime') plt.title('Boxplot Year Month Distribution'); plt.xticks(rotation=90) plt.xlabel('Year Month') plt.ylabel('MW')
sns.boxplot(x='year_month', y='PJME_MW', data=df_plot) plt.ylabel('Consumption [MW]') plt.xlabel('Year Month')
This box plot reveals that consumption is less variable during summer and winter months (when peaks occur) but more dispersed during spring and autumn. Notably, summer 2018 consumption is higher than in 2017, likely due to warmer weather. When feature engineering, consider including temperature data as an external variable.
4.3 Box Plots — Day Distribution
Another useful visualization is the consumption distribution across the week, similar to the weekly consumption seasonal plot:
df_plot = df[['day_str', 'day', 'PJME_MW']].sort_values(by='day') plt.title('Boxplot Day Distribution') plt.xlabel('Day of week') plt.ylabel('MW') sns.boxplot(x='day_str', y='PJME_MW', data=df_plot) plt.ylabel('Consumption [MW]') plt.xlabel('Day of week')
As previously noted, consumption is lower on weekends, yet there are several outliers, suggesting that calendar features like "day of the week" are useful but may not fully capture the series’ dynamics.
4.4 Box Plots — Hour Distribution
Lastly, the hour distribution box plot indicates how consumption varies throughout the day. Here’s the code:
plt.title('Boxplot Hour Distribution'); plt.xlabel('Hour') plt.ylabel('MW') sns.boxplot(x='hour', y='PJME_MW', data=df) plt.ylabel('Consumption [MW]') plt.xlabel('Hour')
The "M" shape observed earlier is less pronounced here, and there are numerous outliers. This suggests that consumption relies not only on daily seasonality but also on other factors, such as climatic conditions like temperature or humidity.
5. Time Series Decomposition
Time series data can exhibit various patterns. It is often beneficial to decompose a time series into components, each representing an underlying pattern category. We can conceptualize a time series as comprising three components: a trend, a seasonal component, and a remainder component (containing residual values).
There are two primary decomposition methods: additive and multiplicative. In additive decomposition, the series is represented as the sum of seasonal, trend, and remainder components:
Conversely, multiplicative decomposition can be expressed as:
In general, additive decomposition is most suitable for series with constant variance, while multiplicative decomposition is more appropriate for non-stationary series with variable variance.
In Python, time series decomposition can be conveniently performed using the Statsmodels library:
df_plot = df[df['year'] == 2017].reset_index() df_plot = df_plot.drop_duplicates(subset=['Datetime']).sort_values(by='Datetime') df_plot = df_plot.set_index('Datetime') df_plot['PJME_MW - Multiplicative Decompose'] = df_plot['PJME_MW'] df_plot['PJME_MW - Additive Decompose'] = df_plot['PJME_MW']
# Additive Decomposition result_add = seasonal_decompose(df_plot['PJME_MW - Additive Decompose'], model='additive', period=24*7)
# Multiplicative Decomposition result_mul = seasonal_decompose(df_plot['PJME_MW - Multiplicative Decompose'], model='multiplicative', period=24*7)
# Plot result_add.plot().suptitle('', fontsize=22) plt.xticks(rotation=45) result_mul.plot().suptitle('', fontsize=22) plt.xticks(rotation=45) plt.show()
The plots above reference 2017. Both plots indicate several local peaks in the trend, with higher values in summer. The seasonal component reveals multiple periodicities, emphasizing the weekly cycle, though daily seasonality can also be observed when focusing on specific months.
6. Lag Analysis
In time series forecasting, a lag refers to a past value of the series. For daily series, the first lag corresponds to the value from the previous day, the second lag to the value two days prior, and so forth.
Lag analysis involves computing correlations between the series and its lagged versions, known as autocorrelation. The autocorrelation coefficient for a k-lagged version of a series is defined as:
Where (bar{y}) represents the series mean and (k) denotes the lag.
The autocorrelation coefficients form the autocorrelation function (ACF) for the series, depicting the autocorrelation coefficient against the number of lags considered. When a series has a trend, small lags typically exhibit large positive autocorrelations since observations close in time are also similar in value. In seasonal data, autocorrelation values are higher at seasonal lags (and their multiples) compared to other lags. For data exhibiting both trend and seasonality, these effects combine.
The partial autocorrelation function (PACF) is a more practical function, showing only the direct autocorrelation between two lags. For instance, the PACF for lag 3 indicates the correlation not explained by lags 1 and 2. In essence, partial correlation reflects the direct impact a particular lag has on the current time value.
It’s important to note that the autocorrelation coefficient becomes clearer if the series is stationary; thus, differencing the series beforehand often stabilizes the signal.
The following code plots the PACF for various hours of the day:
from statsmodels.graphics.tsaplots import plot_pacf
actual = df['PJME_MW'] hours = range(0, 24, 4)
for hour in hours:
plot_pacf(actual[actual.index.hour == hour].diff().dropna(), lags=30, alpha=0.01)
plt.title(f'PACF - h = {hour}')
plt.ylabel('Correlation')
plt.xlabel('Lags')
plt.show()
The PACF shows Pearson partial autocorrelation coefficients for different lags. Naturally, the non-lagged series correlates perfectly with itself, so lag 0 will always equal 1. The blue band indicates the confidence interval; if a lag exceeds this band, it is statistically significant and holds relevance.
6.1 Lag Analysis — Feature Engineering
Lag analysis significantly impacts feature engineering in time series. As previously mentioned, a highly correlated lag is crucial for the series and should be included.
One common technique involves performing an hourly division of the dataset—dividing data into 24 subsets, each representing an hour of the day. This process regularizes and smooths the signal, simplifying forecasting.
Each subset should then undergo feature engineering, training, and fine-tuning. The final forecast will result from combining the outcomes of these 24 models. Each hourly model will have its characteristics, most notably important lags.
Two types of lags relevant to lag analysis include: 1. Auto-regressive lags: Close to lag 0, these are expected to have high values (recent lags are more predictive of current values). They indicate the trend present in the series. 2. Seasonal lags: These correspond to seasonal periods, often representing weekly cycles when hourly data is split.
Note that auto-regressive lag 1 can also be interpreted as a daily seasonal lag for the series.
Let’s now analyze the PACF plots presented earlier.
Night Hours
Consumption during nighttime hours (0, 4) relies more on auto-regressive lags rather than weekly ones, with the most significant correlations localized within the first five lags. Seasonal periods like 7, 14, 21, and 28 appear less impactful, suggesting that lags 1 to 5 should be prioritized during feature engineering.
Day Hours
In contrast, daytime hours (8, 12, 16, 20) exhibit both auto-regressive and seasonal lags. This is particularly true for hours 8 and 12, when consumption peaks, while the importance of seasonal lags diminishes as night approaches. For these subsets, include both seasonal and auto-regressive lags in your analysis.
Here are some tips for effective lag feature engineering: - Avoid considering too many lags to prevent overfitting. Generally, auto-regressive lags range from 1 to 7, while weekly lags can include 7, 14, 21, and 28. However, it is not mandatory to use all of them as features. - Including non-auto-regressive and non-seasonal lags is typically inadvisable, as they may contribute to overfitting. Focus on understanding why a particular lag is important. - Transforming lags can result in more effective features. For instance, seasonal lags can be aggregated using a weighted mean to create a single feature representing the series’ seasonality.
Free Resources
Lastly, I would like to highlight a valuable (and free) resource for understanding time series: Forecasting: Principles and Practice. Though it primarily uses R rather than Python, this textbook offers an excellent introduction to forecasting methods and covers essential aspects of time series analysis.
Conclusion
This article has aimed to provide a comprehensive template for conducting Exploratory Data Analysis (EDA) in time series forecasting. EDA is a crucial step in any data science endeavor, as it helps to understand the data's nature and characteristics, laying the groundwork for feature engineering that can significantly enhance model performance.
We have explored various analyses commonly employed in time series EDA, encompassing both statistical/mathematical methods and visualizations. The intent of this work is to provide a practical framework for initiating your analysis, while further investigations should be conducted based on the specific historical series and business context.
Thank you for following this guide to the end.
Unless otherwise noted, all images are by the author.