Mastering Pandas MultiIndexing for Time Series Data Analysis
Written on
Chapter 1: Introduction to Data Cleaning
Before diving into predictive modeling or analytics, it's crucial to format and clean your dataset. The Pandas library offers robust tools that significantly expedite this process — the quicker we clean, the more time we can dedicate to analysis and modeling.
Here's a dataset consisting of stock prices for four companies over several weeks (notably, a date for Microsoft is intentionally excluded):
ticker date close
0 AAPL 2021-03-18 120.53
1 AAPL 2021-03-19 119.99
2 AAPL 2021-03-22 123.39
3 AAPL 2021-03-23 122.54
4 AAPL 2021-03-24 120.09
5 AAPL 2021-03-25 120.59
6 AAPL 2021-03-26 121.21
7 AMZN 2021-03-18 3027.99
8 AMZN 2021-03-19 3074.96
9 AMZN 2021-03-22 3110.87
10 AMZN 2021-03-23 3137.50
11 AMZN 2021-03-24 3087.07
12 AMZN 2021-03-25 3046.26
13 AMZN 2021-03-26 3052.03
14 GOOGL 2021-03-18 2021.34
15 GOOGL 2021-03-19 2026.96
16 GOOGL 2021-03-22 2030.69
17 GOOGL 2021-03-23 2041.33
18 GOOGL 2021-03-24 2032.53
19 GOOGL 2021-03-25 2032.46
20 GOOGL 2021-03-26 2024.73
21 MSFT 2021-03-18 230.72
22 MSFT 2021-03-19 230.35
23 MSFT 2021-03-22 235.99
24 MSFT 2021-03-23 237.58
25 MSFT 2021-03-24 235.46
26 MSFT 2021-03-25 232.34
Suppose we wish to compute the cumulative return for each stock during the week and evaluate their co-movement through a correlation matrix. Currently, the data is in a stacked configuration, making it challenging to analyze. A tabular format where each row represents a date and each column represents a stock is far more efficient for time series analysis.
Getting to a Tabular Format
To convert our data into a tabular structure, we need to:
- Group the data by stock.
- Horizontally stack the data.
Utilizing Pandas' MultiIndex feature makes this straightforward. By executing the following command on our dataset (stored in a DataFrame named data), we can create a multi-index:
data = data.set_index(['ticker', 'date'])
By indexing with both stock ticker and date, we enable multi-indexing since we are referencing more than one column. Displaying the resulting DataFrame illustrates how multi-indexing simplifies our work:
print(data)
Output:
close
ticker date
AAPL 2021-03-18 120.53
2021-03-19 119.99
2021-03-22 123.39
2021-03-23 122.54
2021-03-24 120.09
2021-03-25 120.59
2021-03-26 121.21
AMZN 2021-03-18 3027.99
2021-03-19 3074.96
2021-03-22 3110.87
2021-03-23 3137.50
2021-03-24 3087.07
2021-03-25 3046.26
2021-03-26 3052.03
GOOGL 2021-03-18 2021.34
2021-03-19 2026.96
2021-03-22 2030.69
2021-03-23 2041.33
2021-03-24 2032.53
2021-03-25 2032.46
2021-03-26 2024.73
MSFT 2021-03-18 230.72
2021-03-19 230.35
2021-03-22 235.99
2021-03-23 237.58
2021-03-24 235.46
2021-03-25 232.34
The DataFrame is now neatly organized by stock ticker. For instance, extracting just Apple's returns is as simple as:
data.loc['AAPL']
Output:
close
date
2021-03-18 120.53
2021-03-19 119.99
2021-03-22 123.39
2021-03-23 122.54
2021-03-24 120.09
2021-03-25 120.59
2021-03-26 121.21
You can also retrieve data for a specific date across multiple stocks using IndexSlice, which simplifies multi-index slicing:
idx = pd.IndexSlice
print(data.loc[idx[['AAPL', 'GOOGL'], '2021-03-18'],])
Output:
close
ticker date
AAPL 2021-03-18 120.53
GOOGL 2021-03-18 2021.34
Horizontally Stacking by Stock
In this scenario, we want to take the returns for each stock and stack them horizontally, ensuring the prices align correctly. Misalignment can ruin a time series analysis, so this is critical. We can achieve this with a straightforward loop:
# Retrieve tickers from the first level of the multi-index
tickers = list(set(data.index.get_level_values(0)))
# Initialize a DataFrame for the tabular stock price data
clean_table = data.loc['AAPL'][['close']]
clean_table.rename({'close': 'AAPL'}, axis=1, inplace=True)
# Loop through all tickers excluding Apple and "h-stack" prices
for ticker in tickers:
if ticker != 'AAPL':
clean_table[ticker] = data.loc[ticker][['close']]
print(clean_table)
Output:
AAPL AMZN MSFT GOOGL
date
2021-03-18 120.53 3027.99 230.72 2021.34
2021-03-19 119.99 3074.96 230.35 2026.96
2021-03-22 123.39 3110.87 235.99 2030.69
2021-03-23 122.54 3137.50 237.58 2041.33
2021-03-24 120.09 3087.07 235.46 2032.53
2021-03-25 120.59 3046.26 232.34 2032.46
2021-03-26 121.21 3052.03 NaN 2024.73
Notice how effortlessly we can incorporate new data into our DataFrame with just one line of code. Since both clean_table and data.loc[ticker][['close']] are indexed by date, Pandas ensures that the dates are aligned properly. When adding another DataFrame or Series to an existing one, Pandas will only add non-null values where the indexes match.
However, if we try to add data with mismatched dates:
# Sample stale data with mismatched dates
wrong_dates = pd.Series(
[26.337, 25.677, 25.175, 24.113, 24.240, 24.633, 24.990],
index=pd.date_range('2016-01-04', periods=7)
)
clean_table['wrong_dates'] = wrong_dates
print(clean_table)
Output:
AAPL AMZN MSFT GOOGL wrong_dates
date
2021-03-18 120.53 3027.99 230.72 2021.34 NaN
2021-03-19 119.99 3074.96 230.35 2026.96 NaN
2021-03-22 123.39 3110.87 235.99 2030.69 NaN
2021-03-23 122.54 3137.50 237.58 2041.33 NaN
2021-03-24 120.09 3087.07 235.46 2032.53 NaN
2021-03-25 120.59 3046.26 232.34 2032.46 NaN
2021-03-26 121.21 3052.03 NaN 2024.73 NaN
Pandas detects the date discrepancies and fills in null values instead of numeric data. If you wish to incorporate wrong_dates into clean_table, you could perform an outer merge:
clean_table.merge(wrong_dates, how='outer', left_index=True, right_index=True)
Additionally, observe that MSFT lacks a value due to having one less price observation than the others. Pandas effectively handles this scenario as well.
Now that we comprehend how indexes aid in aligning our data, let's remove the null values to finalize our analysis:
clean_table = clean_table.dropna(axis=1, how='all').dropna(axis=0)
print(clean_table)
Output:
AAPL AMZN MSFT GOOGL
date
2021-03-18 120.53 3027.99 230.72 2021.34
2021-03-19 119.99 3074.96 230.35 2026.96
2021-03-22 123.39 3110.87 235.99 2030.69
2021-03-23 122.54 3137.50 237.58 2041.33
2021-03-24 120.09 3087.07 235.46 2032.53
2021-03-25 120.59 3046.26 232.34 2032.46
Chapter 2: Cumulative Returns and Correlations
To derive cumulative returns and correlations, we first calculate the monthly returns, which can be easily obtained using the shift method (this method allows us to divide each observation by the preceding one). Then we apply cumprod to compute the cumulative return. The last entry of cumul_returns provides the cumulative return for each stock over our selected period:
stock_returns = (clean_table / clean_table.shift(1) - 1).dropna()
cumul_returns = (1 + stock_returns).cumprod() - 1
cumul_returns.iloc[-1].plot(kind='bar', figsize=(8, 5));
plt.tight_layout()
plt.savefig('ts_cumul_returns')
Here’s the plot illustrating cumulative returns for our brief sample period (Source: Sharadar, graphic created by the author):
Lastly, we can generate our correlation matrix:
print(stock_returns.corr())
Output:
AAPL AMZN MSFT GOOGL
AAPL 1.000000 0.440295 0.724448 0.344711
AMZN 0.440295 1.000000 0.710669 0.814061
MSFT 0.724448 0.710669 1.000000 0.499615
GOOGL 0.344711 0.814061 0.499615 1.000000
Conclusion
And there you have it! Pandas truly simplifies the analysis of time series data — a key reason for its popularity in data analysis. The more you understand Pandas' intricacies, like multi-indexing, the less time you’ll spend on cleaning and formatting, allowing you to focus more on analysis and modeling. Cheers!