Mastering Pandas MultiIndexing for Time Series Data Analysis

Chapter 1: Introduction to Data Cleaning

Before diving into predictive modeling or analytics, it's crucial to format and clean your dataset. The Pandas library offers robust tools that significantly expedite this process — the quicker we clean, the more time we can dedicate to analysis and modeling.

Here's a dataset consisting of stock prices for four companies over several weeks (notably, a date for Microsoft is intentionally excluded):

ticker date close

0 AAPL 2021-03-18 120.53

1 AAPL 2021-03-19 119.99

2 AAPL 2021-03-22 123.39

3 AAPL 2021-03-23 122.54

4 AAPL 2021-03-24 120.09

5 AAPL 2021-03-25 120.59

6 AAPL 2021-03-26 121.21

7 AMZN 2021-03-18 3027.99

8 AMZN 2021-03-19 3074.96

9 AMZN 2021-03-22 3110.87

10 AMZN 2021-03-23 3137.50

11 AMZN 2021-03-24 3087.07

12 AMZN 2021-03-25 3046.26

13 AMZN 2021-03-26 3052.03

14 GOOGL 2021-03-18 2021.34

15 GOOGL 2021-03-19 2026.96

16 GOOGL 2021-03-22 2030.69

17 GOOGL 2021-03-23 2041.33

18 GOOGL 2021-03-24 2032.53

19 GOOGL 2021-03-25 2032.46

20 GOOGL 2021-03-26 2024.73

21 MSFT 2021-03-18 230.72

22 MSFT 2021-03-19 230.35

23 MSFT 2021-03-22 235.99

24 MSFT 2021-03-23 237.58

25 MSFT 2021-03-24 235.46

26 MSFT 2021-03-25 232.34

Suppose we wish to compute the cumulative return for each stock during the week and evaluate their co-movement through a correlation matrix. Currently, the data is in a stacked configuration, making it challenging to analyze. A tabular format where each row represents a date and each column represents a stock is far more efficient for time series analysis.

Getting to a Tabular Format

To convert our data into a tabular structure, we need to:

Group the data by stock.
Horizontally stack the data.

Utilizing Pandas' MultiIndex feature makes this straightforward. By executing the following command on our dataset (stored in a DataFrame named data), we can create a multi-index:

data = data.set_index(['ticker', 'date'])

By indexing with both stock ticker and date, we enable multi-indexing since we are referencing more than one column. Displaying the resulting DataFrame illustrates how multi-indexing simplifies our work:

print(data)

Output:

close

ticker date

AAPL 2021-03-18 120.53

2021-03-19 119.99

2021-03-22 123.39

2021-03-23 122.54

2021-03-24 120.09

2021-03-25 120.59

2021-03-26 121.21

AMZN 2021-03-18 3027.99

2021-03-19 3074.96

2021-03-22 3110.87

2021-03-23 3137.50

2021-03-24 3087.07

2021-03-25 3046.26

2021-03-26 3052.03

GOOGL 2021-03-18 2021.34

2021-03-19 2026.96

2021-03-22 2030.69

2021-03-23 2041.33

2021-03-24 2032.53

2021-03-25 2032.46

2021-03-26 2024.73

MSFT 2021-03-18 230.72

2021-03-19 230.35

2021-03-22 235.99

2021-03-23 237.58

2021-03-24 235.46

2021-03-25 232.34

The DataFrame is now neatly organized by stock ticker. For instance, extracting just Apple's returns is as simple as:

data.loc['AAPL']

Output:

close

date

2021-03-18 120.53

2021-03-19 119.99

2021-03-22 123.39

2021-03-23 122.54

2021-03-24 120.09

2021-03-25 120.59

2021-03-26 121.21

You can also retrieve data for a specific date across multiple stocks using IndexSlice, which simplifies multi-index slicing:

idx = pd.IndexSlice

print(data.loc[idx[['AAPL', 'GOOGL'], '2021-03-18'],])

Output:

close

ticker date

AAPL 2021-03-18 120.53

GOOGL 2021-03-18 2021.34

Horizontally Stacking by Stock

In this scenario, we want to take the returns for each stock and stack them horizontally, ensuring the prices align correctly. Misalignment can ruin a time series analysis, so this is critical. We can achieve this with a straightforward loop:

# Retrieve tickers from the first level of the multi-index

tickers = list(set(data.index.get_level_values(0)))

# Initialize a DataFrame for the tabular stock price data

clean_table = data.loc['AAPL'][['close']]

clean_table.rename({'close': 'AAPL'}, axis=1, inplace=True)

# Loop through all tickers excluding Apple and "h-stack" prices

for ticker in tickers:

if ticker != 'AAPL':

clean_table[ticker] = data.loc[ticker][['close']]

print(clean_table)

Output:

AAPL AMZN MSFT GOOGL

date

2021-03-18 120.53 3027.99 230.72 2021.34

2021-03-19 119.99 3074.96 230.35 2026.96

2021-03-22 123.39 3110.87 235.99 2030.69

2021-03-23 122.54 3137.50 237.58 2041.33

2021-03-24 120.09 3087.07 235.46 2032.53

2021-03-25 120.59 3046.26 232.34 2032.46

2021-03-26 121.21 3052.03 NaN 2024.73

Notice how effortlessly we can incorporate new data into our DataFrame with just one line of code. Since both clean_table and data.loc[ticker][['close']] are indexed by date, Pandas ensures that the dates are aligned properly. When adding another DataFrame or Series to an existing one, Pandas will only add non-null values where the indexes match.

However, if we try to add data with mismatched dates:

# Sample stale data with mismatched dates

wrong_dates = pd.Series(

[26.337, 25.677, 25.175, 24.113, 24.240, 24.633, 24.990],

index=pd.date_range('2016-01-04', periods=7)

)

clean_table['wrong_dates'] = wrong_dates

print(clean_table)

Output:

AAPL AMZN MSFT GOOGL wrong_dates

date

2021-03-18 120.53 3027.99 230.72 2021.34 NaN

2021-03-19 119.99 3074.96 230.35 2026.96 NaN

2021-03-22 123.39 3110.87 235.99 2030.69 NaN

2021-03-23 122.54 3137.50 237.58 2041.33 NaN

2021-03-24 120.09 3087.07 235.46 2032.53 NaN

2021-03-25 120.59 3046.26 232.34 2032.46 NaN

2021-03-26 121.21 3052.03 NaN 2024.73 NaN

Pandas detects the date discrepancies and fills in null values instead of numeric data. If you wish to incorporate wrong_dates into clean_table, you could perform an outer merge:

clean_table.merge(wrong_dates, how='outer', left_index=True, right_index=True)

Additionally, observe that MSFT lacks a value due to having one less price observation than the others. Pandas effectively handles this scenario as well.

Now that we comprehend how indexes aid in aligning our data, let's remove the null values to finalize our analysis:

clean_table = clean_table.dropna(axis=1, how='all').dropna(axis=0)

print(clean_table)

Output:

AAPL AMZN MSFT GOOGL

date

2021-03-18 120.53 3027.99 230.72 2021.34

2021-03-19 119.99 3074.96 230.35 2026.96

2021-03-22 123.39 3110.87 235.99 2030.69

2021-03-23 122.54 3137.50 237.58 2041.33

2021-03-24 120.09 3087.07 235.46 2032.53

2021-03-25 120.59 3046.26 232.34 2032.46

Chapter 2: Cumulative Returns and Correlations

To derive cumulative returns and correlations, we first calculate the monthly returns, which can be easily obtained using the shift method (this method allows us to divide each observation by the preceding one). Then we apply cumprod to compute the cumulative return. The last entry of cumul_returns provides the cumulative return for each stock over our selected period:

stock_returns = (clean_table / clean_table.shift(1) - 1).dropna()

cumul_returns = (1 + stock_returns).cumprod() - 1

cumul_returns.iloc[-1].plot(kind='bar', figsize=(8, 5));

plt.tight_layout()

plt.savefig('ts_cumul_returns')

Here’s the plot illustrating cumulative returns for our brief sample period (Source: Sharadar, graphic created by the author):

Lastly, we can generate our correlation matrix:

print(stock_returns.corr())

Output:

AAPL AMZN MSFT GOOGL

AAPL 1.000000 0.440295 0.724448 0.344711

AMZN 0.440295 1.000000 0.710669 0.814061

MSFT 0.724448 0.710669 1.000000 0.499615

GOOGL 0.344711 0.814061 0.499615 1.000000

Conclusion

And there you have it! Pandas truly simplifies the analysis of time series data — a key reason for its popularity in data analysis. The more you understand Pandas' intricacies, like multi-indexing, the less time you’ll spend on cleaning and formatting, allowing you to focus more on analysis and modeling. Cheers!

arsalandywriter.com

Mastering Pandas MultiIndexing for Time Series Data Analysis

Chapter 1: Introduction to Data Cleaning

Getting to a Tabular Format

Horizontally Stacking by Stock

Chapter 2: Cumulative Returns and Correlations

Conclusion

Share the page:

Recent Post:

Unlock the Power of the Swiss Ball for Enhanced Workouts

Harnessing Quantum Algorithms for Mindfulness and Relaxation

Unlock Your Coding Potential: Earn an Extra $1000 This Month

Navigating the Cosmic Energies: Mercury and Venus Insights

The Myth of Software Requirements: A Deep Dive into Project Failures

Unlocking the Mind: Eye Movements, Brain Activation, and Consciousness

Aligning Your Life: The Triangulation Approach for Self-Improvement

Unlocking Design Potential with AI: 10 Figma Plugins You Need