Unlocking Insights from Facebook Messenger Data Using Python
Written on
Analyzing our conversations on Facebook Messenger can feel like wielding a magic wand to revisit past interactions. With the privacy laws in place, we can retrieve our old chats, and naturally, this calls for the power of Python and visualization tools!
This exploration stems from a simple question: what trends can we identify in my group chats with friends? There are services that allow you to print your entire conversation as a book, but I prefer not to share all the photos and messages with the world. Moreover, with around 50,000 messages accumulated over nearly a decade, reading through everything would be overwhelming. Thus, I decided to leverage my analytical skills to derive insights from this data—consider it a belated Christmas gift to my friends!
Here's what we'll cover: 1. How to retrieve our messages and interpret the data. 2. What specific insights we can extract from the data. 3. Compiling everything together to visualize the results.
If you want to replicate this analysis, my code is available at Messenger_Podium. I’ll assume you have a Python environment set up, either in Jupyter Notebook or an IDE. If you're new to Python, don't worry—it's a fun project to ease into data science!
Retrieving and Understanding Our Data
Thanks to Facebook for allowing us to explore our past interactions.
Note: I’m French, so my Facebook interface is in French, but you should still follow along easily.
To download your data, navigate to “Parameters” and select “Your Information” — then download your information. I opted to download only my messages from the last three years.
> While more data can be downloaded, I focused on one particular conversation from 2021.
The download format can be HTML or JSON. For ease of use, we’ll select JSON.
Now we have all our conversations from the last three years. I’ll concentrate on a chat thread with my oldest friends, which is the most active one. This thread contains everything, from images to voice messages, but we'll focus solely on the text messages for now. If you're interested in analyzing photos and voice messages as well, let me know in the comments!
What exactly is a JSON file? If you're unsure, my friend Omar provides a great overview [here](JSON-in-a-nutshell). In my chat archive, there are five JSON files, each representing messages from different time periods. The first JSON object lists participants, while the second contains the messages themselves.
Accessing Messenger Data with Pandas
I enjoy using Pandas for data manipulation because it suits this project well—each row represents a single message.
import pandas as pd import json
def load_all_messages(path):
# Load the first message file
with open(path + 'message_1.json') as file:
data = json.load(file, object_hook=parse_obj)
df = pd.json_normalize(data['messages'])
# Open the remaining files and append their data
for i in range(2, 6):
with open(path + f'message_{i}.json', encoding='utf8') as file:
data = json.load(file, object_hook=parse_obj)
df_temp = pd.json_normalize(data['messages'])
df = df.append(df_temp)
return df
> Spoiler Alert: Facebook (now Meta) did not encode the JSON properly, so we need a workaround to access the content.
def parse_obj(obj):
for key in obj:
if isinstance(obj[key], str):
obj[key] = obj[key].encode('latin_1').decode('utf-8')elif isinstance(obj[key], list):
obj[key] = list(map(lambda x: x if type(x) != str else x.encode('latin_1').decode('utf-8'), obj[key]))return obj
Thanks to Jakub on Stack Overflow for the tip on encoding!
While we could inspect each message manually, let’s take a data-driven approach. The data includes timestamps, message content, authorship, and reactions.
def clean_data(df):
df['date_time'] = pd.to_datetime(df['timestamp_ms'], unit='ms')
df['content'] = df['content'].str.lower()
# Filter out unnecessary columns
df.drop(columns=['timestamp_ms', 'gifs', 'is_unsent', 'photos', 'type', 'videos', 'audio_files', 'sticker.uri', 'call_duration', 'share.link', 'share.share_text', 'users', 'files'], inplace=True)
df['year'] = df['date_time'].dt.year
df['hour'] = df['date_time'].dt.hour
df['weekday'] = df['date_time'].dt.weekday
# Exclude non-participants
df = df[~df['sender_name'].isin([''])]
df['content'] = df.content.fillna('')
return df
Defining Our Objectives
Now it’s time to ask ourselves: what insights do we seek?
#### 1. Total Message Count The first metric we can analyze is the number of messages sent by each participant in the conversation. Let’s celebrate the individual who sent the most messages in 2021.
#### 2. Average Message Length Who crafted the longest message? Additionally, we can compute the average word count per message. Is there a correlation between message length and perceived intelligence? I’m sure you could find an article discussing this!
#### 3. Message Timing We can analyze the frequency of messages sent on different days of the week and at various hours of the day to uncover patterns.
#### 4. Most Common Words By identifying the most frequently used words, we can gain insights into the conversation dynamics. This analysis can also be done per individual, revealing each person's chat behavior.
#### 5. Reactions Analysis Reactions to messages can provide additional insights, such as who responds most frequently and which messages sparked significant interaction.
Turning Data into Insights
A significant aspect of being a data scientist is crafting a narrative from the data we gather.
As mentioned, if you'd like to explore this analysis yourself, all the code is accessible at Messenger_Podium. I won't overload this article with all the code snippets. Once we have the necessary information, we can export it to Excel for easier visualization using PowerPoint.
with pd.ExcelWriter("../2. Output/Data.xlsx", engine='openpyxl', mode='w') as writer:
df_grouped_2021.to_excel(writer, sheet_name='grouped', startrow=1)
df_grouped_2020.to_excel(writer, sheet_name='grouped', startrow=15)
sender_list = np.concatenate((df.sender_name.unique(), ['all']))
for sender in sender_list:
print(sender)
day_max_message, hours, day, word_max_freq = by_sender(df, temmenized, sender)
day_max_message.to_excel(writer, sheet_name=sender, startrow=0)
hours.to_excel(writer, sheet_name=sender, startrow=5)
day.to_excel(writer, sheet_name=sender, startrow=10)
word_max_freq.to_excel(writer, sheet_name=sender, startrow=15)
What Comes Next?
Paul won the Chatterbox Award, sending an impressive 3,350 messages over the year, far surpassing the competition, while Alex took home the title for the longest message with a staggering 450 words.
As weekends approach, activity spikes, especially around 6 PM, which we’ve dubbed the “Aperitif Effect.”
If you’re interested in further analysis, such as applying NLP techniques to our messages, please let me know in the comments! I’m considering teaching an algorithm how to mimic a specific writing style. Thoughts?
If you enjoyed this post, check out my previous article:
Translating SQL Grouping Sets to Python
#### How to handle the multi-group by of Postgres 9.5 in Pandas