Mastering Exploratory Data Analysis with Python: A Hands-On Guide
Written on
Chapter 1: Introduction to the Fortune 500 Dataset
Before diving into the coding aspect, ensure you have imported the necessary libraries for this analysis. We will utilize Pandas to read the CSV file, rename columns, and sort the data.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
data = pd.read_csv("fortune500.csv")
data.head()
To prevent any confusion caused by the column names, I renamed the "Revenue (in millions)" column to simply "Revenue." Next, I filtered the dataset to include only those entries from the year 1971.
data.rename(columns={"Revenue (in millions)": "Revenue"}, inplace=True)
data_1971 = data[data['Year'] == 1971]
I chose to break down the code into manageable sections for clarity. I created a new variable that sorts the 1971 data by revenue. Since we are interested in the "top 20 companies," I set the sorting order to descending.
data_sorted = data_1971.sort_values('Revenue', ascending=False).head(20)
data_sorted
For visualization, I utilized Matplotlib to create a straightforward bar graph. Notably, the last five companies exhibit minimal revenue differences, which I highlighted with a red line for better visibility.
plt.figure(figsize=(15, 9))
plt.plot(data_sorted['Company'], data_sorted['Revenue'], color='red')
plt.bar(data_sorted['Company'], data_sorted['Revenue'], color='lightgrey')
plt.xlabel('Company Name')
plt.ylabel('Revenue (in Millions)')
plt.title('Top 20 Company Revenues in 1971')
plt.xticks(rotation=45)
plt.show()
Chapter 2: Analyzing Profit Growth (1990-1999)
Next, we will analyze which top 10 companies experienced the greatest profit increases between 1990 and 1999. To achieve this, I stored the profit data for each year in separate variables.
data_1990 = data[data['Year'] == 1990]
data_1999 = data[data['Year'] == 1999]
Using the .merge function, I combined the two datasets to facilitate comparison.
merged_data = pd.merge(data_1990, data_1999, on='Company', suffixes=('_1990', '_1999'))
merged_data.dtypes
To enhance clarity, I converted the profit columns to float and removed unnecessary columns.
merged_data['Profit_1999'] = pd.to_numeric(merged_data['Profit_1999'], errors='coerce')
merged_data['Profit_1990'] = pd.to_numeric(merged_data['Profit_1990'], errors='coerce')
merged_data.drop(columns=['Revenue_1990', 'Revenue_1999', 'Rank_1990', 'Rank_1999'], inplace=True)
To address the ambiguity in the question, I calculated both absolute and percentage profit increases.
merged_data['Profit_Increase'] = merged_data['Profit_1999'] - merged_data['Profit_1990']
merged_data['Profit_Percentage_Increase'] = ((merged_data['Profit_1999'] - merged_data['Profit_1990']) / merged_data['Profit_1990']) * 100
I created a variable to store the top 10 companies based on absolute profit increase and reset the index.
top_10_increases_absolute = merged_data.nlargest(10, 'Profit_Increase').reset_index(drop=True)
top_10_increases_absolute.head(5)
To enhance visualization, I included a red line in the plots to better illustrate the differences in profit increases.
plt.figure(figsize=(10, 6))
plt.bar(top_10_increases_absolute['Company'], top_10_increases_absolute['Profit_Increase'], color='skyblue')
plt.plot(top_10_increases_absolute['Company'], top_10_increases_absolute['Profit_Increase'], color='red')
plt.xlabel('Company')
plt.ylabel('Profit Increase in $')
plt.title('Top 10 Companies with the Most Profit Increase (1990-1999)')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
Chapter 3: Revenue Trends Over Decades
Finally, we will examine and visualize the average revenue by decade. I began by converting the "Year" column into a datetime format and created a new column to represent the decade.
data['Year'] = pd.to_datetime(data['Year'], format='%Y')
data['Decade'] = (data['Year'].dt.year // 10) * 10
Next, I grouped the data by decade and calculated the mean revenue for each period.
decade_top_companies = data.groupby(['Decade'])['Revenue'].mean().reset_index().sort_values('Revenue', ascending=False)
decade_top_companies
I used a line chart to illustrate the revenue increase across decades, followed by a bar chart for a clearer comparison.
plt.figure(figsize=(10, 6))
plt.plot(decade_top_companies['Decade'], decade_top_companies['Revenue'], color='red')
plt.xlabel('Decade')
plt.ylabel('Revenue')
plt.title('Company Revenues Over Decades')
plt.xticks(rotation=45)
plt.show()
plt.figure(figsize=(10, 6))
sns.barplot(data=decade_top_companies, x='Decade', y='Revenue', palette='viridis')
plt.xlabel('Decade')
plt.ylabel('Revenue')
plt.title('Average Revenue by Decade')
plt.show()
Let’s connect!
@moliveiracaio/subscribe
Kaggle
GitHub