Mastering Exploratory Data Analysis with Python: A Hands-On Guide

Chapter 1: Introduction to the Fortune 500 Dataset

Before diving into the coding aspect, ensure you have imported the necessary libraries for this analysis. We will utilize Pandas to read the CSV file, rename columns, and sort the data.

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

data = pd.read_csv("fortune500.csv")

data.head()

To prevent any confusion caused by the column names, I renamed the "Revenue (in millions)" column to simply "Revenue." Next, I filtered the dataset to include only those entries from the year 1971.

data.rename(columns={"Revenue (in millions)": "Revenue"}, inplace=True)

data_1971 = data[data['Year'] == 1971]

I chose to break down the code into manageable sections for clarity. I created a new variable that sorts the 1971 data by revenue. Since we are interested in the "top 20 companies," I set the sorting order to descending.

data_sorted = data_1971.sort_values('Revenue', ascending=False).head(20)

data_sorted

For visualization, I utilized Matplotlib to create a straightforward bar graph. Notably, the last five companies exhibit minimal revenue differences, which I highlighted with a red line for better visibility.

plt.figure(figsize=(15, 9))

plt.plot(data_sorted['Company'], data_sorted['Revenue'], color='red')

plt.bar(data_sorted['Company'], data_sorted['Revenue'], color='lightgrey')

plt.xlabel('Company Name')

plt.ylabel('Revenue (in Millions)')

plt.title('Top 20 Company Revenues in 1971')

plt.xticks(rotation=45)

plt.show()

Chapter 2: Analyzing Profit Growth (1990-1999)

Next, we will analyze which top 10 companies experienced the greatest profit increases between 1990 and 1999. To achieve this, I stored the profit data for each year in separate variables.

data_1990 = data[data['Year'] == 1990]

data_1999 = data[data['Year'] == 1999]

Using the .merge function, I combined the two datasets to facilitate comparison.

merged_data = pd.merge(data_1990, data_1999, on='Company', suffixes=('_1990', '_1999'))

merged_data.dtypes

To enhance clarity, I converted the profit columns to float and removed unnecessary columns.

merged_data['Profit_1999'] = pd.to_numeric(merged_data['Profit_1999'], errors='coerce')

merged_data['Profit_1990'] = pd.to_numeric(merged_data['Profit_1990'], errors='coerce')

merged_data.drop(columns=['Revenue_1990', 'Revenue_1999', 'Rank_1990', 'Rank_1999'], inplace=True)

To address the ambiguity in the question, I calculated both absolute and percentage profit increases.

merged_data['Profit_Increase'] = merged_data['Profit_1999'] - merged_data['Profit_1990']

merged_data['Profit_Percentage_Increase'] = ((merged_data['Profit_1999'] - merged_data['Profit_1990']) / merged_data['Profit_1990']) * 100

I created a variable to store the top 10 companies based on absolute profit increase and reset the index.

top_10_increases_absolute = merged_data.nlargest(10, 'Profit_Increase').reset_index(drop=True)

top_10_increases_absolute.head(5)

To enhance visualization, I included a red line in the plots to better illustrate the differences in profit increases.

plt.figure(figsize=(10, 6))

plt.bar(top_10_increases_absolute['Company'], top_10_increases_absolute['Profit_Increase'], color='skyblue')

plt.plot(top_10_increases_absolute['Company'], top_10_increases_absolute['Profit_Increase'], color='red')

plt.xlabel('Company')

plt.ylabel('Profit Increase in $')

plt.title('Top 10 Companies with the Most Profit Increase (1990-1999)')

plt.xticks(rotation=45)

plt.tight_layout()

plt.show()

Chapter 3: Revenue Trends Over Decades

Finally, we will examine and visualize the average revenue by decade. I began by converting the "Year" column into a datetime format and created a new column to represent the decade.

data['Year'] = pd.to_datetime(data['Year'], format='%Y')

data['Decade'] = (data['Year'].dt.year // 10) * 10

Next, I grouped the data by decade and calculated the mean revenue for each period.

decade_top_companies = data.groupby(['Decade'])['Revenue'].mean().reset_index().sort_values('Revenue', ascending=False)

decade_top_companies

I used a line chart to illustrate the revenue increase across decades, followed by a bar chart for a clearer comparison.

plt.figure(figsize=(10, 6))

plt.plot(decade_top_companies['Decade'], decade_top_companies['Revenue'], color='red')

plt.xlabel('Decade')

plt.ylabel('Revenue')

plt.title('Company Revenues Over Decades')

plt.xticks(rotation=45)

plt.show()

plt.figure(figsize=(10, 6))

sns.barplot(data=decade_top_companies, x='Decade', y='Revenue', palette='viridis')

plt.xlabel('Decade')

plt.ylabel('Revenue')

plt.title('Average Revenue by Decade')

plt.show()

Let’s connect!

@moliveiracaio/subscribe

Kaggle

GitHub

arsalandywriter.com

Mastering Exploratory Data Analysis with Python: A Hands-On Guide

Chapter 1: Introduction to the Fortune 500 Dataset

Chapter 2: Analyzing Profit Growth (1990-1999)

Chapter 3: Revenue Trends Over Decades

Share the page:

Recent Post:

Transform Your Life with These 6 Must-Read Self-Help Books

Exploring the Art of Breathwork: A Guide to Natural Healing

Avoid These Common Money Drainers to Boost Your Savings