arsalandywriter.com

Understanding Essential Data Cleaning Techniques in Python

Written on

Data cleaning is a fundamental yet often tedious aspect of data analysis.

Cleaning data can be quite labor-intensive! Real-world datasets tend to be disorganized and seldom present in an ideal state. They may feature incorrect or abbreviated column headings, gaps in data, mismatched data types, or excessive information within a single column, among other issues.

Addressing these problems is crucial prior to analyzing the data. Clean data enhances productivity and facilitates the generation of accurate insights.

In this article, I will outline three critical types of data cleaning that are essential when working with Python.

To illustrate these techniques, I will utilize an extended version of the Titanic dataset created by Pavlo Fesenko, which is freely accessible under a Creative Commons license.

This dataset comprises 1,309 rows and 21 columns. Below, I provide numerous examples demonstrating how to maximize the utility of this data.

Let’s dive in!

First, import pandas and load this CSV file into a pandas DataFrame. A good first step is to utilize the .info() method to gain a comprehensive overview of the dataset's size, columns, and their respective data types.

df = pd.read_csv("Complete_Titanic_Extended_Dataset.csv")

df.info()

Let’s begin with the simplest cleaning tasks that can save both memory and time as you continue processing the data.

Eliminate Unused and Irrelevant Columns

You may observe that this dataset includes 21 columns, most of which may not be necessary for your analysis. Thus, it is beneficial to retain only the relevant columns.

For instance, if you determine that the columns PassengerId, SibSp, Parch, WikiId, Name_wiki, and Age_wiki are not required, you can create a list of these column names and apply it in the df.drop() function as shown below.

columns_to_drop = ['PassengerId', 'SibSp', 'Parch', 'WikiId', 'Name_wiki', 'Age_wiki']

df.drop(columns_to_drop, inplace=True, axis=1)

df.head()

By checking memory consumption with the argument memory_usage="deep" in the .info() method, you’ll find that this modified dataset consumes only 834 KB compared to 1000 KB of the original DataFrame.

While these figures may seem minor, they can be substantial when dealing with larger datasets.

Dropping irrelevant columns has resulted in a 17% reduction in memory usage!

However, a slight drawback of using the .drop() method with inplace=True is that it alters the original DataFrame. If you wish to keep the original DataFrame, you can assign the output of df.drop() (without inplace) to a different variable, as shown below.

df1 = df.drop(columns_to_drop, axis=1)

Alternatively, if you need to remove a large number of columns while retaining only 4-5 columns, consider using df.copy() with the selected columns.

For instance, to keep only the Name, Sex, Age, and Survived columns, you can subset the original dataset using df.copy() as follows:

df1 = df[["Name", "Age", "Sex", "Survived"]].copy()

Depending on the specific requirements of your task, you can choose any of the methods mentioned above for selecting relevant columns.

In the previous image, you may have noticed some missing values in the Age and Survived columns, which need to be addressed before proceeding.

Addressing Missing Values (NaNs)

Most datasets require handling missing values, a challenging aspect of data cleaning. If you intend to use this data for machine learning, it's important to remember that many models do not accept missing data.

> But how can you identify missing data?

Several techniques can help you detect which sections or columns of the dataset have missing values. Here are four commonly used methods:

Using the .info() Method

This straightforward approach allows you to quickly ascertain whether there are missing values in any columns. By executing df.info(), you can see a summary of the DataFrame.

Columns highlighted in red indicate where multiple values are missing. Ideally, each column in this dataset should contain 1,309 values; however, this output reveals that most columns have fewer than 1,309 values.

You can also visualize these missing values.

Heatmap of Missing Data

Creating a heatmap is a common method for visualizing missing data. You can encode the data as boolean values, i.e., 1 for missing and 0 for present, using the pandas function .isna().

> What does .isna() do in pandas?

The method .isna() returns a DataFrame where all values are replaced with True for NaN and False otherwise.

You can generate a heatmap with a single line of code:

import seaborn as sns

sns.heatmap(df.isna())

In the above graph, the X-axis displays all column names, while the Y-axis represents the index or row numbers. The legend on the right indicates the Boolean values used for missing data.

If the column names are difficult to read, you can create a transposed version as follows:

sns.heatmap(df.isna().transpose())

Such heatmaps are particularly useful when there are fewer features or columns. However, with a larger number of features, subsetting may be necessary.

Keep in mind that creating visualizations may take time, especially with extensive datasets.

While heatmaps provide insight into the location of missing data, they do not convey the extent of the missing data. You can determine this using the next method.

Missing Data as a Percentage of Total Data

While there isn't a direct way to calculate this, you can leverage the .isna() method and the following code snippet.

import numpy as np

print("Amount of missing values in - ")

for column in df.columns:

percentage_missing = np.mean(df[column].isna())

print(f'{column} : {round(percentage_missing * 100)}%')

This approach allows you to see how many percentage values are missing from each column, which can be valuable when handling these missing entries.

> I've identified the missing data, but what's next?

There is no universally accepted method for addressing missing data. The approach you take should consider the individual column, the quantity of missing values, and the significance of that column for future analysis.

Based on these observations, you can choose from the following three methods to handle missing data:

  1. Drop the Record — Remove an entire record if a specific column contains a missing value or NaN. Be cautious, as this can significantly reduce the number of records in the dataset if the column has many missing entries.
  2. Drop the Column or Feature — This requires careful analysis of the column to assess its future relevance. You may proceed with this option if you are confident that the feature lacks useful information, as is the case with the PassengerId feature in this dataset.
  3. Impute Missing Data — This technique involves substituting missing values or NaNs with the mean, median, or mode of the same column.

All these methods for handling missing data warrant further discussion, which I will explore in a subsequent article.

In addition to missing data, another common issue is incorrect data types, which must be addressed to maintain data quality.

Correcting Data Types

While utilizing various Python libraries, you'll find that specific transformations require the correct data type. Hence, each column's data type should be accurate and suitable for its intended use.

When you load data into a DataFrame using read_csv or similar functions, pandas attempts to infer the data type of each column based on the values present.

This inference is generally accurate, but some columns may require manual adjustments.

For instance, in the Titanic dataset, you can view the column data types by executing .info() as shown below.

In the output above, the columns Age and Survived are shown as float64, while Age should be an integer and Survived should ideally contain two values—Yes or No.

To further illustrate, let’s examine a random sample of 5 values from these columns.

df[["Name", "Sex", "Survived", "Age"]].sample(5)

In addition to missing values, the Survived column contains values of 0.0 and 1.0 that should be represented as 0 and 1 for Boolean values indicating No and Yes, respectively. Furthermore, the Age column contains decimal values.

Before proceeding, you can rectify this issue by adjusting the data types of the columns. Depending on your version of pandas, it may be necessary to resolve missing values before correcting data types.

Along with the aforementioned data cleaning steps, you might also need to consider the following techniques based on your use case:

  1. Replace Values in a Column — Sometimes, columns contain values like True/False or Yes/No, which can easily be converted to 1 and 0 for machine learning applications.
  2. Remove Outliers — Outliers are data points that differ significantly from the rest. However, it's not always advisable to remove outliers; they require careful evaluation.
  3. Eliminate Duplicates — Data can be considered duplicate when all values across the records are identical. The pandas DataFrame method .drop_duplicates() is useful for this purpose.

That’s all!

I hope you found this article insightful and helpful. Data cleaning, while often time-consuming, necessitates creative approaches to handle various data issues effectively.

If you enjoy reading such informative articles, please subscribe to my email list!

As a free user, you can read only three stories, but as a Medium member, you gain access to everything on the platform. So, consider Joining Medium using my referral link below; it also supports me as a writer since Medium shares a portion of your subscription fee with me.

<div class="link-block">

<h2>Join Medium with my referral link - Suraj Gurav</h2>

<h3>Get Instant Access to Medium!! It is time to join Medium to get unlimited access to each article and Read everything…</h3>

<p>medium.com</p>

</div>

Thank you for reading!

Share the page:

Twitter Facebook Reddit LinkIn

-----------------------

Recent Post:

Exploring the Sum of Infinitely Many Angles in Mathematics

Discover how infinitely small angles can lead to an infinite sum through trigonometric functions and real analysis concepts.

generate a new title here, between 50 to 60 characters long

Explore the critical distinction between healthcare and medical care and its implications for patient well-being.

Exploring the Future of ChatGPT: 6 Transformative Insights

Delving into six transformative ideas about ChatGPT and its long-term implications for content creation and society.

Blocking Certain Thoughts Reveals Our Fragility: Understanding Mental Barriers

Exploring the implications of suppressing thoughts and how meditation can foster mental resilience.

Finding Joy in Your New Sobriety Journey: A Personal Story

Discover how to embrace happiness in sobriety through commitment and resilience in this inspiring personal story.

# Exploring the Future of Cloud Computing: Insights and Innovations

Dive into the evolution of cloud computing, its impact on businesses, and the emerging technologies shaping the future.

# The Fragmented Landscape of AI Art Communities: A Growing Concern

Exploring the challenges faced by AI artists in fragmented online communities and the impact on creativity and recognition.

Breaking Down the Four Stages of Debugging for New Developers

Explore the four key stages of debugging and how junior developers can learn from them to enhance their problem-solving skills.