Effortless Data Scraping: A Simple Python Approach
Written on
Chapter 1: Introduction to Data Scraping
In recent years, many individuals have explored various articles about web data scraping. Commonly, these resources recommend using Node.js with the Cheerio library or Python with Beautiful Soup. While these tools can be powerful, mastering them often requires significant time and effort to locate the necessary elements, retrieve data, clean it, and create a DataFrame. And, of course, additional time is needed to troubleshoot any bugs.
This brief guide will demonstrate the simplest method to scrape tabular data from any website using only three lines of Python code!
Illustration by Chaeyun Kim
Chapter 2: Leveraging Pandas for Data Extraction
Our primary tool in this endeavor is Pandas, the well-known library celebrated for its ability to effortlessly extract tabular data from HTML content.
Section 2.1: Scraping COVID-19 Data from Worldometer
To illustrate, let's consider scraping real-time COVID-19 statistics from the Worldometer website. As the data on this site is dynamic and frequently updated, it makes sense to retrieve the most current information each time the script is executed.
To obtain this dataset, ensure your machine is set up with Python and Pandas. We will utilize the read_html() function from Pandas to extract all tables from the specified webpage. However, it's important to note that directly reading the URL may result in a 403 Forbidden error. To prevent this, we will first use the requests module to fetch the HTML content before applying Pandas to read it.
The overall script appears as follows:
import requests
import pandas as pd
headers = {'User-Agent': 'Mozilla/5.0'}
response = requests.get(url, headers=headers)
dfs = pd.read_html(response.text)
print(dfs[0])
The variable dfs will contain a list of DataFrames extracted from the HTML tables. Let’s take a look at the first DataFrame generated by the script.
Yes, that's all it takes! With just three lines of Python code, you can obtain an updated DataFrame containing COVID-19 data for analysis. It's as simple as that!
The first video titled "Scrape HTML tables easily with Pandas and Python" provides a visual guide to effectively extract tabular data using these methods.
In the second video, "How I always get the right table when web scraping with Python," the presenter shares tips and techniques for ensuring accurate data extraction when scraping websites.
Chapter 3: Conclusion and Further Exploration
Enjoy experimenting with your data analysis and visualization projects! I hope you found this concise article helpful. Please feel free to reach out with any questions, comments, or suggestions.
About me & Check out all my blog content: Link
Stay Safe and Healthy! Thank you for reading! 😊