Mastering Web Scraping: A Quick Guide to Extracting Betting Data
Written on
In this brief tutorial, I will guide you through the process of web scraping using Python, even if you have no prior coding experience.
Before diving in, I often faced a challenge at the start of new projects due to a lack of accessible data. For instance, while trying to bet on sports, I realized I needed extensive data to improve my odds of winning. Fortunately, you don’t have to be a Python expert to scrape the web and obtain this crucial data. This guide will provide you with a straightforward, step-by-step approach to scraping your preferred websites from the ground up.
Additionally, as outlined in my previous article on monetizing web scraping without selling data, you can potentially earn extra income by scraping a betting site. So let’s get started!
What is Web Scraping?
Web scraping involves gathering information from websites. This is done using software that mimics human browsing behavior to collect the desired data.
In this guide, we will focus on scraping the betting platform 'Tipico' (link provided in the code below) to extract all available betting odds for various sports. Once you grasp the fundamentals of web scraping, you will be equipped to scrape most sites you encounter.
Legal Disclaimer: Excessive scraping can lead to high traffic, potentially overloading websites. Always adhere to the site's terms of service and review their 'robots.txt' file to understand how they permit scraping. Furthermore, I do not endorse gambling in any form.
Requirements for Web Scraping
- Selenium: This tool is essential for automating web applications. It enables you to launch a browser and perform actions like clicking buttons and searching for specific information. No previous experience with Selenium is needed; everything will be covered here from scratch!
- Python: You don’t need to be a Python guru for this tutorial. A basic understanding of for loops, if statements, and lists will suffice. If these concepts are unfamiliar, don’t worry; I will explain them as we go along.
Before proceeding, ensure that Python 3.x is installed on your machine. If you have that ready, let's begin by setting up Selenium!
Setting Up Selenium
- Install Selenium: Execute the following command in your command prompt or terminal: pip install selenium
- Download the Driver: This is necessary for Selenium to interact with the browser. Determine your Google Chrome version and download the appropriate Chromedriver from the official site (remember to download a new Chromedriver if Chrome updates). Unzip the driver if needed and note the path where you saved it.
Note 1: This basic project will focus on pre-match games. However, sure bets are often found in live games. A separate tutorial for scraping live odds is available in the article linked below. Be aware that scraping live odds can be more challenging than pre-match odds, so make sure you understand all concepts discussed in this tutorial first.
Code to Get Started
I will walk you through each line of code necessary to scrape the betting site. The complete code can be found at the end of this article.
Update March 10th, 2021: I added some lines of code to adapt to changes on the website.
Update April 22nd, 2021:
- Accessing the website may be restricted in certain countries. If this happens, use a VPN to connect to a European country (I recommend TunnelBear, which is free).
- The site has undergone significant changes in its live and pre-match sections, so for this tutorial, we will utilize the sections with the old structure.
To navigate to these sections, check the left panel for the "top sports" section and select the league(s) you want to scrape. After that, you’ll obtain a link we will use for scraping. In this guide, I will use the link generated after selecting the Spanish League (as shown in the code below). However, feel free to choose any league and replace it in the web variable in the code.
Each modified line of code is detailed in the full code provided at the end of this article and is functional as of April 22nd, 2021.
Update October 20th, 2021: If you encounter issues with the code, please inform me in the comments. I will need some time to address updates, but in the meantime, you can learn about Selenium through my step-by-step tutorial linked below.
With that said, let’s jump into the tutorial!
We will explore numerous functions and methods commonly used in Selenium. To assist with memorization, check out the web scraping cheat sheet I created.
Importing Selenium
To begin using Selenium, enter the following code in your Python editor. Additionally, you will need to import the time library (which will be useful for explicit waits later).
Writing Your First Selenium Python Test
Start by defining the web and path variables. The web variable will hold the betting site URL, while path refers to the location where you saved the ChromeDriver from step 2.
In this case, I will scrape data from the Spanish League; however, you can select any league you wish to obtain the corresponding link to paste into the web variable.
Now we will create a driver instance to help navigate the website, which we will refer to as driver. This is achieved by writing the first line of the following code.
Once the driver instance is established, we can open the betting website using the driver.get command. Execute the code, and you will see the browser open automatically.
Handling the Popup
When the betting site loads, a popup appears. To continue scraping, we must dismiss this popup by instructing Selenium to click the accept button each time the website opens. We accomplish this with the following code.
Code Breakdown
- time.sleep(5) serves as an explicit wait, allowing Selenium to pause for 5 seconds before executing the next line.
- accept is a variable we created for the 'accept' button we need to click.
- driver.find_element_by_xpath() assists us in locating an element on the website by providing the XPath of that element. Finding the XPath is straightforward in Chrome. Here’s how to find the XPath of the 'accept' button:
- Open Google Chrome and navigate to the betting site.
- Right-click on the page and select ‘Inspect’ to access the Chrome Developer Tool, which reveals the underlying code of the website.
- Click the mouse cursor icon on the left, hover over the 'accept' button, and click it to highlight the code. Right-click on the highlighted code and select ‘Copy’ > ‘Copy XPath.’
Now that you have copied the XPath, paste it inside the parentheses in driver.find_element_by_xpath(). This tells Selenium where to find the 'accept' button.
- accept.click() commands Selenium to click the 'accept' button when the website loads.
With everything set, we are ready to start scraping the betting site.
Plan for Scraping the Website
Before we write the actual scraping code, let’s outline what we will be doing. Here are the components we will utilize to scrape the website:
- Sports Title: Represents the sports categories available. Although there are multiple sports, we will focus on football for simplicity. The code we write will enable you to scrape any sport.
- Single-row Event: Events that appear in a single row. Live events may have two rows, but we will concentrate on upcoming matches.
- Odds Event: Represents the available odds within a row. Each row contains one odds event, which includes three boxes: '3-way,' 'Over/Under,' and 'Handicap.'
Now, let’s build our web scraper!
Building the Web Scraper
Initializing Storage
We will use empty lists [] to store all the data we scrape.
Selecting Only Upcoming Matches
The website features both live and upcoming matches; for simplicity, we will extract odds solely from upcoming matches. We will choose the upcoming matches box using the following code.
Identifying Sports Titles
As illustrated earlier, we need to search for the names of the sports. To enable Selenium to locate all sports titles, we will write the following code:
Code Breakdown
- sport_title represents each sport name.
- driver.find_elements_by_class_name() helps us find elements on the website by their class name. Remember that .find_element retrieves a single element, while .find_elements returns a list of elements. We will loop through this list shortly.
To find the class name of the football title, follow these steps: 1. Click on ‘Football’ to see its corresponding code highlighted, resembling <div class=”SportTitle-styles-sport”>. Copy this class name and use it in driver.find_elements_by_class_name().
You have now learned how to use .find_element_by_class_name(), .find_element_by_xpath(), and .click() with Selenium. Before we move on, ensure you know how to use for loops and if statements. If you're familiar with these, feel free to skip this section.
Refresher on for Loops, if Statements, and Lists
Consider a list of football teams: 'Barcelona,' 'Madrid,' and 'Sevilla.' To iterate through this list, use the following code:
Code Breakdown
- teams_example is the list of teams we created.
- for team in teams_example loops through each team in the list.
- print(team) executes for each team in the teams_example list.
If you run this code, the output will be:
Barcelona Madrid Sevilla
By using an if statement, we can specify conditions under which the code continues. For instance:
If we run this code, it will print only 'Barcelona' since we instructed Python to only print when the team is Barcelona.
Great! Now that you understand for loops and if statements, let’s proceed with the tutorial.
Filtering for Football Only
The class name 'SportTitle-styles-sport' gives us access to all available sports on the site. As mentioned earlier, we will only extract data from football matches. We will select the football section using the following code.
Code Breakdown
- sport represents each sport name in the sport_title variable.
- .text retrieves the text attribute of the variable. By comparing sport_title.text to the sport name (football), we ensure that we retrieve data solely from the football section.
- sport.find_element_by_xpath(‘./..’) allows us to find an element using XPath within the sports category (in this case, the football section). The ‘./’ specifies our current location in 'sport_title'. By using ‘./..’, we acquire the parent node, and repeating this gives us the grandparent node, which we need to limit our scrape to the football section.
Locating Single Row Events
Next, we need to identify the 'single row events'. To do so, use this code:
Code Breakdown
- single_row_events represents each event, typically found in one row.
- grandparent.find_elements_by_class_name() assists us in finding all football matches/events within the grandparent node (the football section).
To find the class name of a match, follow these steps.
Once the code is highlighted, look for the class name, which should be ‘EventRow-styles-event-row’.
Extracting Data: Team Names and Odds Events
To obtain team names and locate the odds (odd_events), we will write the following code:
Code Breakdown
- for match in single_row_events loops through all matches in the single_row_events list.
- odd_events represents each event with available odds.
- match.find_elements_by_class_name('EventOddGroup-styles-odd-groups') helps us find all 'odds_event' within each match.
To find the code behind the odds box, follow these steps.
Once the code is highlighted, note the class name, which should be ‘EventOddGroup-styles-odd-groups’.
- for team in match.find_elements_by_class_name('EventTeams-styles-titles') loops through the elements with the class name ‘EventTeams-styles-titles’ within the match node. Each match contains two teams (home and away), and we will iterate through them.
To find the code for team names, follow these steps.
Although the highlighted class name is ‘EventTeams-styles-event-teams EventTeams-styles-additional-margin’, do not select this one, as it will yield the names of both rows (team names + ‘half time’) when scraping live games. Instead, select the class name that specifies ‘EventTeams-styles-titles’.
- team.text retrieves the text attribute from the team element.
- teams.append(team.text) stores the team names in the teams list we created at the beginning.
Extracting Data: The Odds
Having located the odds events, we will now extract the 3-way odds (1x2) using the following code:
Code Breakdown
- for odd_event in odds_events iterates over the matches on the website.
- enumerate(odd_events) counts the number of elements in the odd_events list while iterating. It starts counting from '0'.
- for n, box in enumerate(odds_events) goes through all 'odds boxes' in each match. As previously mentioned, there are three boxes: ‘3-way,’ ‘Over/Under,’ and ‘Handicap.’ We will focus on the ‘3-way’ box this time.
- rows denotes the number of rows within the odds boxes. Remember, there may be two rows if scraping live matches.
- box.find_elements_by_xpath(‘.//*’) retrieves the child nodes within each ‘odds box’, resulting in a list of one or two rows.
- n==0 indicates to only take values from the first box, which is the ‘3-way’ box (1x2).
- rows[0] instructs Python to only select the first row in each odds box, thus ignoring the second row in case of live matches.
- x12.append(rows[0].text) stores the ‘3-way’ odds in the x12 list we created earlier.
- driver.quit() closes the browser.
Congratulations! You have successfully scraped your first website.
Final Step
The last step of this tutorial involves automating the Python script, allowing you to run it daily, weekly, or at specific times. Here’s a tutorial on automating Python scripts on Mac and Windows in three simple steps.
If you wish to scrape websites without facing blocks, refer to this article:
Complete Code
There are several ways to view and manipulate the data you've collected. You can use CSV libraries to export the information to Excel, but I prefer to create a dictionary from the lists and utilize Pandas. Below is the complete code we wrote along with additional lines for exporting to Pandas.
If you're interested in learning more about beating the bookies, check out the articles linked below.
Final Thoughts
This guide covers the essentials of web scraping. With these skills, you can extract data from most websites. However, scraping dynamic data like that found in live matches requires more advanced knowledge of Selenium. Let me know if you’d like the second part of this tutorial to learn how to scrape more dynamic data.