Using requests and BeautifulSoup in Python to scrape data

The amount of data available on the internet is quite staggering. It is often quite easy to do a quick search and click through to view data on a website. However, if you want to actually use that data in your analysis, you have to be able to fetch it and convert it into a format that is usable. The creators and owners of the websites, however, may not want you do this. They might prefer that you only look at the data, along with its surrounding ads. The fact that you want to use the data for analysis inherently makes it valuable. The data provider most likely makes money from the ads you view as you look at the data. They may even charge you for access to view the data itself. For this reason, they are incentivized to stop you from fetching it.

In this article, I’ll show you a very basic way to download (or scrape) data when the simplest method may not work. It will not work in every case, but you can add it to your toolbox to consider if you need to scrape data using python.

In a previous article I used the pandas library to download a table from Wikipedia. It worked quite well. Pandas will read an html page, look for tables within the page, then turns every table it finds into a list of DataFrames for you.

import pandas as pd

fomc = pd.read_html("https://en.wikipedia.org/wiki/History_of_Federal_Open_Market_Committee_actions")

print(len(fomc))
fomc[1].head()
5
                 Date Fed. Funds Rate Discount Rate      Votes  \
0    November 5, 2020        0%–0.25%         0.25%       10-0   
1  September 16, 2020        0%–0.25%         0.25%        8-2   
2     August 27, 2020        0%–0.25%         0.25%  unanimous   
3       July 29, 2020        0%–0.25%         0.25%       10-0   
4       June 10, 2020        0%–0.25%         0.25%       10-0   

                                               Notes  Unnamed: 5  
0                                 Official statement         NaN  
1  Kaplan dissented, preferring "the Committee [t...         NaN  
2  No meeting, but announcement of approval of up...         NaN  
3                                 Official statement         NaN  
4                                 Official statement         NaN  

Why would we need anything else to scrape data?

If life was simple, this would work for all web pages we ever want to use. However, sometimes it just doesn’t work, so we need to dig further into the details of how this works. For example, let’s say we want to get historical earnings data from Yahoo! finance. I’ve talked about one of the Yahoo! finance APIs in a previous article, but the API I use there doesn’t give you granular historical earnings data that is available on the Yahoo! finance earnings pages. Let’s just see if we can grab historical earnings data for a symbol, like you will see here, for AAPL. You can see this page in your browser, and it contains a table of results, but what happens when you try to load it using pandas?

url = "https://finance.yahoo.com/calendar/earnings/?symbol=AAPL"
try:
    pd.read_html(url)
except Exception as ex:
    print(ex)
HTTP Error 404: Not Found

Why the 404?

At the time of running this code, I got a 404 error. This means the page is “not found”. But we know it does exist, so what is happening?

This is Yahoo!’s way of telling you to buzz off, you are not welcome here with your screen scraping attempt. It turns out that Wikipedia allows us to download the web page mechanically, but Yahoo! doesn’t. Can we perhaps still attempt to download the data?

For web sites that return raw html for a request, it should be possible to read the data as long as you can convince the web server that you are not automated software, but a real web browser being read by a human being. If we look at the source code for read_html, we can see that basics of how pandas does this. The code is a bit complex, but feel free to read it over, but basically it does the following:

  1. fetches the raw html using urllib
  2. uses a parser to parse the raw html, then fetches all the tables
  3. turns the tables into DataFrames
  4. handles tons of options for all the above steps, including using different parsers and options for creating the DataFrames

One thing that sticks out right away in the first step is that pandas doesn’t set any HTTP headers (or allow you to pass them into this method), so Yahoo! is probably just rejecting that connection since it will look like it is automated. To write some lower level code, let’s consider how we could connect directly to the Yahoo! HTTP server and have some more control over what we are sending in our request.

The requests library

There is an easy to use python library called requests that can be used to automate HTTP requests. (If you want to dig into the HTTP specs, they are all listed here). Let’s see what happens if we do the simplest request (an HTTP GET request) for the same url using requests. The requests library has methods for each of the HTTP verbs, and we can just pass it the url. It returns a response object that contains the wrapped server response. If you invoke raise_for_status on the response, it will raise an HTTPError for any error encountered. Install requests with pip install requests.

import requests

res = requests.get(url)
try:
    res.raise_for_status()
except requests.exceptions.HTTPError as err:
    print(err)
404 Client Error: Not Found for url: https://finance.yahoo.com/calendar/earnings/?symbol=AAPL

OK, we have the same error, so that’s a good start. What else could we do to try to convince Yahoo! we are a real browser? The get method is just a thin wrapper to the request method, which takes a number of parameters. One that is very important to consider is headers, a dict of HTTP headers to pass in on the request. Web browsers always send along an identifier called the User-Agent. The most logical header to include first is a valid User-Agent. One way to get a value to use is to just see what your current web browser is sending. A handy way to do this is to use DuckDuckGo, it gives you this info when you ask it my user agent, like this.

For me, this happened to be Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.4 Safari/605.1.15. Let’s try adding that to our request.

headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
                         "AppleWebKit/605.1.15 (KHTML, like Gecko) "
                         "Version/15.4 Safari/605.1.15"}
res = requests.get(url, headers=headers)
try:
    res.raise_for_status()
except requests.exceptions.HTTPError as err:
    print(err)

Now we have a valid response. What does it look like? The actual HTML that the browser renders is included in the content of the response. Let’s just look at the beginning of it. It’s just a standard html document.

res.content[:50]
b'<!DOCTYPE html><html data-color-theme="light" id="'

Now how do I read the html in the scraped data?

Your web browser takes this html and renders it into a nice looking web page. The page will be complete with ads, a table, colors, and styles applied to the visual elements. You just want to grab the raw data underneath. In order to do this, we need to parse the html. Instead of writing your own parser, you can use BeautifulSoup, a library that parses the html and provides useful ways to extract what you want from it. Install it with pip install beautifulsoup4. You’ll also want to install lxml – pip install lxml.

from bs4 import BeautifulSoup

soup = BeautifulSoup(res.content)

Now before we start to try to select the table and data out of the soup, it may be helpful to look at the page in your web browser, such as Firefox, Safari, or Chrome, and right click on the table and choose the “inspect element” or “inpect” option, assuming you have developer tools enabled. This will allow you to see the structure of the html document and the table itself.

In this case, we only have one table (at the time of writing this, Yahoo! could always change things!), so we will try to select it out of the soup. The select method will return a list of all the table elements in the page.

Indexing in pandas can be so confusing

There are so many ways to do the same thing! What is the difference between .loc, .iloc, .ix, and []?  You can read the official documentation but there's so much of it and it seems so confusing. You can ask a question on Stack Overflow, but you're just as likely to get too many different and confusing answers as no answer at all. And existing answers don't fit your scenario.

You just need to get started with the basics.

What if you could quickly learn the basics of indexing and selecting data in pandas with clear examples and instructions on why and when you should use each one? What if the examples were all consistent, used realistic data, and included extra relevant background information?

Master the basics of pandas indexing with my free ebook. You'll learn what you need to get comfortable with pandas indexing. Covered topics include:

  • what an index is and why it is needed
  • how to select data in both a Series and DataFrame.
  • the difference between .loc, .iloc, .ix, and [] and when (and if) you should use them.
  • slicing, and how pandas slicing compares to regular Python slicing
  • boolean indexing
  • selecting via callable
  • how to use where and mask.
  • how to use query, and how it can help performance
  • time series indexing

Because it's highly focused, you'll learn the basics of indexing and be able to fall back on this knowledge time and again as you use other features in pandas.

Just give me your email and you'll get the free 57 page e-book, along with helpful articles about Python, pandas, and related technologies once or twice a month. Unsubscribe at any time.

Invalid email address
len(soup.select("table"))
1

Now that we’ve confirmed there’s just one, let’s see if we can get the header (th) and data rows (tr) from the table.

table = soup.select("table")[0]
columns = []
for th in table.select("th"):
    columns.append(th.text)
columns
['Symbol',
 'Company',
 'Earnings Date',
 'EPS Estimate',
 'Reported EPS',
 'Surprise(%)']

Now let’s grab the rows. We just loop through each table row (tr) and then each data element (td) and make a list of lists.

data = []
for tr in table.select("tr"):
    row = []
    for td in tr.select("td"):
        row.append(td.text)
    if len(row):
        data.append(row)
# first and last row and length of table
data[0], data[-1], len(data)    
(['AAPL', 'Apple Inc', 'Oct 26, 2022, 4 PMEDT', '-', '-', '-'],
 ['AAPL', 'Apple Inc.', 'Jan 15, 1997, 12 AMEST', '-0.02', '-0.03', '-48.31'],
 100)

We have six columns of data.

  • symbol
  • company name
  • a date that has a strange malformed timezone
  • the Earnings Per Share (EPS) estimate
  • the EPS reported value
  • a percentage called the surprise – how high or low the earnings are compared to the estimate.

Let’s make a DataFrame.

df = pd.DataFrame(data, columns=columns)
df.head()
  Symbol     Company           Earnings Date EPS Estimate Reported EPS  \
0   AAPL   Apple Inc   Oct 26, 2022, 4 PMEDT            -            -   
1   AAPL   Apple Inc   Jul 25, 2022, 4 PMEDT            -            -   
2   AAPL   Apple Inc   Apr 26, 2022, 4 PMEDT         1.43            -   
3   AAPL  Apple Inc.  Jan 27, 2022, 11 AMEST         1.89          2.1   
4   AAPL  Apple Inc.  Oct 28, 2021, 12 PMEDT         1.24         1.24   

  Surprise(%)  
0           -  
1           -  
2           -  
3      +11.17  
4       +0.32  

Data cleanup using pandas

At this point, we just want to do a little data cleanup. Since everything is just text at this point, we need to first convert things to the correct data types. If you’re interested in learning more about data conversion, check out this article. First, let’s make all the numeric values be numbers, and if there’s no data (like in the dates in the future), we’ll set them to NaN by setting errors to coerce.

for column in ['EPS Estimate', 'Reported EPS', 'Surprise(%)']:
    df[column] = pd.to_numeric(df[column], errors='coerce')

Now the Earnings Date column is a little strange because it has a timezone aware datetime in it. I happen to know that AAPL always reports earnings after the market closes at 16:00 Eastern time. The historical earnings times are not accurate, just the dates. But let’s pretend we wanted to convert these into datetime objects anyway versus just dates. We need to turn this into a format that can be parsed properly by pd.to_datetime. What we have now doesn’t work:

try:
    pd.to_datetime(df['Earnings Date'])
except Exception as ex:
    print(ex)
Unknown string format: Oct 26, 2022, 4 PMEDT

Now ideally, we could just parse this by passing in a format (using this nice reference). I can attempt this with the AM/PM indicator and timezone connected, let’s see if that works.

try:
    pd.to_datetime(df['Earnings Date'], format='%b %d, %Y, %I %p%Z')
except Exception as ex:
    print(ex)
time data 'Oct 26, 2022, 4 PMEDT' does not match format '%b %d, %Y, %I %p%Z' (match)

The EDT/EST is not a full timezone name and doesn’t get parsed by to_datetime (or even using datetime.datetime.strptime directly. Since we know the values are in the Eastern US timezone, we can just set it directly. We’ll remove the timezone from the raw data, parse it into a datetime, then set the timezone.

Note that I use the str accessor to do the string replace operation since that field starts off as a string. Then, when it’s converted to a datetime using pd.to_datetime, I use the dt accessor to do the timezone localization.

# remove the timezone part of the date
df['Earnings Date'] = df['Earnings Date'].str.replace("EDT|EST", "", regex=True)
df['Earnings Date'] = pd.to_datetime(df['Earnings Date'])

# set the timezone manually
import pytz
eastern = pytz.timezone('US/Eastern')
df['Earnings Date'] = df['Earnings Date'].dt.tz_localize(eastern)

df.head()
  Symbol     Company             Earnings Date  EPS Estimate  Reported EPS  \
0   AAPL   Apple Inc 2022-10-26 16:00:00-04:00           NaN           NaN   
1   AAPL   Apple Inc 2022-07-25 16:00:00-04:00           NaN           NaN   
2   AAPL   Apple Inc 2022-04-26 16:00:00-04:00          1.43           NaN   
3   AAPL  Apple Inc. 2022-01-27 11:00:00-05:00          1.89          2.10   
4   AAPL  Apple Inc. 2021-10-28 12:00:00-04:00          1.24          1.24   

   Surprise(%)  
0          NaN  
1          NaN  
2          NaN  
3        11.17  
4         0.32  

Summary

In this example, we first tried getting data from a web page using pandas, then used the requests library to get the data instead, after setting the User-Agent header. We then used BeautifulSoup to parse the html to extract a table. Finally, we cleaned up the data using pandas.

At this point it makes sense to give you a few warnings. First, this method will not work on many websites. In this case, Yahoo! is only blocking obvious attempts to download data using automated software. This techique will not work for sites that do any of the following:

  1. Require authentication. You will need to authenticate your requests. This may or may not be easy to do, depending on the authentication method.
  2. Uses JavaScript for rendering. If a site is rendered in JavaScript, your requests will just return raw JavaScript code. For this reason, the table (and data) will not be present in the response. It is possible to scrape JavaScript sites, but those methods are more complicated and involve running a browser instance.
  3. Aggressively block automated code. Sites may choose to block any User-Agents that don’t look like a real browser. There are many ways that sites can do this, and so you may find this technique doesn’t work.

On top of all of the above, you should be a good network citizen. Do not download aggressively from a website, even if they don’t block you. Most sites expect to handle typical human users who traverse a site slowly. You should, at a minimum, access the site as a human would. This means accessing it slowly, with lengthy pauses between fetching data.

Finally, it’s important to note that websites will change their formats, designs, and layouts frequently. This will probably break any code you write that downloads data. For this reason, critical code and infrastructure should rely on supported APIs.

Hopefully you now have a better understanding of how data can be retrieved, parsed, and cleaned up from a basic website.

3 thoughts on “Using requests and BeautifulSoup in Python to scrape data

  1. “Do not download aggressively” – Good advice here… I built myself a little thing to present data in a way that suits me from a site that shows what movies are on where.
    Recently they updated the site so I needed to download a lot more files to get the data. Trying to get my program working again, I’m pretty sure I got my home’s public IP on to a blacklist with the movie site. That has to me looking at https://pypi.org/project/ratelimit/ before I try again!

    BTW I used requests, BS, sqlite and flask. Code here https://git.sr.ht/~eliotb/flickscraper

  2. You addressed the exact problem I encountered with Yahoo! This 404 message is a recent phenomenon as I had no problem a couple of years ago. Thanks for your excellent and helpful work

Have anything to say about this topic?