An introduction to accessing financial data in EDGAR, using Python

Some sources of financial data can be expensive or difficult to find. For example, some is only available from exchanges or vendors who charge a hefty fee for access. However, the financial industry is also heavily regulated, and one of its main regulators provides free access to its data. The (U.S. Securities and Exchange Commission)[https://www.sec.gov] (known as the SEC), has the mission of protecting investors and ensuring that markets are fair, orderly, and efficient.

Companies and individuals are highly incentivized to provide the SEC with accurate and timely information about their businesses, or risk fines or criminal prosecution. Most of this information is provided by filling out forms, and the data in those forms are known as filings. The filing data is all available, for free, via the SEC’s EDGAR system. In this article, I’ll show you the basics of how to access EDGAR filings using Python, then show you an examples of querying multiple filings from EDGAR. Along the way, you’ll learn about a few useful Python libraries and techiques, as well as how to deal with some difficult data.

Even if you are not interested in finance, the techniques we use here for cleaning up the data can be useful in other domains.

Poking around EDGAR

Before we access data using Python, it helps to poke around in the system a bit using a web browser. If you try the main search page, you can type in a company name and the interface will helpfully give you results for that company. For instance, searching a popular name like “Apple” will show you various filings related to Apple Inc. There are dozens of different filings related to Apple, and you can click on the links and read them in a human readable format.

For example, when I ran this search, the top link was for an insider trading transaction. In this case, an insider (usually a director or officer in the company) bought or sold some stock in Apple and is required to report that to the SEC, using a form. The full rules for filling out SEC Form 4 are available if you find that sort of thing interesting.

Each filing has a human readable document, like this one, and a link to the actual filing data. In this case, the filing contains the following:

The last text document is actually in SGML, which is a predecessor to XML. We will parse that document later.

Accessing data using Python

I won’t spend any more time digging into the search interface, but you can definitely get plenty of data without writing code. There are number of useful tools for searching company filings, and many users will be able to use Edgar effectively in that manner. But we want to understand how to automate collection of useful data from EDGAR. Now that you’ve seen what a basic filing looks like, let’s try to access the data using Python. I’ll stick with the example of our insider trading transaction.

The Edgar site give us some rules about how we are to access the data using a computer program. I’d encourage you to check the details out, but I’ll summarize the important points here.

10 requests per second [10-requests-per-second]

You need to limit your requests to only 10 requests per second. If you request more than that, you risk being blocked. Note that the code I’m writing in this article will be run for only one document at a time on one computer. It’s highly unlikely that I will break this rule, but if you run multiple processes this is highly likely to cause you trouble. You should plan in inserting some throttling code to ensure you don’t exceed the limits if you’ll be fetching large amounts of data.

User agent requirement

The API for EDGAR uses simple HTTP with no authentication. You do need to supply a company name and administrative email in your HTTP headers. If you don’t, your request will be denied.

Accept compressed data

Your user agent needs to accept compressed content, using gzip or deflate.

With just those three pieces of information, we can continue. Let’s use the requests API to access the form above, with custom headers. Requests will accept the compressed data.

# Add your own email and website here before continuing
email="your email"
website="your website"
import requests

headers = { "User-Agent": f"{website} {email}"}

form = "https://www.sec.gov/Archives/edgar/data/320193/000032019322000063/wf-form4_165248105838188.xml"
res = requests.get(form)
try:
    res.raise_for_status()
except requests.HTTPError as err:
    print(err)
403 Client Error: Forbidden for url: https://www.sec.gov/Archives/edgar/data/320193/000032019322000063/wf-form4_165248105838188.xml

As you can see, EDGAR blocks us if we don’t supply our custom header with our User-Agent. If we instead supply our headers, we can take a look at our result.

res = requests.get(form, headers=headers)
res.raise_for_status()
res.content[0:50]
b'<?xml version="1.0"?>\n<ownershipDocument>\n\n    <sc'

We have an xml document (as expected if you clicked the links above). Let’s use Python’s lxml package to parse the XML. We can use XPath expressions to extract any information we want from the document. It would be useful to extract the name of the insider and the number of shares in the transaction, and whether they purchased or sold the shares.

from lxml import etree

doc = etree.fromstring(res.content)
owner = doc.xpath("/ownershipDocument/reportingOwner/reportingOwnerId/rptOwnerName")[0].text
security = doc.xpath("//securityTitle/value")[0].text
date = doc.xpath("//transactionDate/value")[0].text
amount = doc.xpath("//transactionAmounts/transactionShares/value")[0].text
disposal = doc.xpath("//transactionAmounts/transactionAcquiredDisposedCode/value")[0].text
owner, security, date, amount, disposal
('BELL JAMES A', 'Common Stock', '2022-05-06', '1276', 'D')

So here we see an Apple insider, James A Bell, disposed of 1,276 shares of Apple Common Stock on 2022-05-06. Let’s see if we can pull up all the insider transactions at Apple for a given time period, such as the first quarter of 2022. How would we do that? We’ll need to understand a few more EDGAR concepts.

Indexing in pandas can be so confusing

There are so many ways to do the same thing! What is the difference between .loc, .iloc, .ix, and []?  You can read the official documentation but there's so much of it and it seems so confusing. You can ask a question on Stack Overflow, but you're just as likely to get too many different and confusing answers as no answer at all. And existing answers don't fit your scenario.

You just need to get started with the basics.

What if you could quickly learn the basics of indexing and selecting data in pandas with clear examples and instructions on why and when you should use each one? What if the examples were all consistent, used realistic data, and included extra relevant background information?

Master the basics of pandas indexing with my free ebook. You'll learn what you need to get comfortable with pandas indexing. Covered topics include:

  • what an index is and why it is needed
  • how to select data in both a Series and DataFrame.
  • the difference between .loc, .iloc, .ix, and [] and when (and if) you should use them.
  • slicing, and how pandas slicing compares to regular Python slicing
  • boolean indexing
  • selecting via callable
  • how to use where and mask.
  • how to use query, and how it can help performance
  • time series indexing

Because it's highly focused, you'll learn the basics of indexing and be able to fall back on this knowledge time and again as you use other features in pandas.

Just give me your email and you'll get the free 57 page e-book, along with helpful articles about Python, pandas, and related technologies once or twice a month. Unsubscribe at any time.

Invalid email address

The Central Index Key (CIK)

The SEC assignes a special identifier to each filer, they call it the Central Index Key. This may or may not correspond one-to-one to other identifiers you may be familiar with, like CUSIP or ISIN or even a stock ticker. But EDGAR helpfully provides a reference between CIK and stock ticker for those public companies with a stock ticker. Since every filing will be under a CIK, we’ll use this cross-reference to find the CIK. I suspect that this document is sorted by search popularity, when you see the top 5 entries.

symbol_to_cik = requests.get("https://www.sec.gov/files/company_tickers.json").json()
for i in range(5):
    print(i, symbol_to_cik[f"{i}"])
0 {'cik_str': 320193, 'ticker': 'AAPL', 'title': 'Apple Inc.'}
1 {'cik_str': 789019, 'ticker': 'MSFT', 'title': 'MICROSOFT CORP'}
2 {'cik_str': 1652044, 'ticker': 'GOOGL', 'title': 'Alphabet Inc.'}
3 {'cik_str': 1018724, 'ticker': 'AMZN', 'title': 'AMAZON COM INC'}
4 {'cik_str': 1318605, 'ticker': 'TSLA', 'title': 'Tesla, Inc.'}

We will make a lookup dictionary from ticker to CIK.

cik_lookup = dict([(val['ticker'], val['cik_str']) for key, val in symbol_to_cik.items()])

cik = cik_lookup['AAPL']
cik
320193

Recent filings

Now we can request all the recent filings for a CIK. Let’s get all of AAPL’s recent filings. Searching for submissions for a CIK gives a JSON document that includes a number of high level pieces of information about the filer (like a description, name, addresses, and state of incorporation). It also includes several lists of recent filing information. We need to pad the CIK with leading 0s to get the submissions.

edgar_filings = requests.get(f"https://data.sec.gov/submissions/CIK{cik:0>10}.json", headers=headers).json()
edgar_filings.keys()
dict_keys(['cik', 'entityType', 'sic', 'sicDescription', 'insiderTransactionForOwnerExists', 'insiderTransactionForIssuerExists', 'name', 'tickers', 'exchanges', 'ein', 'description', 'website', 'investorWebsite', 'category', 'fiscalYearEnd', 'stateOfIncorporation', 'stateOfIncorporationDescription', 'addresses', 'phone', 'flags', 'formerNames', 'filings'])
edgar_filings['filings']['recent'].keys()
dict_keys(['accessionNumber', 'filingDate', 'reportDate', 'acceptanceDateTime', 'act', 'form', 'fileNumber', 'filmNumber', 'items', 'size', 'isXBRL', 'isInlineXBRL', 'primaryDocument', 'primaryDocDescription'])

So we can see that there are a number of top level fields, as well as a filings attribute that has lists of filing data. You can explore this data if you run the code yourself, but for now we can transform the EDGAR filings data into a pandas DataFrame.

import pandas as pd

recents = pd.DataFrame(edgar_filings['filings']['recent'])
recents.head()
        accessionNumber  filingDate  reportDate        acceptanceDateTime act  \
0  0000320193-22-000063  2022-05-13  2022-05-06  2022-05-13T18:31:28.000Z       
1  0000320193-22-000061  2022-05-06  2022-05-04  2022-05-06T18:30:57.000Z       
2  0001193125-22-128368  2022-04-29              2022-04-28T19:03:29.000Z  33   
3  0001193125-22-128361  2022-04-29              2022-04-28T18:59:32.000Z  33   
4  0001193125-22-128354  2022-04-29              2022-04-28T18:54:29.000Z  33   

      form  fileNumber filmNumber items    size  isXBRL  isInlineXBRL  \
0        4                                 4636       0             0   
1        4                                16362       0             0   
2      S-8  333-264555   22869039         99455       0             0   
3  S-8 POS  333-165214   22869020        121168       0             0   
4  S-8 POS  333-195509   22869000        121168       0             0   

                           primaryDocument primaryDocDescription  
0  xslF345X03/wf-form4_165248105838188.xml                FORM 4  
1  xslF345X03/wf-form4_165187623819106.xml                FORM 4  
2                           d332661ds8.htm                   S-8  
3                        d279200ds8pos.htm               S-8 POS  
4                        d279200ds8pos.htm               S-8 POS  

Let’s also turn our date columns into dates. You can check out this article on how to convert pandas datatypes.

recents['reportDate'] = pd.to_datetime(recents['reportDate'])
recents['filingDate'] = pd.to_datetime(recents['filingDate'])

If you rememember, the insider trading transactions were SEC Form 4. Let’s look at those, for the first quarter of 2022. If you have questions about how to index pandas DataFrames, you can check out this series of articles.

insider_q1 = recents[(recents['form'] == "4") &
        (recents['filingDate'] >= '2022-01-01') &
        (recents['filingDate'] <= '2022-03-31')]
insider_q1.shape
(17, 14)

OK, we now have 17 insider transactions in Apple for the first quarter of 2022. Let’s download the documents and build a pandas DataFrame of the transaction data. The first step is to make a function that downloads the correct filing. All EDGAR filings are found in the form https://www.sec.gov/Archives/edgar/data/[CIK]/[Accession Number]/[Document]. We already know the CIK. The Accession Number is available in our DataFrame. Given a row, we can return the URL. Note that the Primary Document listed is the human readable document. We want to access the raw filing data which is in SGML. Using that, we can extract the XML content. We also remove the - in the directory for the accessionNumber, but not the document itself.

Parsing the data

How do we parse the html? We use our old friend Beautiful Soup! The method is similar to my previous article, but in this case, the document is an SGML file. SGML is a markup language the precedes XML and html, but we can actually parse it using the same tools.

def make_url(cik, row):
    accessionNumber = row['accessionNumber'].replace("-", "")
    return f"https://www.sec.gov/Archives/edgar/data/{cik}/{accessionNumber}/{row['accessionNumber']}.txt"
make_url(cik, insider_q1.iloc[0])
'https://www.sec.gov/Archives/edgar/data/320193/000032019322000041/0000320193-22-000041.txt'
from bs4 import BeautifulSoup
req = requests.get(make_url(cik, insider_q1.iloc[0]), headers=headers)
soup = BeautifulSoup(req.content, 'html.parser')
req.content[0:300]
b'<SEC-DOCUMENT>0000320193-22-000041.txt : 20220308\n<SEC-HEADER>0000320193-22-000041.hdr.sgml : 20220308\n<ACCEPTANCE-DATETIME>20220308183738\nACCESSION NUMBER:\t\t0000320193-22-000041\nCONFORMED SUBMISSION TYPE:\t4\nPUBLIC DOCUMENT COUNT:\t\t1\nCONFORMED PERIOD OF REPORT:\t20220304\nFILED AS OF DATE:\t\t20220308\nD'

A quick explainer on what we’re doing here. First, get the main text document which is an SGML document. We then parse the content using Beautiful Soup, but use the html parser. The html parser is much more forgiving, so it will not fail as a standard xml parser would. We only want to find the xml node that is in the text node in this document. We use can look at that node to see what the “raw” xml looks like.

str(soup.find_all('xml')[0])[0:300]
'<xml>\n<?xml version="1.0"?>\n<ownershipdocument>\n<schemaversion>X0306</schemaversion>\n<documenttype>4</documenttype>\n<periodofreport>2022-03-04</periodofreport>\n<notsubjecttosection16>0</notsubjecttosection16>\n<issuer>\n<issuercik>0000320193</issuercik>\n<issuername>Apple Inc.</issuername>\n<issuertradi'

Messy data

Now as I was working through this, I noticed something. This XML document doesn’t look like the first one we looked at in the beginning of the article. Those tags were mixed case, these are all lowercase. XML is case sensitive, so one of these two is not valid.

This is typical of data that we find in EDGAR, thousands of companies use different software packages to prepare their filings, and the SEC doesn’t seem to enforce strict data compliance for the filings. Every time you think you have a way to fetch the data you want, you’ll likely find a company that does things just slightly differently. As a result, we have to wrestle with the data and figure out how to bend it to our will. In this case, I’m going to check the case of the root element, and use two different sets of XPath expressions to retrieve our data. This is a hack, but will hopefully get us to the next step

import re

def get_document(cik, row):
    url = make_url(cik, row)
    res = requests.get(url, headers=headers)
    res.raise_for_status()
    soup = BeautifulSoup(res.content, 'html.parser')
    # use a case insensitive search for the root node of the XML document
    docs = soup.find_all(re.compile("ownershipDocument", re.IGNORECASE))
    if len(docs) > 0:
        doc = etree.fromstring(str(docs[0]))
        if docs[0].name == "ownershipDocument":
            owner = doc.xpath("/ownershipDocument/reportingOwner/reportingOwnerId/rptOwnerName")[0].text
            security = doc.xpath("//securityTitle/value")[0].text
            date = doc.xpath("//transactionDate/value")[0].text
            amount = doc.xpath("//transactionAmounts/transactionShares/value")[0].text
            disposal = doc.xpath("//transactionAmounts/transactionAcquiredDisposedCode/value")[0].text
        elif docs[0].name == "ownershipdocument":
            owner = doc.xpath("/ownershipdocument/reportingowner/reportingownerid/rptownername")[0].text
            security = doc.xpath("//securitytitle/value")[0].text
            date = doc.xpath("//transactiondate/value")[0].text
            amount = doc.xpath("//transactionamounts/transactionshares/value")[0].text
            disposal = doc.xpath("//transactionamounts/transactionacquireddisposedcode/value")[0].text
            pass
        else:
            raise ValueError(f"Don't know how to process {docs[0].name}")

        return (owner, security, date, amount, disposal)
get_document(cik, insider_q1.iloc[0])
('WAGNER SUSAN', 'Restricted Stock Unit', '2022-03-04', '1685', 'A')

That worked. Now we can loop through our DataFrame rows and get the data for each row.

results = []
for i, row in insider_q1.iterrows():
    try:
        results.append(get_document(cik, row))
    except ValueError as ve:
        print(f"{ve} on row {i}")

pd.DataFrame(results, columns=['name', 'type', 'date', 'quantity', 'acquired_disposed' ])
                  name                   type        date quantity  \
0         WAGNER SUSAN  Restricted Stock Unit  2022-03-04     1685   
1       SUGAR RONALD D  Restricted Stock Unit  2022-03-04     1685   
2      LOZANO MONICA C  Restricted Stock Unit  2022-03-04     1685   
3    LEVINSON ARTHUR D  Restricted Stock Unit  2022-03-04     1685   
4          JUNG ANDREA  Restricted Stock Unit  2022-03-04     1685   
5          Gorsky Alex  Restricted Stock Unit  2022-03-04     1685   
6       GORE ALBERT JR  Restricted Stock Unit  2022-03-04     1685   
7         BELL JAMES A  Restricted Stock Unit  2022-03-04     1685   
8   Adams Katherine L.           Common Stock  2021-11-29     2000   
9         WAGNER SUSAN           Common Stock  2022-02-01     1986   
10      SUGAR RONALD D           Common Stock  2022-02-01     1986   
11     LOZANO MONICA C           Common Stock  2022-02-01     1986   
12   LEVINSON ARTHUR D           Common Stock  2022-02-01     1986   
13         JUNG ANDREA           Common Stock  2022-02-01     1986   
14         Gorsky Alex           Common Stock  2022-02-01      486   
15      GORE ALBERT JR           Common Stock  2022-02-01     1986   
16        BELL JAMES A           Common Stock  2022-02-01     1986   

   acquired_disposed  
0                  A  
1                  A  
2                  A  
3                  A  
4                  A  
5                  A  
6                  A  
7                  A  
8                  D  
9                  A  
10                 A  
11                 A  
12                 A  
13                 A  
14                 A  
15                 A  
16                 A  

There we go, there’s a list of insider stock trades for a single company during a single quarter.

Wrapping up

In summary, we learned a few things from this article. First, the SEC has a free system with a lot of data called EDGAR. The data consists of filings, which are submitted by individuals and companies as required by the SEC. It’s possible to access all the EDGAR data using HTTP APIs. Some of that data can be very messy, but with a little bit of creative Python code, we can extract what we’re looking for.

What might you do with EDGAR data and Python?

Have anything to say about this topic?