Finding and analyzing free stock index data with Python and EDGAR

A stock index is just a list of stocks. But an index is a special list because investors use it to make investing decisions. An index is constructed via rules about stocks to include, how much to include, and when to include (or remove it). Finding this data, especially for more obscure indexes, can be difficult. This is especially the case if you want to find this data for free. This article will show you one technique to find stock index information using free data. I’ll do all this with standard Python tools – all code included.

We will use pandas, lxml, matplotlib, and a little elbow grease.

What is a stock index?

A stock index is a bit more than just a list of stocks. We want three pieces of information: the list, the weights, and a date that the index changes.

List

The rules to pick stocks in an index can be somewhat complicated. Each index has some guidelines for selecting stocks and explains who makes the decisions on index inclusion. Most indexes can only change their constituents (a name for the members of the index) four times per year. For example, consider an index that contains the top 100 stocks by market capitalization for an exchange. As stocks go up and down in price, or as companies merge with other companies, it may be necessary to remove some and add others to the list.

Weights

You can think of the index as a list of weights. Each stock in the index will have a weight, and the sum of the weights will be 1. An index can be equal weighted: it can contain 100 equally weighted stocks having a weight of 0.01 per stock. Usually indexes use market capitalization for weighting. Market capitalization is the total value of the outstanding shares in the market. The higher market cap stocks will have a bigger weight than the smaller ones. Rebalancing is the act of recalculating the weights and adding or removing stocks from the list.

History

The index rebalance happens on a certain date, usually once per quarter. Any analysis of historical index data requires a history of index weights and constituents.

In terms of ease to obtain this data, the easiest to find is the list. But the weights and historical data can be a bigger challenge to find.

Index examples

Some common stock indexes that many people have heard of in the United States are the Dow Jones Industrial Average (a.k.a. “The Dow”), the S&P 500, and the Nasdaq 100. There are many more indexes, including indexes in other countries, like the FTSE 100 on the London Stock Exchange.

The news media will often report on the daily performance of the Dow when discussing the market. If you aren’t a finance person, you may not know that practically no one in finance cares about the Dow. For historical reasons, the media often says things like “the market was up 300 points today”. Usually we are interested in percentage point changes of the S&P500. The Dow is a bad index to use for investors for a few reasons, which we will show in this article.

The S&P 500 is a well known index maintained by S&P Global. The stocks in the index are from 500 US companies, and they are roughly the 500 largest by market capitalization. Companies move in and out of the index based on a set of rules created by S&P. Once a quarter, the index is rebalanced. The index currently has 503 stocks in it (as of time of writing) because three of the 500 companies have 2 types of common stock, and both symbols are included in the index. If you want to know more details about how S&P constructs and maintains their indexes, here’s a detailed document describing the rules.

Side note, I will choose to use the term indexes instead of indices for the plural of index, even though in this context indices would be appropriate. You will see both used in a financial context, but I find indexes to be more comfortable.

Getting this data for free

S&P makes their money by providing market data, including the most current data about the indexes that they maintain. While you can pretty easily find the current stocks that are in a popular index like the Dow or S&P 500, finding the historical values that are in the index, as well as their weights, can be a challenge. S&P would rather just sell you this data. If you work at a large bank or hedge fund, your company probably pays S&P (or a competitor) a fair amount of money for a database of index values, weights, and the dates for each of these values.

It turns out that you can pretty closely approximate this data for free, if you’re willing to put in the work to find it and clean it up.

To do that, we’ll use Python and various Python libraries to access the SEC’s EDGAR system. You can read all about what EDGAR and the SEC is and how to setup your environment to read EDGAR data in this article. However, I’ll cover some of the details here as well.

How do we find index data in EDGAR?

The approach that we’ll take to find the elements of an index is to look for an Exchange Traded Fund (ETF) that tracks the index. One of the main reason indexes exist is so that investors can measure their investing success against what “the market” does. If you decided to buy 10 stocks and hold them for one year, at the end of that year, how would you know if you did a good job of picking stocks? One way is to compare your performance to the performance of a stock index.

Over the years, many investors learned that outperforming the index was really difficult, and just buying all the stocks in the index was much easier. You won’t beat the index, but you will also not underperform it (except for fees). But buying all 503 stocks is difficult and expensive, so the financial industry offers a solution in the form of an ETF. The ETF is a special fund that contains all the stocks in the index. Since ETFs are listed securities, the SEC requires them to make regular filings. The filings are in the EDGAR database.

If you want to own the S&P 500 index, there are many ETFs that track it. The most common one is SPY, although there are many others, and the others tend to have lower fees.

For this article, since we want to make some of the manual steps later a little easier, we will use the Dow Jones Industrial Average. It is an inferior index for investing, but since it has only 30 stocks, it serves well for an example. You can find ETFs using a screener tool like this one. Just search for Dow Jones as the index provider, and you’ll find DIA. Here’s the result page for it

What filings are available for ETFs?

You can go into EDGAR and search for all the filings using their search form. Here are the results for DIA.

The SEC requires a filing called N-PORT that contains the portfolio investments for an ETF. This is a quarterly report. At the time of writing the most recent report is dated 2022-07-31 (but was filed on 2022-09-23).

Let’s take a look at what this report looks like. I’ll go through this without a ton of explanation (you can refer to the earlier article for details).

import requests
import json
from lxml import etree

import pandas as pd

website = "example.org" # your website
email = "[email protected]" # your email

headers = {"User-Agent": f"{website} {email}"}

# get the XML document directly (note, this is not the html version linked to above, but the underlying xml doc)
nport = requests.get("https://www.sec.gov/Archives/edgar/data/1041130/000175272422213777/primary_doc.xml", headers=headers)

nport.raise_for_status()

print(nport.content[:100]

b'<?xml version="1.0" encoding="UTF-8"?><edgarSubmission xmlns="http://www.sec.gov/edgar/nport" xmlns:'

# turn the xml above into an lxml document
doc = etree.fromstring(nport.content)

Parsing out the data

Now I’ll make things look really easy here. In reality I had to manually inspect a few examples of the N-PORT data to figure out the structure. I also took a look at the filing instructions supplied by the SEC, as well as the technical specs for the xml. You can pull up the raw xml to see the structure, and you can dig into the specs if you want to learn more. I learned that the investments are all located in the invstOrSec section of the document. Using lxml, we can make an XPath expression to select only those values.

investments = doc.xpath("//nport:invstOrSec", namespaces={'nport':'http://www.sec.gov/edgar/nport'})
investments[0].xpath("nport:name", namespaces={'nport':'http://www.sec.gov/edgar/nport'})[0].text

'Honeywell International Inc'

You can see the first element is for Honeywell. I then found a nice code snippet to use that would convert an lxml document into a simple Python dict for easier manipulation.

# from https://gist.github.com/jacobian/795571
def elem2dict(node):
    """
    Convert an lxml.etree node tree into a dict.
    """
    result = {}

    for element in node.iterchildren():
        # Remove namespace prefix
        key = element.tag.split('}')[1] if '}' in element.tag else element.tag

        # Process element as tree element if the inner XML contains non-whitespace content
        if element.text and element.text.strip():
            value = element.text
        else:
            value = elem2dict(element)
        if key in result:


            if type(result[key]) is list:
                result[key].append(value)
            else:
                tempvalue = result[key].copy()
                result[key] = [tempvalue, value]
        else:
            result[key] = value
    return result

data = [elem2dict(i) for i in investments]

print(len(data))
data[0]

30
{'name': 'Honeywell International Inc',
 'lei': 'ISRPG12PN4EIEOEMW547',
 'title': 'Honeywell International Inc',
 'cusip': '438516106',
 'identifiers': {'isin': {}},
 'balance': '5706213.00000000',
 'units': 'NS',
 'curCd': 'USD',
 'valUSD': '1098217753.98000000',
 'pctVal': '3.858904435656',
 'payoffProfile': 'Long',
 'assetCat': 'EC',
 'issuerCat': 'CORP',
 'invCountry': 'US',
 'isRestrictedSec': 'N',
 'fairValLevel': '1',
 'securityLending': {'isCashCollateral': 'N',
  'isNonCashCollateral': 'N',
  'isLoanByFund': 'N'}}

Further analysis with pandas and matplotlib

Now you can see that the first investment has a few interesting fields, including the name, the cusip (a standard identifier we’ll talk about in the future), the balance of shares (balance and units – NS means Number of Shares) and the percentage of the total investment (pctVal). We can make a pandas DataFrame of this data for some further analysis.

df = pd.DataFrame(data)

Note that in the data above, the numbers are passed in as text, so let’s convert balance, valUSD, and pctVal into numbers. If you would like an overview of converting data in pandas, check out this article.

for c in ['balance', 'valUSD', 'pctVal']:
    df[c] = pd.to_numeric(df[c])

df[['name', 'cusip', 'balance', 'units', 'pctVal']]

                                    name      cusip    balance units  \
0            Honeywell International Inc  438516106  5706213.0    NS   
1                      Cisco Systems Inc  17275R102  5706213.0    NS   
2                         Salesforce Inc  79466L302  5706213.0    NS   
3                                Dow Inc  260557103  5706241.0    NS   
4                      Johnson & Johnson  478160104  5706213.0    NS   
5                        McDonald's Corp  580135101  5706213.0    NS   
6                 UnitedHealth Group Inc  91324P102  5706213.0    NS   
7   International Business Machines Corp  459200101  5706213.0    NS   
8                                  3M Co  88579Y101  5706213.0    NS   
9                               NIKE Inc  654106103  5706213.0    NS   
10          Walgreens Boots Alliance Inc  931427108  5706213.0    NS   
11                              Visa Inc  92826C839  5706213.0    NS   
12               Procter & Gamble Co/The  742718109  5706213.0    NS   
13                        Microsoft Corp  594918104  5706213.0    NS   
14                          Chevron Corp  166764100  5706213.0    NS   
15                    Walt Disney Co/The  254687106  5706213.0    NS   
16                           Walmart Inc  931142103  5706213.0    NS   
17                      Coca-Cola Co/The  191216100  5706213.0    NS   
18                             Amgen Inc  031162100  5706213.0    NS   
19                    Home Depot Inc/The  437076102  5706213.0    NS   
20                            Intel Corp  458140100  5706213.0    NS   
21           Goldman Sachs Group Inc/The  38141G104  5706213.0    NS   
22            Verizon Communications Inc  92343V104  5706213.0    NS   
23                         Boeing Co/The  097023105  5706213.0    NS   
24                       Caterpillar Inc  149123101  5706213.0    NS   
25                             Apple Inc  037833100  5706213.0    NS   
26                 Travelers Cos Inc/The  89417E109  5706213.0    NS   
27                        Merck & Co Inc  58933Y105  5706213.0    NS   
28                   JPMorgan Chase & Co  46625H100  5706213.0    NS   
29                   American Express Co  025816109  5706213.0    NS   

       pctVal  
0    3.858904  
1    0.909688  
2    3.689679  
3    1.066888  
4    3.499200  
5    5.280680  
6   10.874146  
7    2.622395  
8    2.872023  
9    2.304195  
10   0.794398  
11   4.252895  
12   2.785204  
13   5.628956  
14   3.283858  
15   2.127350  
16   2.647658  
17   1.286636  
18   4.961878  
19   6.033974  
20   0.728031  
21   6.684611  
22   0.926129  
23   3.194233  
24   3.974996  
25   3.258394  
26   3.182002  
27   1.791305  
28   2.313017  
29   3.088166

Just from inspection, you might see a few things you can learn about this index, but let’s visualize the data a bit to illustrate a point or two about it. One way to do this is to just show a pie chart of both the pctVal and balance columns. Let’s make a function to do this which we’ll use later for comparing with another index.

from matplotlib import pyplot as plt

def plot_index(df, name, sort='pctVal'):
    fig, axes = plt.subplots(1,2, figsize=(12,8))
    df = df.sort_values(by=sort, ascending=False)
    fig.suptitle(f"Shares vs % of Total for {name}")
    axes[0].pie(df['balance'])
    labels = df['name'].copy()
    labels.iloc[min(25, len(labels)):] = ''
    axes[1].pie(df['pctVal'], labels=labels, rotatelabels=True)

plot_index(df, "Dow Jones Industrial Average")

pie chart of the shares and % of total value for DIA — Shares and % of total value for shares in DIA

Comparing indexes

Let’s now go ahead and grab the Nasdaq 100 index. The ETF QQQ tracks the Nasdaq 100, and so we can grab a recent N-PORT filing for it just like we did for DIA. I’ll do this all in one go, just repeating the steps from above, but with a recent filing for QQQ.

def get_nport_values(url):
    nport = requests.get(url, headers=headers)
    nport.raise_for_status()
    doc = etree.fromstring(nport.content)
    investments = doc.xpath("//nport:invstOrSec", namespaces={'nport':'http://www.sec.gov/edgar/nport'})
    data = [elem2dict(i) for i in investments]
    df = pd.DataFrame(data)
    for c in ['balance', 'valUSD', 'pctVal']:
        df[c] = pd.to_numeric(df[c])
    return df

# recent nport document for QQQ
url = "https://www.sec.gov/Archives/edgar/data/1067839/000175272422184608/primary_doc.xml"
df_n100 = get_nport_values(url)
plot_index(df_n100, 'Nasdaq 100')

Pie charts for shares and % of total value in QQQ — The shares and percent of total value for QQQ

Share counts

These two sets of pie charts do a good job of showing you the difference between these two types of indexes. It’s pretty clear that the Dow is equally weighted by shares. You see this on the left pie chart for DIA. Each slice is the same size. In the Nasdaq 100, the share sizes don’t especially line up with the percentage of the total value. In some cases (such as Tesla) a large slice of total value is a small slice of the total shares. This is because Tesla has a large share price, but not as many shares outstanding as Amazon, for example.

Now, I’ll make a little side note that equally weighted indexes aren’t inherently bad. Some investors may want an equal weighted index to prevent too much of the investment being in only the largest companies. The main problem with the Dow is that it has only 30 stocks in it and it equally weights by shares, not market cap. It also doesn’t represent the entire market well, which is how the new media uses it, unfortunately.

Percentage of value

Look at the top three holdings in the Dow by value and the bottom three, and do the same for the Nasdaq 100.

df.sort_values(by='pctVal').head(3)[['name', 'pctVal']]

                            name    pctVal
20                    Intel Corp  0.728031
10  Walgreens Boots Alliance Inc  0.794398
1              Cisco Systems Inc  0.909688

df.sort_values(by='pctVal').tail(3)[['name', 'pctVal']]

                           name     pctVal
6        UnitedHealth Group Inc  10.874146
21  Goldman Sachs Group Inc/The   6.684611
19           Home Depot Inc/The   6.033974

df_n100.sort_values(by='pctVal').head(3)[['name', 'pctVal']]

              name    pctVal
83  DocuSign, Inc.  0.108177
41      Okta, Inc.  0.128520
34    Splunk, Inc.  0.134234

df_n100.sort_values(by='pctVal').tail(3)[['name', 'pctVal']]

                name     pctVal
17       Apple, Inc.  12.578838
67   Microsoft Corp.  10.918983
10  Amazon.com, Inc.   6.142787

If you don’t follow the stock market, this might not be obvious, but Apple, Microsoft, and Amazon have much larger market capitalizations than DocuSign, Okta, and Splunk. But why is UnitedHealth group 10% of the Dow, when Intel is only 0.72 % ? UnitedHealth has a stock price of around $500, but Intel has a price around $25. However, their market capitalizations are ~ $479B and $105B respectively. A price weighted index is very different from a market cap weighted index because it depends somewhat arbitrarily on price.

Where do we go from here?

We’ve now just scratched the surface with using the EDGAR data to understand what is in an index. There is a lot more useful data in EDGAR about each of the stocks in the index. But to access it, we will need to map the value in the N-PORT report back to a stock ticker. You’ll notice that we have a name and a CUSIP, but not a ticker. We’ll tackle that problem next.