When exploring the world of quantitative finance or algorithmic trading, you quickly end up facing a very common issue: Where do I get historical market data? No matter what sort of analysis or trading you plan to do, you’ll need access to quality market data for your research and development. This can be a challenging and possibly expensive process. If all you want is daily U.S. Equity closing prices for large cap stocks, you’ll probably be able to find this from a number of free or close to free sources. However, if you want to access intraday data (prices at hourly, minute, or even sub minute levels), or data for other types of securities (futures, bonds, foreign stocks, for example), you will find the data to be a bit more expensive and difficult to find. For example, I found that historical 1 minute data for the full S&P 500 going back to 1998 will cost over $750 from several vendors, and will be over 50 GB of data.
However, some brokerages will give you access to historical data as part of their service offerings. For example, Interactive Brokers (IB) offers APIs for fetching historical data at different resolutions. For many people, this data may be good enough for historical backtesting and research, and it is included in the price you are already paying for market data.
OK, there has to be a catch, right?
Yes, there are several issues with downloading data from a broker like IB. They point out that you may want to purchase your data from a vendor that specializes in historical market data. Some of these issues are:
- Being forced to use a clunky API instead of just downloading bulk CSV files. Some vendors, like Polygon offer bulk file downloads of historical data, but IB forces you to use their API which adds a little bit of complexity to the process.
- IB has placed restrictions on their APIs to prevent users from abusing the system, so downloads will need to be rate limited to not be flagged for abuse of the system. Their servers will also rate limit your results if you send too many requests, and you may get disconnected.
- IB doesn’t offer historical data for stocks that are no longer listed, so your dataset will automatically suffer from survivorship bias. Some companies will get acquired at high prices, others will go bankrupt or be delisted, and your historical backtests will have neither of these scenarios included. It also appears some expired futures data is not available, but I haven’t been able to verify this yet.
Even given these issues, using IB to obtain some historical data for research is worth considering as a first option, especially if you’re already paying for the market data. If it doesn’t meet your needs, you can always purchase data form someone else.
In order to fetch historical data, you need to have met several criteria:
- Opened an IB account, and funded it
- Downloaded and configured the TWS software and python API
- Subscribed to Level 1 (top of book) market data for any contracts you wish to query
Along with these steps, IB places some limitations on fetching data:
- No more than 50 outstanding requests at a time. They note that it is probably more efficient to do fewer requests rather than try to test the upper limit.
- If asking for 30 second bars or lower, no 6 requests for the same contract in 2 seconds, 60 requests in 10 minutes, or two identical in 15 seconds. If you are grabbing consecutive single days for a symbol you can hit this limit pretty easily.
- In general, if your request will return more than a few thousand bars you should consider splitting it up.
So what sort of data is available? Bar data is available in sizes of 1, 5, 10, 15, and 30 seconds, but resolutions below 30 seconds are only available for six months from the current date. They will also generate larger bars of 1, 2, 3, 5, 10, 15, 20, and 30 minutes and 1, 2, 3, 4, and 8 hours, along with daily, weekly, and monthly bars. Those bars can consist of trades, bids and asks, midpoint, and various other fields described in the documentation. Note that building bars with last price and bid/ask will require at least two queries (TRADES and BID_ASK), then merging the data together. When considering the pacing of requests, this may factor into any downloading decisions.
In my testing, I found that more than a few thousand rows of data are returned for some queries (for example, fetching daily data for 40 years of AAPL returns over 9000 rows, 20 years of NVDA returns over 5000 rows at once. For minute bar data, I found that querying multiple days of daily data will cause rate limiting to take effect.
In order to run my code, you need to follow the directions from my earlier post to install the IB API. Once you’ve activated your Python virtualenv, you also need to make sure you’ve installed a few more Python libraries.
pyenv activate ib-example pip install python-dateutil matplotlib jupyter
I’ve posted a command line application to GitHub that allows for some flexible downloads of data. It supports a few different command line options for querying different ranges of data.
$ ./download_bars.py --help usage: download_bars.py [-h] [-d] [-p PORT] [--size SIZE] [--duration DURATION] [-t DATA_TYPE] [--base-directory BASE_DIRECTORY] [--currency CURRENCY] [--exchange EXCHANGE] [--security-type SECURITY_TYPE] [--start-date START_DATE] [--end-date END_DATE] [--max-days] symbol [symbol ...] positional arguments: symbol optional arguments: -h, --help show this help message and exit -d, --debug turn on debug logging -p PORT, --port PORT local port for TWS connection --size SIZE bar size --duration DURATION bar duration -t DATA_TYPE, --data-type DATA_TYPE bar data type --base-directory BASE_DIRECTORY base directory to write bar files --currency CURRENCY currency for symbols --exchange EXCHANGE exchange for symbols --security-type SECURITY_TYPE security type for symbols --start-date START_DATE First day for bars --end-date END_DATE Last day for bars --max-days Set start date to earliest date
For example, to fetch all historical data for AAPL as daily bars and place the csv file in
./download_bars.py --max-days --size '1 day' AAPL
To fetch a week of 1 minute bars for AMGN, with each day saved as a separate csv file in
./download_bars.py --size "1 min" --start-date 20200202 --end-date 20200207 AMGN
You can refer to the code for more details. However, at a higher level using the IB historical data API involves several methods. First, I use the
reqHeadTimeStamp method to find the timestamp for the earliest data available for the contract. This is useful if we want to access the entire history of data, or to validate that we aren’t requesting data before the earliest date. Our result for this query is processed in the
headTimeStamp method. Next, we invoke the
reqHistoricalData method, making sure to request a reasonable amount of data. The results of this call are handled in the
historicalData method, which is called once for each bar. Once all the data has been delivered, the
historicalDataEnd method is invoked. There, we check that we’ve received all our data, save it to disk, and check to see if we have more data in our timespan to download. If so, we invoke the
reqHistoricalData method again, repeating this process until all the data is downloaded. All the IB methods are well documented in the IB API documentation.
I’ve also created a very simple Jupyter notebook that shows what some of the data looks like.