My previous article explained how to access corporate reports in the EDGAR database, but it didn't explain how to extract data from a report. If you look at a report listing, you'll see that EDGAR provides reports in three primary formats:
- Regular text - Data provided in regular files (*.txt)
- Web pages - Data to be viewed in a browser (*.htm)
- XBRL - Data provided in XBRL-formatted files (*.xml)
The first two options are fine if you want to read report data yourself. But if you want to extract data programmatically, the last option is the most practical. XBRL files aren't easy for humans to read, but because of their structure, they're ideally suited for computers.
This article introduces the XBRL format and then explains how to read XBRL using BeautifulSoup
. At the end, I'll present example code that programmatically downloads and parses an XBRL file from EDGAR.
1. Introducing XBRL
A primary role of the US Securities and Exchange Commission (SEC) is to ensure that investors have reliable information with which to make decisions. To this end, the SEC requires that publicly-traded corporations submit reports that accurately portray their financial state. Corporations have traditionally provided these reports in regular text, but as computerized stock analysis became popular, the SEC decided on a more structured, computer-readable format.
The SEC selected the eXtensible Business Reporting Language (XBRL) for structured corporate reporting. As of April 2009, the SEC requires that corporations provide financial reports in XBRL format in addition to text. Since then, India and the United Kingdom have also adopted XBRL for corporate reporting.
XBRL is based on the eXtensible Markup Language (XML), but uses special tags to mark financial data. This section presents the basics of XML and namespaces, and then provides an overview of XBRL.
1.1 XML, Schema, and Namespaces
A good way to introduce XML is to compare it with HTML. An HTML document structures its content using nested tags that take the form <xyz>...</xyz>
. For example, HTML uses <b>...</b>
tags to display text in boldface, as in <b>Hi there!</b>
. HTML lets you control a tag's behavior with attributes, such as the id
attribute in <p id="...">...</p>
.
I like to think of XML as generic HTML. An XML document contains tags and attributes similar to those in HTML but XML doesn't define any specific tags or attributes. Instead, implementers can define their own tags and attributes by creating a schema. Schemas are defined in special XML documents formatted with XML Schema Definition (XSD), and for this reason, schema documents have the suffix *.xsd instead of *.xml.
An XML document can access the tags and attributes of a schema using a namespace declaration. As an example, the following declaration specifies that the XML document will access the tags and attributes defined in the schema located at http://www.example.com:
xmlns:ex="http://www.example.com"
The xmlns
portion stands for XML Namespace, and must be present in every namespace declaration. The ex
is optional, and serves as a prefix for tags obtained from the schema. For example, if the schema defines an element named apple
, the XML document can access the element using <ex:apple>...</ex:apple>
tags.
1.2 XBRL Reports and Schema
An XBRL document is an XML document that structures its content using XBRL's tags and attributes. This may sound straightforward, but a single document may need to access features from many different schemas. For example, different countries have different reporting requirements, so an American report will access a different set of elements than a British report. Similarly, different types of reports will require different schemas, so an annual report will use different tags than a prospectus.
A thorough discussion of the tags/attributes in an American corporation's annual report would take up a sizable book. In this discussion, my goal is to present some of the namespaces that are commonly accessed in American reports:
- Base XBRL Schema - Provides the overall structure of an XBRL document
- US Document and Entity Information (DEI) - Sets a document's type and characteristics
- US Generally Accepted Accounting Principles (GAAP) - Defines required elements of American reports
- Entity-specific Schema - Defines elements specific to the entity providing the report
You don't need to memorize the elements of these namespaces, but the more familiar you are, the better you'll be able to extract data from XBRL documents.
1.2.1 The Base XBRL Schema
The fundamental tags and attributes of XBRL are provided in the schema located at http://www.xbrl.org/2003/instance. Documents commonly access these elements through the xbrli
prefix, as given in the following namespace declaration:
xlmns:xbrli="http://www.xbrl.org/2003/instance"
Of the many elements defined by the schema, xbrli:xbrl
is particularly important. This is because the content of every XBRL document must be contained inside <xbrli:xbrl>...</xbrli:xbrl>
tags.
To understand other tags provided by the base schema, you should be familiar with the following terms:
instance
- an XBRL document whose root element is <xbrli:xbrl>
fact
- an individual detail in a report, such as $20M concept
- the meaning associated with a fact, such as the cost of goods sold entity
- the company or individual described by a concept context
- a data structure that associates an entity with a concept
Many XBRL documents start by defining a long list of contexts. Each context is represented by an <xbrli:context>
element and each has an id
attribute. Each <xbrli:context>
element contains an <xbrli:entity>
subelement that identifies an entity. The following markup defines a context with an identifier of FD2013Q4YTD
:
<xbrli:context id="FD2013Q4YTD">
<xbrli:entity>
<xbrli:identifier scheme="http://www.sec.gov/CIK">0001065088</xbrli:identifier>
</xbrli:entity>
<xbrli:period>
<xbrli:startDate>2013-01-01</xbrli:startDate>
<xbrli:endDate>2013-12-31</xbrli:endDate>
</xbrli:period>
</xbrli:context>
Later sections in the document can reference this context by assigning a contextRef
attribute to the context's ID. This is shown in the following markup:
<us-gaap:IncomeTaxDisclosureTextBlock contextRef="FD2013Q4YTD" ...>
1.2.2 US Document and Entity Information (DEI)
Every XBRL document submitted to the SEC needs to provide information about its content. A submitter can meet this requirement by including elements from the US Document and Entity Information (DEI) schema. These elements are commonly prefixed with dei
and a document can access them with the following declaration:
xlmns:dei="http://xbrl.sec.gov/dei/2014-01-31"
The elements defined in this schema identify the XBRL report's type and provide information about the entity submitting the report. Table 1 lists eleven of the many elements available.
Table 1: Elements Provided by the US Document and Entity Information Schema (Abridged)
DocumentType | Type of document being reported |
EntityCentralIndexKey | CIK of the entity submitting the report |
TradingSymbol | Exchange symbol of the entity submitting the report |
EntityCurrentReportingStatus | Identifies if the entity is subject to filing requirements |
EntityFilerCategory | Identifies the entity's filing category (large, small, ... |
EntityRegistrantName | Exact name of the entity has given in the charter |
DocumentFiscalPeriodFocus | The document's focus fiscal period |
DocumentFiscalYearFocus | The document's focus fiscal year |
CurrentFiscalYearEndDate | End of the current fiscal year |
AmendmentFlag | Identifies if the document is an amendment to a
previously-filed document |
AmendmentDescription | Description of changes in amended document |
It's important to see the difference between EntityCentralIndexKey
, TradingSymbol
, and EntityRegistrantName
. The EntityCentralIndexKey
element identifies the submitter's CIK code, the TradingSymbol
identifies the submitter's trading (ticker) symbol, and EntityRegistrantName
provides the entity's formal name.
The following markup, taken from an eBay annual report, demonstrates how DEI elements are used:
<dei:DocumentType contextRef="..." id="Fact-...">
10-K
</dei:DocumentType>
<dei:EntityCentralIndexKey contextRef="..." id="Fact-...">
0001065088
</dei:EntityCentralIndexKey>
<dei:TradingSymbol contextRef="..." id="Fact-...">
EBAY
</dei:TradingSymbol>
<dei:EntityRegistrantName contextRef="..." id="Fact-...">
EBAY INC
</dei:EntityRegistrantName>
<dei:EntityFilerCategory contextRef="..." id="Fact-...">
Large Accelerated Filer
</dei:EntityFilerCategory>
As shown, each DEI element has an id
attribute and a contextRef
that refers to an <xbrli:context>
element defined earlier in the document.
1.2.3 US Generally Accepted Accounting Principles (GAAP)
To ensure that businesses use common terminology in their accounting reports, the US Financial Accounting Standards Board (FASB) provides a set of standards called the Generally Accepted Accounting Principles, or GAAP. Entities can provide GAAP data in their XBRL reports by accessing the FASB's schema definitions. GAAP elements are commonly preceded with the us-gaap
prefix:
xmlns:us-gaap="http://fasb.org/us-gaap/2014-01-31"
This schema provides thousands of elements related to accounting, and Table 2 lists a small but important subset. You can look through a more complete table here.
Table 2: Elements of the US Generally Accepted Accounting Principles Schema (Abridged)
AccountsPayableCurrent | Liabilities payable to vendors as of the balance sheet date |
AccountsReceivableGross | Amounts due from customers or clients |
AccountsReceivableNet | Amounts due from customers or clients, reduced to
estimated realizable value |
AccruedIncomeTaxes | Unpaid sum of known and estimated tax obligations |
AccruedInsuranceCurrent | Obligations payable to insurance entities to mitigate loss |
AssetManagementCosts | Aggregate costs related to asset management |
AssetsCurrent | Sum of all assets expected to be realized within year |
BorrowedFunds | Sum of all debt amounts |
Cash | Unrestricted cash available for operating needs |
CommercialPaper | Value of short-term borrowings using unsecured
obligations issued by banks and corporations |
CommonStockNoParValue | Issuance value per share of no-par value stock |
CommonStockSharesIssued | Total number of common shares that have been
sold or granted to shareholders |
CommonStockValue | Aggregate par or stated value of issued common stock |
SalariesAndWages | Expenditures for salaries other than officers |
ConvertibleDebt | Amount of debt that can be converted into another
form of financial instrument, such as common stock |
CostOfGoodsSold | Aggregate costs related to goods sold during the period |
CostOfServices | Total costs related to services rendered during the period |
CostsAndExpenses | Total costs of sales and operating expenses for the period |
DebtCurrent | Sum of short-term debt and maturities of long-term debt |
DeferredRevenue | Cash or other assets that have not yet been realized |
Depreciation | Amount of expense related to the cost of tangible assets
over the assets' useful lives |
DirectOperatingCosts | Aggregate expenses directly related to operations |
Dividends | Equity impact of cash, stock, and dividends declared
for all securities during the period |
EarningsPerShareBasic | Net income (loss) for the period per share of common stock |
GrossProfit | Aggregate revenue minus the cost of goods/services sold and
operating expenses |
IntangibleAssetsCurrent | Current portion of non-physical assets, excluding financial
assets |
InterestAndDebtExpense | Expenses related to interest and debt payments |
InventoryGross | Merchandise, goods, or supplies held for future sale or used
int manufacturing or production |
Land | Real estate held for productive use, not held for sale |
Liabilities | Sum of all recognized liabilities |
LiabilitiesAndStockholdersEquity | Total of liabilities and stockholder's equity, including the
portion of equity attributable to noncontrolling interests |
NetIncomeLoss | Portion of profit or loss for the period, net of income taxes |
ProfitLoss | Consolidated profit or loss for the period |
NotesPayable | Aggregate amount of notes payable, with initial maturities
beyond one year or the normal operating cycle |
OfficersCompensation | Expenditures for salaries of officers |
OperatingCycle | Entity's operating cycle if less than 12 months |
OperatingExpenses | Recurring costs associated with normal operations except
expenses included in the cost of sales or services |
PreferredStockValue | Stated value of issued nonredeemable preferred stock |
ResearchAndDevelopment
Expense | Costs incurred during research and development
activities |
Revenues | Aggregate revenue recognized during the period |
SharesIssued | Number of shares of stock issued |
SharesOutstanding | Number of shares issued and outstanding |
StockholdersEquity | Total of stockholders' equity items, net of receivables
from officers, directors, owners, and affiliates |
You can find accounting data in a report by searching for the appropriate us-gaap
element. For example, eBay's 2014 annual report identifies its aggregate liabilities with the following markup:
<us-gaap:Liabilities contextRef="..." decimals="..." id="..." unitRef="usd">
25226000000
</us-gaap:Liabilities>
The us-gaap
schema has many elements that closely resemble one another in name and purpose. If you're searching for specific accounting data, be sure not to confuse the elements.
2. Parsing XBRL with BeautifulSoup
After you've downloaded an XBRL document, you can extract its data using a number of methods. If you know what element you're interested in, you can perform a brute-force search for the text, as in us-gaap:Assets
. At the opposite extreme, the python-xbrl library was specially created for parsing XBRL documents, but I've never gotten it to work properly.
This section explains how to parse XBRL using the BeautifulSoup
package introduced in the previous article. You don't need to learn any new classes or methods, but it is important to specify that you want to perform XML parsing. If you install the lxml library (pip install lxml
), then you can create the BeautifulSoup
instance with the following code:
soup = BeautifulSoup(..., 'lxml')
For some reason, when I call the find_all
method to search for an XBRL tag, the returned list is always empty. But when I call find_all
without arguments, the returned list contains Tag
s that represent XBRL tags. Therefore, I use code like the following:
soup = BeautifulSoup(xbrl_string, 'lxml')
tag_list = soup.find_all()
for tag in tag_list:
if tag.name == 'us-gaap:liabilities':
print('Liabilities: ' + tag.text)
An annual report may contain multiple <us-gaap:liabilities>
elements, each corresponding to a different reporting period. Each period corresponds to a <context>
element, so you can distinguish between GAAP elements by checking their contextRef
attributes.
3. Complete EDGAR-XBRL Example
If you followed the previous article and the content of this article, you shouldn't have any trouble understanding how to access a company's EDGAR reports and parse them in Python. To demonstrate this, the code in Listing 1 searches EDGAR for the 2014 annual report (10-K) from IBM (CIK: 0000051143) and then parses the XBRL to determine the stockholder's equity (us-gaap:stockholdersequity
),
Listing 1: Reading Stockholder's Equity from IBM's Annual Report (xbrl_reader.py)
from bs4 import BeautifulSoup
import requests
import sys
cik = '0000051143'
type = '10-K'
dateb = '20160101'
base_url = "https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK={}&type={}&dateb={}"
edgar_resp = requests.get(base_url.format(cik, type, dateb))
edgar_str = edgar_resp.text
doc_link = ''
soup = BeautifulSoup(edgar_str, 'html.parser')
table_tag = soup.find('table', class_='tableFile2')
rows = table_tag.find_all('tr')
for row in rows:
cells = row.find_all('td')
if len(cells) > 3:
if '2015' in cells[3].text:
doc_link = 'https://www.sec.gov' + cells[1].a['href']
if doc_link == '':
print("Couldn't find the document link")
sys.exit()
doc_resp = requests.get(doc_link)
doc_str = doc_resp.text
xbrl_link = ''
soup = BeautifulSoup(doc_str, 'html.parser')
table_tag = soup.find('table', class_='tableFile', summary='Data Files')
rows = table_tag.find_all('tr')
for row in rows:
cells = row.find_all('td')
if len(cells) > 3:
if 'INS' in cells[3].text:
xbrl_link = 'https://www.sec.gov' + cells[2].a['href']
xbrl_resp = requests.get(xbrl_link)
xbrl_str = xbrl_resp.text
soup = BeautifulSoup(xbrl_str, 'lxml')
tag_list = soup.find_all()
for tag in tag_list:
if tag.name == 'us-gaap:stockholdersequity':
print("Stockholder's equity: " + tag.text)
This code only works properly if the SEC doesn't change the markup for the EDGAR website. Of course, the markup is unlikely to remain constant over time, so keep in mind that you may have to dig into the markup to update the code.
History
- 2nd February, 2018 - Initial article submission