Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / Languages / Python

Parsing XBRL with Python

4.33/5 (9 votes)
2 Feb 2018CPOL11 min read 72.9K   584  
Extracting data from online financial reports with Python

My previous article explained how to access corporate reports in the EDGAR database, but it didn't explain how to extract data from a report. If you look at a report listing, you'll see that EDGAR provides reports in three primary formats:

  • Regular text - Data provided in regular files (*.txt)
  • Web pages - Data to be viewed in a browser (*.htm)
  • XBRL - Data provided in XBRL-formatted files (*.xml)

The first two options are fine if you want to read report data yourself. But if you want to extract data programmatically, the last option is the most practical. XBRL files aren't easy for humans to read, but because of their structure, they're ideally suited for computers.

This article introduces the XBRL format and then explains how to read XBRL using BeautifulSoup. At the end, I'll present example code that programmatically downloads and parses an XBRL file from EDGAR.

1. Introducing XBRL

A primary role of the US Securities and Exchange Commission (SEC) is to ensure that investors have reliable information with which to make decisions. To this end, the SEC requires that publicly-traded corporations submit reports that accurately portray their financial state. Corporations have traditionally provided these reports in regular text, but as computerized stock analysis became popular, the SEC decided on a more structured, computer-readable format.

The SEC selected the eXtensible Business Reporting Language (XBRL) for structured corporate reporting. As of April 2009, the SEC requires that corporations provide financial reports in XBRL format in addition to text. Since then, India and the United Kingdom have also adopted XBRL for corporate reporting.

XBRL is based on the eXtensible Markup Language (XML), but uses special tags to mark financial data. This section presents the basics of XML and namespaces, and then provides an overview of XBRL.

1.1 XML, Schema, and Namespaces

A good way to introduce XML is to compare it with HTML. An HTML document structures its content using nested tags that take the form <xyz>...</xyz>. For example, HTML uses <b>...</b> tags to display text in boldface, as in <b>Hi there!</b>. HTML lets you control a tag's behavior with attributes, such as the id attribute in <p id="...">...</p>.

I like to think of XML as generic HTML. An XML document contains tags and attributes similar to those in HTML but XML doesn't define any specific tags or attributes. Instead, implementers can define their own tags and attributes by creating a schema. Schemas are defined in special XML documents formatted with XML Schema Definition (XSD), and for this reason, schema documents have the suffix *.xsd instead of *.xml.

An XML document can access the tags and attributes of a schema using a namespace declaration. As an example, the following declaration specifies that the XML document will access the tags and attributes defined in the schema located at http://www.example.com:

XML
xmlns:ex="http://www.example.com"

The xmlns portion stands for XML Namespace, and must be present in every namespace declaration. The ex is optional, and serves as a prefix for tags obtained from the schema. For example, if the schema defines an element named apple, the XML document can access the element using <ex:apple>...</ex:apple> tags.

1.2 XBRL Reports and Schema

An XBRL document is an XML document that structures its content using XBRL's tags and attributes. This may sound straightforward, but a single document may need to access features from many different schemas. For example, different countries have different reporting requirements, so an American report will access a different set of elements than a British report. Similarly, different types of reports will require different schemas, so an annual report will use different tags than a prospectus.

A thorough discussion of the tags/attributes in an American corporation's annual report would take up a sizable book. In this discussion, my goal is to present some of the namespaces that are commonly accessed in American reports:

  1. Base XBRL Schema - Provides the overall structure of an XBRL document
  2. US Document and Entity Information (DEI) - Sets a document's type and characteristics
  3. US Generally Accepted Accounting Principles (GAAP) - Defines required elements of American reports
  4. Entity-specific Schema - Defines elements specific to the entity providing the report

You don't need to memorize the elements of these namespaces, but the more familiar you are, the better you'll be able to extract data from XBRL documents.

1.2.1 The Base XBRL Schema

The fundamental tags and attributes of XBRL are provided in the schema located at http://www.xbrl.org/2003/instance. Documents commonly access these elements through the xbrli prefix, as given in the following namespace declaration:

XML
xlmns:xbrli="http://www.xbrl.org/2003/instance"

Of the many elements defined by the schema, xbrli:xbrl is particularly important. This is because the content of every XBRL document must be contained inside <xbrli:xbrl>...</xbrli:xbrl> tags.

To understand other tags provided by the base schema, you should be familiar with the following terms:

  • instance - an XBRL document whose root element is <xbrli:xbrl>
  • fact - an individual detail in a report, such as $20M
  • concept - the meaning associated with a fact, such as the cost of goods sold
  • entity - the company or individual described by a concept
  • context - a data structure that associates an entity with a concept

Many XBRL documents start by defining a long list of contexts. Each context is represented by an <xbrli:context> element and each has an id attribute. Each <xbrli:context> element contains an <xbrli:entity> subelement that identifies an entity. The following markup defines a context with an identifier of FD2013Q4YTD:

XML
<xbrli:context id="FD2013Q4YTD">
    <xbrli:entity>
    <xbrli:identifier scheme="http://www.sec.gov/CIK">0001065088</xbrli:identifier>
  </xbrli:entity>
  <xbrli:period>
    <xbrli:startDate>2013-01-01</xbrli:startDate>
    <xbrli:endDate>2013-12-31</xbrli:endDate>
  </xbrli:period>
</xbrli:context>

Later sections in the document can reference this context by assigning a contextRef attribute to the context's ID. This is shown in the following markup:

XML
<us-gaap:IncomeTaxDisclosureTextBlock contextRef="FD2013Q4YTD" ...>

1.2.2 US Document and Entity Information (DEI)

Every XBRL document submitted to the SEC needs to provide information about its content. A submitter can meet this requirement by including elements from the US Document and Entity Information (DEI) schema. These elements are commonly prefixed with dei and a document can access them with the following declaration:

XML
xlmns:dei="http://xbrl.sec.gov/dei/2014-01-31"

The elements defined in this schema identify the XBRL report's type and provide information about the entity submitting the report. Table 1 lists eleven of the many elements available.

Table 1: Elements Provided by the US Document and Entity Information Schema (Abridged)
DocumentType Type of document being reported
EntityCentralIndexKey CIK of the entity submitting the report
TradingSymbol Exchange symbol of the entity submitting the report
EntityCurrentReportingStatus Identifies if the entity is subject to filing requirements
EntityFilerCategory Identifies the entity's filing category (large, small, ...
EntityRegistrantName Exact name of the entity has given in the charter
DocumentFiscalPeriodFocus The document's focus fiscal period
DocumentFiscalYearFocus The document's focus fiscal year
CurrentFiscalYearEndDate End of the current fiscal year
AmendmentFlag Identifies if the document is an amendment to a
previously-filed document
AmendmentDescription Description of changes in amended document

It's important to see the difference between EntityCentralIndexKey, TradingSymbol, and EntityRegistrantName. The EntityCentralIndexKey element identifies the submitter's CIK code, the TradingSymbol identifies the submitter's trading (ticker) symbol, and EntityRegistrantName provides the entity's formal name.

The following markup, taken from an eBay annual report, demonstrates how DEI elements are used:

XML
<dei:DocumentType contextRef="..." id="Fact-...">
  10-K
</dei:DocumentType>
<dei:EntityCentralIndexKey contextRef="..." id="Fact-...">
  0001065088
</dei:EntityCentralIndexKey>
<dei:TradingSymbol contextRef="..." id="Fact-...">
  EBAY
</dei:TradingSymbol>
<dei:EntityRegistrantName contextRef="..." id="Fact-...">
  EBAY INC
</dei:EntityRegistrantName>
<dei:EntityFilerCategory contextRef="..." id="Fact-...">
  Large Accelerated Filer
</dei:EntityFilerCategory>

As shown, each DEI element has an id attribute and a contextRef that refers to an <xbrli:context> element defined earlier in the document.

1.2.3 US Generally Accepted Accounting Principles (GAAP)

To ensure that businesses use common terminology in their accounting reports, the US Financial Accounting Standards Board (FASB) provides a set of standards called the Generally Accepted Accounting Principles, or GAAP. Entities can provide GAAP data in their XBRL reports by accessing the FASB's schema definitions. GAAP elements are commonly preceded with the us-gaap prefix:

XML
xmlns:us-gaap="http://fasb.org/us-gaap/2014-01-31" 

This schema provides thousands of elements related to accounting, and Table 2 lists a small but important subset. You can look through a more complete table here.

Table 2: Elements of the US Generally Accepted Accounting Principles Schema (Abridged)
AccountsPayableCurrent Liabilities payable to vendors as of the balance sheet date
AccountsReceivableGross Amounts due from customers or clients
AccountsReceivableNet Amounts due from customers or clients, reduced to
estimated realizable value
AccruedIncomeTaxes Unpaid sum of known and estimated tax obligations
AccruedInsuranceCurrent Obligations payable to insurance entities to mitigate loss
AssetManagementCosts Aggregate costs related to asset management
AssetsCurrent Sum of all assets expected to be realized within year
BorrowedFunds Sum of all debt amounts
Cash Unrestricted cash available for operating needs
CommercialPaper Value of short-term borrowings using unsecured
obligations issued by banks and corporations
CommonStockNoParValue Issuance value per share of no-par value stock
CommonStockSharesIssued Total number of common shares that have been
sold or granted to shareholders
CommonStockValue Aggregate par or stated value of issued common stock
SalariesAndWages Expenditures for salaries other than officers
ConvertibleDebt Amount of debt that can be converted into another
form of financial instrument, such as common stock
CostOfGoodsSold Aggregate costs related to goods sold during the period
CostOfServices Total costs related to services rendered during the period
CostsAndExpenses Total costs of sales and operating expenses for the period
DebtCurrent Sum of short-term debt and maturities of long-term debt
DeferredRevenue Cash or other assets that have not yet been realized
Depreciation Amount of expense related to the cost of tangible assets
over the assets' useful lives
DirectOperatingCosts Aggregate expenses directly related to operations
Dividends Equity impact of cash, stock, and dividends declared
for all securities during the period
EarningsPerShareBasic Net income (loss) for the period per share of common stock
GrossProfit Aggregate revenue minus the cost of goods/services sold and
operating expenses
IntangibleAssetsCurrent Current portion of non-physical assets, excluding financial
assets
InterestAndDebtExpense Expenses related to interest and debt payments
InventoryGross Merchandise, goods, or supplies held for future sale or used
int manufacturing or production
Land Real estate held for productive use, not held for sale
Liabilities Sum of all recognized liabilities
LiabilitiesAndStockholdersEquity Total of liabilities and stockholder's equity, including the
portion of equity attributable to noncontrolling interests
NetIncomeLoss Portion of profit or loss for the period, net of income taxes
ProfitLoss Consolidated profit or loss for the period
NotesPayable Aggregate amount of notes payable, with initial maturities
beyond one year or the normal operating cycle
OfficersCompensation Expenditures for salaries of officers
OperatingCycle Entity's operating cycle if less than 12 months
OperatingExpenses Recurring costs associated with normal operations except
expenses included in the cost of sales or services
PreferredStockValue Stated value of issued nonredeemable preferred stock
ResearchAndDevelopment
Expense
Costs incurred during research and development
activities
Revenues Aggregate revenue recognized during the period
SharesIssued Number of shares of stock issued
SharesOutstanding Number of shares issued and outstanding
StockholdersEquity Total of stockholders' equity items, net of receivables
from officers, directors, owners, and affiliates

You can find accounting data in a report by searching for the appropriate us-gaap element. For example, eBay's 2014 annual report identifies its aggregate liabilities with the following markup:

XML
<us-gaap:Liabilities contextRef="..." decimals="..." id="..." unitRef="usd">
  25226000000
</us-gaap:Liabilities>

The us-gaap schema has many elements that closely resemble one another in name and purpose. If you're searching for specific accounting data, be sure not to confuse the elements.

2. Parsing XBRL with BeautifulSoup

After you've downloaded an XBRL document, you can extract its data using a number of methods. If you know what element you're interested in, you can perform a brute-force search for the text, as in us-gaap:Assets. At the opposite extreme, the python-xbrl library was specially created for parsing XBRL documents, but I've never gotten it to work properly.

This section explains how to parse XBRL using the BeautifulSoup package introduced in the previous article. You don't need to learn any new classes or methods, but it is important to specify that you want to perform XML parsing. If you install the lxml library (pip install lxml), then you can create the BeautifulSoup instance with the following code:

Python
soup = BeautifulSoup(..., 'lxml')

For some reason, when I call the find_all method to search for an XBRL tag, the returned list is always empty. But when I call find_all without arguments, the returned list contains Tags that represent XBRL tags. Therefore, I use code like the following:

Python
soup = BeautifulSoup(xbrl_string, 'lxml')
tag_list = soup.find_all()
for tag in tag_list:
    if tag.name == 'us-gaap:liabilities':
        print('Liabilities: ' + tag.text)

An annual report may contain multiple <us-gaap:liabilities> elements, each corresponding to a different reporting period. Each period corresponds to a <context> element, so you can distinguish between GAAP elements by checking their contextRef attributes.

3. Complete EDGAR-XBRL Example

If you followed the previous article and the content of this article, you shouldn't have any trouble understanding how to access a company's EDGAR reports and parse them in Python. To demonstrate this, the code in Listing 1 searches EDGAR for the 2014 annual report (10-K) from IBM (CIK: 0000051143) and then parses the XBRL to determine the stockholder's equity (us-gaap:stockholdersequity),

Listing 1: Reading Stockholder's Equity from IBM's Annual Report (xbrl_reader.py)
Python
from bs4 import BeautifulSoup
import requests
import sys

# Access page
cik = '0000051143'
type = '10-K'
dateb = '20160101'

# Obtain HTML for search page
base_url = "https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK={}&type={}&dateb={}"
edgar_resp = requests.get(base_url.format(cik, type, dateb))
edgar_str = edgar_resp.text

# Find the document link
doc_link = ''
soup = BeautifulSoup(edgar_str, 'html.parser')
table_tag = soup.find('table', class_='tableFile2')
rows = table_tag.find_all('tr')
for row in rows:
    cells = row.find_all('td')
    if len(cells) > 3:
        if '2015' in cells[3].text:
            doc_link = 'https://www.sec.gov' + cells[1].a['href']

# Exit if document link couldn't be found
if doc_link == '':
    print("Couldn't find the document link")
    sys.exit()

# Obtain HTML for document page
doc_resp = requests.get(doc_link)
doc_str = doc_resp.text

# Find the XBRL link
xbrl_link = ''
soup = BeautifulSoup(doc_str, 'html.parser')
table_tag = soup.find('table', class_='tableFile', summary='Data Files')
rows = table_tag.find_all('tr')
for row in rows:
    cells = row.find_all('td')
    if len(cells) > 3:
        if 'INS' in cells[3].text:
            xbrl_link = 'https://www.sec.gov' + cells[2].a['href']

# Obtain XBRL text from document
xbrl_resp = requests.get(xbrl_link)
xbrl_str = xbrl_resp.text

# Find and print stockholder's equity
soup = BeautifulSoup(xbrl_str, 'lxml')
tag_list = soup.find_all()
for tag in tag_list:
    if tag.name == 'us-gaap:stockholdersequity':
        print("Stockholder's equity: " + tag.text)

This code only works properly if the SEC doesn't change the markup for the EDGAR website. Of course, the markup is unlikely to remain constant over time, so keep in mind that you may have to dig into the markup to update the code.

History

  • 2nd February, 2018 - Initial article submission

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)