Parsing XBRL with Python

Matt Scarpino

4.33/5 (9 votes)

2 Feb 2018CPOL11 min read

72.9K

584

Extracting data from online financial reports with Python

Download source code - 773 B

My previous article explained how to access corporate reports in the EDGAR database, but it didn't explain how to extract data from a report. If you look at a report listing, you'll see that EDGAR provides reports in three primary formats:

Regular text - Data provided in regular files (*.txt)
Web pages - Data to be viewed in a browser (*.htm)
XBRL - Data provided in XBRL-formatted files (*.xml)

The first two options are fine if you want to read report data yourself. But if you want to extract data programmatically, the last option is the most practical. XBRL files aren't easy for humans to read, but because of their structure, they're ideally suited for computers.

This article introduces the XBRL format and then explains how to read XBRL using BeautifulSoup. At the end, I'll present example code that programmatically downloads and parses an XBRL file from EDGAR.

1. Introducing XBRL

A primary role of the US Securities and Exchange Commission (SEC) is to ensure that investors have reliable information with which to make decisions. To this end, the SEC requires that publicly-traded corporations submit reports that accurately portray their financial state. Corporations have traditionally provided these reports in regular text, but as computerized stock analysis became popular, the SEC decided on a more structured, computer-readable format.

The SEC selected the eXtensible Business Reporting Language (XBRL) for structured corporate reporting. As of April 2009, the SEC requires that corporations provide financial reports in XBRL format in addition to text. Since then, India and the United Kingdom have also adopted XBRL for corporate reporting.

XBRL is based on the eXtensible Markup Language (XML), but uses special tags to mark financial data. This section presents the basics of XML and namespaces, and then provides an overview of XBRL.

1.1 XML, Schema, and Namespaces

A good way to introduce XML is to compare it with HTML. An HTML document structures its content using nested tags that take the form <xyz>...</xyz>. For example, HTML uses ... tags to display text in boldface, as in Hi there!. HTML lets you control a tag's behavior with attributes, such as the id attribute in ....

I like to think of XML as generic HTML. An XML document contains tags and attributes similar to those in HTML but XML doesn't define any specific tags or attributes. Instead, implementers can define their own tags and attributes by creating a schema. Schemas are defined in special XML documents formatted with XML Schema Definition (XSD), and for this reason, schema documents have the suffix *.xsd instead of *.xml.

An XML document can access the tags and attributes of a schema using a namespace declaration. As an example, the following declaration specifies that the XML document will access the tags and attributes defined in the schema located at http://www.example.com:

XML

xmlns:ex="http://www.example.com"

The xmlns portion stands for XML Namespace, and must be present in every namespace declaration. The ex is optional, and serves as a prefix for tags obtained from the schema. For example, if the schema defines an element named apple, the XML document can access the element using <ex:apple>...</ex:apple> tags.

1.2 XBRL Reports and Schema

An XBRL document is an XML document that structures its content using XBRL's tags and attributes. This may sound straightforward, but a single document may need to access features from many different schemas. For example, different countries have different reporting requirements, so an American report will access a different set of elements than a British report. Similarly, different types of reports will require different schemas, so an annual report will use different tags than a prospectus.

A thorough discussion of the tags/attributes in an American corporation's annual report would take up a sizable book. In this discussion, my goal is to present some of the namespaces that are commonly accessed in American reports:

Base XBRL Schema - Provides the overall structure of an XBRL document
US Document and Entity Information (DEI) - Sets a document's type and characteristics
US Generally Accepted Accounting Principles (GAAP) - Defines required elements of American reports
Entity-specific Schema - Defines elements specific to the entity providing the report

You don't need to memorize the elements of these namespaces, but the more familiar you are, the better you'll be able to extract data from XBRL documents.

1.2.1 The Base XBRL Schema

The fundamental tags and attributes of XBRL are provided in the schema located at http://www.xbrl.org/2003/instance. Documents commonly access these elements through the xbrli prefix, as given in the following namespace declaration:

XML

xlmns:xbrli="http://www.xbrl.org/2003/instance"

Of the many elements defined by the schema, xbrli:xbrl is particularly important. This is because the content of every XBRL document must be contained inside <xbrli:xbrl>...</xbrli:xbrl> tags.

To understand other tags provided by the base schema, you should be familiar with the following terms:

instance - an XBRL document whose root element is <xbrli:xbrl>
fact - an individual detail in a report, such as $20M
concept - the meaning associated with a fact, such as the cost of goods sold
entity - the company or individual described by a concept
context - a data structure that associates an entity with a concept

Many XBRL documents start by defining a long list of contexts. Each context is represented by an <xbrli:context> element and each has an id attribute. Each <xbrli:context> element contains an <xbrli:entity> subelement that identifies an entity. The following markup defines a context with an identifier of FD2013Q4YTD:

XML

<xbrli:context id="FD2013Q4YTD">
    <xbrli:entity>
    <xbrli:identifier scheme="http://www.sec.gov/CIK">0001065088</xbrli:identifier>
  </xbrli:entity>
  <xbrli:period>
    <xbrli:startDate>2013-01-01</xbrli:startDate>
    <xbrli:endDate>2013-12-31</xbrli:endDate>
  </xbrli:period>
</xbrli:context>

Later sections in the document can reference this context by assigning a contextRef attribute to the context's ID. This is shown in the following markup:

XML

<us-gaap:IncomeTaxDisclosureTextBlock contextRef="FD2013Q4YTD" ...>

1.2.2 US Document and Entity Information (DEI)

Every XBRL document submitted to the SEC needs to provide information about its content. A submitter can meet this requirement by including elements from the US Document and Entity Information (DEI) schema. These elements are commonly prefixed with dei and a document can access them with the following declaration:

XML

xlmns:dei="http://xbrl.sec.gov/dei/2014-01-31"

The elements defined in this schema identify the XBRL report's type and provide information about the entity submitting the report. Table 1 lists eleven of the many elements available.

Table 1: Elements Provided by the US Document and Entity Information Schema (Abridged)

`DocumentType`	Type of document being reported
`EntityCentralIndexKey`	CIK of the entity submitting the report
`TradingSymbol`	Exchange symbol of the entity submitting the report
`EntityCurrentReportingStatus`	Identifies if the entity is subject to filing requirements
`EntityFilerCategory`	Identifies the entity's filing category (large, small, ...
`EntityRegistrantName`	Exact name of the entity has given in the charter
`DocumentFiscalPeriodFocus`	The document's focus fiscal period
`DocumentFiscalYearFocus`	The document's focus fiscal year
`CurrentFiscalYearEndDate`	End of the current fiscal year
`AmendmentFlag`	Identifies if the document is an amendment to a previously-filed document
`AmendmentDescription`	Description of changes in amended document

It's important to see the difference between EntityCentralIndexKey, TradingSymbol, and EntityRegistrantName. The EntityCentralIndexKey element identifies the submitter's CIK code, the TradingSymbol identifies the submitter's trading (ticker) symbol, and EntityRegistrantName provides the entity's formal name.

The following markup, taken from an eBay annual report, demonstrates how DEI elements are used:

XML

<dei:DocumentType contextRef="..." id="Fact-...">
  10-K
</dei:DocumentType>
<dei:EntityCentralIndexKey contextRef="..." id="Fact-...">
  0001065088
</dei:EntityCentralIndexKey>
<dei:TradingSymbol contextRef="..." id="Fact-...">
  EBAY
</dei:TradingSymbol>
<dei:EntityRegistrantName contextRef="..." id="Fact-...">
  EBAY INC
</dei:EntityRegistrantName>
<dei:EntityFilerCategory contextRef="..." id="Fact-...">
  Large Accelerated Filer
</dei:EntityFilerCategory>

As shown, each DEI element has an id attribute and a contextRef that refers to an <xbrli:context> element defined earlier in the document.

1.2.3 US Generally Accepted Accounting Principles (GAAP)

To ensure that businesses use common terminology in their accounting reports, the US Financial Accounting Standards Board (FASB) provides a set of standards called the Generally Accepted Accounting Principles, or GAAP. Entities can provide GAAP data in their XBRL reports by accessing the FASB's schema definitions. GAAP elements are commonly preceded with the us-gaap prefix:

XML

xmlns:us-gaap="http://fasb.org/us-gaap/2014-01-31"

This schema provides thousands of elements related to accounting, and Table 2 lists a small but important subset. You can look through a more complete table here.

Table 2: Elements of the US Generally Accepted Accounting Principles Schema (Abridged)

`AccountsPayableCurrent`	Liabilities payable to vendors as of the balance sheet date
`AccountsReceivableGross`	Amounts due from customers or clients
`AccountsReceivableNet`	Amounts due from customers or clients, reduced to estimated realizable value
`AccruedIncomeTaxes`	Unpaid sum of known and estimated tax obligations
`AccruedInsuranceCurrent`	Obligations payable to insurance entities to mitigate loss
`AssetManagementCosts`	Aggregate costs related to asset management
`AssetsCurrent`	Sum of all assets expected to be realized within year
`BorrowedFunds`	Sum of all debt amounts
`Cash`	Unrestricted cash available for operating needs
`CommercialPaper`	Value of short-term borrowings using unsecured obligations issued by banks and corporations
`CommonStockNoParValue`	Issuance value per share of no-par value stock
`CommonStockSharesIssued`	Total number of common shares that have been sold or granted to shareholders
`CommonStockValue`	Aggregate par or stated value of issued common stock
`SalariesAndWages`	Expenditures for salaries other than officers
`ConvertibleDebt`	Amount of debt that can be converted into another form of financial instrument, such as common stock
`CostOfGoodsSold`	Aggregate costs related to goods sold during the period
`CostOfServices`	Total costs related to services rendered during the period
`CostsAndExpenses`	Total costs of sales and operating expenses for the period
`DebtCurrent`	Sum of short-term debt and maturities of long-term debt
`DeferredRevenue`	Cash or other assets that have not yet been realized
`Depreciation`	Amount of expense related to the cost of tangible assets over the assets' useful lives
`DirectOperatingCosts`	Aggregate expenses directly related to operations
`Dividends`	Equity impact of cash, stock, and dividends declared for all securities during the period
`EarningsPerShareBasic`	Net income (loss) for the period per share of common stock
`GrossProfit`	Aggregate revenue minus the cost of goods/services sold and operating expenses
`IntangibleAssetsCurrent`	Current portion of non-physical assets, excluding financial assets
`InterestAndDebtExpense`	Expenses related to interest and debt payments
`InventoryGross`	Merchandise, goods, or supplies held for future sale or used int manufacturing or production
`Land`	Real estate held for productive use, not held for sale
`Liabilities`	Sum of all recognized liabilities
`LiabilitiesAndStockholdersEquity`	Total of liabilities and stockholder's equity, including the portion of equity attributable to noncontrolling interests
`NetIncomeLoss`	Portion of profit or loss for the period, net of income taxes
`ProfitLoss`	Consolidated profit or loss for the period
`NotesPayable`	Aggregate amount of notes payable, with initial maturities beyond one year or the normal operating cycle
`OfficersCompensation`	Expenditures for salaries of officers
`OperatingCycle`	Entity's operating cycle if less than 12 months
`OperatingExpenses`	Recurring costs associated with normal operations except expenses included in the cost of sales or services
`PreferredStockValue`	Stated value of issued nonredeemable preferred stock
`ResearchAndDevelopment` `Expense`	Costs incurred during research and development activities
`Revenues`	Aggregate revenue recognized during the period
`SharesIssued`	Number of shares of stock issued
`SharesOutstanding`	Number of shares issued and outstanding
`StockholdersEquity`	Total of stockholders' equity items, net of receivables from officers, directors, owners, and affiliates

You can find accounting data in a report by searching for the appropriate us-gaap element. For example, eBay's 2014 annual report identifies its aggregate liabilities with the following markup:

XML

<us-gaap:Liabilities contextRef="..." decimals="..." id="..." unitRef="usd">
  25226000000
</us-gaap:Liabilities>

The us-gaap schema has many elements that closely resemble one another in name and purpose. If you're searching for specific accounting data, be sure not to confuse the elements.

2. Parsing XBRL with BeautifulSoup

After you've downloaded an XBRL document, you can extract its data using a number of methods. If you know what element you're interested in, you can perform a brute-force search for the text, as in us-gaap:Assets. At the opposite extreme, the python-xbrl library was specially created for parsing XBRL documents, but I've never gotten it to work properly.

This section explains how to parse XBRL using the BeautifulSoup package introduced in the previous article. You don't need to learn any new classes or methods, but it is important to specify that you want to perform XML parsing. If you install the lxml library (pip install lxml), then you can create the BeautifulSoup instance with the following code:

Python

soup = BeautifulSoup(..., 'lxml')

For some reason, when I call the find_all method to search for an XBRL tag, the returned list is always empty. But when I call find_all without arguments, the returned list contains Tags that represent XBRL tags. Therefore, I use code like the following:

Python

soup = BeautifulSoup(xbrl_string, 'lxml')
tag_list = soup.find_all()
for tag in tag_list:
    if tag.name == 'us-gaap:liabilities':
        print('Liabilities: ' + tag.text)

An annual report may contain multiple <us-gaap:liabilities> elements, each corresponding to a different reporting period. Each period corresponds to a <context> element, so you can distinguish between GAAP elements by checking their contextRef attributes.

3. Complete EDGAR-XBRL Example

If you followed the previous article and the content of this article, you shouldn't have any trouble understanding how to access a company's EDGAR reports and parse them in Python. To demonstrate this, the code in Listing 1 searches EDGAR for the 2014 annual report (10-K) from IBM (CIK: 0000051143) and then parses the XBRL to determine the stockholder's equity (us-gaap:stockholdersequity),

Listing 1: Reading Stockholder's Equity from IBM's Annual Report (xbrl_reader.py)

Python

from bs4 import BeautifulSoup
import requests
import sys

# Access page
cik = '0000051143'
type = '10-K'
dateb = '20160101'

# Obtain HTML for search page
base_url = "https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK={}&type={}&dateb={}"
edgar_resp = requests.get(base_url.format(cik, type, dateb))
edgar_str = edgar_resp.text

# Find the document link
doc_link = ''
soup = BeautifulSoup(edgar_str, 'html.parser')
table_tag = soup.find('table', class_='tableFile2')
rows = table_tag.find_all('tr')
for row in rows:
    cells = row.find_all('td')
    if len(cells) > 3:
        if '2015' in cells[3].text:
            doc_link = 'https://www.sec.gov' + cells[1].a['href']

# Exit if document link couldn't be found
if doc_link == '':
    print("Couldn't find the document link")
    sys.exit()

# Obtain HTML for document page
doc_resp = requests.get(doc_link)
doc_str = doc_resp.text

# Find the XBRL link
xbrl_link = ''
soup = BeautifulSoup(doc_str, 'html.parser')
table_tag = soup.find('table', class_='tableFile', summary='Data Files')
rows = table_tag.find_all('tr')
for row in rows:
    cells = row.find_all('td')
    if len(cells) > 3:
        if 'INS' in cells[3].text:
            xbrl_link = 'https://www.sec.gov' + cells[2].a['href']

# Obtain XBRL text from document
xbrl_resp = requests.get(xbrl_link)
xbrl_str = xbrl_resp.text

# Find and print stockholder's equity
soup = BeautifulSoup(xbrl_str, 'lxml')
tag_list = soup.find_all()
for tag in tag_list:
    if tag.name == 'us-gaap:stockholdersequity':
        print("Stockholder's equity: " + tag.text)

This code only works properly if the SEC doesn't change the markup for the EDGAR website. Of course, the markup is unlikely to remain constant over time, so keep in mind that you may have to dig into the markup to update the code.

History

2^nd February, 2018 - Initial article submission

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)