Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles
(untagged)

Example of converting tables in a series of PDFs to an XML database

0.00/5 (No votes)
13 Jun 2013 1  
A walk-through of how to convert a series of PDFs containing tables output by Microsoft Access to an XML database.

Introduction

An example application of how to convert a series of PDFs containing tables outputted by Microsoft Access 2007 to an XML database.

Background

The Gerald R. Ford Presidential Library and Museum is partnering with the Wikimedia Commons (the repository for media files for Wikimedia projects including Wikipedia) under GLAM to donate content related to the life and presidency of Gerald R. Ford. One of these media donations are a series of 200dpi contact sheets that document Ford's legacy.

The contact sheets are available online at the Gerald R. Ford Presidential Digital Library and contain ~290,000 photographs taken by the White House Photographic Office Photographs from 1974-77. The ~10,000 contact sheets are available online and are listed chronologically with an identifier for each roll of photographic film. No description of the film is provided on the contact sheet page. Instead, the descriptions are provided separately in the form of a chronological series of PDFs which contain tables outputted from Microsoft Access 2007 that contain the contact sheet metadata. The original database from which the tables were output are not available.  

In order for files uploaded to the Wikimedia Commons to be found and used, they should include all relevant metadata such as description, author, date, etc. These PDFs contain tables which hold this data which should be associated with the relevant contact sheet. The solution is to download all the PDFs to create a database so that the relevant metadata can be made machine readable and used in the upload of the contact sheets.

PDF tables

The PDF tables appear in the following manner for this example file :

Page 1:

Page 2:

The tables can be output to xml in Adobe Acrobat 9 and be parsed and stitched together once in XML.

Initial Relation

The easiest way to reconstruct the relational database is to find the the unique key. Each contact sheet does have a unique identifier. An example contact sheet URL is:  

http://www.fordlibrarymuseum.gov/library/whphotos/A0032_NLGRF.jpg 

The format for the contact sheet url is:  

http://www.fordlibrarymuseum.gov/library/whphotos/<identifier>_NLGRF.jpg  

This identifier is present in the table and can be used to obtain the contact sheet.

Table Conversion

In order to use the tables in the PDFs, they need to be read. Reading in the PDF format is a challenge. To make the data more accessible, PDFs can be exported to different formats. In this case, the PDFs will be exported as HTML 3.2. HTML 4 w/ CSS introduces color markup which which needlessly grow the file. HTML3.2 supports tables, which is what we're after. XML 1.0 export is lossy, with many of the columns coming out empty.

Instead, the PDFs will be batch exported to HTML3.2 (following a method from here), and the HTML output will be read. Fortunately, the html export preserves the table structure which allows for easy parsing.

To batch export the PDFs to HTML 3.2, you'll need at least Adobe Acrobat Pro 9. Here we'll will use Acrobat Pro 9.1  

  1. Select File->Export->Export Multiple Files
  2. Click Add Files->add files and add the files to convert. You may also drag 'n drop the files. Press OK.
  3. Specify the output folder and change the Export To to XML 1.0. And press OK. The files will now be converted.

Because the PDFs were output by Microsoft Access, they are tagged PDFs and the tables are explicitly tagged. In the HTML output, this is evident by the structuring. 

<TABLE border=0 cellSpacing=0 cellPadding=2 align=center><TBODY>
<TR>
<TH height=18 vAlign=center width=40 align=left>
   <FONT color=#00214d size=+1>Roll # </FONT></TH>
<TH height=18 vAlign=center width=48 align=left>Frames </TH>
<TH height=18 vAlign=center width=38 align=left>Tone </TH>
<TH height=18 vAlign=center width=235 align=left>Subject -Proper </TH>
<TH height=18 vAlign=center width=144 align=left>Subject -Generic </TH>
<TH height=18 vAlign=center width=161 align=left>Names </TH>
<TH height=18 vAlign=center width=100 align=left>Geographic </TH>
<TH height=18 vAlign=center width=82 align=left>Location </TH>
<TH height=18 vAlign=center width=80 align=left>Photographer </TH></TR>
<TR>
<TD height=35 vAlign=top width=40 align=left>
  <A href="http://www.fordlibrarymuseum.gov/library/whphotos/A0001_NLGRF.jpg">
  <FONT color=#0000ff></B>A0001 </A></FONT></TD>
<TD height=35 vAlign=top width=48 align=left><FONT color=#000000>2 </FONT></TD>
<TD height=35 vAlign=top width=38 align=left>Color </TD>
<TD height=35 vAlign=center width=235 align=left>
  Swearing in of Gerald R. Ford as the 38th President of the United States </TD>
<TD height=35 vAlign=top width=144 align=left>greeting each other </TD>
<TD height=35 vAlign=top width=161 align=left>Kissinger, Carl Albert </TD>
<TD height=35 vAlign=top width=100 align=left></TD>
<TD height=35 vAlign=top width=82 align=left>East Room </TD>
<TD height=35 vAlign=top width=80 align=left>Moore </TD></TR>
<TR>
<TD height=35 vAlign=top width=40 align=left>
  <A href="http://www.fordlibrarymuseum.gov/library/whphotos/A0001_NLGRF.jpg">
  <FONT color=#0000ff>A0001 </A></FONT></TD>
<TD height=35 vAlign=top width=48 align=left><FONT color=#000000>3-5 </FONT></TD>
<TD height=35 vAlign=top width=38 align=left>Color </TD>
<TD height=35 vAlign=center width=235 align=left>
  Swearing in of Gerald R. Ford as the 38th President of the United States </TD>
<TD height=35 vAlign=top width=144 align=left>talking </TD>
<TD height=35 vAlign=center width=161 align=left>GRF, Chief Justice Warren Burger </TD>
<TD height=35 vAlign=top width=100 align=left></TD>
<TD height=35 vAlign=top width=82 align=left>East Room </TD>
<TD height=35 vAlign=top width=80 align=left>Moore </TD></TR>
<TR>
<TD height=35 vAlign=top width=40 align=left>
  <A href="http://www.fordlibrarymuseum.gov/library/whphotos/A0001_NLGRF.jpg">
  <FONT color=#0000ff>A0001 </A></FONT></TD>
<TD height=35 vAlign=top width=48 align=left><FONT color=#000000>6-7 </FONT></TD>
<TD height=35 vAlign=top width=38 align=left>Color </TD>
<TD height=35 vAlign=center width=235 align=left>
  Swearing in of Gerald R. Ford as the 38th President of the United States </TD>
<TD height=35 vAlign=top width=144 align=left>arm raised </TD>
<TD height=35 vAlign=center width=161 align=left>GRF, Chief Justice Warren Burger </TD>
<TD height=35 vAlign=top width=100 align=left></TD>
<TD height=35 vAlign=top width=82 align=left>East Room </TD>
<TD height=35 vAlign=top width=80 align=left>Moore </TD></TR>
<TR>
<TD height=35 vAlign=top width=40 align=left>
  <A href="http://www.fordlibrarymuseum.gov/library/whphotos/A0001_NLGRF.jpg">
  <FONT color=#0000ff>A0001 </A></FONT></TD>
<TD height=35 vAlign=top width=48 align=left><FONT color=#000000>8 </FONT></TD>
<TD height=35 vAlign=top width=38 align=left>Color </TD>
<TD height=35 vAlign=center width=235 align=left>
  Swearing in of Gerald R. Ford as the 38th President of the United States </TD>
<TD height=35 vAlign=top width=144 align=left>arm raised </TD>
<TD height=35 vAlign=center width=161 align=left>GRF, Chief Justice Warren Burger </TD>
<TD height=35 vAlign=top width=100 align=left></TD>
<TD height=35 vAlign=top width=82 align=left>East Room </TD>
<TD height=35 vAlign=top width=80 align=left>Moore </TD></TR>
<TR>
<TD height=35 vAlign=top width=40 align=left>
  <A href="http://www.fordlibrarymuseum.gov/library/whphotos/A0001_NLGRF.jpg">
  <FONT color=#0000ff>A0001 </A></FONT></TD>
<TD height=35 vAlign=top width=48 align=left><FONT color=#000000>9-13 </FONT></TD>
<TD height=35 vAlign=top width=38 align=left>Color </TD>
<TD height=35 vAlign=center width=235 align=left>
  Swearing in of Gerald R. Ford as the 38th President of the United States </TD>
<TD height=35 vAlign=top width=144 align=left>acknowledging applause </TD>
<TD height=35 vAlign=top width=161 align=left>GRF, Betty Ford </TD>
<TD height=35 vAlign=top width=100 align=left></TD>
<TD height=35 vAlign=top width=82 align=left>East Room </TD>
<TD height=35 vAlign=top width=80 align=left>Moore </TD></TR>
<TR>
<TD height=35 vAlign=top width=40 align=left>
  <A href="http://www.fordlibrarymuseum.gov/library/whphotos/A0001_NLGRF.jpg">
  <FONT color=#0000ff>A0001 </A></FONT></TD>
<TD height=35 vAlign=top width=48 align=left><FONT color=#000000>14-18 </FONT></TD>
<TD height=35 vAlign=top width=38 align=left>Color </TD>
<TD height=35 vAlign=center width=235 align=left>
  Swearing in of Gerald R. Ford as the 38th President of the United States </TD>
<TD height=35 vAlign=top width=144 align=left>Acceptance Speech </TD>
<TD height=35 vAlign=top width=161 align=left>GRF </TD>
<TD height=35 vAlign=top width=100 align=left></TD>
<TD height=35 vAlign=top width=82 align=left>East Room </TD>
<TD height=35 vAlign=top width=80 align=left>Moore </TD></TR>
<TR>
<TD height=35 vAlign=top width=40 align=left>
  <A href="http://www.fordlibrarymuseum.gov/library/whphotos/A0001_NLGRF.jpg">
  <FONT color=#0000ff>A0001 </A></FONT></TD>
<TD height=35 vAlign=top width=48 align=left><FONT color=#000000>19-21 </FONT></TD>
<TD height=35 vAlign=top width=38 align=left>Color </TD>
<TD height=35 vAlign=center width=235 align=left>
  Swearing in of Gerald R. Ford as the 38th President of the United States </TD>
<TD height=35 vAlign=center width=144 align=left>acknowledging applause; walking off stage </TD>
<TD height=35 vAlign=top width=161 align=left>GRF, Betty Ford </TD>
<TD height=35 vAlign=top width=100 align=left></TD>
<TD height=35 vAlign=top width=82 align=left>East Room </TD>
<TD height=35 vAlign=top width=80 align=left>Moore </TD></TR>
<TR>
<TD height=34 vAlign=top width=40 align=left>
  <A href="http://www.fordlibrarymuseum.gov/library/whphotos/A0002_NLGRF.jpg">
  <FONT color=#0000ff>A0002 </A></FONT></TD>
<TD height=34 vAlign=top width=48 align=left><FONT color=#000000>1A-4A </FONT></TD>
<TD height=34 vAlign=top width=38 align=left>Color </TD>
<TD height=34 vAlign=center width=235 align=left>
  Swearing in of Gerald R. Ford as the 38th President of the United States </TD>
<TD height=34 vAlign=top width=144 align=left>Swearing In </TD>
<TD height=34 vAlign=center width=161 align=left>GRF, Chief Justice Warren Burger, Audience </TD>
<TD height=34 vAlign=top width=100 align=left></TD>
<TD height=34 vAlign=top width=82 align=left>East Room </TD>
<TD height=34 vAlign=top width=80 align=left>Moore </TD></TR>
<TR>
<TD height=35 vAlign=top width=40 align=left>
  <A href="http://www.fordlibrarymuseum.gov/library/whphotos/A0002_NLGRF.jpg">
  <FONT color=#0000ff>A0002 </A></FONT></TD>
<TD height=35 vAlign=top width=48 align=left><FONT color=#000000>5A-13A </FONT></TD>
<TD height=35 vAlign=top width=38 align=left>Color </TD>
<TD height=35 vAlign=center width=235 align=left>
  Swearing in of Gerald R. Ford as the 38th President of the United States </TD>
<TD height=35 vAlign=center width=144 align=left>Acceptance Speech, long shots </TD>
<TD height=35 vAlign=center width=161 align=left>
  GRF, Betty Ford, Chief Justice Warren Burger, Audience </TD>
<TD height=35 vAlign=top width=100 align=left></TD>
<TD height=35 vAlign=top width=82 align=left>East Room </TD>
<TD height=35 vAlign=top width=80 align=left>Moore</TD></TR></TBODY></TABLE>

These tables will need to be stitched together to recreate the database. 

Obtaining the PDFs  

A chronological list of links to the PDFs is available online.

An example link is:

http://www.fordlibrarymuseum.gov/library/whphotos/19740809whpo.pdf

The format of the links is:

http://www.fordlibrarymuseum.gov/library/whphotos/<year><month><day>whpo.pdf

To download all of the PDFs on the page, we'll use the free Firefox extension DownloadThemAll!

  1. After going to the web page, right click on the web page and select "DownloadThemAll" from the context menu.
  2. Enter where you want to save the files. Under Filters, check documents and uncheck everything else. If you scroll down, you'll see the files to be downloaded in green. Click Start! to begin downloading.
  3. Wait for the files to all download. 

Putting it all together

Once all the PDFs are downloaded, it is time to put it together. The PDFs will be batch converted to XML. A program will then go through each XML file, extract the table, and build the table in another XML file.

The created XML database will have the outline:

<root>
  <roll id="x">
    <frames name="" tone="" subjectproper="" 
      subjectgeneric="" names="" geographic="" 
      location="" photographer="" date=""/>
    <frames name="" tone="" subjectproper="" 
      subjectgeneric="" names="" geographic="" 
      location="" photographer="" date=""/>
  </roll>
</root>

The DTD will be:

<!DOCTYPE root [
<!ELEMENT root  ( roll+ )>
 
<!ELEMENT roll  ( frames+ )>
<!ATTLIST roll
id ID #REQUIRED
>
 
<!ELEMENT frames  EMPTY>
<!ATTLIST frames
name CDATA #REQUIRED
tone CDATA #REQUIRED
subjectproper CDATA #REQUIRED
subjectgeneric CDATA #REQUIRED
names CDATA #REQUIRED
geographic CDATA #REQUIRED
location CDATA #REQUIRED
photographer CDATA #REQUIRED
date CDATA #REQUIRED
>
]>

The data will be extracted from the tables in the PDF and the PDF name which provides the date. The date format of the PDFs is:

<year><month><day>whpo.pdf

For the program, we'll use VBScript. You use MSHTML to read the HTML files, Msxml2.DOMDocument.6.0 to create the XML document. Msxml2.SAXXMLReader.6.0 and Msxml2.SAXXMLWriter.6.0 are used to pretty print the XML document.

Option Explicit
 
Dim htmlin: Set htmlin = CreateObject("htmlfile")'mshtml
Dim fs: Set fs = CreateObject("Scripting.FileSystemObject")
 
Dim XMLout: Set XMLout = CreateObject("Msxml2.DOMDocument.6.0")
XMLout.async = False
 
Dim root: Set root = XMLout.createElement("root")
XMLout.appendChild(root)
 
Dim FSO: Set FSO = CreateObject("Scripting.FileSystemObject")
Dim folder: Set folder = FSO.GetFolder("C:\Ford\HTML\")
Dim file
For Each file In folder.Files
   'If XMLin.load(file.Path) Then
    Dim oFile: Set oFile = FSO.OpenTextFile(file)
    htmlin.open("about:blank")
    htmlin.write(oFile.ReadAll())
    oFile.Close
        
    'date format is year-month-day
    Dim filedate: filedate = FSO.GetBaseName(file)
    WScript.Echo filedate
    filedate = Mid(filedate,1,4) & "-" & _
                Mid(filedate,5,2) & "-" & Mid(filedate,7,2) 
    
    Dim lastgoodnode

    Dim tables: Set tables = htmlin.getElementsByTagName("Table")
    Dim table
    For Each table In tables
        
        Dim rows: Set rows = table.getElementsByTagName("TR")
        Dim row
        For Each row In rows
            Dim columns: Set columns = row.getElementsByTagName("TD")'Header is TH
            If columns.length = 0 Then 'Header
                'ignore
            ElseIf columns.length <> 9 Then
                WScript.StdOut.WriteLine "Column Error: " & _
                        file.Path & "|" & row.xml
            Else 
                'columns.item(0).text
                '1
                Dim id: id = Trim(columns.item(0).innerText)                    
                Dim rollid: Set rollid= Nothing
                If(id = "") Then 'continue from last node
                
                    Dim subjectproper: subjectproper = LCase(columns.item(3).innerText)
                    If Not(InStr(subjectproper,"roll") Or _
                        InStr(subjectproper,"taken on") Or _
                        InStr(subjectproper,"empty folder")) Then
                        
                    Set rollid = lastgoodnode
                    'WScript.Echo "empty node"
                    lastgoodnode.setAttribute "name", _
                      lastgoodnode.getAttribute("name") & columns.item(1).innerText
                    lastgoodnode.setAttribute "tone", _
                      lastgoodnode.getAttribute("tone") & columns.item(2).innerText
                    lastgoodnode.setAttribute "subjectproper", _
                      lastgoodnode.getAttribute("subjectproper") & columns.item(3).innerText
                    lastgoodnode.setAttribute "subjectgeneric", _
                      lastgoodnode.getAttribute("subjectgeneric") & columns.item(4).innerText
                    lastgoodnode.setAttribute "names", _
                      lastgoodnode.getAttribute("names") & columns.item(5).innerText
                    lastgoodnode.setAttribute "geographic", _
                      lastgoodnode.getAttribute("geographic") & columns.item(6).innerText
                    lastgoodnode.setAttribute "location", _
                      lastgoodnode.getAttribute("location") & columns.item(7).innerText
                    lastgoodnode.setAttribute "photographer", _
                      lastgoodnode.getAttribute("photographer") & columns.item(8).innerText
                    End If
                        
                Else
                    Set rollid = XMLout.SelectSingleNode("/root/roll[@id='" & _
                        id & "']")'nodeFromID(id)'SelectSingleNode("//roll[@id='" & _
                        id & "']")'nodeFromID(id)
                    If rollid Is Nothing Then 'Create new roll if not there
                        'WScript.Echo "id: " & id
                        Set rollid = XMLout.createElement("roll")
                        rollid.setAttribute "id", id
                        root.appendChild(rollid)
                    End If
                        
                    'WScript.Echo "f: " & columns.item(1).innerText
                    Dim frames: Set frames = XMLout.createElement("frames")
                    
                    frames.setAttribute "name", columns.item(1).innerText
                    frames.setAttribute "tone", columns.item(2).innerText
                    frames.setAttribute "subjectproper", columns.item(3).innerText
                    frames.setAttribute "subjectgeneric", columns.item(4).innerText
                    frames.setAttribute "names", columns.item(5).innerText
                    frames.setAttribute "geographic", columns.item(6).innerText
                    frames.setAttribute "location", columns.item(7).innerText
                    frames.setAttribute "photographer", columns.item(8).innerText
                    frames.setAttribute "date", filedate
                    
                    Set lastgoodnode = frames
                    rollid.appendChild(frames)
                End If
            End If
        Next
    Next
Next
 
WScript.Echo "Saving xml file"
 
Dim writer: Set writer = CreateObject("Msxml2.MXXMLWriter.6.0")
With writer
    .indent = True
    .omitXMLDeclaration = False
    .standalone = True
    '.encoding = "utf-8"
End With
Dim reader: set reader = CreateObject("Msxml2.SAXXMLReader.6.0")
With reader
    .contentHandler = writer
    .putProperty "http://xml.org/sax/properties/lexical-handler", writer
    .parse(XMLout)
End With
 
Dim Stream : Set Stream = CreateObject("ADODB.Stream")
With Stream
    .Open
    .WriteText writer.output
    .SaveToFile ("C:\Ford\data1.xml")
    .Close
End With

The resulting database, ~36 MB, can be found at: dropbox.

To Do

Further explain the VBScript code.

History

  • December 14, 2012: Initial post.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here