Introduction
An example application of how to convert a series of PDFs containing tables outputted by Microsoft Access 2007 to an XML database.
Background
The Gerald R. Ford Presidential Library and Museum is partnering with the Wikimedia Commons (the repository for media files for Wikimedia projects including Wikipedia) under GLAM to donate content related to the life and presidency of Gerald R. Ford. One of these media donations are a series of 200dpi contact sheets that document Ford's legacy.
The contact sheets are available online at the Gerald R. Ford Presidential Digital Library and contain ~290,000 photographs taken by the White House Photographic Office Photographs from 1974-77. The ~10,000 contact sheets are available online and are listed chronologically with an identifier for each roll of photographic film. No description of the film is provided on the contact sheet page. Instead, the descriptions are provided separately in the form of a chronological series of PDFs which contain tables outputted from Microsoft Access 2007 that contain the contact sheet metadata. The original database from which the tables were output are not available.
In order for files uploaded to the Wikimedia Commons to be found and used, they should include all relevant metadata such as description, author, date, etc. These PDFs contain tables which hold this data which should be associated with the relevant contact sheet. The solution is to download all the PDFs to create a database so that the relevant metadata can be made machine readable and used in the upload of the contact sheets.
PDF tables
The PDF tables appear in the following manner for this example file :
Page 1:
Page 2:
The tables can be output to xml in Adobe Acrobat 9 and be parsed and stitched together once in XML.
Initial Relation
The easiest way to reconstruct the relational database is to find the the unique key. Each contact sheet does have a unique identifier. An example contact sheet URL is:
http://www.fordlibrarymuseum.gov/library/whphotos/A0032_NLGRF.jpg
The format for the contact sheet url is:
http://www.fordlibrarymuseum.gov/library/whphotos/<identifier>_NLGRF.jpg
This identifier is present in the table and can be used to obtain the contact sheet.
Table Conversion
In order to use the tables in the PDFs, they need to be read. Reading in the PDF format is a challenge. To make the data more accessible, PDFs can be exported to different formats. In this case, the PDFs will be exported as HTML 3.2. HTML 4 w/ CSS introduces color markup which which needlessly grow the file. HTML3.2 supports tables, which is what we're after. XML 1.0 export is lossy, with many of the columns coming out empty.
Instead, the PDFs will be batch exported to HTML3.2 (following a method from here), and the HTML output will be read. Fortunately, the html export preserves the table structure which allows for easy parsing.
To batch export the PDFs to HTML 3.2, you'll need at least Adobe Acrobat Pro 9. Here we'll will use Acrobat Pro 9.1
- Select File->Export->Export Multiple Files
- Click Add Files->add files and add the files to convert. You may also drag 'n drop the files. Press OK.
- Specify the output folder and change the Export To to XML 1.0. And press OK. The files will now be converted.
Because the PDFs were output by Microsoft Access, they are tagged PDFs and the tables are explicitly tagged. In the HTML output, this is evident by the structuring.
<TABLE border=0 cellSpacing=0 cellPadding=2 align=center><TBODY>
<TR>
<TH height=18 vAlign=center width=40 align=left>
<FONT color=#00214d size=+1>Roll # </FONT></TH>
<TH height=18 vAlign=center width=48 align=left>Frames </TH>
<TH height=18 vAlign=center width=38 align=left>Tone </TH>
<TH height=18 vAlign=center width=235 align=left>Subject -Proper </TH>
<TH height=18 vAlign=center width=144 align=left>Subject -Generic </TH>
<TH height=18 vAlign=center width=161 align=left>Names </TH>
<TH height=18 vAlign=center width=100 align=left>Geographic </TH>
<TH height=18 vAlign=center width=82 align=left>Location </TH>
<TH height=18 vAlign=center width=80 align=left>Photographer </TH></TR>
<TR>
<TD height=35 vAlign=top width=40 align=left>
<A href="http://www.fordlibrarymuseum.gov/library/whphotos/A0001_NLGRF.jpg">
<FONT color=#0000ff></B>A0001 </A></FONT></TD>
<TD height=35 vAlign=top width=48 align=left><FONT color=#000000>2 </FONT></TD>
<TD height=35 vAlign=top width=38 align=left>Color </TD>
<TD height=35 vAlign=center width=235 align=left>
Swearing in of Gerald R. Ford as the 38th President of the United States </TD>
<TD height=35 vAlign=top width=144 align=left>greeting each other </TD>
<TD height=35 vAlign=top width=161 align=left>Kissinger, Carl Albert </TD>
<TD height=35 vAlign=top width=100 align=left></TD>
<TD height=35 vAlign=top width=82 align=left>East Room </TD>
<TD height=35 vAlign=top width=80 align=left>Moore </TD></TR>
<TR>
<TD height=35 vAlign=top width=40 align=left>
<A href="http://www.fordlibrarymuseum.gov/library/whphotos/A0001_NLGRF.jpg">
<FONT color=#0000ff>A0001 </A></FONT></TD>
<TD height=35 vAlign=top width=48 align=left><FONT color=#000000>3-5 </FONT></TD>
<TD height=35 vAlign=top width=38 align=left>Color </TD>
<TD height=35 vAlign=center width=235 align=left>
Swearing in of Gerald R. Ford as the 38th President of the United States </TD>
<TD height=35 vAlign=top width=144 align=left>talking </TD>
<TD height=35 vAlign=center width=161 align=left>GRF, Chief Justice Warren Burger </TD>
<TD height=35 vAlign=top width=100 align=left></TD>
<TD height=35 vAlign=top width=82 align=left>East Room </TD>
<TD height=35 vAlign=top width=80 align=left>Moore </TD></TR>
<TR>
<TD height=35 vAlign=top width=40 align=left>
<A href="http://www.fordlibrarymuseum.gov/library/whphotos/A0001_NLGRF.jpg">
<FONT color=#0000ff>A0001 </A></FONT></TD>
<TD height=35 vAlign=top width=48 align=left><FONT color=#000000>6-7 </FONT></TD>
<TD height=35 vAlign=top width=38 align=left>Color </TD>
<TD height=35 vAlign=center width=235 align=left>
Swearing in of Gerald R. Ford as the 38th President of the United States </TD>
<TD height=35 vAlign=top width=144 align=left>arm raised </TD>
<TD height=35 vAlign=center width=161 align=left>GRF, Chief Justice Warren Burger </TD>
<TD height=35 vAlign=top width=100 align=left></TD>
<TD height=35 vAlign=top width=82 align=left>East Room </TD>
<TD height=35 vAlign=top width=80 align=left>Moore </TD></TR>
<TR>
<TD height=35 vAlign=top width=40 align=left>
<A href="http://www.fordlibrarymuseum.gov/library/whphotos/A0001_NLGRF.jpg">
<FONT color=#0000ff>A0001 </A></FONT></TD>
<TD height=35 vAlign=top width=48 align=left><FONT color=#000000>8 </FONT></TD>
<TD height=35 vAlign=top width=38 align=left>Color </TD>
<TD height=35 vAlign=center width=235 align=left>
Swearing in of Gerald R. Ford as the 38th President of the United States </TD>
<TD height=35 vAlign=top width=144 align=left>arm raised </TD>
<TD height=35 vAlign=center width=161 align=left>GRF, Chief Justice Warren Burger </TD>
<TD height=35 vAlign=top width=100 align=left></TD>
<TD height=35 vAlign=top width=82 align=left>East Room </TD>
<TD height=35 vAlign=top width=80 align=left>Moore </TD></TR>
<TR>
<TD height=35 vAlign=top width=40 align=left>
<A href="http://www.fordlibrarymuseum.gov/library/whphotos/A0001_NLGRF.jpg">
<FONT color=#0000ff>A0001 </A></FONT></TD>
<TD height=35 vAlign=top width=48 align=left><FONT color=#000000>9-13 </FONT></TD>
<TD height=35 vAlign=top width=38 align=left>Color </TD>
<TD height=35 vAlign=center width=235 align=left>
Swearing in of Gerald R. Ford as the 38th President of the United States </TD>
<TD height=35 vAlign=top width=144 align=left>acknowledging applause </TD>
<TD height=35 vAlign=top width=161 align=left>GRF, Betty Ford </TD>
<TD height=35 vAlign=top width=100 align=left></TD>
<TD height=35 vAlign=top width=82 align=left>East Room </TD>
<TD height=35 vAlign=top width=80 align=left>Moore </TD></TR>
<TR>
<TD height=35 vAlign=top width=40 align=left>
<A href="http://www.fordlibrarymuseum.gov/library/whphotos/A0001_NLGRF.jpg">
<FONT color=#0000ff>A0001 </A></FONT></TD>
<TD height=35 vAlign=top width=48 align=left><FONT color=#000000>14-18 </FONT></TD>
<TD height=35 vAlign=top width=38 align=left>Color </TD>
<TD height=35 vAlign=center width=235 align=left>
Swearing in of Gerald R. Ford as the 38th President of the United States </TD>
<TD height=35 vAlign=top width=144 align=left>Acceptance Speech </TD>
<TD height=35 vAlign=top width=161 align=left>GRF </TD>
<TD height=35 vAlign=top width=100 align=left></TD>
<TD height=35 vAlign=top width=82 align=left>East Room </TD>
<TD height=35 vAlign=top width=80 align=left>Moore </TD></TR>
<TR>
<TD height=35 vAlign=top width=40 align=left>
<A href="http://www.fordlibrarymuseum.gov/library/whphotos/A0001_NLGRF.jpg">
<FONT color=#0000ff>A0001 </A></FONT></TD>
<TD height=35 vAlign=top width=48 align=left><FONT color=#000000>19-21 </FONT></TD>
<TD height=35 vAlign=top width=38 align=left>Color </TD>
<TD height=35 vAlign=center width=235 align=left>
Swearing in of Gerald R. Ford as the 38th President of the United States </TD>
<TD height=35 vAlign=center width=144 align=left>acknowledging applause; walking off stage </TD>
<TD height=35 vAlign=top width=161 align=left>GRF, Betty Ford </TD>
<TD height=35 vAlign=top width=100 align=left></TD>
<TD height=35 vAlign=top width=82 align=left>East Room </TD>
<TD height=35 vAlign=top width=80 align=left>Moore </TD></TR>
<TR>
<TD height=34 vAlign=top width=40 align=left>
<A href="http://www.fordlibrarymuseum.gov/library/whphotos/A0002_NLGRF.jpg">
<FONT color=#0000ff>A0002 </A></FONT></TD>
<TD height=34 vAlign=top width=48 align=left><FONT color=#000000>1A-4A </FONT></TD>
<TD height=34 vAlign=top width=38 align=left>Color </TD>
<TD height=34 vAlign=center width=235 align=left>
Swearing in of Gerald R. Ford as the 38th President of the United States </TD>
<TD height=34 vAlign=top width=144 align=left>Swearing In </TD>
<TD height=34 vAlign=center width=161 align=left>GRF, Chief Justice Warren Burger, Audience </TD>
<TD height=34 vAlign=top width=100 align=left></TD>
<TD height=34 vAlign=top width=82 align=left>East Room </TD>
<TD height=34 vAlign=top width=80 align=left>Moore </TD></TR>
<TR>
<TD height=35 vAlign=top width=40 align=left>
<A href="http://www.fordlibrarymuseum.gov/library/whphotos/A0002_NLGRF.jpg">
<FONT color=#0000ff>A0002 </A></FONT></TD>
<TD height=35 vAlign=top width=48 align=left><FONT color=#000000>5A-13A </FONT></TD>
<TD height=35 vAlign=top width=38 align=left>Color </TD>
<TD height=35 vAlign=center width=235 align=left>
Swearing in of Gerald R. Ford as the 38th President of the United States </TD>
<TD height=35 vAlign=center width=144 align=left>Acceptance Speech, long shots </TD>
<TD height=35 vAlign=center width=161 align=left>
GRF, Betty Ford, Chief Justice Warren Burger, Audience </TD>
<TD height=35 vAlign=top width=100 align=left></TD>
<TD height=35 vAlign=top width=82 align=left>East Room </TD>
<TD height=35 vAlign=top width=80 align=left>Moore</TD></TR></TBODY></TABLE>
These tables will need to be stitched together to recreate the database.
Obtaining the PDFs
A chronological list of links to the PDFs is available online.
An example link is:
http://www.fordlibrarymuseum.gov/library/whphotos/19740809whpo.pdf
The format of the links is:
http://www.fordlibrarymuseum.gov/library/whphotos/<year><month><day>whpo.pdf
To download all of the PDFs on the page, we'll use the free Firefox extension DownloadThemAll!
- After going to the web page, right click on the web page and select "DownloadThemAll" from the context menu.
- Enter where you want to save the files. Under Filters, check documents and uncheck everything else. If you scroll down, you'll see the files to be downloaded in green. Click Start! to begin downloading.
- Wait for the files to all download.
Putting it all together
Once all the PDFs are downloaded, it is time to put it together. The PDFs will be batch converted to XML. A program will then go through each XML file, extract the table, and build the table in another XML file.
The created XML database will have the outline:
<root>
<roll id="x">
<frames name="" tone="" subjectproper=""
subjectgeneric="" names="" geographic=""
location="" photographer="" date=""/>
<frames name="" tone="" subjectproper=""
subjectgeneric="" names="" geographic=""
location="" photographer="" date=""/>
</roll>
</root>
The DTD will be:
<!DOCTYPE root [
<!ELEMENT root ( roll+ )>
<!ELEMENT roll ( frames+ )>
<!ATTLIST roll
id ID #REQUIRED
>
<!ELEMENT frames EMPTY>
<!ATTLIST frames
name CDATA #REQUIRED
tone CDATA #REQUIRED
subjectproper CDATA #REQUIRED
subjectgeneric CDATA #REQUIRED
names CDATA #REQUIRED
geographic CDATA #REQUIRED
location CDATA #REQUIRED
photographer CDATA #REQUIRED
date CDATA #REQUIRED
>
]>
The data will be extracted from the tables in the PDF and the PDF name which provides the date. The date format of the PDFs is:
<year><month><day>whpo.pdf
For the program, we'll use VBScript. You use MSHTML to read the HTML files, Msxml2.DOMDocument.6.0 to create the XML document. Msxml2.SAXXMLReader.6.0 and Msxml2.SAXXMLWriter.6.0 are used to pretty print the XML document.
Option Explicit
Dim htmlin: Set htmlin = CreateObject("htmlfile")Dim fs: Set fs = CreateObject("Scripting.FileSystemObject")
Dim XMLout: Set XMLout = CreateObject("Msxml2.DOMDocument.6.0")
XMLout.async = False
Dim root: Set root = XMLout.createElement("root")
XMLout.appendChild(root)
Dim FSO: Set FSO = CreateObject("Scripting.FileSystemObject")
Dim folder: Set folder = FSO.GetFolder("C:\Ford\HTML\")
Dim file
For Each file In folder.Files
Dim oFile: Set oFile = FSO.OpenTextFile(file)
htmlin.open("about:blank")
htmlin.write(oFile.ReadAll())
oFile.Close
Dim filedate: filedate = FSO.GetBaseName(file)
WScript.Echo filedate
filedate = Mid(filedate,1,4) & "-" & _
Mid(filedate,5,2) & "-" & Mid(filedate,7,2)
Dim lastgoodnode
Dim tables: Set tables = htmlin.getElementsByTagName("Table")
Dim table
For Each table In tables
Dim rows: Set rows = table.getElementsByTagName("TR")
Dim row
For Each row In rows
Dim columns: Set columns = row.getElementsByTagName("TD") If columns.length = 0 Then ElseIf columns.length <> 9 Then
WScript.StdOut.WriteLine "Column Error: " & _
file.Path & "|" & row.xml
Else
Dim id: id = Trim(columns.item(0).innerText)
Dim rollid: Set rollid= Nothing
If(id = "") Then
Dim subjectproper: subjectproper = LCase(columns.item(3).innerText)
If Not(InStr(subjectproper,"roll") Or _
InStr(subjectproper,"taken on") Or _
InStr(subjectproper,"empty folder")) Then
Set rollid = lastgoodnode
lastgoodnode.setAttribute "name", _
lastgoodnode.getAttribute("name") & columns.item(1).innerText
lastgoodnode.setAttribute "tone", _
lastgoodnode.getAttribute("tone") & columns.item(2).innerText
lastgoodnode.setAttribute "subjectproper", _
lastgoodnode.getAttribute("subjectproper") & columns.item(3).innerText
lastgoodnode.setAttribute "subjectgeneric", _
lastgoodnode.getAttribute("subjectgeneric") & columns.item(4).innerText
lastgoodnode.setAttribute "names", _
lastgoodnode.getAttribute("names") & columns.item(5).innerText
lastgoodnode.setAttribute "geographic", _
lastgoodnode.getAttribute("geographic") & columns.item(6).innerText
lastgoodnode.setAttribute "location", _
lastgoodnode.getAttribute("location") & columns.item(7).innerText
lastgoodnode.setAttribute "photographer", _
lastgoodnode.getAttribute("photographer") & columns.item(8).innerText
End If
Else
Set rollid = XMLout.SelectSingleNode("/root/roll[@id='" & _
id & "']") id & "']") If rollid Is Nothing Then Set rollid = XMLout.createElement("roll")
rollid.setAttribute "id", id
root.appendChild(rollid)
End If
Dim frames: Set frames = XMLout.createElement("frames")
frames.setAttribute "name", columns.item(1).innerText
frames.setAttribute "tone", columns.item(2).innerText
frames.setAttribute "subjectproper", columns.item(3).innerText
frames.setAttribute "subjectgeneric", columns.item(4).innerText
frames.setAttribute "names", columns.item(5).innerText
frames.setAttribute "geographic", columns.item(6).innerText
frames.setAttribute "location", columns.item(7).innerText
frames.setAttribute "photographer", columns.item(8).innerText
frames.setAttribute "date", filedate
Set lastgoodnode = frames
rollid.appendChild(frames)
End If
End If
Next
Next
Next
WScript.Echo "Saving xml file"
Dim writer: Set writer = CreateObject("Msxml2.MXXMLWriter.6.0")
With writer
.indent = True
.omitXMLDeclaration = False
.standalone = True
End With
Dim reader: set reader = CreateObject("Msxml2.SAXXMLReader.6.0")
With reader
.contentHandler = writer
.putProperty "http://xml.org/sax/properties/lexical-handler", writer
.parse(XMLout)
End With
Dim Stream : Set Stream = CreateObject("ADODB.Stream")
With Stream
.Open
.WriteText writer.output
.SaveToFile ("C:\Ford\data1.xml")
.Close
End With
The resulting database, ~36 MB, can be found at: dropbox.
To Do
Further explain the VBScript code.
History
- December 14, 2012: Initial post.