Introduction
In my last post on Blend PDF with HTML5, we discussed how PDF.js is extended with interactive form elements and how to render them in HTML5. With this article, I'll discuss how to port client side PDF.js to Node.js, so that we can parse PDF on server with lightweight data format in JSON as output.
Moving PDF parsing to Node.js will simplify client side rendering and make it more flexible, because it doesn't need to load entire PDF.js library to the browser, and doesn't require JavaScript Typed Array, XHR L2, Canvas and other HTML5 features as prerequisites. In a later post, I'll discuss how to make client HTML5 render work in browsers where PDF.js currently doesn't work, this post will focus on porting PDF.js to Node.js and how the parsing result data format is defined and structured.
When PDF.js is ported to Node.js, interactive form elements parsing extension still work without much change, since we just swapped browser's JavaScript virtual machine for Node.js and Google V8 Engine. Another added bonus from running in Node.js is that we can run the extended PDF.js as a command line utility to convert PDF to JSON file. When dynamic PDF parsing is desired, you can also run the library in a web service.
This project is open sourced in Github, module name is pdf2json. You can also install pdf2json via NPM to try it out.
Background
One of my projects has literally hundreds of e-forms already created in PDF, a form based user interface for data collecting and presenting is desired, and also runs as a web application. Those PDF e-forms files can be updated quite often during the data collection season. Since PDF is already a standard electronic format, we don't have to re-create those forms in other tools or process, all we need is to have a generic form processor that can parse and render them directly in order to bring hundreds of forms online together with other data processing and service integrations.
My first effort to create this "generic form processor" is a pure client side solution, it extends PDF.js with form elements parsing and rendering, documented in my earlier post. It's a very efficient and practical approach that enabled us to bring lots of PDF forms into our web application in a timely manner, handles interface integration, form live updates, scalability and data service integration, etc. very well. Although it runs well with modern browsers, as the project grows, we found it's cumbersome and troublesome to support older browsers, because client renderer relies on new capabilities in HTML5, including
JavaScript Typed Array, XHR level 2, Web Worker and HTML5 Canvas. This prompted me to come up with a broader-reach approach that can build upon what we already have, to support older browsers in a seamless and transparent way.
The idea is to move the PDF parsing to server, when client needs a form template, it'll send a request with form ID to a web service. The service will locate the PDF, parse it then send the response with a JSON payload that representing the PDF form. This way, the client can just focus on process the JSON data in response, no parsing PDF binary in the browser, so that it can better handle user experience and data service integration via Ajax in a cross platform and cross browser way, all browsers, either modern ones with latest HTML5 capabilities or order ones like IE7 and IE8 can deliver a consistent user experience and interaction.
Architectonically, moving PDF parsing to server follows "separation of concerns" principle. When we provide a form template service, the client focuses on cross browser rendering, while the server side focuses on how to retrieve form definition and no need to worry about how those information is presented. Between the client and server, it's data contract in JSON that represents the PDF form in text format. Technically, as long as the JSON format is the same, when forms are not defined in PDF, it has no impact to client, only parsing provider needs update.
Additionally, moving PDF parsing to server has other benefits too. For frequently updated PDF forms, we can compare or run a diff with the parsing response, since JSON is in text while PDF binaries are difficult to find diffs between versions. For relatively stable forms, we can pre-process these forms by converting PDF to JSON files, and just deploy the JSON file to web server, it'll save server side CPU cycles in runtime with higher scalability and requires no change on client renderer.
Another advantage is that it enables separating user data from template data (form templates are same for all users). Form template data can be taken out of session to simplify and minify user session to improve scalability. One parsing output for a PDF form can be cached and reused cross users and sessions.
Porting PDF.js to Node.js will solve the older browser problem and also ensures the benefits above, extending PDF.js will make sure we still have interactive form elements parsing, forms content can still be defined and edited in PDF while rendering can still run with HTML5. The challenges are the dependencies of PDF.js and how to define a concise text based data contract between client and server.
Because PDF.js is designed and developed as a client side library, it has dependencies that are not available in Node.js runtime, like XHR level 2, Web Worker and HTML5 Canvas. Within PDF.js, it intertwines parsing and rendering together. Plus, parsing output should be as concise as possible in order to reduce bandwidth usage while it ought to be sufficient enough to represent all informations needed to render the form in client. Let's talk about how to handle them in details.
Dependencies handling
As a client side library, PDF.js depends on some new HTML5 capabilities, we need address all of them when porting to Node.js, since neither Node.js nor Google V8 Engine doesn't implement them, including:
- XHR Level 2 - transporting binary data via Ajax
- DOMParser - parsing embedded XML metadata from PDF
- Web Worker - enabling parsing work run in a separated thread
- Canvas - drawing lines, fills, colors, shapes and text in browser
- Others - like web fonts, canvas image, DOM manipulations, etc.
In order to port PDF.js to Node.js, we have to address those dependencies and also extend/modify it. Here below are some brief introduction to the works implemented in pdf2json:
Move Global Variables to Module
Without the global window
object, all global variables in PDF.js (like PDFJS and globalScope) need to be wrapped in node module's scope. Global variables defined in core.js
are moved to /pdf.js
:
var PDFJS = {};
var globalScope = {};
Entire PDF.js is wrapped in one Node.js module, named PDFJSClass
, implemented in /pdf.js
:
var PDFJSClass = (function () {
'use strict';
var _nextId = 1;
var _name = 'PDFJSClass';
var cls = function () {
nodeEvents.EventEmitter.call(this);
var _id = _nextId++;
this.get_id = function() { return _id; };
this.get_name = function() { return _name + _id; };
this.pdfDocument = null;
this.formImage = null;
};
nodeUtil.inherits(cls, nodeEvents.EventEmitter);
cls.prototype.parsePDFData = function(arrayBuffer) {
var parameters = {password: '', data: arrayBuffer};
this.pdfDocument = null;
this.formImage = null;
var self = this;
PDFJS.getDocument(parameters).then(
function getDocumentCallback(pdfDocument) {
self.load(pdfDocument, 1);
},
function getDocumentError(message, exception) {
nodeUtil._logN.call(self, "An error occurred while parsing the PDF: " + message);
},
function getDocumentProgress(progressData) {
nodeUtil._logN.call(self, "Loading progress: " + progressData.loaded / progressData.total + "%");
}
);
};
cls.prototype.load = function(pdfDocument, scale) {
this.pdfDocument = pdfDocument;
var pages = this.pages = [];
this.pageWidth = 0;
var pagesCount = pdfDocument.numPages;
var pagePromises = [];
for (var i = 1; i <= pagesCount; i++)
pagePromises.push(pdfDocument.getPage(i));
var self = this;
var pagesPromise = PDFJS.Promise.all(pagePromises);
nodeUtil._logN.call(self, "PDF loaded. pagesCount = " + pagesCount);
pagesPromise.then(function(promisedPages) {
self.parsePage(promisedPages, 0, 1.5);
});
pdfDocument.getMetadata().then(function(data) {
var info = data.info, metadata = data.metadata;
self.documentInfo = info;
self.metadata = metadata;
var pdfTile = "";
if (metadata && metadata.has('dc:title')) {
pdfTile = metadata.get('dc:title');
}
else if (info && info['Title'])
pdfTile = info['Title'];
self.emit("pdfjs_parseDataReady", {Agency:pdfTile, Id: info});
});
};
cls.prototype.parsePage = function(promisedPages, id, scale) {
nodeUtil._logN.call(this, "start to parse page:" + (id+1));
var self = this;
var pdfPage = promisedPages[id];
var pageParser = new PDFPageParser(pdfPage, id, scale);
pageParser.parsePage(function() {
if (!self.pageWidth)
self.pageWidth = pageParser.width;
PDFField.checkRadioGroup(pageParser.Boxsets);
var page = {Height: pageParser.height,
HLines: pageParser.HLines,
VLines: pageParser.VLines,
Fills:pageParser.Fills,
Texts: pageParser.Texts,
Fields: pageParser.Fields,
Boxsets: pageParser.Boxsets
};
self.pages.push(page);
if (id == self.pdfDocument.numPages - 1) {
nodeUtil._logN.call(self, "complete parsing page:" + (id+1));
self.emit("pdfjs_parseDataReady", {Pages:self.pages, Width: self.pageWidth});
}
else {
process.nextTick(function(){
self.parsePage(promisedPages, ++id, scale);
});
}
});
};
cls.prototype.destroy = function() {
this.removeAllListeners();
if (this.pdfDocument)
this.pdfDocument.destroy();
this.pdfDocument = null;
this.formImage = null;
};
return cls;
})();
module.exports = PDFJSClass;
Replace XHR Level 2 with FS
I don't need Ajax to load PDF binary asynchronously in Node.js, it's replaced with node's fs (File System) to load PDF file from file system. pdfparser.js
is the entrance to pdf2json module, here is its code:
var nodeUtil = require("util"),
nodeEvents = require("events"),
_ = require("underscore"),
fs = require('fs'),
PDFJS = require("./pdf.js");
nodeUtil._logN = function logWithClassName(msg) { nodeUtil.log(this.get_name() + " - " + msg);};
nodeUtil._backTrace = function logCallStack() {
try {
throw new Error();
} catch (e) {
var msg = e.stack ? e.stack.split('\n').slice(2).join('\n') : '';
nodeUtil.log(msg);
}
};
var PDFParser = (function () {
'use strict';
var _nextId = 1;
var _name = 'PDFParser';
var _binBuffer = {};
var _maxBinBufferCount = 10;
var cls = function (context) {
nodeEvents.EventEmitter.call(this);
var _id = _nextId++;
this.get_id = function() { return _id; };
this.get_name = function() { return _name + _id; };
this.context = context;
this.pdfFilePath = null;
this.data = null;
this.PDFJS = new PDFJS();
this.parsePropCount = 0;
};
nodeUtil.inherits(cls, nodeEvents.EventEmitter);
cls.get_nextId = function () {
return _name + _nextId;
};
var _onPDFJSParseDataReady = function(data) {
_.extend(this.data, data);
this.parsePropCount++;
if (this.parsePropCount >= 2) {
this.emit("pdfParser_dataReady", this);
nodeUtil._logN.call(this, "PDF parsing completed.");
}
};
var startPasringPDF = function() {
this.data = {};
this.parsePropCount = 0;
this.PDFJS.on("pdfjs_parseDataReady", _.bind(_onPDFJSParseDataReady, this));
this.PDFJS.parsePDFData(_binBuffer[this.pdfFilePath]);
};
var processBinaryCache = function() {
if (_.has(_binBuffer, this.pdfFilePath)) {
startPasringPDF.call(this);
return true;
}
var allKeys = _.keys(_binBuffer);
if (allKeys.length > _maxBinBufferCount) {
var idx = this.get_id() % _maxBinBufferCount;
var key = allKeys[idx];
_binBuffer[key] = null;
delete _binBuffer[key];
nodeUtil._logN.call(this, "re-cycled cache for " + key);
}
return false;
};
var processPDFContent = function(err, data) {
nodeUtil._logN.call(this, "Load PDF file status:" + (!!err ? "Error!" : "Success!") );
if (err) {
this.data = err;
this.emit("pdfParser_dataError", this);
}
else {
_binBuffer[this.pdfFilePath] = data;
startPasringPDF.call(this);
}
};
cls.prototype.loadPDF = function (pdfFilePath) {
var self = this;
self.pdfFilePath = pdfFilePath;
nodeUtil._logN.call(this, " is about to load PDF file " + pdfFilePath);
if (processBinaryCache.call(this))
return;
fs.readFile(pdfFilePath, _.bind(processPDFContent, self));
};
cls.prototype.destroy = function() {
this.removeAllListeners();
if (this.context) {
this.context.destroy();
this.context = null;
}
this.pdfFilePath = null;
this.data = null;
this.PDFJS.destroy();
this.PDFJS = null;
this.parsePropCount = 0;
};
return cls;
})();
module.exports = PDFParser;
Use XMLDOM for DOMParser
pdf.js instantiates DOMParser to parse XML based PDF metadata, I replace it with xmldom module; PDFJSClass
has method of load
, when it invokes pdfDocument.getMetadata
(see above for details), xmldom
will kick in to parse out XML metadata.
var DOMParser = require('xmldom').DOMParser;
function Metadata(meta) {
if (typeof meta === 'string') {
meta = fixMetadata(meta);
var parser = new DOMParser();
meta = parser.parseFromString(meta, 'application/xml');
} else if (!(meta instanceof Document)) {
error('Metadata: Invalid metadata object');
}
this.metaDocument = meta;
this.metadata = {};
this.parse();
}
Fake Worker for Web Wroker
PDF.js will fallback to "fake worker" when Web Worker is not available, the code is built in, not much works need to be done, only need to be aware that the parsing would occur in the same thread, not in background worker thread any longer. From my tests, paring performance is not a concern, either running as a web service or a command line, regular PDF forms (less than 8 pages) usually be parsed and serialized within a couple of hundreds milliseconds.
'use strict';
globalScope.postMessage = function WorkerTransport_postMessage(obj) {
console.log("Inside globalScope.postMessage:" + JSON.stringify(obj));
};
The idea of "fake worker" is to create a JavaScript object that has the same APIs as Web Worker, like postMessage, onmessage, terminate
, etc., so the caller code can still invoke the same API without change, while the callee simply does the work in the same thread. This "creating same API with a object" technique also applies to Canvas.
PDFCanvas with HTML5 Canvas API
This is where I spent my most time on, because PDF.js heavily relies on canvas to drawing lines, fills, colors, shapes and text for screen output, while there is no "canvas" in Node.js. Our porting purpose is to change output from screen to in-memory objects, then we can serialize the objects to JSON string. In order to keep pdf.js "drawing" code intact as much as possible, PDFCanvas
is created to handle all "drawing" instructions, it intercepts the operation and creates JavaScript objects, rather than drawing onto screen.
For example, in pdfcanvas.js
, we have helper methods like:
var _drawPDFLine = function(p1, p2, lineWidth) {
var pL = new PDFLine(p1.x, p1.y, p2.x, p2.y, lineWidth);
pL.processLine(this.canvas);
};
var _drawPDFFill = function(cp, min, max, color) {
var width = max.x - min.x;
var height = max.y - min.y;
var pF = new PDFFill(cp.x, cp.y, width, height, color);
pF.processFill(this.canvas);
};
var contextPrototype = CanvasRenderingContext2D_.prototype;
contextPrototype.setFont = function(fontObj) {
if ((!!this.currentFont) && _.isFunction(this.currentFont.clean)) {
this.currentFont.clean();
this.currentFont = null;
}
this.currentFont = new PDFFont(fontObj);
};
contextPrototype.fillText = function(text, x, y, maxWidth, fontSize) {
var str = text.trim();
if (str.length < 1)
return;
var p = this.getCoords_(x, y);
var a = processStyle(this.fillStyle || this.strokeStyle);
var color = (!!a) ? a.color : '#000000';
this.currentFont.processText(p, text, maxWidth, color, fontSize, this.canvas, this.m_);
};
Take PDFFill for example, it's implemented in PDFFile.js
, when processFill
is invoked, it generates a "fill" object and inserts it to targetData.Fills
collection:
var nodeUtil = require("util"),
_ = require("underscore"),
PDFUnit = require('./pdfunit.js');
var PDFFill = (function PFPLineClosure() {
'use strict';
var _nextId = 1;
var _name = 'PDFFill';
var cls = function (x, y, width, height, color) {
var _id = _nextId++;
this.get_id = function() { return _id; };
this.get_name = function() { return _name + _id; };
this.x = x;
this.y = y;
this.width = width;
this.height = height;
this.color = color;
};
cls.get_nextId = function () {
return _name + _nextId;
};
cls.prototype.processFill = function (targetData) {
var clrId = PDFUnit.findColorIndex(this.color);
var oneFill = {x:PDFUnit.toFormX(this.x),
y:PDFUnit.toFormY(this.y),
w:PDFUnit.toFormX(this.width),
h:PDFUnit.toFormY(this.height),
clr: clrId};
targetData.Fills.push(oneFill);
};
return cls;
})();
module.exports = PDFFill;
Other PDF process classes are implemented in similar way, I won't list more code here, here are some pointers to them:
- pdfcanvas.js: replacement for HTML5 Canvas in Node.js;
- pdffield.js: generating interactive form fields (text input, radio button, push button, check boxes, combo boxes, etc.);
- pdffill.js: creating fill data structure (a rectangular area with color)
- pdffont.js: matching font family and process text content;
- pdfline.js: data structure for horizontal and vertical lines;
- pdfunit.js: unit conversion and color definition
Extension and modifications to PDF.js
In addition to changing or replacing dependencies, I also need to extend or modify some code in PDF.js to fit general purpose of parsing PDF in Node.js, including:
Fonts
No need to call ensureFonts
to make sure fonts downloaded, since we're not running in browsers. We only need to parse out font info and set them in JSON's texts array. Embedded/glyph fonts will be ignored and mapped to general font family based on font name, so the parsing output doesn't match original PDF fonts all the time, while font styles (size, bold or italic) will be preserved. More details are in pdffont.js.
DOM
All DOM manipulation code in pdf.js are commented out, including creating canvas and div for screen rendering and font downloading purpose. We're going to leave the rendering related tasks to client render, pdf2json can just focusing on providing form template data.
Form Elements
We've extended PDF.js with interactive form elements parsing in my earlier post on Blend PDF with HTML5, although it was done as a client side library, it's still applicable when moving to Node.js, I won't repeat those details in here. When outputting form elements to in-memory objects, pdffield.js has all data structures and operations to handle them.
Embedded Images
Since my use cases primarily focus on PDF based electronic forms parsing, I intentionally leave out all embedded medias, including embedded fonts and images. You can add them back if your project requires them.
After the changes and extensions listed above, this pdf2json Node.js module will work either in a server environment or as a standalone command line tool. I have a RESTful web service built with resitify and pdf2json, it's been running on an Amazon EC2 instance, meanwhile the command line utility works similar to the Vows unit tests.
Output Format
Once we start to run PDF.js in Node.js, the parsing output JSON format becomes the data contract between client render and PDF parser. Generally, each PDF parsing output has following data structured in JSON:
- 'Agency': the main text identifier for the PDF document
- 'Id': the XML meta data that embedded in PDF document
- 'Pages': array of 'Page' object that describes each page in the PDF, including sizes, lines, fills and texts within the page. More info about 'Page' object can be found at 'Page Object Reference' section
- 'Width': the PDF page width in page unit
And each page object within 'Pages' array describes page elements and attributes with 5 main fields:
- 'Height': height of the page in page unit
- 'HLines': horizontal line array, each line has 'x', 'y' in relative coordinates for positioning, and 'w' for width, plus 'l' for length. Both width and length are in page unit
- 'Vline': vertical line array, each line has 'x', 'y' in relative coordinates for positioning, and 'w' for width, plus 'l' for length. Both width and length are in page unit
- 'Fills': an array of rectangular area with solid color fills, same as lines, each 'fill' object has 'x', 'y' in relative coordinates for positioning, 'w' and 'h' for width and height in page unit, plus 'clr' to reference a color with index in color dictionary. More info about 'color dictionary' can be found at 'Dictionary Reference' section.
- 'Texts': an array of text blocks with position, actual text and styling informations:
- 'x' and 'y': relative coordinates for positioning
- 'clr': a color index in color dictionary, same 'clr' field as in 'Fill' object
- 'A': text alignment, including:
- 'R': an array of text run, each text run object has two main fields:
- 'T': actual text
- 'S': style index from style dictionary. More info about 'Style Dictionary' can be found at 'Dictionary Reference' section
More documentation on output format, including style dictionary (to reduce payload size), form elements data definitions, text input formatters, styling without style dictionary, rotated text support and a list of known issues, can be found at pdf2json project page.
Wrap Up
Porting PDF.js to Node.js enables the use cases like parsing PDF forms with a web services or a command line utility, extending PDF.js with form elements parsing brings interactivities to PDF.js, it ultimately enables a generic form based user interface, the experience can be built efficiently from existing PDF forms, and also makes integrating data service via Ajax more efficient and flexible. It serves my project very well with hundreds of PDF forms, wish it'd be useful for you.