Introduction
pdf2json extends pdf.js with interactive form elements and runs as a node.js module. It takes PDF file as input, parses it then converts it to in-memory objects in node.js. The commandline utility included in pdf2json module takes the in-memory parsing results and writes it out as JSON file, this article presents a different runtime context: to run pdf2json in RESTful web service.
When running pdf2json through web service, PDF file can be located and parsed on demand in the server side, the client application, either web client, desktop app or a mobile app, receives PDF content in JSON format rather than PDF binary, so that the client can focus more on form presentation and data binding/integration, eliminating the need to worry about loading PDF binaries and parsing them. This architecture separates data parsing from presentation, also separates out data template (the service JSON payload) from user data (what user enters to the form), so that the session data and cache size on the app server can be reduced significantly, and also makes a state-less/session-less service possible for higher scalability and availability.
This project is open sourced in Github, module name is p2jsvc, it's built with pdf2json v0.1.23, resitify v2.3.5 and node.js v0.10.1.
Background
To run pdf2json in REST web service, node.js built-in web server is leveraged and resitify is chosen as the REST API framework. Although resitify borrows heavily from express, it enables full controls over HTTP interaction with a strict RESTful style service API. The service end point of p2jsvc is very simple:
HTTP GET: http:
HTTP POST: http:
content-type: application/json
body: {"folderName":"", "pdfId":""}
The JSON format in the response body is well documented, I won't repeat it here, let's dive in to see how the service is built.
Context and Response Class
Before talking about the actual service code, we can briefly look at two helper classes. First one is response class:
'use strict';
var SvcResponse = (function () {
var _svcStatusMsg = {200: "OK", 400: "Bad Request", 404: "Not Found"};
var cls = function (code, message, fieldName, fieldValue) {
this.status = {
code: code,
message: message || _svcStatusMsg[code],
fieldName: fieldName,
fieldValue: fieldValue
};
};
cls.prototype.setStatus = function(code, message, fieldName, fieldValue) {
this.status.code = code;
this.status.message = message || _svcStatusMsg[code];
this.status.fieldName = fieldName;
this.status.fieldValue = fieldValue;
};
cls.prototype.destroy = function() {
this.status = null;
};
return cls;
})();
module.exports = SvcResponse;
The actual response class will derive from it, so that status
is always part of response payload for both success and error cases. The client will always check the status.code
before trying to read other properties, in case of application error (not network exceptions), the client code can construct user friendly messages based on status.message, status.fieldName and status.fieldValue
. One example is when user log in failed, the HTTP status from XHR is 200, while in the reponse body, status.code
will be 401, so the client will show a "try again" message.
The second helper class is context class, it wraps up the request, response
objects and next
function from restify:
'use strict';
var SvcContext = (function () {
var cls = function (req, res, next) {
this.req = req;
this.res = res;
this.next = next;
};
cls.prototype.completeResponse = function(jsObj) {
this.res.send(200, jsObj);
this.next();
};
cls.prototype.destroy = function() {
this.req = null;
this.res = null;
this.next = null;
};
return cls;
})();
module.exports = SvcContext;
Since our web service layer is on top of pdf2json, while pdf2json has and should not have any knowledge about web service request and response, the communication between these two layers will rely on nodejs events for asynchronious operations. We'll instantiate new instance of pdf2json and SvcContext
for each request, and the new SvcContext
instance will be injected into the instance of pdf2json. When parsing complete event raises, the event handler in service layer can use the SvcContext
instance from event data to complete the response in nodejs' non-blocking asynchornous fashion, so the service instance can continously serve other requests while waiting for the events from earlier ones.
With SvcReponse
and SvcContext
, writing a REST service for pdf2json becomes a simple and fun task.
Create and Configure the Server
resitify does the heavy lifting to create and configure the server:
var server = restify.createServer({
name: self.get_name(),
version: self.get_version()
});
server.use(restify.acceptParser(server.acceptable));
server.use(restify.authorizationParser());
server.use(restify.dateParser());
server.use(restify.queryParser());
server.use(restify.bodyParser());
server.use(restify.jsonp());
server.use(restify.gzipResponse());
server.pre(restify.pre.userAgentConnection());
Some resitify built-in handlers are configured to handle requests, including:
- Accept header parsing
- Authorization header parsing
- Date header parsing
- JSONP support
- Gzip Response
- Query string parsing
- Body parsing (JSON/URL-encoded/multipart form)
Since I'm using curl
to test service APIs, pre.userAgentConnection()
is configured to check whether the user agent is curl
. If it is, it sets the Connection header to "close" and removes the "Content-Length" heade. Without it, curl
will use Connection: keep-alive
as default.
Route the Request and Start the Server
As discussed earlier, we'd like to support both GET
and POST
for the a PDF resource, and we also want to instantiate new instance for SvcContext
for each request then calls to pdf2json to parse the PDF asynchrounously, this would un-block our server while ealier requests is in process:
server.get('/p2jsvc/:folderName/:pdfId', function(req, res, next) {
_gfilter(new SvcContext(req, res, next));
});
server.post('/p2jsvc', function(req, res, next) {
_gfilter(new SvcContext(req, res, next));
});
server.get('/p2jsvc/status', function(req, res, next) {
var jsObj = new SvcResponse(200, "OK", server.name, server.version);
res.send(200, jsObj);
return next();
});
server.listen(8001, function() {
nodeUtil.log(nodeUtil.format('%s listening at %s', server.name, server.url));
});
For each GET
or POST request,
, it's routed to the same _gfilter
function with a new instance of SvcContext
. The '/p2jsvc/status'
route simply returns a HTTP 200 response without parsing a PDF, it can be used for health check calls from service monitoring tools.
Process the Request
All PDF parsing request is processd with a new instance of pdf2json, class name is PDFParser
:
var _gfilter = function(svcContext) {
var req = svcContext.req;
var folderName = req.params.folderName;
var pdfId = req.params.pdfId;
nodeUtil.log(self.get_name() + " resceived request:" + req.method + ":" + folderName + "/" + pdfId);
var pdfParser = new PFParser(svcContext);
_customizeHeaders(svcContext.res);
pdfParser.on("pdfParser_dataReady", _.bind(_onPFBinDataReady, self));
pdfParser.on("pdfParser_dataError", _.bind(_onPFBinDataError, self));
pdfParser.loadPDF(_pdfPathBase + folderName + "/" + pdfId + ".pdf");
};
When a new instance of PDFParser
is created, the svcContext
instance is also passed into. When "pdfParser_dataReady" or "pdfParser_dataError" event raised, the event handler can acccess the original request and response objects to complete the response. This new instance, context and event based set up is essential to the throughput and performance of our service.
Complete the Response
The response will be completed when either "pdfParser_dataReady" or "pdfParser_dataError" event is raised from pdf2json instance, it's done via a new instance of SvcReponse
:
var _onPFBinDataReady = function(evtData) {
var resData = new SvcResponse(200, "OK", evtData.pdfFilePath, "FormImage JSON");
resData.formImage = evtData.data;
evtData.context.completeResponse(resData);
};
var _onPFBinDataError = function(evtData){
nodeUtil.log(this.get_name() + " 500 Error: " + JSON.stringify(evtData.data));
evtData.context.completeResponse(new SvcResponse(500, JSON.stringify(evtData.data)));
};
If parsing successful, PDF parsing result in JSON is created when invoking context.completeResponse(resData)
. The service layer code handles all service related tasks, including server, request, response, invoking PDFParser asynchronously and also serialize the parsing result to JSON, while pdf2json instance works in a context-agnostic way, so that it can resued either in a web service project or as a command line tool.
Cross Domain Support
In my project, the web server and app server are running on separated VMs with different host names and sub-domains, this p2jsvc is deployed to app server while my Backbone based web client are deployed to web server, and it communicates with app server through Ajax. To support this cross domain (or corss sub-domain) server configuration, Apache Proxy is configued in httpd.conf on the web server:
<IfModule proxy_module>
proxyrequests off
ProxyPass /p2jsvc/ http:
ProxyPassReverse /p2jsvc/ http:
</IfModule>
Additionally, p2jsvc also supports JSONP (in server configuration) and Cross Origin Reource Sharing (CORS):
var _customizeHeaders = function(res) {
res.header("Access-Control-Allow-Origin", "*");
res.header("Access-Control-Allow-Headers", "X-Requested-With");
res.header("Cache-Control", "no-cache, must-revalidate");
};
Run and Test the Service
Here are some quick command reference to run and test the service. For installation:
git clone https:
cd p2jsvc
npm install
to start the server for development:
cd p2jsvc
node index
If server starts successfully, you should see prompts in console:
[time_stamp] - PDFFORMServer1 listening at http:
When start the server on production server, I use forever to run it as background process:
cd p2jsvc
forever start index.js
To run the test with HTTP GET:
curl -isv http:
curl -isv http:
curl -isv http:
Those xfa_xxx.pdf
are testing PDF files, you can replace them with your own under data
directory. Similarly you can test it with POST:
curl -isv -H "Content-Type: application/json" -X POST -d '{"folderName":"data", "pdfId":"xfa_1040ez"}' http:
curl -isv -H "Content-Type: application/json" -X POST -d '{"folderName":"data", "pdfId":"xfa_1040a"}' http:
curl -isv -H "Content-Type: application/json" -X POST -d '{"folderName":"data", "pdfId":"xfa_1040"}' http:
Lastly, here is the curl
command to check the service status:
curl -isv http:
When the service is up and running correctly, the response JSON body should be:
{"status":{"code":200,"message":"OK","fieldName":"PDFFORMServer1"}}
The following command will send 10 concurrent requests to parse PDFs for conconsurrency benchmark test:
ab -n 10 -c 10 http:
ab -n 10 -c 10 http:
ab -n 10 -c 10 http:
Wrap Up
Expose pdf2json with a REST interface is fairly simple while powerful with resitify, although this article is all about runnning pdf2json in a RESTful web service project, its context and event based asynchronious model is appliable to other resitify based web service project, wish you found it useful too.