Introduction
In this article, I will go through how I made a simple Windows service that watches a folder for incoming PDF documents (from a scanner for example) and then renames the file and moves it to a designated folder
depending on the contents of the file. The solution uses regular expressions to decide where to move the document (identification) and then it uses it to extract information that is useful for the naming of the file,
such as invoice date, customer name etc.
Background
After buying a new scanner (the excellent ScanSnap IX500) for digitizing over 2000 pages of old invoices and other stuff, I was faced with the
problem of sorting all the scanned documents, and I realized that doing it by hand would be far to boring for me to do, so I decided to solve the
problem programmatically instead.
After playing around with the scanner and the software that came with it, I found that using high resolution scanning for the OCR and then scaling down the image for the actual PDF was the best way to go to get good OCR quality.
The OCR is done by the ABBYY engine, which in turn places a transparent text over the corresponding place of the image, creating a PDF in which you can mark, copy and so on.
So, when ABBYY leaves off, I'm left with a "searchable PDF", which in turn needed parsing for my project. After investigating the open source solutions for PDF document software, I found that Apache PDFBox
suited my needs the best, and it so happened that there was an article here on codeproject.com (Converting-PDF-to-Text-in-C) that had some precompiled
binaries with everything you need to use it in your .NET project, so I went ahead and use the sample from there.
Using the code
Compiling
To be able to compile my project, you need to download the binaries from here and include the following files in your project's resource folder:
- IKVM.OpenJDK.Core.dll
- IKVM.OpenJDK.SwingAWT.dll
- pdfbox-1.7.0.dll
Also, be sure to copy the following files to your bin folder of the project (otherwise it won't run):
- commons-logging.dll
- fontbox-1.7.0.dll
- IKVM.OpenJDK.Util.dll
- IKVM.Runtime.dll
Architecture
The solution creates a Windows service that needs to be installed by using the installutil.exe command found in the .NET corresponding framework folder. When in debug mode, the code is run using F5 as usual, but when compiled into release, it is turned into a service.
General flow
The whole idea of the project is to:
- Watch a folder for new PDF files.
- When a new file shows up, search the file for certain identifiers to decide what to do with it.
- When an identifier is matched, select important information in the file and use this to name the file appropriately.
- Move the file to a destination depending on the identifier.
Setting up the file watchers
Because of how ABBYY (OCR software I use) is set up, the file is named <prefix>_OCR.pdf after it has gone through the OCR process, and thus a FileSystemWatcher
object is set up like this:
FileSystemWatcher watcher = new FileSystemWatcher(watchFolder, "*_OCR.pdf");
watcher.NotifyFilter = NotifyFilters.LastWrite| NotifyFilters.FileName | NotifyFilters.DirectoryName;
watcher.Created += new FileSystemEventHandler(OnCreated);
watcher.EnableRaisingEvents = true;
When testing the software out, I often found myself processing files that ended up in the wrong directory with the wrong file name due to poorly written identifiers, and to be able to reprocess files regardless
of file name, I also set up a watcher for a rematchFolder
where the filter just says "*.pdf" instead. That way you can change the configuration, and then any file can be thrown in the rematch folder and go through the processing again
with new rules.
Configuration
There are two pieces of configurations that run the service. One is the app.config that points out where to find the in folders, the no match folder, the rules configuration
file and where to put the log file.
The rules configuration is stored in an xml file, and then loaded into a list (PDFTemplates
)
of PDFTemplate
objects. The PDFTemplate
class simply holds:
- A list of strings in
identifiers
where each string is a regular expression and ALL identifiers must be matched in a file for the rule to take action - A list of strings in
contentSelectors
where each string is a regular expression, holding matching groups (denoted in the regular expression with "(...)") where the first content selector to match something
is used for renaming the file. - A string in
fileNamePrefix
setting the prefix of the file name the rule should rename the file to. - A string in
destionationFolder
holding the full path to a directory where the rule should move the file to.
And the XML file looks accordingly:
<ArrayOfPDFTemplate xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<PDFTemplate>
<identifiers>
<string>[Ss]ome company</string>
</identifiers>
<contentSelectors>
<string>\bInvoice date\W+(?:\w+\W+){0,20}?([0-9] *[0-9] *[0-9] *[0-9] *- *[0-9] *[0-9] *- *[0-9] *[0-9])\b</string>
<string>([0-9] *[0-9] *[0-9] *[0-9] *- *[0-9] *[0-9] *- *[0-9] *[0-9])</string>
</contentSelectors>
<fileNamePrefix>Some Company</fileNamePrefix>
<destinationFolder>C:\Sorted PDF Files\Some Company</destinationFolder>
</PDFTemplate>
...
</ArrayOfPDFTemplate>
Running the match and renaming the file
Now, when a file gets processed, it goes through all the identifiers of the objects in the
PDFTemplates
list, and for the first match, the rule gets applied.
If no rule is matched at all, the file is moved (but not renamed) to a designated "noMatch" folder for manual processing.
The code for searching through the file for identifiers and renaming it goes like this:
...
org.apache.pdfbox.pdmodel.PDDocument doc = org.apache.pdfbox.pdmodel.PDDocument.load(fullPath);
org.apache.pdfbox.util.PDFTextStripper stripper = new org.apache.pdfbox.util.PDFTextStripper();
text = stripper.getText(doc);
doc.close();
...
foreach (string identifier in thisTemplate.identifiers)
{
if (!Regex.IsMatch(text, identifier, RegexOptions.IgnoreCase))
{
identifiersFound = false;
break;
}
}
...
foreach (string contentSelector in thisTemplate.contentSelectors)
{
Match thisMatch = Regex.Match(text, contentSelector,
RegexOptions.IgnoreCase | RegexOptions.Multiline);
if (thisMatch.Captures.Count != 0)
{
string selection = thisMatch.Groups[1].Value;
newFileName = newFileName + "_" + selection;
break;
}
}
And then there is just the matter of renaming and moving the file being processed.
Points of Interest
This article is really not so much about solving things elegantly (the code needs some rework for that - it's just a hack) but rather a starting point for you if you're facing the similar
situation. I really tried to find software that would do this for me instead, but I really drew a blank when it comes to using regular expressions and setting up my own rules collection
to apply to a file.