Introduction
My company Red Cell Innovation Inc. provides a document scanning service. Often we require a page count of a collection of PDF files for the purpose of billing, quality control, scheduling, and estimating.
This is an application that I quickly whipped up to facilitate this. The application uses a procedural style to accomplish this in about 200 lines of code including XAML and comments in a single simple codebehind class.
Features
- Simple: Drop a directory in the application.
- Fast: Scanned 20GB of PDFs and counted 53877 pages in 499 files in 7 seconds on an SSD (270 seconds on a network drive).
How It Works
Language |
C# 5.0 |
.NET Framework |
4.5 |
UI Framework |
WPF |
Libraries |
iTextSharp |
Pattern |
Codebehind procedural |
When the application starts, the user is prompted to drop files and/or folders into the application's window.
When files or folders are dropped, the Start
method is invoked, changing the visibility of UI elements to the count screen.
The async Analyze
method is invoked to create a new thread that traverses the filesystem recursively. A new thread is requested from the thread pool for each directory to be enumerated and its files counted.
private async Task Analyze (IEnumerable<string> filenames)
{
await Task.Run(async () =>
{
foreach (string filename in filenames)
{
if (this._cancel)
break;
Dispatcher.Invoke(Update);
if (Directory.Exists(filename))
{
string[] nestedFilenames = Directory.GetFiles(filename, "*.pdf", SearchOption.AllDirectories);
await Analyze(nestedFilenames);
}
this._files++;
if (new FileInfo(filename).Extension.ToLower() != ".pdf")
continue;
this._filesPdf++;
int pages = Count(filename);
this._pages += pages;
}
Dispatcher.Invoke(Update);
});
}
private int Count (string filename)
{
using (var reader = new PdfReader(filename))
{
int pages = reader.NumberOfPages;
reader.Close();
return pages;
}
}
The Count
method uses the iTextSharp library was used to read the PDF files. Since PDF files are internally indexed, the document does not need to be scanned (see PDF Syntax). Instead a PdfReader
object is instantiated and its Number OfPages
property read.
The system resources used are negligible.
PDF Syntax
This could have been done quite easily without iTextSharp by creating a simple PDF parser; however this would have increased the time required to develop the application, which was about an hour, already being familiar with iTextSharp.
To accomplish this without iTextSharp we would read the PDF and follow the references.
This is a syntactically correct and complete PDF file. To find the section, we first check the Trailer
which specifies reference 1
as the Root
. We can see that section 1
contains the Catalog
, which points to reference 3
as the Pages
section. Note how the Pages
resource describes a single page, described in section 4
.
%PDF-1. 4
1 0 obj
<< /Type /Catalog
/Outlines 2 0 R
/Pages 3 0 R
>>
endobj
2 0 obj
<< /Type Outlines
/Count 0
>>
endobj
3 0 obj
<< /Type /Pages
/Kids [ 4 0 R ]
/Count 1
>>
endobj
4 0 obj
<< /Type /Page
/Parent 3 0 R
/MediaBox [ 0 0 612 792 ]
/Contents 5 0 R
/Resources << /ProcSet 6 0 R >>
>>
endobj
5 0 obj
<< /Length 35 >>
stream
<-- Page-marking operators -->
endstream
endobj
6 0 obj
[ /PDF ]
endobj
xref
0 7
0000000000 65535 f
0000000009 00000 n
0000000074 00000 n
0000000120 00000 n
0000000179 00000 n
0000000300 00000 n
0000000384 00000 n
trailer
<< /Size 7
/Root 1 0 R
>>
startxref
408
%%EOF
Acknowledgements
iTextSharp is the work of Paulo Soares, Bruce Lowagie, et al.
PDF file syntax example is from PDF Reference, sixth edition. © 2006 Adobe®Systems Incorporated.
History
January 18 2014 |
Application written |
October 24, 2014 |
Article written |