Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles
(untagged)

PDF Page Counter

0.00/5 (No votes)
24 Oct 2014 1  
Quickly count the number of pages in a collection of PDF documents

Introduction

My company Red Cell Innovation Inc. provides a document scanning service. Often we require a page count of a collection of PDF files for the purpose of billing, quality control, scheduling, and estimating.

This is an application that I quickly whipped up to facilitate this. The application uses a procedural style to accomplish this in about 200 lines of code including XAML and comments in a single simple codebehind class.

Features

  • Simple: Drop a directory in the application.
  • Fast: Scanned 20GB of PDFs and counted 53877 pages in 499 files in 7 seconds on an SSD (270 seconds on a network drive).

How It Works

Language C# 5.0
.NET Framework 4.5
UI Framework WPF
Libraries iTextSharp
Pattern Codebehind procedural

When the application starts, the user is prompted to drop files and/or folders into the application's window.

UI: Drop files and.or folders to be counted.

When files or folders are dropped, the Start method is invoked, changing the visibility of UI elements to the count screen.

UI: File and page counts

The async Analyze method is invoked to create a new thread that traverses the filesystem recursively. A new thread is requested from the thread pool for each directory to be enumerated and its files counted.

private async Task Analyze (IEnumerable<string> filenames)
{
    await Task.Run(async () =>
    {
        foreach (string filename in filenames)
        {
            if (this._cancel)
                break;

            Dispatcher.Invoke(Update);
            if (Directory.Exists(filename))
            {
                string[] nestedFilenames = Directory.GetFiles(filename, "*.pdf", SearchOption.AllDirectories);
                await Analyze(nestedFilenames);
            }

            this._files++;
            if (new FileInfo(filename).Extension.ToLower() != ".pdf")
                continue;

            this._filesPdf++;
            int pages = Count(filename);
            this._pages += pages;
        }
        Dispatcher.Invoke(Update);
    });
}

private int Count (string filename)
{
    using (var reader = new PdfReader(filename))
    {
        int pages = reader.NumberOfPages;
        reader.Close();
        return pages;
    }
}

The Count method uses the iTextSharp library was used to read the PDF files. Since PDF files are internally indexed, the document does not need to be scanned (see PDF Syntax). Instead a PdfReader object is instantiated and its Number OfPages property read.

The system resources used are negligible.

Task Manager Performance

PDF Syntax

This could have been done quite easily without iTextSharp by creating a simple PDF parser; however this would have increased the time required to develop the application, which was about an hour, already being familiar with iTextSharp.

To accomplish this without iTextSharp we would read the PDF and follow the references.

This is a syntactically correct and complete PDF file.  To find the section, we first check the Trailer which specifies reference 1 as the Root. We can see that section 1 contains the Catalog, which points to reference 3 as the Pages section. Note how the Pages resource describes a single page, described in section 4.

%PDF-1. 4
1  0  obj
 <<  /Type /Catalog
  /Outlines  2 0 R
  /Pages  3 0 R
 >>
endobj
2  0  obj
 <<  /Type  Outlines
  /Count  0
 >>
endobj
3  0  obj
 <<  /Type  /Pages
  /Kids  [ 4 0 R ]
  /Count  1
 >>
endobj
4  0  obj
 <<  /Type  /Page
  /Parent  3 0 R
  /MediaBox  [ 0  0  612  792 ]
  /Contents  5 0 R
  /Resources  <<  /ProcSet  6 0 R  >>
 >>
endobj
5  0  obj
 <<  /Length  35  >>
 stream
 <-- Page-marking operators -->
 endstream
endobj
6  0  obj
 [ /PDF ]
endobj
xref
0  7
0000000000  65535  f
0000000009  00000  n
0000000074  00000  n
0000000120  00000  n
0000000179  00000  n
0000000300  00000  n
0000000384  00000  n
trailer
 <<  /Size  7
  /Root  1 0 R
 >>
startxref
408
%%EOF

Acknowledgements

iTextSharp is the work of Paulo Soares, Bruce Lowagie, et al.

PDF file syntax example is from PDF Reference, sixth edition. © 2006 Adobe®Systems Incorporated.

History

January 18 2014 Application written
October 24, 2014 Article written

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here