I recently posted about using PdfBox.net to manipulate Pdf documents in your C# application. This time, I take a quick look at iTextSharp, another library for working with Pdf documents from within the .NET framework.
Some Navigation Aids:
What is iTextSharp?
iTextSharp is a direct .NET port of the open source iText Java library for PDF generation and manipulation. As the project’s summary page on SourceForge states, iText “ . . . can be used to create PDF Documents from scratch, to convert XML to PDF . . . to fill out interactive PDF forms, to stamp new content on existing PDF documents, to split and merge existing PDF documents, and much more.”
iTextSharp presents a formidable set of tools for developers who need to create and/or manipulate Pdf files. This does come with a cost, however. The Pdf file format itself is complex; therefore, programming libraries which seek to provide a flexible interface for working with Pdf files become complex by default. iText is no exception.
I noted in my previous post on PdfBox that PdfBox was a little easier for me to get up and running with, at least for rather basic tasks such as splitting and merging existing Pdf files. I also noted that iText looked to be a little more complex, and I was correct. However, iTextSharp does not suffer some of the performance drawbacks inherent to PdfBox, at least on the .net platform.
As I observed in my previous post, PdfBox.net is NOT a direct port of the PdfBox Java library, but instead is a Java library running within .net using IKVM. While I found it very cool to be able to run Java code in a .NET context, there was a serious performance hit, most notably the first time the PdfBox library was called, and the massive IKVM library spun up what amounts to a .Net implementation of the Java Virtual Machine, within which the Java code of the PdfBox library is then executed.
Needless to say, iTextSharp does not suffer this limitation. the library itself it relatively lightweight, and fast.
One of the most common tasks we need to do is extract pages from one Pdf into a new file. We’ll take a look at some relatively basic sample code which does just that, and get a feel for using the iTextSharp programming model.
In the following code sample, the primary iTextSharp classes we will be using are the PdfReader
, Document, PdfCopy
, and PdfImportedPage
classes.
My simplified understanding of how this works is as follows: The PdfReader
instance contains the content of the source PDF file. The Document
class, once initialized with the PdfReader
instance and a new output FileStream
, essentially becomes a container into which pages extracted from the source file represented in the PdfReader
class will be copied. Note that the Document
class represents the Pdf content as HTML, which will be used to construct a properly formatted Pdf file. The result is then output to the Filestream
, and saved to disk at the location specified by the destination file name.
You can download the iTextSharp source code and binaries as a single package from Files page at the iTextSharp project site. Just click on the “Download itextsharp-all-5.4.0.zip” link. Extract the files from the .zip archive, and stash them somewhere convenient. Next, set a reference in your project to the itextsharp.dll. You will need to browse to the folder where you stashed the extracted contents of the iTextSharp download.
NOTE: The complete example code for this post is available at my Github Repo.
I went ahead and created a project named iTextTools, with a class file named PdfExtractorUtility
. Add the following using statements at the top of the file:
using iTextSharp.text;
using iTextSharp.text.pdf;
using System;
namespace iTextTools
{
public class PdfExtractorUtility
{
}
}
First, I’ll add a simple method to extract a single page from an existing PDF file and save to a new file:
public void ExtractPage(string sourcePdfPath, string outputPdfPath,
int pageNumber, string password = "<span style="color: rgb(139, 0, 0);">")
{
PdfReader reader = null;
Document document = null;
PdfCopy pdfCopyProvider = null;
PdfImportedPage importedPage = null;
try
{
reader = new PdfReader(sourcePdfPath);
document = new Document(reader.GetPageSizeWithRotation(pageNumber));
pdfCopyProvider = new PdfCopy(document,
new System.IO.FileStream(outputPdfPath, System.IO.FileMode.Create));
document.Open();
importedPage = pdfCopyProvider.GetImportedPage(reader, pageNumber);
pdfCopyProvider.AddPage(importedPage);
document.Close();
reader.Close();
}
catch (Exception ex)
{
throw ex;
}
}
As you can see, simply pass in the path to the source document, the page number to be extracted, and an output file path, and you’re done.
If we want to be able to a range of contiguous pages, we might add another method defining a start and end point:
public void ExtractPages(string sourcePdfPath, string outputPdfPath,
int startPage, int endPage)
{
PdfReader reader = null;
Document sourceDocument = null;
PdfCopy pdfCopyProvider = null;
PdfImportedPage importedPage = null;
try
{
reader = new PdfReader(sourcePdfPath);
sourceDocument = new Document(reader.GetPageSizeWithRotation(startPage));
pdfCopyProvider = new PdfCopy(sourceDocument,
new System.IO.FileStream(outputPdfPath, System.IO.FileMode.Create));
sourceDocument.Open();
for (int i = startPage; i <= endPage; i++)
{
importedPage = pdfCopyProvider.GetImportedPage(reader, i);
pdfCopyProvider.AddPage(importedPage);
}
sourceDocument.Close();
reader.Close();
}
catch (Exception ex)
{
throw ex;
}
}
What if we want non-contiguous pages from the source document? Well, we might override the above method with one which accepts an array of ints representing the desired pages:
public void ExtractPages(string sourcePdfPath,
string outputPdfPath, int[] extractThesePages)
{
PdfReader reader = null;
Document sourceDocument = null;
PdfCopy pdfCopyProvider = null;
PdfImportedPage importedPage = null;
try
{
reader = new PdfReader(sourcePdfPath);
sourceDocument = new Document(reader.GetPageSizeWithRotation(extractThesePages[0]));
pdfCopyProvider = new PdfCopy(sourceDocument,
new System.IO.FileStream(outputPdfPath, System.IO.FileMode.Create));
sourceDocument.Open();
foreach (int pageNumber in extractThesePages)
{
importedPage = pdfCopyProvider.GetImportedPage(reader, pageNumber);
pdfCopyProvider.AddPage(importedPage);
}
sourceDocument.Close();
reader.Close();
}
catch (Exception ex)
{
throw ex;
}
}
Scratching the Surface
Obviously, the example(s) above are a simplistic first exploration of what appears to be a powerful library. What I notice about iText in general is that, unlike some API’s, the path to achieving your desired result is often not intuitive. I believe this is as much to do with the nature of the PDF file format, and possibly the structure of lower-level libraries upon which iTextSharp is built.
That said, there is without a doubt much to be discerned by exploring the iTextSharp source code. Additionally, there are a number of resources to assist the erstwhile developer in using this library:
Lastly, there is a book authored by one of the primary contributors to the iText project, Bruno Lowagie:
CodeProject John on Google