Introduction
This article demonstrates how to use the iTextSharp .NET library to convert a PDF file to text.
Background
It seems like I was always searching for a better way to convert a PDF file to text (so I could edit it, parse it with regex, etc). And we are not talking about a couple pages of PDF here - I was receiving daily reports in PDF format that were 200-300 pages in length.
I started with a Python library that I found to do the PDF-to-text conversion. This seemed like a good choice because I was planning on using Python to parse the PDF anyway. Unfortunately, converting a single 200+ page PDF with this method was taking on the order of several minutes (on a pretty fast machine). Unacceptable.
Code Project to the rescue! My next solution was the original article regarding PDF-to-text that used PDFBox. By using this method, my PDF conversion went down from a couple minutes to about 10 seconds (again for a 200+ page PDF). All good, right?
Well...it was a great improvement to be sure. But something about knowing that the code was piggybacking on the Java VM and that it was, therefore, slower than the Java version rubbed me the wrong way. So between that and the fact that I am a huge nerd when I get a free weekend, I decided to revisit the potential solutions listed in the original article. I was able to convert the Java source code that uses the iText library and utilize the iTextSharp version of this same library. The end result is that I can now convert a 250 page PDF file to text in less than a second.
Using the code
This is the full C# code for my project. As I mentioned in the introduction, I just converted it from Java, so you may see a little design weirdness. Nevertheless, the code is quite short. And the resulting app is fast!
This code references a few of the iTextSharp dlls. I have included them in the project download files, but you can also find them on sourceforge (for future updates, etc.).
using System;
using System.IO;
using iTextSharp.text;
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
public class ParsingPDF {
static string PDF;
static string TEXT2;
public void parsePdf(String src, String dest)
{
PdfReader reader = new PdfReader(src);
StreamWriter output = new StreamWriter(new FileStream(dest, FileMode.Create));
int pageCount = reader.NumberOfPages;
for (int pg = 1; pg <= pageCount; pg++)
{
byte[] streamBytes = reader.GetPageContent(pg);
PRTokeniser tokenizer = new PRTokeniser(streamBytes);
while (tokenizer.NextToken())
{
if (tokenizer.TokenType == PRTokeniser.TokType.STRING)
{
output.WriteLine(tokenizer.StringValue);
}
}
}
output.Flush();
output.Close();
}
static void Main(string[] args)
{
if (args.Length < 1 || args.Length > 2)
{
Console.WriteLine("USAGE: ParsePDF infile.pdf <outfile.txt>");
return;
}
else if (args.Length == 1)
{
PDF = args[0];
TEXT2 = Path.GetFileNameWithoutExtension(PDF) + ".txt";
}
else
{
PDF = args[0];
TEXT2 = args[1];
}
try
{
DateTime t1 = DateTime.Now;
ParsingPDF example = new ParsingPDF();
example.parsePdf(PDF, TEXT2);
DateTime t2 = DateTime.Now;
TimeSpan ts = t2 - t1;
Console.WriteLine("Parsing completed in {0:0.00} seconds.", ts.TotalSeconds);
}
catch (Exception ex)
{
Console.WriteLine("ERROR: " + ex.Message);
}
}
public class MyTextRenderListener : IRenderListener
{
protected StreamWriter output;
public MyTextRenderListener(StreamWriter output)
{
this.output = output;
}
public void BeginTextBlock()
{
output.Write("<");
}
public void EndTextBlock()
{
output.WriteLine(">");
}
public void RenderImage(ImageRenderInfo renderInfo)
{
}
public void RenderText(TextRenderInfo renderInfo)
{
output.Write("<");
output.Write(renderInfo.GetText());
output.Write(">");
}
}
}
Points of Interest
It was interesting to see some Java code again. I haven't done anything serious in Java for over five years, but what struck me was how close the Java code is to the C# code. This made the conversion relatively easy.
I initially planned to incorporate the Task Parallel Library to try and speed up the results, but that was before I realized that the non-parallel version was performing in under half a second. Just for fun, I may look into the TPL anyway. It would be an good learning exercise, and it would be interesting to see how TPL performs. In the case of a many-page PDF document, I'm sure TPL is not going to launch hundreds of threads, for example. But how many will it launch?
So I'll keep this article updated if I pursue the TPL version. Also, I want to implement this in my new favorite baby: F#.
History
Original version posted Oct 22, 2012.