Introduction
PDF documents are commonly used and their content is usually compressed. This
article shows a simple C code that can be used to extract plain text from the
PDF file.
Why?
Adobe does allows you to submit PDF files and will extract the text or HTML
and mail it back to you. But there are times when you need to extract the text
yourself or do it inside an application. You may also want to apply special
formatting (e.g., add tabs) so that the text can be easily imported into Excel
for example (when your PDF document mostly contains tables that you need to port
to Excel, which is how this code got developed).
There are several projects on "The Code Project" that show how to create PDF
documents, but none that provide free code that shows how to extract text
without using a commercial library. In the reader comments, a need was expressed
for code just like what is being supplied here.
There are several libraries out there that read or create PDF file, but you
have to register them for commercial use or sign various agreements. The code
supplied here is very simple and basic, but it is entirely free. It only use the
ZLIB library which is also free.
Basics
You can download documents such as PDFReference15_v5.pdf from here that explains some of the
inners of PDF files. In short, each PDF file contains a number of objects. Each
object may require one or more filters to decompress it and may also provide a
stream of data. Text streams are usually compressed using the FlateDecode filter
and may be uncompressed using code from the ZLIB (http://www.zlib.org/) library.
The data for each object can be found between "stream" and "endstream"
sections. Once inflated, the data needs to be processed to extract the text. The
data usually contains one or more text objects (starting with BT and ending with
ET) with formatting instructions inside. You can learn a lot from the structure
of PDF file by stepping through this application.
About Code
This single source code file contains very simple, very basic C code. It
initially reads in the entire PDF file into one buffer and then repeatedly scans
for "stream" and "endstream" sections. It does not check which filter should be
applied and always assumes FlateDecode. (If it gets it wrong, usually no output
is generated for that section of the file, so it is not a big issue). Once the
data stream is inflated (uncompressed), it is processed. During the processing,
the code searches for the BT and ET tokens that signify text objects. The
contents of each is processed to extract the text and a guess is made as to
whether tabs or new line characters are needed.
The code is far from complete or being any sort of general utility class, but
it does demonstrate how you can extract the text yourself. It is enough to show
you how and get you going.
The code is however fully functional, so when it is applied to a PDF
document, it generally does a fair job of extracting the text. It has been
tested on several PDF files.
This code is supplied as is, no warranties. Use at your own risk.
Using The Code
The download contains one C file. To use it, create a simple Windows 32
Console project and add the pdf.c file to the project. You also need to
go here (bless them!) and
download the free "zlib compiled DLL" zip file. Extract zdll.lib to your
project directory and add it as a project dependency (link against it). Also put
zlib1.dll in your project directory. Also put zconf.h and
zlib.h in your project directory and add them to the project.
Now, step through the application and note that the input PDF and output text
file names are hardwired at the start of the main
method.
Future Enhancements
If there is enough interest, the author may consider uploading a release
version with a Windows interface. The code is quite good for extracting data
from tables in a form that can be readily imported into Excel, with the column
preserved (because of the tabs that get added).
Code Snippets
Stream sections are located using initially:
size_t streamstart = FindStringInBuffer (buffer, "stream", filelen);
size_t streamend = FindStringInBuffer (buffer, "endstream", filelen);
And then once the data portion is identified, it is inflated as follows:
z_stream zstrm; ZeroMemory(&zstrm, sizeof(zstrm));
zstrm.avail_in = streamend - streamstart + 1;
zstrm.avail_out = outsize;
zstrm.next_in = (Bytef*)(buffer + streamstart);
zstrm.next_out = (Bytef*)output;
int rsti = inflateInit(&zstrm);
if (rsti == Z_OK)
{
int rst2 = inflate (&zstrm, Z_FINISH);
if (rst2 >= 0)
{
size_t totout = zstrm.total_out;
ProcessOutput(fileo, output, totout);
}
}
The main work gets done in the ProcessOutput
method which
processes the uncompressed stream to extract text portion of any text object. It
looks as follows:
void ProcessOutput(FILE* file, char* output, size_t len)
{
bool intextobject = false;
bool nextliteral = false;
int rbdepth = 0;
char oc[oldchar];
int j=0;
for (j=0; j<oldchar; j++) oc[j]=' ';
for (size_t i=0; i<len; i++)
{
char c = output[i];
if (intextobject)
{
if (rbdepth==0 && seen2("TD", oc))
{
float num = ExtractNumber(oc,oldchar-5);
if (num>1.0)
{
fputc(0x0d, file);
fputc(0x0a, file);
}
if (num<1.0)
{
fputc('\t', file);
}
}
if (rbdepth==0 && seen2("ET", oc))
{
intextobject = false;
fputc(0x0d, file);
fputc(0x0a, file);
}
else if (c=='(' && rbdepth==0 && !nextliteral)
{
rbdepth=1;
int num = ExtractNumber(oc,oldchar-1);
if (num>0)
{
if (num>1000.0)
{
fputc('\t', file);
}
else if (num>100.0)
{
fputc(' ', file);
}
}
}
else if (c==')' && rbdepth==1 && !nextliteral)
{
rbdepth=0;
}
else if (rbdepth==1)
{
if (c=='\\' && !nextliteral)
{
nextliteral = true;
}
else
{
nextliteral = false;
if ( ((c>=' ') && (c<='~')) || ((c>=128) && (c<255)) )
{
fputc(c, file);
}
}
}
}
for (j=0; j<oldchar-1; j++) oc[j]=oc[j+1];
oc[oldchar-1]=c;
if (!intextobject)
{
if (seen2("BT", oc))
{
intextobject = true;
}
}
}
}