Introduction
I have always wondered about the Adobe Reader and the PDF files. Have you ever tried to open a PDF file in a text editor? It�s amazing! In this project I am trying to bring the hidden things behind the PDF files to light. This simple application lets you create PDF files, just as you create txt files from a Notepad (hence the name pdfpad). Type your text in the editor and save it as a PDF file. Of course, you need Acrobat Reader to view the created PDF file. You cannot open an existing PDF file in this editor. You can only create, and once created its done. The greatest feature of this project is the digital signature. It teaches you the very basics of adding an invisible digital signature to the files created using pdfpad. It automatically adds an invisible digital signature when you create PDF files in pdfpad.exe.
Due to lack of time, many of the details couldn't be included. Please bear with us.
Background
Firstly, one should know the basics of the PDF format. I recommend you to download a copy of the PDF reference manual from PDF Reference and go through it (oops it's 1000 pages!).
Download the application demo, enter some text, and save the file as PDF. Now open the file in notepad and read on�
If one says C++ is object oriented, I would say PDF is more object oriented. In a PDF everything is treated as an object and every object has its own property and will refer other objects. This makes large PDF files (A 1000 page book just downloaded) to be navigated randomly and quickly.
A PDF file is read from the last. There is a token called the startxref
, this is were everything begins. A viewer application reads this entry to get the offset of a table called xref
. The table lists the objects used in the file and also their byte offsets within the PDF file. The format of the entries greatly matter here. Each entry should be 20 bytes long including the carriage return and the line feed.
Every object is numbered sequentially starting from 0 to n. ( though not necessary). If you notice the xref
entry you will find a �0� and a number n. This means that the table contains n objects starting from 0. Just take a look at them� 0000000074 this is the byte offset, 00000 is the generation number, n ..means it's in use. Only the first entry has the generation number that is not zero and it's marked f. Read the reference manual for more details.
A PDF document can be regarded as a hierarchy of objects contained in the body section of a PDF file. At the root of the hierarchy are the document's catalog dictionary. Most of the objects in the hierarchy are dictionaries. Each page of the document is represented by a page object, which is a dictionary that includes references to the page contents and other attributes such as its thumbnail image and any annotations associated with it. The individual page objects are tied together in a structure called the page tree, which in turn is located via an indirect reference in the document catalog.
The root of a document object hierarchy is the catalog dictionary, located via the Root entry in the trailer of the PDF file. The catalog contains references to other objects that define the document's contents, outline article threads, named destinations and other attributes.
Now to start with, the reader reads the value of the root entry in the trailer. This is the root. This is the root of all the references that are to be made. Now the reader reads the byte offset of the root object and moves to the root. This is a catalog dictionary. This again contains many other references. In our application only minimum entries are made so that it is easy to understand.
Now let�s see what happens to the text that we enter in the edit box. Firstly, all the occurrences of the end of line are replaced with the PDF operators for line feed. Then all the operators for showing the text on the page is added in the contents dictionary. This content is added as a stream, which is called a content stream. For compressing the text I have used zlib, courtesy zlib, this is a freely downloadable library. Flat compression algorithm is used to compress the text. This algorithm is supported by the Adobe viewer.
The most amazing thing is about the digital signature. I haven't employed a real life digital signature using cryptographic libraries. All I intend to show is, how to add a digital signature to the PDF document. The entries here are all dummy entries. This signature can be made a real digital signature if you can change the contents entry in the Signature dictionary with the real signed hash of the document.
I won�t be covering the details of the digital signature here. I will stick to the details of the PDF. PDF has two types of digital signatures, invisible and visible. Our application uses invisible signatures. The signature can be viewed in the signature panel. The entries in the signature dictionary can be changed to put your name, time of signing, location etc., programmatically using the user's inputs. This is left to you.
When a digital signature is added to a document, the Adobe acrobats signature handler calculates a checksum that is based on the content of the document at that time and it embeds the checksum in the signature. When the signature is validated, the handler recalculates the checksum for that signed version of the document and compares it with the value in the signature. If the signed version has changed in any way the signature handler detects the change and marks the signature as invalid.
You can also use Crypto API to create the hash, Sign using Digital Certificates etc., which I hope to cover in my next article. While creating the hash, the byte range must be specified correctly. Byte range is an array of two integers, Starting offset and number of bytes. Byte range array is used to exclude the contents entry in the signature dictionary. This entry will be filled with a temp entry initially to get the total file size for calculating the hash. After creating the hash the contents entries are be made. This explains why byte range is specified so as to exclude the contents entry from creating the hash. Otherwise while verifying the signature it may get invalidated.
Once you get a grip of the reference manual you can modify the code below, to add more pages, add drawing to the below etc.
Using the code
The main function that creates the PDF files is added to the ***doc
class, it's called CreatedPdf ( CString text)
. I enjoy manipulating the CString
object rather than using the char buffers. You can modify accordingly to make it more efficient. The code is well commented to explain the details.
This is a part of the Doc
class, that should be modified to write the files in PDF format:
void CPdfPadDoc::Serialize(CArchive& ar)
{
if(ar.IsStoring())
{
CString strFull;
CEdit &edit =((CEditView*)m_viewList.GetHead())->GetEditCtrl();
edit.GetWindowText(strFull);
ar.WriteString(CreatePdf(strFull));
ar.Flush();
}
}
The main function written in doc
class is the CreatePdf()
. This actually takes the text and returns the formatted PDF to be written to the file:
CString CPdfPadDoc::CreatePdf(CString text)
{
text.Replace("\r\n",")Tj T*(");
int objArray[10];
int fontSize=1;
int hPos=50;
int vPos=750;
CString fileBuff;
CString header="%PDF-1.5\r%\xC3\xBE\r\n";
fileBuff=header;
objArray[0]=fileBuff.GetLength();
CString catalog="1 0 obj<</Pages "
"2 0 R/Type /Catalog/AcroForm 6 0 R>>\nendobj\r";
fileBuff+=catalog;
objArray[1]=fileBuff.GetLength();
CString pageTree="2 0 obj<</Count 1/"
"Kids [3 0 R]/Type /Pages>>\nendobj\r";
fileBuff+=pageTree;
objArray[2]=fileBuff.GetLength();
CString page="3 0 obj<</Annots[7 0 R]/"
"Contents [5 0 R]/Type /Page/Parent 2 0 R/Rotate 0/"
"MediaBox[0 0 612 792]/CropBox[0 0 612 792]/"
"Resources<</Font<</T1_0 4 0 R>>/"
"ProcSet[/PDF/Text]>>>>\nendobj\n";
fileBuff+=page;
objArray[3]=fileBuff.GetLength();
CString font="4 0 obj<</Type/Font/BaseFont/"
"Times-Roman/Subtype/Type1>>\nendobj\n";
fileBuff+=font;
objArray[4]=fileBuff.GetLength();
CString stream;
stream.Format("%s%d%s%d%s%d%s%s%s","0 g\r1 i \rBT\r/T1_0 ",
fontSize," Tf\r0 Tc 0 Tw 0 Ts 100 Tz 0 Tr 1.2 TL 12 0 0 12 ",
hPos," ",vPos," Tm \rT* (",text,")Tj \rET");
CString compressedStream=FlateCompress(stream);
int len=compressedStream.GetLength();
CString contents;
contents.Format("%s%d%s%s%s",
"5 0 obj<</Filter /FlateDecode/Length ",len,
">>stream\r\n",compressedStream,
"\r\nendstream\rendobj\n");
fileBuff+=contents;
objArray[5]=fileBuff.GetLength();
CString acroForm;
acroForm="6 0 obj<</Fields[7 0 R]/SigFlags 3/"
"DA(/Helv 0 Tf 0 g )>>\nendobj\n";
fileBuff+=acroForm;
objArray[6]=fileBuff.GetLength();
CString annotation;
annotation="7 0 obj<</Type /Annots/Subtype /"
"Widget/FT /Sig/Rect[0 0 0 0]/P 3 0 R/"
"T(signature)/V 8 0 R/MK<<>>>>\"
"nendobj\n";
fileBuff+=annotation;
objArray[7]=fileBuff.GetLength();
CString sign;
sign="8 0 obj<</Type /Sig/Filter/ICM.SignDoc/Contents";
int byteRange[2];
fileBuff+=sign;
byteRange[0]=fileBuff.GetLength();
//This is just a dummy signature. Actually it should be
//taken after using cryptographic library.
//In the next version...
CString signature="<AE423B23FE56>";
fileBuff+=signature;
byteRange[1]=fileBuff.GetLength();
//We dont know the actual byte range. ie the end of file.
//Therefore it is a dummy entry now.
//We will replace it after we get the length of file.
sign.Format("%s%d%s%d%s","/ByteRange [0 ",byteRange[0],
" ",byteRange[1],"XXX]/Name(Shiraz)/"
"M(D:20040524100433+05'30')/"
"Location(Cordiant)/Reason(ICM Library)"
"/Date(Nov 3 200314:27:40)>>\nendobj\n");
fileBuff+=sign;
/*This table will contain the objects used in the
file and there byte offsets*/
objArray[8]=fileBuff.GetLength();
CString xref;
/*Please look in article*/
xref.Format("%s%d%s","xref\r\n0 ",9,
"\r\n0000000000 65535 f\r\n");
int numObj=8;
CString offsets;
for(int i=0;i<numObj;i++)
{
//This field should be 20 bytes long.
offsets.Format("%0.10d",objArray[i]);
xref+=offsets+" 00000 n\r\n";
}
fileBuff+=xref;
CString trailer;
trailer.Format("%s%d%s","trailer\n<</Size 9/"
"Root 1 0 R/ID[<5181383ede94727bcb32ac27ded71c68"
"><5181383ede94727bcb32ac27ded71c68>]>>\"
"r\nstartxref\r\n",objArray[8],"\r\n%%EOF\r\n");
fileBuff+=trailer;
/*We have finished with the pdf file.One thing
left is the actual byte range.*/
CString byteRangeEnd;
byteRangeEnd.Format("%d",fileBuff.GetLength()-byteRange[1]);
fileBuff.Replace("XXX",byteRangeEnd);
/*retrun the final string*/
return fileBuff;
}
This method is used to compress the content stream. The usage of the DLL can be found in zlib.dll.
CString CPdfPadDoc::FlateCompress(CString inputStream)
{
CMemFile *pInput=new CMemFile();
CMemFile *pOutput=new CMemFile();
z_stream zstream;
memset(&zstream,0,sizeof(z_stream));
DWORD inputLength=inputStream.GetLength();
char *inBuffer=new char[inputLength];
inBuffer=inputStream.GetBuffer(inputStream.GetLength());
inputStream.ReleaseBuffer();
pInput->Write(inBuffer,inputLength);
pInput->SeekToBegin();
BYTE zBufIn[20000];
BYTE zBufOut[4000];
deflateInit(&zstream, Z_DEFAULT_COMPRESSION);
int error = Z_OK;
while ( TRUE )
{
UINT cbRead = 0;
cbRead = pInput->Read(zBufIn, sizeof(zBufIn));
if ( cbRead == 0 )
break;
zstream.next_in = (Bytef*)zBufIn;
zstream.avail_in = (uInt)cbRead;
while ( TRUE )
{
zstream.next_out = (Bytef*)zBufOut;
zstream.avail_out = sizeof(zBufOut);
err = deflate(&zstream, Z_SYNC_FLUSH);
if (err != Z_OK)
break;
UINT cbWrite = sizeof(zBufOut) - zstream.avail_out;
if ( cbWrite == 0 )
break;
pOutput->Write(zBufOut, cbWrite);
if ( zstream.avail_out != 0 )
break;
}
}
error = deflateEnd(&zstream);
DWORD szOutBuff=pOutput->GetLength();
char *outBuffer=(char*)malloc(szOutBuff);
pOutput->SeekToBegin();
pOutput->Read(outBuffer,szOutBuff);
CString outStream(outBuffer);