Introduction
PDFSharp is a very nice freeware library which is a convenient alternative to iTextSharp, especially after the owners of the latter decided to effectively restrict the license conditions to non-commercial use only. PDFSharp is now at stable version 1.32 which does contain a known issue in the code responsible for converting 1-bit images to CCITT Fax Group 4 encoding. The issue (and corresponding fix) is described here, the issue appears to be fixed in version 1.50 of the library (currently at beta), but... what if you need to fix a bunch of PDFs which were created using the faulty PDFSharp version?
The issue is nasty, since it manifests itself in a fairly small fraction of images (e.g., scans), and its annoying visible symptom is behaviour of Adobe Reader which throws "Insufficient data for image" error and displays blank image. Note that a few other PDF readers have no problems in opening the 'broken' PDF, nevertheless, it may be useful to be able to fix the broken files.
Using the Code
I had to create a basic utility which does the fixing of a given PDF file. Note that the PDFs were very simple, i.e., only images in them, no text, no other objects. Still the code below should give the idea of the approach.
The tricky part is to read the 'broken' image content. As mentioned above, a number of codecs are actually able to handle the broken bitstream, and another great library - LibTiff.NET - is one of them.
The approach is:
- Read the PDF file
- Iterate through the images in the file
- Decode the broken bitstream
- Encode the image correctly
- Save the corrected image into a PDF file again using a fixed version of PDFSharp
The package contains the PDFFixProcessor
which does the decoding and fixing. Here are the 'clever' bits:
Code that decodes the faulty bitstream by wrapping it in a TIFF header and then reading it using LibTiff.NET
library:
...
string name = Path.GetTempFileName();
var tiff = Tiff.Open(name, "w");
tiff.SetField(TiffTag.IMAGEWIDTH, "" + width);
tiff.SetField(TiffTag.IMAGELENGTH, "" + height);
tiff.SetField(TiffTag.COMPRESSION, "" + (int)Compression.CCITTFAX4);
tiff.SetField(TiffTag.BITSPERSAMPLE, "" + bitsPerComponent);
tiff.SetField(TiffTag.SAMPLESPERPIXEL, 1);
tiff.SetField(TiffTag.FAXMODE, "" + (int)FaxMode.CLASSF);
tiff.WriteRawStrip(0, stream, stream.Length);
tiff.Close();
...
int i = 0, j = 0;
var tiff2 = Tiff.Open(name, "r");
int rowsize = tiff2.ScanlineSize();
int imageSize = destinationData.Stride * height;
byte[] destinationBuffer = new byte[imageSize];
byte[] scanline = new byte[rowsize];
int destinationIndex = 0;
for (i = 0; i < height; i++)
{
destinationIndex = i * destinationData.Stride;
var readResult = tiff2.ReadScanline(scanline, i);
for (j = 0; j < rowsize; j++)
{
destinationBuffer[destinationIndex] =
(byte)~(scanline[j]);
destinationIndex++;
}
}
...
Also, note how the resulting bitmap is written straight into a memory buffer of a System.Drawing.Bitmap
:
...
Bitmap destination = new Bitmap(width, height, PixelFormat.Format1bppIndexed);
destination.SetResolution(xres, yres);
BitmapData destinationData = destination.LockBits(new Rectangle
(0, 0, destination.Width, destination.Height),
ImageLockMode.WriteOnly, PixelFormat.Format1bppIndexed);
...
Marshal.Copy(destinationBuffer, 0, destinationData.Scan0, imageSize);
destination.UnlockBits(destinationData);
...
The class does also include code which handles combinations of image encodings as generated by PDFSharp (e.g., some 1-bit images may be encoded using CCITT Fax Group 4 and bit compression on top of that). The class seems to be handling that quite neatly, although some attributes are hard coded (note: I had to fix a set of PDFs which did follow a certain pattern).
Points of Interest
Fortunately, it isn't mandatory to know the intricacies of the mechanism to decode/encode CCITT Fax Group 4 bitstreams as the libraries do that for us.
History
- 18/05/2015 - Source code uploaded