(untagged)

Fix CCITT Fax G4 bi-level Images Broken by PDFSharp

Mikolaj Barwicki

0.00/5 (No votes)

20 May 2015

Code to fix PDFs containing 1-bit CCITT Fax Group 4 images broken by PDFSharp

Download source code from GitHub

Introduction

PDFSharp is a very nice freeware library which is a convenient alternative to iTextSharp, especially after the owners of the latter decided to effectively restrict the license conditions to non-commercial use only. PDFSharp is now at stable version 1.32 which does contain a known issue in the code responsible for converting 1-bit images to CCITT Fax Group 4 encoding. The issue (and corresponding fix) is described here, the issue appears to be fixed in version 1.50 of the library (currently at beta), but... what if you need to fix a bunch of PDFs which were created using the faulty PDFSharp version?

The issue is nasty, since it manifests itself in a fairly small fraction of images (e.g., scans), and its annoying visible symptom is behaviour of Adobe Reader which throws "Insufficient data for image" error and displays blank image. Note that a few other PDF readers have no problems in opening the 'broken' PDF, nevertheless, it may be useful to be able to fix the broken files.

Using the Code

I had to create a basic utility which does the fixing of a given PDF file. Note that the PDFs were very simple, i.e., only images in them, no text, no other objects. Still the code below should give the idea of the approach.

The tricky part is to read the 'broken' image content. As mentioned above, a number of codecs are actually able to handle the broken bitstream, and another great library - LibTiff.NET - is one of them.

The approach is:

Read the PDF file
Iterate through the images in the file
Decode the broken bitstream
Encode the image correctly
Save the corrected image into a PDF file again using a fixed version of PDFSharp

The package contains the PDFFixProcessor which does the decoding and fixing. Here are the 'clever' bits:

Code that decodes the faulty bitstream by wrapping it in a TIFF header and then reading it using LibTiff.NET library:

...

// Generate temp file name
string name = Path.GetTempFileName();

// Stream the bits into a tiff file so that they can be decoded later
// I know this looks cheesy, but I don't have time to do it in-memory...
var tiff = Tiff.Open(name, "w");
tiff.SetField(TiffTag.IMAGEWIDTH, "" + width);
tiff.SetField(TiffTag.IMAGELENGTH, "" + height);
tiff.SetField(TiffTag.COMPRESSION, "" + (int)Compression.CCITTFAX4);
tiff.SetField(TiffTag.BITSPERSAMPLE, "" + bitsPerComponent);
tiff.SetField(TiffTag.SAMPLESPERPIXEL, 1);

tiff.SetField(TiffTag.FAXMODE, "" + (int)FaxMode.CLASSF);

tiff.WriteRawStrip(0, stream, stream.Length);
tiff.Close();

...

int i = 0, j = 0;

var tiff2 = Tiff.Open(name, "r");

int rowsize = tiff2.ScanlineSize();
int imageSize = destinationData.Stride * height;
byte[] destinationBuffer = new byte[imageSize];
byte[] scanline = new byte[rowsize];

int destinationIndex = 0;

for (i = 0; i < height; i++)
{
    destinationIndex = i * destinationData.Stride;

    var readResult = tiff2.ReadScanline(scanline, i);

    for (j = 0; j < rowsize; j++)
    {
        destinationBuffer[destinationIndex] =
          (byte)~(scanline[j]); // "not" in order to get correct colors
        destinationIndex++;
    }
}
...

Also, note how the resulting bitmap is written straight into a memory buffer of a System.Drawing.Bitmap:

...
// Create destination bitmap
Bitmap destination = new Bitmap(width, height, PixelFormat.Format1bppIndexed);
destination.SetResolution(xres, yres);

// Lock destination bitmap in memory
BitmapData destinationData = destination.LockBits(new Rectangle
(0, 0, destination.Width, destination.Height),
     ImageLockMode.WriteOnly, PixelFormat.Format1bppIndexed);

...

// Copy binary image data to destination bitmap
Marshal.Copy(destinationBuffer, 0, destinationData.Scan0, imageSize);

// Unlock destination bitmap
destination.UnlockBits(destinationData);
...

The class does also include code which handles combinations of image encodings as generated by PDFSharp (e.g., some 1-bit images may be encoded using CCITT Fax Group 4 and bit compression on top of that). The class seems to be handling that quite neatly, although some attributes are hard coded (note: I had to fix a set of PDFs which did follow a certain pattern).

Points of Interest

Fortunately, it isn't mandatory to know the intricacies of the mechanism to decode/encode CCITT Fax Group 4 bitstreams as the libraries do that for us.

History

18/05/2015 - Source code uploaded

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here