Data URI Image Extractor

Chad Z. Hower aka Kudzu

4.74/5 (5 votes)

5 Jul 2011BSD2 min read

37.6K

607

This short article will show an easy way to extract HTML data URI images and convert the HTML to use external images.

Download source code - 3.51 KB

Introduction

In HTML, it is actually possible to embed raw image data directly into HTML so that separate image files are not needed. This can speed up HTTP transfers in many cases, but there are compatibility issues especially with Internet Explorer up to and including version 8. This short article will show an easy way to extract these images and convert the HTML to use external images.

Data URI

Normally images are included using this syntax:

HTML

<img src="Image1.png">

The data URI syntax however allows images to be directly embedded, reducing the number of HTTP requests and also allowing for saving on disk as a single file. While this article deals with only the img tag, this method can be applied to other tags as well. Here is an example of data URI usage:

HTML

<img src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAUA
AAAFCAYAAACNbyblAAAAHElEQVQI12P4//8/w38GIAXDIBKE0DHxgljNBAAO
9TXL0Y4OHwAAAABJRU5ErkJggg==">

More information on data URIs is available here:

SeaMonkey 2.1

Most editors do not use the data URI syntax. However, starting with SeaMonkey 2.1 Composer (Mozilla HTML Editor), images which are dragged and dropped are imported using this syntax. This is quite a bad change in my opinion, especially since it is not obvious and because it is a change in behavior from 2.0. In my case, I made a large HTML file with over 50 images before I discovered it was not linking them, but instead embedding them.

Utilities

Amazingly, there are quite a few online utilities to convert images to the data URI format, but none that I could find that could do the reverse. Because I did not want to hand-edit my document, I wrote a quick utility to extract the images to disk and change the HTML to use external images. This allows the document to be loaded by any standard browser including Internet Explorer 8.

About the Source Code

The source code is quite targeted to my specific need. It has a lot of limitations. I have published it however so that it is available as a foundation for you to expand should you have the same need.

The parsing is very basic, but works fine with SeaMonkey output.
It only supports PNG format currently.
There is no exception handling.
The code has not been optimized in any way.

Usage

ImageExtract is a console application and accepts one parameter. The parameter is the HTML file for input. The images will be output in the same directory, and the new HTML file will have a -new suffix. So if the input is index.html, the output HTML will be index-new.html.

Source Code

I have made the project available for download, but it is quite simple. It is a C# .NET Console application. For easy viewing, here is the class:

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;

namespace ImageExtract {
  class Program {
    // NOTE - This program is rough and dirty - I designed
    // it to accomplish and urgent task. I have not built in 
    // normal error handling etc.
    //
    // It also has not been optimized at all
    // and certainly is not very efficient.
    //
    // It also assumes all images are png files.
    static void Main(string[] aArgs) {
      string xSrcPathname = aArgs[0];
      string xPath = Path.GetDirectoryName(xSrcPathname);
      string xDestPathname = Path.Combine(xPath, 
             Path.GetFileNameWithoutExtension(xSrcPathname) + "-New.html");
      int xImgIdx = 0;

      Console.WriteLine("Processing " + Path.GetFileName(xSrcPathname));
      string xSrc = File.ReadAllText(xSrcPathname);
      var xDest = new StringBuilder();

      string xStart = @"data:image/png;base64,";
      string xB64;
      int x = 0;
      int y = 0;
      int z = 0;
      do {
        x = xSrc.IndexOf(xStart, z);
        if (x == -1) {
          break;
        }
        // Write out preceding HTML
        xDest.Append(xSrc.Substring(z, x - z));

        // Get the Base64 string
        y = xSrc.IndexOf('"', x + 1);
        xB64 = xSrc.Substring(x + xStart.Length, y - x - xStart.Length);
        // Convert the Base64 string to binary data
        byte[] xImgData = System.Convert.FromBase64String(xB64);

        string xImgName;
        // Get Image name and replace it in the HTML
        // We don't want to overwrite images that might already exist on disk,
        // so cycle till we find a non used name
        do {
          xImgIdx++;
          xImgName = "Image" + xImgIdx.ToString("0000") + ".png";
        } while (File.Exists(Path.Combine(xPath, xImgName)));

        Console.WriteLine("Extracting " + xImgName);

        // Write image name into HTML
        xDest.Append(xImgName);
        // Write the binary data to disk
        File.WriteAllBytes(Path.Combine(xPath, xImgName), xImgData);

        z = y;
      } while (true);
      // Write out remaining HTML
      xDest.Append(xSrc.Substring(z));

      // Write out result
      File.WriteAllText(xDestPathname, xDest.ToString());
      Console.WriteLine("Output to " + Path.GetFileName(xDestPathname));
    }
  }
}

History

5^th July, 2011: Initial version

License

This article, along with any associated source code and files, is licensed under The BSD License