Introduction
In HTML, it is actually possible to embed raw image data directly into HTML so that separate image files are not needed. This can speed up HTTP transfers in many cases, but there are compatibility issues especially with Internet Explorer up to and including version 8. This short article will show an easy way to extract these images and convert the HTML to use external images.
Data URI
Normally images are included using this syntax:
<img src="Image1.png">
The data URI syntax however allows images to be directly embedded, reducing the number of HTTP requests and also allowing for saving on disk as a single file. While this article deals with only the img
tag, this method can be applied to other tags as well. Here is an example of data URI usage:
<img src="
AAAFCAYAAACNbyblAAAAHElEQVQI12P4//8/w38GIAXDIBKE0DHxgljNBAAO
9TXL0Y4OHwAAAABJRU5ErkJggg==">
More information on data URIs is available here:
SeaMonkey 2.1
Most editors do not use the data URI syntax. However, starting with SeaMonkey 2.1 Composer (Mozilla HTML Editor), images which are dragged and dropped are imported using this syntax. This is quite a bad change in my opinion, especially since it is not obvious and because it is a change in behavior from 2.0. In my case, I made a large HTML file with over 50 images before I discovered it was not linking them, but instead embedding them.
Utilities
Amazingly, there are quite a few online utilities to convert images to the data URI format, but none that I could find that could do the reverse. Because I did not want to hand-edit my document, I wrote a quick utility to extract the images to disk and change the HTML to use external images. This allows the document to be loaded by any standard browser including Internet Explorer 8.
About the Source Code
The source code is quite targeted to my specific need. It has a lot of limitations. I have published it however so that it is available as a foundation for you to expand should you have the same need.
- The parsing is very basic, but works fine with SeaMonkey output.
- It only supports PNG format currently.
- There is no exception handling.
- The code has not been optimized in any way.
Usage
ImageExtract
is a console application and accepts one parameter. The parameter is the HTML file for input. The images will be output in the same directory, and the new HTML file will have a -new suffix. So if the input is index.html, the output HTML will be index-new.html.
Source Code
I have made the project available for download, but it is quite simple. It is a C# .NET Console application. For easy viewing, here is the class:
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
namespace ImageExtract {
class Program {
static void Main(string[] aArgs) {
string xSrcPathname = aArgs[0];
string xPath = Path.GetDirectoryName(xSrcPathname);
string xDestPathname = Path.Combine(xPath,
Path.GetFileNameWithoutExtension(xSrcPathname) + "-New.html");
int xImgIdx = 0;
Console.WriteLine("Processing " + Path.GetFileName(xSrcPathname));
string xSrc = File.ReadAllText(xSrcPathname);
var xDest = new StringBuilder();
string xStart = @"data:image/png;base64,";
string xB64;
int x = 0;
int y = 0;
int z = 0;
do {
x = xSrc.IndexOf(xStart, z);
if (x == -1) {
break;
}
xDest.Append(xSrc.Substring(z, x - z));
y = xSrc.IndexOf('"', x + 1);
xB64 = xSrc.Substring(x + xStart.Length, y - x - xStart.Length);
byte[] xImgData = System.Convert.FromBase64String(xB64);
string xImgName;
do {
xImgIdx++;
xImgName = "Image" + xImgIdx.ToString("0000") + ".png";
} while (File.Exists(Path.Combine(xPath, xImgName)));
Console.WriteLine("Extracting " + xImgName);
xDest.Append(xImgName);
File.WriteAllBytes(Path.Combine(xPath, xImgName), xImgData);
z = y;
} while (true);
xDest.Append(xSrc.Substring(z));
File.WriteAllText(xDestPathname, xDest.ToString());
Console.WriteLine("Output to " + Path.GetFileName(xDestPathname));
}
}
}
History
- 5th July, 2011: Initial version