Introduction
The articles on our web site are mainly in HTML 4.0, however, many of them don't conform the W3C standard; there are a lot of bad tags in these articles, and I wanted to convert these files to XHTML files in order to conform the W3C standard.
Sometimes, I want to extract some information from web pages. If the web page is in XHTML format, then I can get the information more easily since I can use an XML document prototype to parse the file.
Background
There are several tools that can convert HTML to an XHTML. Dreamweaver is able to convert file by using File-->Convert Menu. But there are some issues with Dreamweaver: it's not free, and sometimes Dreamweaver is not able to fix some errors. Also, you can use a free famous tool called "HTML Tidy". However, HTML Tidy can process some languages only. This article is based on HTML Tidy.
Since XHTML 2.0 is not compatible with HTML and XHTML 1.0, it's not universally used. For example, the default schema for .NET web application is XHTML 1.0. In this article, XHTML refers to XHTML 1.0 transitional format.
Step I. Convert HTML file to UTF-8 format
In order to process all languages, we first have to convert the file to UTF-8 format. (Note: If the source file is already in UTF-8 format, then you can just ignore this step.)
We can use FileStream
and BinaryReader
class to read the HTML file as byte array, then convert it to UTF-8 String
.
Here, we suppose the HTML encoding method is the default encoding of the operation system.
public static byte[] ReadFileAsBytes(String strFilePath)
{
System.IO.FileStream fs = new System.IO.FileStream(strFilePath,
System.IO.FileMode.Open, System.IO.FileAccess.Read,
System.IO.FileShare.ReadWrite);
System.IO.BinaryReader br = new System.IO.BinaryReader(fs);
byte[] baResult = null;
try
{
baResult = new byte[fs.Length];
br.Read(baResult, 0, baResult.Length);
}
finally
{
br.Close();
fs.Close();
}
return baResult;
}
public static String BytesToString(byte[] bData)
{
return System.Text.Encoding.GetEncoding(0).GetString(bData);
}
Step II. Convert File to XHTML
We use HTML Tidy to convert HTML files to XHTML files. Tidy has lots of parameters. If you want to know the details, you can read the manual.
If we want to convert a UTF-8 HTML file to XHTML file, you can use it like this:
tidy.exe -raw -utf8 -asxhtml -i -f logfilename -o outputfilename inputfilename
By using the System.Diagnostics.Process
class, you startup a process, point out the specified name of the input file and output file, and read the entire output as the converted XHTML file. If the output file does not exist, there may be a server error with the input file, in which case you may have to check it manually.
public static String HTML2XHTML(String strOriginalContent,String strOutputPath)
{
String strTempPath = strOutputPath != null ? strOutputPath :
System.IO.Path.GetTempPath();
String strFileName = String.Format("{0}tidy.exe",strTempPath);
if (!System.IO.File.Exists(strFileName))
{
ChinaCars.Util.SysUtil.WriteFile(strFileName,
ChinaCars.Util.App_GlobalResources.Resource.tidy);
}
System.Diagnostics.ProcessStartInfo psiInfo =
new System.Diagnostics.ProcessStartInfo();
psiInfo.FileName = strFileName;
psiInfo.CreateNoWindow = true;
psiInfo.WindowStyle = System.Diagnostics.ProcessWindowStyle.Hidden;
psiInfo.WorkingDirectory = strTempPath;
String strMainFileName = System.Guid.NewGuid().ToString("N");
String strInFileName = String.Format("{0}{1}.in",
strTempPath,strMainFileName);
String strOutFileName = String.Format("{0}{1}.out",
strTempPath,strMainFileName);
String strErrorFileName = String.Format("{0}{1}.log",
strTempPath,strMainFileName);
System.IO.File.Delete(strInFileName);
byte[] baUTF8Data = Encoding.Convert(Encoding.GetEncoding(0),
Encoding.UTF8, Encoding.GetEncoding(0).GetBytes(strOriginalContent));
ChinaCars.Util.SysUtil.WriteFile(strInFileName, baUTF8Data);
psiInfo.Arguments = String.Format(" -raw -utf8 -asxhtml -i -f
{0}.log -o {0}.out {0}.in", strMainFileName);
System.IO.File.Delete(strOutFileName);
System.Diagnostics.Process proc =
System.Diagnostics.Process.Start(psiInfo);
proc.WaitForExit();
System.IO.File.Delete(strInFileName);
System.IO.File.Delete(strErrorFileName);
byte[] baResult = ChinaCars.Util.SysUtil.ReadFileAsBytes(strOutFileName);
String strContent =
Encoding.GetEncoding(0).GetString(Encoding.Convert(Encoding.UTF8,
Encoding.GetEncoding(0), baResult));
strContent = "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0
Transitional//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-
transitional.dtd\">" + strContent;
System.IO.File.Delete(strOutFileName);
return strContent;
}
Step III. Developing Your Own XHTML Resolver
Now you may use System.Xml.XmlDocument
class to load the XHTML document, but you may find the loading process is really long! Sometimes, the LoadXML
will fail! Why?
The DOCTYPE
header in the XHTML tells the .NET XML parser to load corresponding file resource from World Wide Web Consortium (W3C), and may take several or more rounds! Fortunately, the .NET Framework allows us to resolve XML files by ourselves. By overriding ResolveUri
and GetEntity
of the XmlRelolver
, we can reduce the XHTML loading time. The code is shown below:
public class XHTMLResolver:XmlResolver
{
override public ICredentials Credentials
{
set { }
}
public XHTMLResolver()
{
}
public override Uri ResolveUri(Uri baseUri, String relativeUri)
{
if (String.Compare(relativeUri, "-//W3C//DTD XHTML 1.0
Transitional//EN", true) == 0)
{
return new Uri("http://www.w3.org/tr/xhtml1/DTD/xhtml1-
transitional.dtd");
}
else if (String.Compare(relativeUri, "-//W3C//DTD XHTML 1.0
Transitional//EN", true) == 0)
{
return new Uri("http://www.w3.org/tr/xhtml1/DTD/
xhtml1-strict.dtd");
}
else if (String.Compare(relativeUri, "-//W3C//DTD XHTML 1.0
Transitional//EN", true) == 0)
{
return new Uri("http://www.w3.org/tr/xhtml1/DTD/
xhtml1-frameset.dtd");
}
else if (String.Compare(relativeUri, "-//W3C//DTD XHTML
1.1//EN", true) == 0)
{
return new Uri("http://www.w3.org/tr/xhtml11/DTD/xhtml11.dtd");
}
return base.ResolveUri(baseUri,relativeUri);
}
override public object GetEntity(Uri absoluteUri, string role,
Type ofObjectToReturn)
{
Object entityObj = null;
String strURI = absoluteUri.AbsoluteUri;
System.IO.MemoryStream msStream=null;
switch (strURI.ToLower())
{
case "http://www.w3.org/tr/xhtml1/dtd/xhtml1-transitional.dtd":
msStream = new MemoryStream(Resource.xhtml1_transitional);
break;
case "http://www.w3.org/tr/xhtml1/dtd/xhtml1.dcl":
msStream = new MemoryStream(Resource.xhtml1);
break;
case "http://www.w3.org/tr/xhtml1/dtd/xhtml-lat1.ent":
msStream = new MemoryStream(Resource.xhtml_lat1);
break;
case "http://www.w3.org/tr/xhtml1/dtd/xhtml-special.ent":
msStream = new MemoryStream(Resource.xhtml_special);
break;
case "http://www.w3.org/tr/xhtml1/dtd/xhtml-symbol.ent":
msStream = new MemoryStream(Resource.xhtml_symbol);
break;
case "http://www.w3.org/tr/xhtml1/dtd/xhtml1-strict.dtd":
msStream = new MemoryStream(Resource.xhtml1_strict);
break;
case "http://www.w3.org/tr/xhtml1/dtd/xhtml1-frameset.dtd":
msStream = new MemoryStream(Resource.xhtml1_frameset);
break;
case "http://www.w3.org/tr/xhtml11/dtd/xhtml11.dtd":
msStream = new MemoryStream(Resource.xhtml11);
break;
case "http://www.w3.org/tr/xhtml-modularization/dtd/
xhtml-inlstyle-1.mod":
msStream = new MemoryStream(Resource.xhtml_inlstyle_1);
break;
case "http://www.w3.org/tr/xhtml-modularization/dtd/
xhtml-framework-1.mod":
msStream = new MemoryStream(Resource.xhtml_framework_1);
break;
case "http://www.w3.org/tr/xhtml-modularization/dtd/
xhtml-datatypes-1.mod":
msStream = new MemoryStream(Resource.xhtml_datatypes_1);
break;
case "http://www.w3.org/tr/xhtml-modularization/dtd/
xhtml-qname-1.mod":
msStream = new MemoryStream(Resource.xhtml_qname_1);
break;
case "http://www.w3.org/tr/xhtml-modularization/dtd/
xhtml-events-1.mod":
msStream = new MemoryStream(Resource.xhtml_events_1);
break;
case "http://www.w3.org/tr/xhtml-modularization/dtd/
xhtml-attribs-1.mod":
msStream = new MemoryStream(Resource.xhtml_attribs_1);
break;
case "http://www.w3.org/tr/xhtml11/dtd/
xhtml11-model-1.mod":
msStream = new MemoryStream(Resource.xhtml11_model_1);
break;
case "http://www.w3.org/tr/xhtml-modularization/dtd/
xhtml-charent-1.mod":
msStream = new MemoryStream(Resource.xhtml_charent_1);
break;
case "http://www.w3.org/tr/xhtml-modularization/dtd/
xhtml-lat1.ent":
msStream = new MemoryStream(Resource.xhtml_lat11);
break;
case "http://www.w3.org/tr/xhtml-modularization/dtd/
xhtml-symbol.ent":
msStream = new MemoryStream(Resource.xhtml_symbol11);
break;
case "http://www.w3.org/tr/xhtml-modularization/dtd/
xhtml-special.ent":
msStream = new MemoryStream(Resource.xhtml_special11);
break;
case "http://www.w3.org/tr/xhtml-modularization/dtd/
xhtml-text-1.mod":
msStream = new MemoryStream(Resource.xhtml_text_1);
break;
case "http://www.w3.org/tr/xhtml-modularization/dtd/
xhtml-inlstruct-1.mod":
msStream = new MemoryStream(Resource.xhtml_inlstruct_1);
break;
case "http://www.w3.org/tr/xhtml-modularization/dtd/
xhtml-inlphras-1.mod":
msStream = new MemoryStream(Resource.xhtml_inlphras_1);
break;
case "http://www.w3.org/tr/ruby/xhtml-ruby-1.mod":
msStream = new MemoryStream(Resource.xhtml_ruby_1);
break;
case "http://www.w3.org/tr/xhtml-modularization/dtd/
xhtml-blkstruct-1.mod":
msStream = new MemoryStream(Resource.xhtml_blkstruct_1);
break;
case "http://www.w3.org/tr/xhtml-modularization/dtd/
xhtml-blkphras-1.mod":
msStream = new MemoryStream(Resource.xhtml_blkphras_1);
break;
case "http://www.w3.org/tr/xhtml-modularization/dtd/
xhtml-hypertext-1.mod":
msStream = new MemoryStream(Resource.xhtml_hypertext_1);
break;
case "http://www.w3.org/tr/xhtml-modularization/dtd/
xhtml-list-1.mod":
msStream = new MemoryStream(Resource.xhtml_list_1);
break;
case "http://www.w3.org/tr/xhtml-modularization/dtd/
xhtml-edit-1.mod":
msStream = new MemoryStream(Resource.xhtml_edit_1);
break;
case "http://www.w3.org/tr/xhtml-modularization/dtd/
xhtml-bdo-1.mod":
msStream = new MemoryStream(Resource.xhtml_bdo_1);
break;
case "http://www.w3.org/tr/xhtml-modularization/dtd/
xhtml-pres-1.mod":
msStream = new MemoryStream(Resource.xhtml_pres_1);
break;
case "http://www.w3.org/tr/xhtml-modularization/dtd/
xhtml-inlpres-1.mod":
msStream = new MemoryStream(Resource.xhtml_inlpres_1);
break;
case "http://www.w3.org/tr/xhtml-modularization/dtd/
xhtml-blkpres-1.mod":
msStream = new MemoryStream(Resource.xhtml_blkpres_1);
break;
case "http://www.w3.org/tr/xhtml-modularization/dtd/
xhtml-link-1.mod":
msStream = new MemoryStream(Resource.xhtml_link_1);
break;
case "http://www.w3.org/tr/xhtml-modularization/dtd/
xhtml-meta-1.mod":
msStream = new MemoryStream(Resource.xhtml_meta_1);
break;
case "http://www.w3.org/tr/xhtml-modularization/dtd/
xhtml-base-1.mod":
msStream = new MemoryStream(Resource.xhtml_base_1);
break;
case "http://www.w3.org/tr/xhtml-modularization/dtd/
xhtml-script-1.mod":
msStream = new MemoryStream(Resource.xhtml_script_1);
break;
case "http://www.w3.org/tr/xhtml-modularization/dtd/
xhtml-style-1.mod":
msStream = new MemoryStream(Resource.xhtml_style_1);
break;
case "http://www.w3.org/tr/xhtml-modularization/dtd/
xhtml-image-1.mod":
msStream = new MemoryStream(Resource.xhtml_image_1);
break;
case "http://www.w3.org/tr/xhtml-modularization/dtd/
xhtml-csismap-1.mod":
msStream = new MemoryStream(Resource.xhtml_csismap_1);
break;
case "http://www.w3.org/tr/xhtml-modularization/dtd/
xhtml-ssismap-1.mod":
msStream = new MemoryStream(Resource.xhtml_ssismap_1);
break;
case "http://www.w3.org/tr/xhtml-modularization/dtd/
xhtml-param-1.mod":
msStream = new MemoryStream(Resource.xhtml_param_1);
break;
case "http://www.w3.org/tr/xhtml-modularization/dtd/
xhtml-object-1.mod":
msStream = new MemoryStream(Resource.xhtml_object_1);
break;
case "http://www.w3.org/tr/xhtml-modularization/dtd/
xhtml-table-1.mod":
msStream = new MemoryStream(Resource.xhtml_table_1);
break;
case "http://www.w3.org/tr/xhtml-modularization/dtd/
xhtml-form-1.mod":
msStream = new MemoryStream(Resource.xhtml_form_1);
break;
case "http://www.w3.org/tr/xhtml-modularization/dtd/
xhtml-struct-1.mod":
msStream = new MemoryStream(Resource.xhtml_struct_1);
break;
}
if (msStream != null)
{
entityObj = msStream;
}
else
{
XmlUrlResolver xur = new XmlUrlResolver();
entityObj = xur.GetEntity(absoluteUri, role, ofObjectToReturn);
}
return entityObj;
}
Using the Code
By using the HTML2XHTML
method, you can convert an HTML file to an XHTML file.
System.Net.WebClient webClient = new System.Net.WebClient();
String strHTMLContent = webClient.DownloadString("http://www.codeproject.com");
String strXHTMLContent = ChinaCars.Util.XMLUtil.HTML2XHTML(strHTMLContent);
By using the XHTMLResolver
, you can resolve the XHTML file as XML very quickly.
System.Xml.XmlDocument xmlDoc=new System.Xml.XmlDocument();
xmlDoc.XmlResolver =new ChinaCars.Util.XHTMLResolver();
xmlDoc.LoadXml(xmlContent);
History
- 12th March, 2007: Published first version
License
This article has no explicit license attached to it, but may contain usage terms in the article text or the download files themselves. If in doubt, please contact the author via the discussion board below. A list of licenses authors might use can be found here.