(untagged)

How to do xHTML to xHTML Transformations for PDF Conversion Purposes in ASP.NET.

Michael Simonov

0.00/5 (No votes)

1 Mar 2011

The article discusses dynamic xHtml to xHtml XSL transformations for PDF output purposes.

Download example - 2.7 MB

Introduction

Converting HTML pages to different formats and especially to PDF has become a widely spread routine for web developers. The process itself is plenty straightforward, because there are quite a lot of PDF development libraries and services around the web. However, one day you may need not just to make a PDF copy of the page, but to automatically add some modifications to the result PDF output (for example, you may want to access SVG data on the page). In this article, I’m going to show a simple example of accomplishing this task in ASP.NET using some .NET and XSL tips and a PD4ML PDF library.

Step 1: Searching for xHTML Markup

ASP.NET is great for easily creating complicated pages. However, all these controls and other stuff have very little in common with result xHTML markup, which is rendered and sent to the client. That’s why the first thing we are going to do is to somehow bring it to the light. The markup is created with the help of “Render” method of the page’s life cycle, so we need to override this method.

protected override void Render(HtmlTextWriter output)
{
   //Creating String and Html writers to copy the created HTML markup
   StringWriter writer = new StringWriter();
   HtmlTextWriter htmlWriter = new HtmlTextWriter(writer);
   //Creating HTML markup with the help of our "fake" HTmlTextWriter
   base.Render(htmlWriter);
   //Coping the markup to the string and saving it to the disk
   string htmlMarkup = writer.ToString();
   StreamWriter XMLwriter = new StreamWriter(Server.MapPath("Htmloutput.xml"));
   XMLwriter.Write(htmlMarkup);
   XMLwriter.Close();
   //Creating actual HTML markup for display
   output.Write(htmlMarkup);
}

Step 2: Getting Ready for XSL Transformation

Now we need to prepare our XSLT file. ASP.NET produces a valid xHTML markup, hence we just need to change it according to our needs, but there are still some problems you may face:

First, don’t forget, that xHTML markup uses a default xmlns=http://www.w3.org/1999/xhtml namespace, so we need to create some prefix in our XSLT file, to reach the nodes. That’s why we add xmlns:xhtml=http://www.w3.org/1999/xhtml string to our XSLT file and add xhtml to ”exclude-result-prefixes” to remove it from the result document.
Second, now we are able to do transformations, but there is another problem: lots of xmlns="" nodes in the output document. To get rid of them, add xmlns=http://www.w3.org/1999/xhtml to the XSLT file namespace declaration.

Third, HTML pages contain plain text, which is not allowed in XML, therefore it’s not processed by XSLT. To get rid of text nodes, put <xsl:template match="xhtml:body//text()"> template in your XSLT style sheet.

<?xml version="1.0" encoding="utf-8"?>

<xsl:stylesheet version="1.0"
               xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
               xmlns:msxsl="urn:schemas-microsoft-com:xslt"

               xmlns:xhtml="http://www.w3.org/1999/xhtml"
               xmlns="http://www.w3.org/1999/xhtml"

               exclude-result-prefixes="msxsl xhtml">
<!--the rest of the xsl file --!>

Step 3: Creating PDF File

That is, where we come to our final goal. All we need to do is to perform XSL transformation and create PDF file. I‘ll use – PD4ML HTML to PDF converting library, because it’s possible to use it in different programming languages, like Java, PHP, Ruby, etc. I’m going to use MemoryStream because I don’t want to save any intermediate data to hard drive.

protected void MakePDFButton_Click(object sender, EventArgs e)
{
   //Doing XSL transformation
   string XSLTFile = Server.MapPath("XSLTFile.xslt");
   string XMLFile = Server.MapPath("HTMLoutput.xml");
   // Allowing DTD in our xHTML markup
   XmlReaderSettings settings = new XmlReaderSettings();
   settings.ProhibitDtd = false;
   XmlReader reader = XmlReader.Create(XMLFile, settings);
   //Transforming the initial HTML markup and outputting it to MemoryStream
   //object instance for further PDF conversion
   XslCompiledTransform XSLTransform = new XslCompiledTransform();
   XSLTransform.Load(XSLTFile);
   Stream memoryStream = new MemoryStream();
   XSLTransform.Transform(reader, null, memoryStream);
   //Flushing the stream and positioning the cursor at the beginning
   //of the data in the stream.
   memoryStream.Flush();
   memoryStream.Position=0;
   reader.Close();

   //Showing the markup on the page
   StreamReader streamReader=new StreamReader(memoryStream);
   string output=streamReader.ReadToEnd();
   HTMLoutput.Text = Server.HtmlEncode(output);

   //Converting result HTML page to PDF
   PD4ML PDFcreator = new PD4ML();
   PDFcreator.PageSize = PD4Constants.A4;
   PDFcreator.DocumentTitle = "The result PDF file";
   string path=Server.MapPath("Output.pdf");
   StreamWriter streamWriter = new StreamWriter(path);
   memoryStream.Position = 0;
   PDFcreator.render(memoryStream as MemoryStream, streamWriter);
   //Closing all the streams
   streamReader.Close();
   streamWriter.Close();
}

Conclusion

That's it! Now let's come up with a short summary:

Use override “Render” method to manipulate and obtain xHTML markup.
Use custom XML namespace prefix to reach non-prefixed xHTML nodes.
Use little xslt “xmlns=http://www.w3.org/1999/xhtml” hack to get rid of numerous xmlns="" nodes.
Use <xsl:template match="xhtml:body//text()"> if you need to get rid of plain text, which isn't wrapped by any element.

I hope that the combination of a valid xHTML markup, which is taken “for granted” by Visual Studio developers and several easy tips, which were described above will give you countless possibilities of manipulating your document's data.

History

1^st March, 2011: Initial post

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here