Introduction
We have around 20 websites. Management is expanding overseas, and wants all the sites to be multi-lingual, yesterday, of course. And, the hard part is that some of the content is user driven. Faced with a future of endless tagging for resource files, I needed to come up with a better solution. Being a programmer, I wanted to create something that would make all my problems go away. I needed something I could use on any site, without messing much with the source code. It also needs to translate the pages itself, but needed to learn new words when users updated content.
The Teacher
The sites happen to be behind a firewall, so using the translation services of Google or BabelFish isn’t really an option. But, using Machine Translation sounds like a good idea (better than doing all of those resource files). Machine Translation is pretty good, but isn’t an exact science, so we will need a way to change what it comes up with. I found a company called SysTran which has an API. So now, I have a teacher for the pages to learn a new language from. Most machine translation programs have a function where you pass it a URL and it translates the whole page at once. The problem is that it is too slow, and often messes up your JavaScript or something else on the page. This approach is too big of a hammer for me. I need a way to surgically change just the content: labels, combo box data, list data, etc…
Making it Universal
I decided I didn’t want to write a new control or a wrapper for each control we use. I didn’t want to change the code on the sites. All this stuff we do in ASP.NET eventually gets put into just plain ol’ HTML at some point where there are a much smaller set of controls to deal with. So, if I get the HTML after ASP.NET has made it and translate it just before it goes to the client, I could do so in a way so as to not mess up the JavaScript, just translating what I want. An HTTP Module is perfect for this. Not only can I translate the raw HTML, but I can slap it on the front of any website without changing the inner workings of the site.
Like it Raw
Actually, getting the raw output turns out to be a bit of a trick. The only place you can get a hold of it is by creating a Filter
object and assigning it to the Response
object, as follows…
Dim f = _Context.Response.Filter
Dim sr As TranslateFilterToDom = New TranslateFilterToDom(f)
sr.Language = GetUserLanguage()
_Context.Response.Filter = sr
You create an object that inherits from Stream
. Most of it you want to leave alone, just fill in the footprint. What isn’t going to stay the same is the Write
method. This is a stream, so the HTML gets populated in chunks. What we are looking for is the end HTML tag...
Public Overrides Sub Write(ByVal Buffer() As Byte, _
ByVal offset As Integer, ByVal count As Integer)
Dim sBuffer As String = System.Text.UTF8Encoding.UTF8.GetString(Buffer, offset, count)
Dim rHTML As Regex = New Regex("</html>", RegexOptions.IgnoreCase)
If (Not rHTML.IsMatch(sBuffer)) Then
responseHtml.Append(sBuffer)
Else
responseHtml.Append(sBuffer)
Dim finalHtml As String = responseHtml.ToString()
If Language <> "en" Then
xml = New XmlDocument
Dim xmldecl As XmlDeclaration
xmldecl = xml.CreateXmlDeclaration("1.0", "UTF-8", Nothing)
finalHtml = finalHtml.Substring(finalHtml.ToLower.IndexOf("<html"))
finalHtml = finalHtml.Replace(" ", " ")
finalHtml = finalHtml.Replace("<br>", "<br/>")
finalHtml = finalHtml.Replace("<hr>", "<hr/>")
finalHtml = finalHtml.Replace("&", "&")
finalHtml = finalHtml.Replace("<head>", "<head>" & META)
xml.LoadXml(finalHtml)
xml.InsertBefore(xmldecl, xml.DocumentElement)
finalHtml = Translate(xml)
finalHtml = finalHtml.Replace("&", "&")
End If
Dim data As Byte() = System.Text.UTF8Encoding.UTF8.GetBytes(finalHtml)
_sink.Write(data, 0, data.Length)
End If
End Sub
The idea here is to parse through HTML tags on the way to the client. This way, we don’t have to worry about server controls, and don’t have to change code in the application. The problem is how to go about parsing through it. If the HTML is XHTML compliant, we can just dump it into an XMLDocument
and parse that. Then, we pass it to the Translate
method…
Private Function Translate(ByVal poxml As XmlDocument) As String
Dim loNode As XmlNode = DirectCast(poxml.DocumentElement, XmlNode)
TranslateNode(poxml.DocumentElement)
If _IsManaul Then
Dim xmlNode As XmlNode = poxml.GetElementsByTagName("head")(0)
Dim xmlScript As XmlElement = poxml.CreateElement("Script")
xmlScript.InnerXml() = GetManualScript()
xmlNode.AppendChild(xmlScript)
End If
Return poxml.OuterXml
End Function
So, we are just kicking off the recursion to parse the XMLDocument
here. In addition, we are going to handle the event that the MT program doesn’t have the ability to translate the user's selected language, which is what the IsManual
(I know it’s misspelled in the code) is all about.
So, here comes the surgical part. We need to get just the text that users can see. We care about text in the HTML between tags. We care about text in value
attributes. What we don’t want is textboxes or textareas since users will be writing their language. We don’t care about what’s in script
or style
tags.
Private Sub TranslateNode(ByVal poNode As XmlNode)
Dim lbTraverse As Boolean = True
Select Case poNode.Name.ToLower
Case "#text"
poNode.Value = TranslateText(poNode.Value, Nothing)
Case "input"
Dim lsType As String
If poNode.Attributes.ItemOf("type") Is Nothing Then
lsType = "text"
Else
lsType = poNode.Attributes.ItemOf("type").Value.ToLower
End If
If Not poNode.Attributes.ItemOf("value") Is Nothing Then
Select Case lsType
Case "button", "submit"
poNode.Attributes.ItemOf("value").Value = _
TranslateText(poNode.Attributes.ItemOf("value").Value, poNode
End Select
End If
Case "script", "style", "textarea" .
lbTraverse = False
End Select
If lbTraverse Then
For Each loNode As XmlNode In poNode.ChildNodes
TranslateNode(loNode)
Next
End If
End Sub
So, now that we have the text we want to translate, how do we translate it? The downside of using MT is that it’s slow, at least slower than just looking the word up out of a database. What we do is once we translate a word or phrase, we save it off for use by our website or any website that uses this HTTP module.
Private Function TranslateText(ByVal psString As String, _
ByVal pnode As XmlNode) As String
Dim result As String = String.Empty
psString = psString.Replace(vbCr, "")
psString = psString.Replace(vbCrLf, "")
psString = psString.Replace(vbLf, "")
psString = psString.Replace(vbTab, "")
psString = psString.Trim
result = Lookup(psString)
If result.Trim = String.Empty Then
Dim loTranslator As ITranslator = New Systran
If Not loTranslator Is Nothing AndAlso _
Array.Exists(loTranslator.AvailableLanguages, AddressOf HasLanguage) Then
result = loTranslator.Translate(psString, "en", Language)
SaveWord(psString, result)
Else
_IsManaul = True
result = psString
End If
End If
Return result
End Function
Here is where the magic happens. I set this up so you can add your own translator by just implementing ITranslator
. We are looking at using Systran, which has a Web Service API.
Public Function Translate(ByVal psString As String, _
ByVal psFromLanguage As String, ByVal psToLanguage As String) _
As String Implements ITranslator.Translate
Dim lsURL As String = My.Settings.TransaltionURL.Trim.Replace("@Language", psToLanguage)
Dim loWebclient As WebClient = New WebClient
loWebclient.Encoding = Encoding.UTF8
Dim result As String = loWebclient.UploadString(lsURL, psString).Trim
Return result.Replace("body=", "")
End Function
So What If…
What if you don’t have an MT server, or the user is using a language which the MT can’t translate? You need a way of letting the user translate the pages for you. It’s the Wikipedia approach to site translation. By creating a list of phrases used commonly on the site and things used in the controls of the web pages, you can just have all that translated either by using one of the online translators, or by having the user fill in the translation. That will handle the static content well enough. Then, there is the user driven content. We need a way for users to select content presented in English and translate it for us. So, what we do is, if the manual translation switch is on, then we load into the HTML the JavaScript needed to allow the user to select content, and on mouse-up, we open up a dialog which will allow them to offer a suggestion. In our site, we capture the user’s login since we are single sign on, so we can fire them if they put in anything inappropriate.
This is the English version of the demo. No translation takes place.
This is translated to Japanese using Systran MT.
Pardon the spelling. Systran does not translate Vietnamese. So, the user will need to offer a translation. They just select the English text and a dialog pops up asking for the translation. This is then stored in the database for use by all websites.
This is the database table. Across a site or sites of any size, this will get pretty big, but it can be used by all sites. Once a site learns a word, all sites know it at that point, and don't need to go look it up.
Systran does not translate Vietnamese. So, the user will need to offer a translation. They just select the English text and a dialog pops up asking for the translation. This is then stored in the database for use by all websites. Of course, if the user misspells something as is demonstrated here, it will show up misspelled so adding a multi-lingual spell checker would be a nice touch.
Conclusion
Get rid of the RESX files and let your websites learn from each other using MT as a teacher. I hope this is an idea you can expand on. The code will probably need some massaging to get it baked into your site, but hopefully, it is an idea you can use.