Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles
(untagged)

Automatic Site Translation Using HTTPModules and Machine Translation (MT)

0.00/5 (No votes)
26 Jan 2009 1  
HTTP module and machine translation for automatic site translation.

Introduction

We have around 20 websites. Management is expanding overseas, and wants all the sites to be multi-lingual, yesterday, of course. And, the hard part is that some of the content is user driven. Faced with a future of endless tagging for resource files, I needed to come up with a better solution. Being a programmer, I wanted to create something that would make all my problems go away. I needed something I could use on any site, without messing much with the source code. It also needs to translate the pages itself, but needed to learn new words when users updated content.

The Teacher

The sites happen to be behind a firewall, so using the translation services of Google or BabelFish isn’t really an option. But, using Machine Translation sounds like a good idea (better than doing all of those resource files). Machine Translation is pretty good, but isn’t an exact science, so we will need a way to change what it comes up with. I found a company called SysTran which has an API. So now, I have a teacher for the pages to learn a new language from. Most machine translation programs have a function where you pass it a URL and it translates the whole page at once. The problem is that it is too slow, and often messes up your JavaScript or something else on the page. This approach is too big of a hammer for me. I need a way to surgically change just the content: labels, combo box data, list data, etc…

Making it Universal

I decided I didn’t want to write a new control or a wrapper for each control we use. I didn’t want to change the code on the sites. All this stuff we do in ASP.NET eventually gets put into just plain ol’ HTML at some point where there are a much smaller set of controls to deal with. So, if I get the HTML after ASP.NET has made it and translate it just before it goes to the client, I could do so in a way so as to not mess up the JavaScript, just translating what I want. An HTTP Module is perfect for this. Not only can I translate the raw HTML, but I can slap it on the front of any website without changing the inner workings of the site.

Like it Raw

Actually, getting the raw output turns out to be a bit of a trick. The only place you can get a hold of it is by creating a Filter object and assigning it to the Response object, as follows…

Dim f = _Context.Response.Filter
Dim sr As TranslateFilterToDom = New TranslateFilterToDom(f)
sr.Language = GetUserLanguage()
_Context.Response.Filter = sr

You create an object that inherits from Stream. Most of it you want to leave alone, just fill in the footprint. What isn’t going to stay the same is the Write method. This is a stream, so the HTML gets populated in chunks. What we are looking for is the end HTML tag...

Public Overrides Sub Write(ByVal Buffer() As Byte, _
       ByVal offset As Integer, ByVal count As Integer)
    Dim sBuffer As String = System.Text.UTF8Encoding.UTF8.GetString(Buffer, offset, count)
    Dim rHTML As Regex = New Regex("</html>", RegexOptions.IgnoreCase)
    If (Not rHTML.IsMatch(sBuffer)) Then
        responseHtml.Append(sBuffer)
    Else
        responseHtml.Append(sBuffer)
        Dim finalHtml As String = responseHtml.ToString()
        ' Transform the response and write it back out
        If Language <> "en" Then ' Don't do anything if its English
            xml = New XmlDocument 'Create a place to put the XHTML
            Dim xmldecl As XmlDeclaration
            xmldecl = xml.CreateXmlDeclaration("1.0", "UTF-8", Nothing)
            finalHtml = finalHtml.Substring(finalHtml.ToLower.IndexOf("<html"))
        'Help out the HTML a bit to make it XHTML
            finalHtml = finalHtml.Replace(" ", " ")
            finalHtml = finalHtml.Replace("<br>", "<br/>")
            finalHtml = finalHtml.Replace("<hr>", "<hr/>")
            finalHtml = finalHtml.Replace("&", "&amp;")
            finalHtml = finalHtml.Replace("<head>", "<head>" & META) 
        'Add a META tag to make sure its UTF-8
            xml.LoadXml(finalHtml)
            xml.InsertBefore(xmldecl, xml.DocumentElement)
            finalHtml = Translate(xml) 'Start the Translation
            finalHtml = finalHtml.Replace("&amp;", "&") 'Put the & back in
        End If
        Dim data As Byte() = System.Text.UTF8Encoding.UTF8.GetBytes(finalHtml)
        _sink.Write(data, 0, data.Length)
    End If
End Sub

The idea here is to parse through HTML tags on the way to the client. This way, we don’t have to worry about server controls, and don’t have to change code in the application. The problem is how to go about parsing through it. If the HTML is XHTML compliant, we can just dump it into an XMLDocument and parse that. Then, we pass it to the Translate method…

Private Function Translate(ByVal poxml As XmlDocument) As String
    Dim loNode As XmlNode = DirectCast(poxml.DocumentElement, XmlNode)
    TranslateNode(poxml.DocumentElement)
    If _IsManaul Then  
    'The manual switch will put the script on the page to allow for user translation.
        Dim xmlNode As XmlNode = poxml.GetElementsByTagName("head")(0)
        Dim xmlScript As XmlElement = poxml.CreateElement("Script")
        xmlScript.InnerXml() = GetManualScript()
        xmlNode.AppendChild(xmlScript)
    End If
    Return poxml.OuterXml
End Function 

So, we are just kicking off the recursion to parse the XMLDocument here. In addition, we are going to handle the event that the MT program doesn’t have the ability to translate the user's selected language, which is what the IsManual (I know it’s misspelled in the code) is all about.

So, here comes the surgical part. We need to get just the text that users can see. We care about text in the HTML between tags. We care about text in value attributes. What we don’t want is textboxes or textareas since users will be writing their language. We don’t care about what’s in script or style tags.

Private Sub TranslateNode(ByVal poNode As XmlNode)
'Find where the text we want to translate is
    Dim lbTraverse As Boolean = True
    Select Case poNode.Name.ToLower 
        Case "#text" 'catches text between tags
            poNode.Value = TranslateText(poNode.Value, Nothing)
        Case "input"
            Dim lsType As String
            If poNode.Attributes.ItemOf("type") Is Nothing Then
                lsType = "text"
            Else
                lsType = poNode.Attributes.ItemOf("type").Value.ToLower
            End If
            If Not poNode.Attributes.ItemOf("value") Is Nothing Then
                Select Case lsType 
                    Case "button", "submit"
                    poNode.Attributes.ItemOf("value").Value = _
                         TranslateText(poNode.Attributes.ItemOf("value").Value, poNode 
                End Select 
            End If
        Case "script", "style", "textarea" 'don't care about anything in here.
            lbTraverse = False 
    End Select
    If lbTraverse Then 'Go on down the the recursion
        For Each loNode As XmlNode In poNode.ChildNodes 
            TranslateNode(loNode)
        Next
    End If
End Sub

So, now that we have the text we want to translate, how do we translate it? The downside of using MT is that it’s slow, at least slower than just looking the word up out of a database. What we do is once we translate a word or phrase, we save it off for use by our website or any website that uses this HTTP module.

Private Function TranslateText(ByVal psString As String, _
       ByVal pnode As XmlNode) As String 

    Dim result As String = String.Empty
    'clean the text up so we can translate it
    psString = psString.Replace(vbCr, "")
    psString = psString.Replace(vbCrLf, "")
    psString = psString.Replace(vbLf, "")
    psString = psString.Replace(vbTab, "")
    psString = psString.Trim
    result = Lookup(psString) 'Try looking it up out of the database
    If result.Trim = String.Empty Then 'If its not there then translate it
        Dim loTranslator As ITranslator = New Systran 'Set up your own translator interface
        'Make sure you have a translator and that translator will translate the users language
        If Not loTranslator Is Nothing AndAlso _  
                Array.Exists(loTranslator.AvailableLanguages, AddressOf HasLanguage) Then
            result = loTranslator.Translate(psString, "en", Language) 'Get the word
            SaveWord(psString, result) 'And save it off so you don't ever have
            'to go get it again
        Else  
        'If you don't support the language or don't have a translator, 
        'set the Manual switch so users can translate it themselves
            _IsManaul = True
            result = psString
        End If
    End If
    Return result
End Function

Here is where the magic happens. I set this up so you can add your own translator by just implementing ITranslator. We are looking at using Systran, which has a Web Service API.

Public Function Translate(ByVal psString As String, _
        ByVal psFromLanguage As String, ByVal psToLanguage As String) _
        As String Implements ITranslator.Translate

    Dim lsURL As String = My.Settings.TransaltionURL.Trim.Replace("@Language", psToLanguage)
    Dim loWebclient As WebClient = New WebClient
    loWebclient.Encoding = Encoding.UTF8 
    'Make sure you tell the web client to be unicode compliant
    Dim result As String = loWebclient.UploadString(lsURL, psString).Trim
    Return result.Replace("body=", "") 'Got to get rid of this
End Function

So What If…

What if you don’t have an MT server, or the user is using a language which the MT can’t translate? You need a way of letting the user translate the pages for you. It’s the Wikipedia approach to site translation. By creating a list of phrases used commonly on the site and things used in the controls of the web pages, you can just have all that translated either by using one of the online translators, or by having the user fill in the translation. That will handle the static content well enough. Then, there is the user driven content. We need a way for users to select content presented in English and translate it for us. So, what we do is, if the manual translation switch is on, then we load into the HTML the JavaScript needed to allow the user to select content, and on mouse-up, we open up a dialog which will allow them to offer a suggestion. In our site, we capture the user’s login since we are single sign on, so we can fire them if they put in anything inappropriate.

en.jpg

This is the English version of the demo. No translation takes place.

jp.jpg

This is translated to Japanese using Systran MT.

Manual.jpg

Pardon the spelling. Systran does not translate Vietnamese. So, the user will need to offer a translation. They just select the English text and a dialog pops up asking for the translation. This is then stored in the database for use by all websites.

db.jpg

This is the database table. Across a site or sites of any size, this will get pretty big, but it can be used by all sites. Once a site learns a word, all sites know it at that point, and don't need to go look it up.

Systran does not translate Vietnamese. So, the user will need to offer a translation. They just select the English text and a dialog pops up asking for the translation. This is then stored in the database for use by all websites. Of course, if the user misspells something as is demonstrated here, it will show up misspelled so adding a multi-lingual spell checker would be a nice touch.

Conclusion

Get rid of the RESX files and let your websites learn from each other using MT as a teacher. I hope this is an idea you can expand on. The code will probably need some massaging to get it baked into your site, but hopefully, it is an idea you can use.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here