Introduction
This article presents code for an IDNA-client (Internationalising Domain Names in Applications).
With the influence of the Internet on society, it's more important to present brand-names etc. correctly in the native language. It is more straightforward for most people to access a web-page or send an email using the correct name instead of some ASCII'fied version of it. And with the depletion of domain names in the various top-level domains (especially .COM), it's ever more needed to support names using non-ASCII characters.
Most TLDs are now open for registering domain-names with international characters. But you can most likely not register a name in, e.g. the Danish domain .dk containing Hungarian accented characters. The .com, .org, .net and .biz domains probably accept ISO-Latin and all East-Asian character sets. Check your local registrar for details.
For several reasons, the IETF designers wanted to keep UTF-8 or any other encoded strings away from the Domain Name protocol (port 53, RFC-1035). Mainly to be backward compatible with existing protocol and DNS servers. No new software on the server side should be required to support IDNA. So purely US-ASCII was needed to represent national characters in IDNs (Internationalised Domain Names). Hence they came up with a pretty slick scheme called Punycode. The details are in RFC-3490 and RFC-3492.
Side-note: Windows-2000/XP does send an UTF-8 encoded query for IDNs (ref. RFC-2044 and RFC-2181). And Windows/2003-Server seem to support Stuart Kwan's draft. These methods will most likely not work since very few DNS servers supports UTF-8 directly. But there's is a remedy for IE/OE users. See references below.
Domain name conversion
In order to query a name server for an IDN, the name must be converted to ACE (ASCII Compatible Encoding). Here is an example. Suppose you want to resolve the host-name www.bl�b�rsyltet�y.no (www.blueberryjam.no for you English. And yes, the name really exists).
The algorithm goes like this:
As can be seen, the converted name is longer than the original. My code uses fixed size buffers and just makes conservative guess on the maximum sized result. 2 times MAX_HOST_LEN
(2*256) should hopefully be enough for most names. I don't know what would happen with East-Asian characters converted to ACE. E.g. Big5, GBK or Shift-JIS. I cannot easily test this since my Windows box does not support these encoding.
Using the code
The public interface to the conversion functions are in idna.h. The punycode.* files are only for internal use. punycode.* come straight from the RFC-3492 with some adaptations for Windows. getopt.* are only needed by the demo source files to parse the command-line. I choose not to make this an all C++ implementation since that would exclude C users. It is callable from any type of Windows program (MFC-based, console-mode etc.). Later versions may include an option to use libiconv
in addition to Windows' NLS functions.
The main function:
IDNA_init()
BOOL IDNA_init (WORD code_page);
Initialize the IDNA converter using the requested code_page
. This can be 0 to use the system's default ANSI codepage (CP_ACP
). If you use the C++ wrapper class, there's currently no way to specify other codepages.
IDNA_convert_to_ACE()
BOOL IDNA_convert_to_ACE (char *name, size_t *size);
Tries to convert name
to the ACE-form using codepage specified in IDNA_init()
. name
points to the buffer to convert. *size
on input must specify the maximum size of buffer to convert. *size
on output tells you the size of the ACE-converted buffer.
Note: if name
contains only US-ASCII (below and including 0x7F), no conversion is done and name
will remain unchanged.
IDNA_convert_from_ACE()
BOOL IDNA_convert_from_ACE (char *name, size_t *size);
Tries to convert name
to a string using codepage specified in IDNA_init()
. *size
on input must specify the maximum size of buffer to convert. *size
on output tells you the size of the ACE-to-ASCII converted buffer.
Note: if name
contains no labels with a "xn--" prefix, no conversion is done and name
will remain unchanged.
IDNA_strerror()
Returns an error-string for the supplied error-number. _idna_errno
in most cases.
Status codes
The above functions (except IDNA_strerror()
) returns TRUE
on success or FALSE
on any error (no surprise here). Use IDNA_strerror(_idna_errno)
to check why. If _idna_errno == IDNAERR_WINNLS
, the error was in one of the WinNLS functions. Use GetLastError()
to obtain the error. _idna_errno
is not cleared on a subsequent successful call.
C++ wrapper class
The CIDNA_resolver
class is a simple wrapper for the C-code implementation. Using it resembles the ::gethostbyname()
function.
Minimal example:
CIDNA_resolver idna;
struct hostent *he = idna.gethostbyname (name);
If you want to know the ACE name of your supplied name
, extract it from he->h_aliases[0]
(but only if he
is non-NULL
off course). This erases the original alias (if that should exist). But was the best I could do at the moment.
CIDNA_resolver::gethostbyaddr()
is provided to convert an IPv4 address into a name of your codepage. I haven't found any ACE domain-name with a PTR record, so this function was tested with the hosts file only.
The accompanying Makefile is for MSVC, MingW and CygWin. Issue one of these commands:
make msvc
make gcc
to build the demos using Visual-C or MingW/CygWin respectively. And yes, the Makefile requires GNU's make (since I'm tired of the limitations in nmake). I'm too lazy to provide a VC6/7 project file (since I'm a MingW addict). Using the code in your project should be as easy as adding punycode.* and idna.* to your project. The code has been tested on Win-XP only. If it doesn't work on your OS, I'd be happy to hear why. But first, run one of the demo programs with full debug. E.g. "demo-1-vc.exe -c850 -dddd www.troms�.no", and study the printed trace output.
Considerations
Some protocols that exposes hostnames in the application layer will have problems with IDNs. Most noticeably HTTP 1.1. If you try to fetch the URL http://www.bl�b�rsyltet�y.no/some/document, and you're able to resolve it (via the hosts file etc), this will be sent:
GET /some/document HTTP/1.1
User-Agent: whatever
Host: www.bl�b�rsyltet�y.no
This will probably not work if the Web-server serves multiple domains. It will simply not match the "Host:" header-line against the domains it serves. Therefore, application should send the ACE form of the hostname instead:
GET /some/document HTTP/1.1
User-Agent: whatever
Host: www.xn--blbrsyltety-y8aO3x.no
At least Apache 1.3 returns different results depending on what Host header is sent. Using HTTP 1.0 could be a solution for these cases, but I've not tested this (and don't understand virtual servers that well either). Expect such problems to arise more often as IDNs become more popular.
The DNS system is generally not case-sensitive when matching a normal domain-name. This is not the case in IDNA. A domain-name in the ACE-form is normally only stored in its lower case version. E.g. the DNS system has this entry for www.troms�.no in the A record:
www.xn--troms-zuA.no. IN A 195.159.151.136
To handle the uppercase version www.troms�.no, an A record for "www.xn--troms-ipA.no" would be stored too. There would be too many combinations of ACE-forms (and MX records), so this is not generally done.
Note: This code does nothing to display the converted name (in IDNA_convert_from_ACE
) correctly. It just assumes you're using the correct font for whatever codepage that is used. demo-1.c has a codepage (-c
) option. Use that to experiment or call IDNA_init()
with the proper codepage.
References