(untagged)

Optimized IP to ISO3166 Country Code Mapping in C#

R. Reyes

0.00/5 (No votes)

12 Feb 2003

A .Net class that converts an IP Address to the Country Code where the computer is physically located.

Download source files and executable - 431 Kb

Introduction

I started the development of the GameWatch project in December 2002 with the idea of combining both my need for a free (open-source) game browser and experimenting with c# development. In abstract, GameWatch is an online game servers browser for games like Half-Life or Unreal Tournament 2003 (in other words, it's not unlike GameSpy or the All-Seeing-Eye).

One of its numerous internet-related feature is getting the country code from the IP address of a server, so that the users can filter out the servers displayed according to the country they live in (in the hope of lowering the latency to the game server).

In this short article, I'll first introduce some features of IP addresses, then discuss the algorithm and data structure used for the implementation of the GameWatch's IP-to-CountryCode converter.

What is an IP address?

Nowadays, few people can claim they have never seen an IP address. This four-number long string with dots actually represents a single 32-bit unsigned integer, where each number in the four-number format is a byte from this integer. In the common representation one can see daily, the four bytes are extracted in network order from the integer. Network order is equivalent to big-endian order, which basically means that an IP address which 32-bit integer value is ABCD is mapped to A.B.C.D, and not D.C.B.A as it is usually the case on intel-family computers.

So, the first interesting property we get from this, is that we can handle those IP addresses as simply as unsigned 32-bit integers.

Another interesting property is that IP addresses have a semantic: the left part of the address specifies the network number, the right part of the address represents the local address inside this network. The size of the left and right parts depends on the kind of network. In IPV4, there are three kinds of network type:

Class A: the network code is 8 bits and the local address is 24 bits.
Class B: the network code is 16 bits and the local address is 16 bits.
Class C: the network code is 24 bits and the local address is 8 bits.

To be very precise, there are not really 8, 16, and 24 bits for the network codes, as a few bits are taken to help distinguish the network types, but you got the idea, and this is not relevant for our purpose in this article.

Well, what we learned here, is that we won't need all of the 32 bits of the IP address to get the country, as we can omit the local address, which value is always specific to its network. Thus, the significant part of an IP address for our IP-to-Country conversion can vary from 8 to 24 bits.

The standard internet database

The IP addresses of the internet are not allocated by a single organization, but by four Regional Internet Registries (RIRs), each one being responsible for a region of the world.

The RIPE "R�seaux IP Europ�ens", which name means in english Europeans IP Networks.
APNIC is the Asia Pacific Network Information Centre.
ARIN is the American Registry for Internet Numbers.
LACNIC is the Latin American and Caribbean Internet Addresses Registry.

Each of these entities manage their IP allocation database, which are publicly available on their web site in text format. Those databases contain a list of lines of the form:

apnic|nz|ipv4|202.0.32.0|20|19930101|allocated
apnic|nz|ipv4|202.0.48.0|22|19930101|allocated
apnic|jp|ipv4|202.0.65.0|24|19930101|assigned

As you can see, the tokens are separated by '|' characters, and the elements of particular interest for us are the country (second token) and the network address (fourth token). The other tokens represent the RIR responsible for the allocation, the version of the IP protocol used (IPv4 is described in RFC 791), the number of possible local address to be used with this network code, and the date of the initial allocation.

The four databases can be FTP'ed from the ftp site of the four RIRs.

Now that we know how an IP address is composed, and that we have a database to find the country of an IP address, what do we do next ? Well, we must find a way to store the information and retrieve it as fast as possible.

Finding the best data structure

The .Net framework provides a few data container and lookup class, like ArrayList, HashTable, or Stack for instance, but none of these data structure are suitable for our needs. Let's recapitulate our design goals:

We need a data structure that can store thousands of entries with a low space complexity (our so expensive memory is used for other tasks). There are currently more than 47000 networks defined in the RIRs databases.
The data structure must be able to match a variable prefix (8 to 24 bits of network code) for a 32-bit integer (we don't know in advance how long is the network code in a given IP address).
We need it to be fast (because we display thousands of country codes in a user interface, which must stay reactive).

The Hashtable class does not fit, as it cannot match keys with part of the key entries (the Hashtable algorithm would test the IP address against each possible network code size, which is not acceptable).

The arraylist (and all the flavours of Linked List one can think about) does not fit either, as each IP address test would require at most O(N) look-ups, with N the number of entries in the database (and we have a lot of them).

If we dive in the world of the internet software programming, we can easely find a domain that has a very similar issue: IP routing table lookup. IP routing involves finding the best network mask for a given target IP address, in order to find the IP address of a connected gateway that can reach the target.

Well, we are not doing routing, but we have a similar goal: finding the network code that matches our IP address. Note that there are a number of differences as well: we are not interested in frequently updating our table (adding or removing a network mask), and we have only one possible matching network code, while IP routing usually have several ones (this adds the constraint that only the longest-matching network must be selected).

So, how are our friends from IP Routing doing ? In fact, many solutions have been proposed in hundreds of papers. It goes from the simplest binary-trees (with the space complexity drawbacks) to the most sophisticated Peano Count Trees with partitionning based on prefix length [HABIB].

However the most commonly used structures are taken in the family of the Tries (pronounced like "tree", most urban legends say the name comes from reTRIEval trees). Tries are tree-based structures where each node represents a part of the item key. For instance, a character-based trie would store the two words "eater" and "eating" as follows:

Figure 1: eater/eating strings in a Trie

There are several variants of the Trie data structure, one of the most efficient being the PATRICIA (Practical Algorithm To Retrieve Information Coded In Alphanumeric) Trie [MORRISON68].

The main characteristic of the PATRICIA trie is the way it eliminates unnecessary nodes by grouping the sequences of tokens whenever possible.

Using a PATRICIA trie, the previous example would be encoded as:

Figure 2: eater/eating strings in a PATRICIA Trie

FreeBSD for instance uses a radix trie structure (very similar to the PATRICIA one) to hold its ip routing table, and many software routers use a PATRICIA trie to hold their ip tables (FutureSoft's IPV6 Routers, for instance).

The reader can easely find a lot of descriptions on the Patricia Trie around the web. Basically, let me stress the following property of the PATRICIA Trie: it is a compact data structure that can give you the longest prefix of an entry key in O(N) steps (in the worst case), with N the length of the longest prefix (see [PACKER2000] for a good performance analysis).

Implementing a Patricia Trie to store IP Tables

First, we need a class to store the IP addresses and manage them at the bit-level. Unfortunately, the BitArray class provided by the .Net Framework lacks a few methods that we really need. The GameWatch project provides a very simple BitVector class that implements the main following methods:

LongestCommonPrefix() returns the longest common prefix of the BitVector object and the one provided as argument.
AddData() and AddAscii() that add data to the bitvector using standard data types.

The BitVectorTrie class implements a PATRICIA Trie using BitVector objects as keys for the indexing of the data. It contains mainly three methods:

public void Add(BitVector key, object data)
public object Get(BitVector key)
public object GetBest(BitVector key)

The Add() method stores the "data" object using the BitVector "key". Conversely, the Get() method retrieves an object stored using a BitVector key.

GetBest() is the most interesting method for our purpose of mapping an IP address to a country code: it walks the nodes of the trie, following the path provided in the BitVector key. The method walks as much of the trie as it can, and return the value stored at the end of the path (in our case, the BitVector object describes the path to follow).

For instance, the GetBest() method used with the following BitVector {00110100101 1100} as argument retrieves the object highlighted in the figure below.

Figure 3: Look-up of {00110100101 1100} in a PATRICIA Trie

In the example above, the trie look-up stops at the 00110100101 prefix, and returns the object stored in the last node of the path. This node represents the longest prefix for the BitVector given as argument.

As the source code provided is quite simple and documented, the reader can refer to it for the details of the PATRICIA trie implementation.

The IPToCountry class

Now we have the data needed for the mapping (provided by the RIRs) as well as the structure and algorithm to hold it. So, let's build our IP-to-CountryCode converter: The IPToCountry class. This class basically contains a BitVectorTrie member, and provides a method to load a standard RIR database as described at the beginning of this article. The GetCountry() method takes a string containing an IP address in the "aaa.bbb.ccc.ddd" format, and return the country code as defined by the ISO3166 standard. The Load() method is so simple I won't bother the reader with a description of it.

Benchmark results

Benchmarks are always criticized, but here are the results I get on my standard Athlon XP 2000:

f:\Documents and Settings\Rodrigo\Mes documents\projects\iptocountry>bin\itctest.exe
IP tables loaded in 00:00:02.2532400 (47203 networks)
IP list loaded in 00:00:00.0200288 (25000 ip addresses)
Testing ip look-up speed...
25000 addresses were translated in 00:00:01.2417856 (20132 ip/s)
Country code for 195.149.21.72: UK
Country code for 81.3.59.134: DE
Country code for 194.134.233.72: NL

The benchmark program loads a list of 25000 IP addresses that run a game server (taken from the unreal tournament 2003 and Half-Life servers lists) and performs a look-up for each of them.

The results show an average of 20'000 IP addresses processed per second with a nice space complexity, which fits perfectly our requirements.

A few notes on the provided source code

The code provided with this article mainly comes from the GameWatch project (an entry point with some benchmarks was added though). It is released under the terms of the GNU General Public License (see http://www.gnu.org/copyleft/gpl.html for details). According to this license, you can use, modify, and redistribute the software as long as you keep it under the GPL license.

Implementation Note: In the implemented algorithm it is not taken into account the fact that the local addresses in a same network can be dispatched in several countries. Indeed, international organisations can use their network as they wish, and a company can dispatch its IP over several countries. There is unfortunately nothing we can do against it, as this is the private domain of the owner of the network code.

Bibliography

[GWATCH03] The GameWatch Project
[MORRISON68] PATRICIA-- Practical Algorithm To Retrieve Information Coded in Alphanumeric (Donald R. Morrison)
Journal of the ACM (JACM), Volume 15, Issue 4 (October 1968)
[RFC791] RFC 791: INTERNET PROTOCOL DARPA INTERNET PROGRAM PROTOCOL SPECIFICATION
[PACKER2000] Performance Analysis of IP Routing Lookup Algorithms : Patricia Tree based vs. Hashing based (Packer Ali, Dhamim)
[HABIB] Prefix Lookups Algorithm for Fast IP Routing using P-Tree (Md. Ahsan Habib)
[WALDVOGEL&AL97] Scalable High Speed IP Routing Lookups (Marcel Waldvogely, George Varghesez, Jon Turnerz, Bernhard Plattnery)
ACM SIGCOMM'97
APNIC Region Database
ARIN Region Database
LACNIC Region Database
RIPE Region Database

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here