Introduction
I started the development of the GameWatch project in December 2002 with
the idea of combining both my need for a free (open-source) game browser
and experimenting with c# development. In abstract, GameWatch is an online game servers
browser for games like Half-Life or Unreal Tournament 2003 (in other words, it's not unlike
GameSpy or the All-Seeing-Eye).
One of its numerous internet-related feature is getting the country
code from the IP address of a server, so that the users can filter out
the servers displayed according to the country they live in (in the
hope of lowering the latency to the game server).
In this short article, I'll first introduce some features of IP
addresses, then discuss the algorithm and data structure used for the
implementation of the GameWatch's IP-to-CountryCode converter.
What is an IP address?
Nowadays, few people can claim they have never seen an IP address. This
four-number long string with dots actually represents a single 32-bit
unsigned integer, where each number in the four-number format is a byte
from this integer. In the common representation one can see daily, the
four bytes are extracted in network order from the integer. Network
order is equivalent to big-endian order, which basically means that an
IP address which 32-bit integer value is ABCD is mapped to A.B.C.D, and
not D.C.B.A as it is usually the case on intel-family computers.
So, the first interesting property we get from this, is that we can
handle those IP addresses as simply as unsigned 32-bit integers.
Another interesting property is that IP addresses have a semantic:
the left part of the address specifies the network number, the right
part of the address represents the local address inside this
network. The size of the left and right parts depends on the kind of
network. In IPV4, there are three kinds of network type:
- Class A: the network code is 8 bits and the local address is
24 bits.
- Class B: the network code is 16 bits and the local address is 16
bits.
- Class C: the network code is 24 bits and the local address is 8 bits.
To be very precise, there are not really 8, 16, and 24 bits for the
network codes, as a few bits are taken to help distinguish the network
types, but you got the idea, and this is not relevant for our purpose in
this article.
Well, what we learned here, is that we won't need all of the 32 bits
of the IP address to get the country, as we can omit the local address,
which value is always specific to its network. Thus, the significant
part of an IP address for our IP-to-Country conversion can vary from 8
to 24 bits.
The standard internet database
The IP addresses of the internet are not allocated by a single
organization, but by four Regional Internet Registries (RIRs), each one
being responsible for a region of the world.
- The RIPE "R�seaux IP Europ�ens", which name means in english Europeans IP
Networks.
- APNIC is the Asia Pacific Network Information Centre.
- ARIN is the American Registry for Internet Numbers.
- LACNIC is the Latin American and Caribbean Internet Addresses
Registry.
Each of these entities manage their IP allocation database, which are
publicly available on their web site in text format. Those databases
contain a list of lines of the form:
apnic|nz|ipv4|202.0.32.0|20|19930101|allocated
apnic|nz|ipv4|202.0.48.0|22|19930101|allocated
apnic|jp|ipv4|202.0.65.0|24|19930101|assigned
As you can see, the tokens are separated by '|'
characters,
and the elements of particular interest for us are the country (second
token) and the network address (fourth token). The other tokens
represent the RIR responsible for the allocation, the version of the IP
protocol used (IPv4 is described in RFC 791), the number of possible
local address to be used with this network code, and the date of the
initial allocation.
The four databases can be FTP'ed from the ftp site of the four RIRs.
Now that we know how an IP address is composed, and that we have a
database to find the country of an IP address, what do we do next ?
Well, we must find a way to store the information and retrieve it as
fast as possible.
Finding the best data structure
The .Net framework provides a few data container and lookup class, like
ArrayList, HashTable, or Stack for instance, but none of these data
structure are suitable for our needs. Let's recapitulate our design
goals:
- We need a data structure that can store thousands of entries with a
low space complexity (our so expensive memory is used for other
tasks). There are currently more than 47000 networks defined in the RIRs
databases.
- The data structure must be able to match a variable prefix (8 to 24
bits of network code) for a 32-bit integer (we don't know in advance how
long is the network code in a given IP address).
- We need it to be fast (because we display thousands of country
codes in a user interface, which must stay reactive).
The Hashtable
class does not fit, as it cannot match
keys with part of the key entries (the Hashtable algorithm would test
the IP address against each possible network code size, which is not
acceptable).
The arraylist
(and all the flavours of Linked List one
can think about) does not fit either, as each IP address test would
require at most O(N) look-ups, with N the number of entries in the
database (and we have a lot of them).
If we dive in the world of the internet software programming, we can
easely find a domain that has a very similar issue: IP routing
table lookup. IP routing involves finding the best network mask for a
given target IP address, in order to find the IP address of a connected
gateway that can reach the target.
Well, we are not doing routing, but we have a similar goal: finding
the network code that matches our IP address. Note that there are a
number of differences as well: we are not interested in frequently
updating our table (adding or removing a network mask), and we have only
one possible matching network code, while IP routing usually have
several ones (this adds the constraint that only the longest-matching
network must be selected).
So, how are our friends from IP Routing doing ? In fact, many solutions
have been proposed in hundreds of papers. It goes from the simplest
binary-trees (with the space complexity drawbacks) to the most
sophisticated Peano Count Trees with partitionning based on prefix
length [HABIB].
However the most commonly used structures are taken in the family of
the Tries (pronounced like "tree", most urban legends say the name comes
from reTRIEval trees). Tries are tree-based structures where each node
represents a part of the item key. For instance, a character-based trie
would store the two words "eater" and "eating" as follows:
Figure 1: eater/eating strings in a Trie
There are several variants of the Trie data structure, one of the
most efficient being the PATRICIA (Practical Algorithm To Retrieve
Information Coded In Alphanumeric) Trie [MORRISON68].
The main
characteristic of the PATRICIA trie is the way it eliminates
unnecessary nodes by grouping the sequences of tokens whenever possible.
Using a PATRICIA trie, the previous example would be encoded
as:
Figure 2: eater/eating strings in a PATRICIA Trie
FreeBSD for instance uses a radix trie structure (very similar to
the PATRICIA one) to hold its ip routing table, and many software
routers use a PATRICIA trie to hold their ip tables (FutureSoft's IPV6
Routers, for instance).
The reader can easely find a lot of descriptions on the Patricia Trie
around the web. Basically, let me stress the following property of the
PATRICIA Trie: it is a compact data structure that can give you the
longest prefix of an entry key in O(N) steps (in the worst case), with N
the length of the longest prefix (see [PACKER2000] for a good
performance analysis).
Implementing a Patricia Trie to store IP Tables
First, we need a class to store the IP addresses and manage them at the
bit-level. Unfortunately, the BitArray class provided by the .Net
Framework lacks a few methods that we really need. The GameWatch
project provides a very simple BitVector class that implements the main
following methods:
-
LongestCommonPrefix()
returns the longest common
prefix of the BitVector object and the one provided as argument.
-
AddData()
and AddAscii()
that add data to the bitvector using
standard data types.
The BitVectorTrie class implements a PATRICIA Trie using
BitVector objects as keys for the indexing of the data. It contains mainly three methods:
-
public void Add(BitVector key, object data)
-
public object Get(BitVector key)
-
public object GetBest(BitVector key)
The Add()
method stores the "data" object using the
BitVector "key". Conversely, the Get()
method retrieves an
object stored using a BitVector key.
GetBest()
is the most interesting method for our
purpose of mapping an IP address to a country code: it walks the
nodes of the trie, following the path provided in the
BitVector
key. The method walks as much of the trie as it
can, and return the value stored at the end of the path (in our case,
the BitVector
object describes the path to follow).
For instance, the GetBest()
method used with the
following BitVector
{00110100101 1100}
as
argument retrieves the object highlighted in the figure below.
Figure 3: Look-up of {00110100101 1100} in a PATRICIA Trie
In the example above, the trie look-up stops at the
00110100101
prefix, and returns the object stored in the
last node of the path. This node represents the longest prefix for the
BitVector
given as argument.
As the source code provided is quite simple and documented, the
reader can refer to it for the details of the PATRICIA trie
implementation.
The IPToCountry class
Now we have the data needed for the mapping (provided by the RIRs)
as well as the structure and algorithm to hold it. So, let's build our
IP-to-CountryCode converter: The IPToCountry class.
This class basically contains a
BitVectorTrie
member, and
provides a method to load a standard RIR database as described at the
beginning of this article. The
GetCountry()
method takes a
string containing an IP address in the "aaa.bbb.ccc.ddd" format, and
return the country code as defined by the ISO3166 standard. The
Load()
method is so simple I won't bother the reader with a
description of it.
Benchmark results
Benchmarks are always criticized, but here are the results I get on my
standard Athlon XP 2000:
f:\Documents and Settings\Rodrigo\Mes documents\projects\iptocountry>bin\itctest.exe
IP tables loaded in 00:00:02.2532400 (47203 networks)
IP list loaded in 00:00:00.0200288 (25000 ip addresses)
Testing ip look-up speed...
25000 addresses were translated in 00:00:01.2417856 (20132 ip/s)
Country code for 195.149.21.72: UK
Country code for 81.3.59.134: DE
Country code for 194.134.233.72: NL
The benchmark program loads a list of 25000 IP addresses that run a
game server (taken from the unreal tournament 2003 and Half-Life servers
lists) and performs a look-up for each of them.
The results show an average of 20'000 IP addresses processed per
second with a nice space complexity, which fits perfectly our
requirements.
A few notes on the provided source code
The code provided with this article mainly comes from the
GameWatch project (an entry point with
some benchmarks was added though). It is released under the terms of the
GNU General Public License (see
http://www.gnu.org/copyleft/gpl.html
for details). According to this license, you can use, modify, and
redistribute the software as long as you keep it under the GPL license.
Implementation Note: In the implemented algorithm it is not taken into account the fact
that the local addresses in a same network can be dispatched in several
countries. Indeed, international organisations can use their network as
they wish, and a company can dispatch its IP over several
countries. There is unfortunately nothing we can do against it, as this
is the private domain of the owner of the network code.
Bibliography