Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / programming / Internet

The web architecture of The Internet map

5.00/5 (3 votes)
8 Aug 2012CPOL8 min read 14.9K  
How the Amazon Cloud solutions help tackle huge load and how much it may cost

Introduction

 

You must have heard about The Internet map by now. If not, you can take a look at it here. Roughly  speaking, the Internet Map displays websites’ location according to users’ behavior. Similar websites visited by the same people are situated close to one another; different websites not having mutual visitors are situated at a considerable distance from each other. The size of a website on the map is determined by its average click rating while the color is defined by belonging to a nationality. You can get a more detailed notion referring to the About section on the website of the Map.

 

 

In the present article I would like to tell you how the website of The Internet map is organized, which technologies ensure its normal day-to-day functioning and what steps had to be taken in order to sustain a massive surge of visitors wishing to have a look at the map.

The operability of the Internet Map is enabled by present-day internet-giants’ technologies: the Map’s visual display is powered by Google Maps engine by Google Inc., web query processing is performed with Microsoft’s .net technologies, while Amazon Web Services by Amazon is responsible for hosting and content delivery. All the three components are vital for the Map’s normal operation. With some effort, alternatives can be found, but I am not confident that it will bring much of a benefit.

Amazon CloudFront and Google Maps

Google Maps technology means using tiles – small pictures 256x256 pixels each, which the image of the map is formed with. The main thing about those pictures is that there are quite a lot of them. When you see the Map on your screen with a high definition, all of it consists of those petty pics. That means that the server has to be able to process a lot of minor queries and give out the tiles simultaneously, so that the client does not see the mosaic. The total number of tiles needed to display the map equals sum(4^i), where i runs values from 0 to N, where N is the total number of zooms. As is the case with The Internet map, the number of zooms equals 14, i.e. the number of tiles must be approximately 358 million. Fortunately, this astronomical number was reduced to 30 million by avoiding generating empty tiles. If you open your browser console, you will see a huge number of errors 403, that is right them there – the left-out tiles, but that cannot be noticed on the Map since if there is no tile there, the square automatically is filled with black background. At any rate, 30 million tiles is a considerable number, too. Even copying or deleting the catalogue with the tiles in Windows would take 2 days. o_O

Therefore, the conventional way of placing content on a dedicated server is not applicable in this case. Tiles are plenty, users are plenty so servers should also be plenty and they should be located close to users in order for them not to notice a delay. Otherwise, users from Russia will be getting a good response while Japanese users will be remembering the times of dial-up modems looking at your map. Luckily, Amazon has a solution for that matter (and so does Akamai but that is a different subject). It is called CloudFront and essentially emerges as a global CDN (CDN – content delivery network). You deploy your content somewhere (that is called the origin) and create a distribution in CloudFront. Whenever a user is requesting your content, CloudFront automatically finds the node of its network which is closest to the user and, if it does not have a copy of your data yet, it will be requested either from another node that has a copy, or from the origin. The way it looks, your data undergoes multiple rounds of replication and is highly likely to be delivered from CloudFront servers, and not from your own expensive, feeble and unsafe data depository. Speaking about the Internet Map, tapping into CloudFront effectively meant that the data from my hard drive was copied into the Singapore segment of Simple Storage Service (S3), and then a distribution was created via the AWS console in CloudFront, where S3 was cited as the data origin. If you take a look at the code of the Internet Map’s webpage, you may see that the tiles are taken from the following CloudFront address: http://d2h9tsxwphc7ip.cloudfront.net/; determining the closest node, updating the content and stuff like that is done by CloudFront automatically. Yippee!

On the picture you can see how the original map breaks down into tiles, tiles pass for storage into S3, and from there they are imported into CloudFront and are finally served from its nodes to users.

Amazon CloudFront and Google Maps

To provide for searching for a website on the Map, a database is essential, wherein information about the websites and their coordinates will be stored. In this case what we have is MS SQL Express instance in Amazon’s Cloud. It is called Relational Database Service – RDS. We do not really need the relational part here as we only have one table, but it is better to have a fully-featured database than reinvent the wheel later. RDS allows for using not only MS SQL, but also Oracle, MySql and probably something else.

In the picture you can see the original map turn into a table in the RDS database.

Amazon Elastic Beanstalk

Most likely, this feature is the one that astonished me the most of all the “cloudy” Amazon services. Elastic Beanstalk lets you release a project under load in virtually one single click with a minimum offline time or without going offline at all. Knowing how difficult a release may turn out to be, especially when the infrastructure contains several servers and a load balancer, I was flat out amazed at how easily and gracefully Elastic Beanstalk manages it! During the first deployment, it creates the entire environment your application needs: a load balancer (Elastic Load Balancer - ELB), computing units (Elastic Compute Cloud - EC2), and it also determines scaling parameters. Roughly, if you have one server and all queries are directed straight to it, on reaching a certain limit your server will fail to cope with the load and will most probably crash. Sometimes it will not even be able to get up and running again under the load it used to carry perfectly well before, since getting back to the normal operational mode usually takes some time and becomes next to impossible against the background of continual queries. Anyhow, those who have been there and done that will know what I mean.

You can rely on Elastic Beanstalk in all infrastructural questions. In fact, you can just install a plug-in in MS Visual Studio and forget about details. It will control and update versions, deploy etc. by itself. And in case the load increases, it will create the necessary number of EC2 instances.

On the diagram Elastic Beanstalk is marked with a dotted line, inside you can see ELB accepting incoming queries and serving them to IISs in EC2 instances.

Performance and its cost

The Internet map went public 24 of July when I published an article on the biggest Russian IT resource Habrahabr.ru.

Just shortly after the article was published, The Internet map experienced a load of visitors. It is clear from the graph that the website traffic rose dramatically, within the first 6 hours the site was visited by 30,000 people, and the number almost reached 50,000 by the end of the day, mostly from Russia and the former USSR countries. Smelling a rat, Elastic Beanstalk created 10 EC2 instances, and they tackled the problem well. No complaints about problems accessing the website were received. The Map could be viewed freely. As for RDS, it failed immediately, first the search started working really slowly, then erratically, and eventually it just stopped working altogether. The bill for the first day amounted to around 200 USD. Approximately 100 USD for S3+CloudFront, and EC2 and RDS cost 50 USD each.

Having looked into the lesson learned, I optimized the code and readjusted the auto scaling parameters. And it worked. During the week, the website was visited, on average, by 30 to 50 thousand people a day from various destinations, and nothing went astray. I must say though, a sudden traffic increase of the first day did not happen again either.

Then somebody posted some information about the Map on reddit.com, which resulted in explosive traffic upsurge. On Sunday the site was visited by around half a million people, and that with only a small EC2 instance and a small RDS instance operating well. There was one complaint though, that the Map was not loading but I believe it is acceptable for a load like that one.

When I saw all this huge traffic I expect the huge bill from the Amazon and put donataion pop-up. Now it became normal and donations are near ready to foot the bill.

This is the bill for the first week (~1 000 000 unique visitors):

Conclusion

I began to engage in information technologies when the word “cloudy” had nothing to do with IT. Things have changed since then, and standalone servers are on their way out. Undoubtedly, hosting in a cloud has its downsides (ask Instagram, for example). However, the capacity to delegate most of your worries to a cloudy service by far makes up for all risks, in my opinion. If you are at the start of your project development and quality, affordability, reliability and scalability are your priorities, you should probably take straight to the cloud.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)