Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / browser

A Technique to Publish Texts That Can't Be Crawled

4.27/5 (3 votes)
16 Apr 2015CPOL3 min read 13.8K  
Presenting TuringFonts, a way to make uncrawable and uncopiable texts

Introduction

Publishing sensitive information (such as e-mail address, telephone numbers, personal information, etc.) into public websites has always been a complicated task, either because search engines normally index everything that they find (even if we use the “nofollow” and “noindex” instructions) or because they always are hackers willing to build web crawlers, whose unique objective is to collect the exact information that we want to protect.

Thinking about this problem (and after watching The Imitation Game), I came out with a simple solution for make a text understandable for computers but readable for humans.

Basic Principle

When developing this solution, I considered two facts:

  • Search engines and web crawlers only care about the HTML code of pages. Normally, they do not consider the colors, sizes or fonts. So, if we put a white text with a white background, the text is going to be indexed or crawled despite the fact that is illegible.
  • When writing texts (in websites or word processors), we can use any font that we want, and fonts are very flexible. For example, they are symbolic fonts that are totally illegible on purpose, because they draw icons or symbols instead of drawing letters and numbers.

And the idea is very simple:

  1. First, we encode the text that we want to protect using some simple substitution cipher, such as ROT13.

    So, for example, if we want to publish our e-mail, instead or writing johndoe@awesome.com, we are going to write wbuaqbr@njrfbzr.pbz. In that way, our e-mail is going to be protected from search engines and web crawlers, because they are going to incorrectly think that wbuaqbr@njrfbzr.pbz is our address.

  2. And secondly, we apply to that encoded text a special font, whose letters have been unordered in order to reverse the substitution made when encoding the text.

    So, in our previous example, we should use a font that draws a 'j' when encountering a 'w', an 'o' when encountering a 'b', an 'e' when encountering a 'u', etc. In that way, the e-mail address is going to be clearly readable by humans, but illegible for computers, since they do not take in account the font used.

    Note that this technique is not limited to webpages, it can also be used on PDF files (since PDF files can embed the fonts used in them), and it also makes the text uncopiable (at least when using the clipboard).

Using the Code

In order to simplify the use of this technique, I created a project at GitHub, called TuringFonts, where you can use an online encoder to encode your text using a simple substitution cipher and where you can download some fonts that you can use to 'decode' the encoded text.

So let's say that you want to encode your text using ROT13 and that you are going to publish it in your site.

First you must, in your CSS file, declare the font that you are going to use to decode the text.

HTML
@font-face {
    font-family: 'arial_rot13';
    src: url('fonts/arial/arial_rot13.eot');
    src: url('fonts/arial/arial_rot13.eot?#iefix') format('embedded-opentype'),
        url('fonts/arial/arial_rot13.woff2') format('woff2'),
        url('fonts/arial/arial_rot13.woff') format('woff'),
        url('fonts/arial/arial_rot13.ttf') format('truetype'),
        url('fonts/arial/arial_rot13.svg#arialregular') format('svg');
    font-weight: normal;
    font-style: normal;
}

Then you apply this font to your encoded text.

HTML
<h2>Encoded text (illegible for both humans and computers)</h2>
<p style="font-family: Arial">Hlnv hvmhrgrev gvcg gszg lmob sfnzmh nfhg yv zyov gl ivzw</p>
<h2>Decoded text (illegible only for computers, readable for humans)</h2>
<p style="font-family: arial_rot13">Hlnv hvmhrgrev gvcg gszg lmob sfnzmh nfhg yv zyov gl ivzw</p>

And that's all. It's simple, it's easy and effective.

You can see this example working at JSFiddle.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)