Ranking Tokens using TF-IDF in C#

Jesús Utrera

5.00/5 (6 votes)

16 Dec 2015CPOL3 min read

20.8K

Using text retrieval TF-IDF technique to rank tokens in a text document

Introduction

Let's see the wikipedia TF-IDF definition.

«tf–idf, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in information retrieval and text mining. The tf-idf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general.

Variations of the tf–idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query. tf–idf can be successfully used for stop-words filtering in various subject fields including text summarization and classification.

One of the simplest ranking functions is computed by summing the tf–idf for each query term; many more sophisticated ranking functions are variants of this simple model.»

Background

In this tip, we will use TF-IDF to rank the most important words (say tokens) of a document inside a document collection. This code is very simple and of course, for a number of documents, you would perform a better code.

Let's See the datasource

To explain this code, I have used a document collection of diseases (what is, causes, etc.). This data is stored in a MongoDB database. The MongoDB collection is named Documents. This collection stores the disease name (Title) and a list of sections, each of all store the section title (title) and the main text (text). In this example, I will use all the texts of each section. The collections have, approximately, 120 documents, that is not a lot but is useful to test the example.

Using the Code

Each token has this structure in our code:

/// <summary>
/// Represents a token in a document
/// </summary>
public class Token
{
    /// <summary>
    /// Document in wich a token apperas
    /// </summary>
    [BsonRepresentation(BsonType.String)]
    [BsonRequired]
    [BsonElement("Document")]
    public string Document { get; set; }

    /// <summary>
    /// Token of a document
    /// </summary>
    [BsonRepresentation(BsonType.String)]
    [BsonRequired]
    [BsonElement("Token")]
    public string Word { get; set; }

    /// <summary>
    /// Number of times the token appears in the document
    /// </summary>
    [BsonRepresentation(BsonType.Int32)]
    [BsonRequired]
    [BsonElement("Count")]
    public int Count { get; set; }

    /// <summary>
    /// NUmber of times the most seen token of the document appears
    /// </summary>
    [BsonRepresentation(BsonType.Int32)]
    [BsonRequired]
    [BsonElement("Max")]
    public int Max { get; set; }

    /// <summary>
    /// Normalized frequency of the token in document (Count / Max)
    /// </summary>
    [BsonRepresentation(BsonType.Double)]
    [BsonRequired]
    [BsonElement("TF")]
    public double TF { get; set; }

    /// <summary>
    /// Total number of documents in the collection
    /// </summary>
    [BsonRepresentation(BsonType.Int32)]
    [BsonRequired]
    [BsonElement("DN")]
    public int DN { get; set; }

    /// <summary>
    /// Number of documents where the token is
    /// </summary>
    [BsonRepresentation(BsonType.Int32)]
    [BsonRequired]
    [BsonElement("CN")]
    public int CN { get; set; }

    /// <summary>
    /// IDF [Log(DN/(1+CN))]
    /// </summary>
    [BsonRepresentation(BsonType.Double)]
    [BsonRequired]
    [BsonElement("IDF")]
    public double IDF { get; set; }

    /// <summary>
    /// TF-IDF value of the token in document [TF*IDF]
    /// </summary>
    [BsonRepresentation(BsonType.Double)]
    [BsonRequired]
    [BsonElement("TFIDF")]
    public double TFIDF { get; set; }
}

The class which makes the magic is exposed below:

public class Parser
{
    public List<Document> Documents { get; set; }

    public Parser()
    {
        this.Initialize();
    }

    /// <summary>
    /// read all the documents from database
    /// </summary>
    private void Initialize()
    {
        DocumentProvider documentProvider = new DocumentProvider();
        this.Documents = documentProvider.GetAll();
    }

    internal void Execute()
    {
        List<Token> tokenList = new List<Token>();

        //Insert the tokens of documents anr count
        foreach (Document doc in this.Documents)
        {
            foreach (Section secc in doc.Sections)
            {
                List<string> tokens = Generics.ExtractTokens(secc.Text);

                foreach (string word in tokens)
                {
                    if (tokenList.Any(x => x.Document == doc.Title && x.Word == word))
                    {
                        tokenList.First(x => x.Document == doc.Title && x.Word == word).Count++;
                    }
                    else
                    {
                        tokenList.Add(new Token()
                        {
                            Document = doc.Title,
                            Word = word,
                            Count = 1
                        });
                    }
                }
            }
        }

        //Save the other properties
        foreach (Token item in tokenList)
        {
            if (item.Max == 0)
            {
                item.Max = (from elto in tokenList
                               where elto.Document == item.Document
                               select elto.Count).Max();
                item.TF = Convert.ToDouble(item.Count) / Convert.ToDouble(item.Max);
            }
            item.DN = tokenList.GroupBy(x => x.Document).Count();
            item.CN = tokenList.Where(x => x.Word == item.Word).Count();
            item.IDF = Convert.ToDouble(Math.Log10(item.DN / (item.CN)));
            item.TFIDF = item.TF * item.IDF;
        }

        //Save in database
        TokenProvider tokenProvider = new TokenProvider();
        tokenList.ForEach(item => tokenProvider.Insert(item));
    }
}

The first time, we extract all the documents and store in a list. Execute method does the process.

The first step is to create and fill the list of tokens. To do this, we extract all the tokens of each section of all the documents and fill the tokens list or adds one token appearance.

foreach (Document doc in this.Documents)
{
    foreach (Section secc in doc.Sections)
    {
        List<string> tokens = Generics.ExtractTokens(secc.Text);

        foreach (string word in tokens)
        {
            if (tokenList.Any(x => x.Document == doc.Title && x.Word == word))
            {
                tokenList.First(x => x.Document == doc.Title && x.Word == word).Count++;
            }
            else
            {
                tokenList.Add(new Token()
                {
                    Document = doc.Title,
                    Word = word,
                    Count = 1
                });
            }
        }
    }
}

The second step is to process each token and fill the other properties. To use this data in other studies, we fill all the properties.

To fill the token frequency, we will do the normalization of the token count inside document. With this technique, we can avoid a bias towards long documents. To do this, we will divide the main token frequency by the higher frequency of the token which appears more in the document:

        Count
TF  =  -------
         Max

We get the inverse document frequency of the token dividing the document count by the number of documents in which the token appears and calculating the logarithm to the value of the division:

             DN
IDF  =  log ----
             CN

All this is done in a second round.

//Save the other properties
foreach (Token item in tokenList)
{
    if (item.Max == 0)
    {
        item.Max = (from elto in tokenList
                       where elto.Document == item.Document
                       select elto.Count).Max();
        item.TF = Convert.ToDouble(item.Count) / Convert.ToDouble(item.Max);
    }
    item.DN = tokenList.GroupBy(x => x.Document).Count();
    item.CN = tokenList.Where(x => x.Word == item.Word).Count();
    item.IDF = Convert.ToDouble(Math.Log10(item.DN / (item.CN)));
    item.TFIDF = item.TF * item.IDF;
}

Finally, we store these results in the database.

Points of Interest

Once the proccess is ended, we have a ranking of tokens for each document. We can see what are the most descriptive tokens for a document and which of them are less descriptive. As you can see, we could use stop-words to remove common words or make the Stemming of the data to work only with the words stem.

We can use a value to establish the document top most tokens to use it as tags of the document. This decision value can be calculated using machine learning. The most important is that we can extract the most descriptive attributes of a set of products for an on-line shop and use them in a recommended system. Or, we can extract the stop-words for mail filtering.

Original Post (Spanish Language)

http://jesusutrera.com/articles/article03.html

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)