Introduction
Let's see the wikipedia TF-IDF definition.
«tf–idf, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in information retrieval and text mining. The tf-idf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general.
Variations of the tf–idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query. tf–idf can be successfully used for stop-words filtering in various subject fields including text summarization and classification.
One of the simplest ranking functions is computed by summing the tf–idf for each query term; many more sophisticated ranking functions are variants of this simple model.»
Background
In this tip, we will use TF-IDF to rank the most important words (say tokens) of a document inside a document collection. This code is very simple and of course, for a number of documents, you would perform a better code.
Let's See the datasource
To explain this code, I have used a document collection of diseases (what is, causes, etc.). This data is stored in a MongoDB database. The MongoDB collection is named Documents
. This collection stores the disease name (Title
) and a list of sections, each of all store the section title (title
) and the main text (text
). In this example, I will use all the texts of each section. The collections have, approximately, 120 documents, that is not a lot but is useful to test the example.
Using the Code
Each token has this structure in our code:
public class Token
{
[BsonRepresentation(BsonType.String)]
[BsonRequired]
[BsonElement("Document")]
public string Document { get; set; }
[BsonRepresentation(BsonType.String)]
[BsonRequired]
[BsonElement("Token")]
public string Word { get; set; }
[BsonRepresentation(BsonType.Int32)]
[BsonRequired]
[BsonElement("Count")]
public int Count { get; set; }
[BsonRepresentation(BsonType.Int32)]
[BsonRequired]
[BsonElement("Max")]
public int Max { get; set; }
[BsonRepresentation(BsonType.Double)]
[BsonRequired]
[BsonElement("TF")]
public double TF { get; set; }
[BsonRepresentation(BsonType.Int32)]
[BsonRequired]
[BsonElement("DN")]
public int DN { get; set; }
[BsonRepresentation(BsonType.Int32)]
[BsonRequired]
[BsonElement("CN")]
public int CN { get; set; }
[BsonRepresentation(BsonType.Double)]
[BsonRequired]
[BsonElement("IDF")]
public double IDF { get; set; }
[BsonRepresentation(BsonType.Double)]
[BsonRequired]
[BsonElement("TFIDF")]
public double TFIDF { get; set; }
}
The class which makes the magic is exposed below:
public class Parser
{
public List<Document> Documents { get; set; }
public Parser()
{
this.Initialize();
}
private void Initialize()
{
DocumentProvider documentProvider = new DocumentProvider();
this.Documents = documentProvider.GetAll();
}
internal void Execute()
{
List<Token> tokenList = new List<Token>();
foreach (Document doc in this.Documents)
{
foreach (Section secc in doc.Sections)
{
List<string> tokens = Generics.ExtractTokens(secc.Text);
foreach (string word in tokens)
{
if (tokenList.Any(x => x.Document == doc.Title && x.Word == word))
{
tokenList.First(x => x.Document == doc.Title && x.Word == word).Count++;
}
else
{
tokenList.Add(new Token()
{
Document = doc.Title,
Word = word,
Count = 1
});
}
}
}
}
foreach (Token item in tokenList)
{
if (item.Max == 0)
{
item.Max = (from elto in tokenList
where elto.Document == item.Document
select elto.Count).Max();
item.TF = Convert.ToDouble(item.Count) / Convert.ToDouble(item.Max);
}
item.DN = tokenList.GroupBy(x => x.Document).Count();
item.CN = tokenList.Where(x => x.Word == item.Word).Count();
item.IDF = Convert.ToDouble(Math.Log10(item.DN / (item.CN)));
item.TFIDF = item.TF * item.IDF;
}
TokenProvider tokenProvider = new TokenProvider();
tokenList.ForEach(item => tokenProvider.Insert(item));
}
}
The first time, we extract all the documents and store in a list. Execute
method does the process.
The first step is to create and fill the list of tokens. To do this, we extract all the tokens of each section of all the documents and fill the tokens list or adds one token appearance.
foreach (Document doc in this.Documents)
{
foreach (Section secc in doc.Sections)
{
List<string> tokens = Generics.ExtractTokens(secc.Text);
foreach (string word in tokens)
{
if (tokenList.Any(x => x.Document == doc.Title && x.Word == word))
{
tokenList.First(x => x.Document == doc.Title && x.Word == word).Count++;
}
else
{
tokenList.Add(new Token()
{
Document = doc.Title,
Word = word,
Count = 1
});
}
}
}
}
The second step is to process each token and fill the other properties. To use this data in other studies, we fill all the properties.
To fill the token frequency, we will do the normalization of the token count inside document. With this technique, we can avoid a bias towards long documents. To do this, we will divide the main token frequency by the higher frequency of the token which appears more in the document:
Count
TF = -------
Max
We get the inverse document frequency of the token dividing the document count by the number of documents in which the token appears and calculating the logarithm to the value of the division:
DN
IDF = log ----
CN
All this is done in a second round.
foreach (Token item in tokenList)
{
if (item.Max == 0)
{
item.Max = (from elto in tokenList
where elto.Document == item.Document
select elto.Count).Max();
item.TF = Convert.ToDouble(item.Count) / Convert.ToDouble(item.Max);
}
item.DN = tokenList.GroupBy(x => x.Document).Count();
item.CN = tokenList.Where(x => x.Word == item.Word).Count();
item.IDF = Convert.ToDouble(Math.Log10(item.DN / (item.CN)));
item.TFIDF = item.TF * item.IDF;
}
Finally, we store these results in the database.
Points of Interest
Once the proccess is ended, we have a ranking of tokens for each document. We can see what are the most descriptive tokens for a document and which of them are less descriptive. As you can see, we could use stop-words to remove common words or make the Stemming of the data to work only with the words stem.
We can use a value to establish the document top most tokens to use it as tags
of the document. This decision value can be calculated using machine learning. The most important is that we can extract the most descriptive attributes of a set of products for an on-line shop and use them in a recommended system. Or, we can extract the stop-words for mail filtering.
Original Post (Spanish Language)