For this series, we work on a retrieval method to build a chatbot, as this will be both simple and rewarding as it outputs really good results. In this tutorial, we will discuss how to use document retrieval to make Lina chat with users.
Series Introduction
This would be a series of multiple tuts about natural language and machine learning, we would manually create a chatbot table to
- chat with users
- answer their questions
- extract data about users
- perform some actions (recommending movie, telling time ...)
This series' goal is not just to create a chatbot, but it is actually about viewing one practical project that can be achieved using NLP & ML, as my primary goal would be to present both ML & NLP in an interesting way which is to create a chatbot.
In this series, we would cover some techniques, both simple and advanced to reach our final goal. They are:
- Document retrieval using TF-IDF
- Document retrieval using Word2Vect
- cky Parsing
- web scrapping
I hope that through this series, you would come to love NLP & ML through this fun project.
We would like to call our Chatbot Lina :D
Tut Introduction
To build a chatbot, there are two primary methods, generative method, and a retrieval method. For this series, we would work on a retrieval method, as it would be both simple and rewarding as it would output really good results
In this TUT, we would discuss how to use document retrieval to make Lina chat with users, in coming tuts, we would discuss how to make it more interactive.
To build using the retrieval method, there are some general steps:
- Define some sentences and their responses (data gathering step), I have attached a simple sentence response from a repo available online called rDanny.
- Teach your chatbot this dataset using TF-IDF (term freq -inv document freq).
- Compare between the input sentence and the learnt using cosine similarity.
The attached zip contains both the code and the dataset.
Agenda
In this tut, we would:
- (Optional) begin discussing briefly the concept of TF-IDF (term freq -inv document freq)
- (Optional) then discuss what is cosine similarity and why do we use it
- (You can skip the technical background and get ahead with creating your ChatBot) then we come to the interesting part which is really practically building Lina (our chatbot :D)
Most of the technicalities came from Dan Jurafsky & Chris youtube playlist.
And the used dataset came from rDanny.
So let's begin!
tf-idf stands for term freq - inv document freq, it's a simple method to try to give scores to documents that look the same as the user's query.
It is built over two main concepts:
- How many times this word (in the query ) appeared in the each document ==> term freq
- How unique is this word ==> inv document freq
1.a Term Freq
It is a simple count of how many of a specific word appeared in a given document, we count all words in all given documents ==> to give scores to documents.
Example:
Here, in this example, we have multiple documents, here they are Shakespeare's plays, we test for specific word occurrences through these documents, as here when only working on Term freq, we are only interested in word counts:
Now we would need to measure score just according to term freq (word count), in coming sections, we would leverage the way we calculate the score BUT
Raw term frequency is not what we want as:
- A document with 10 occurrences of the term is more relevant than a document with 1 occurrence of the term.
- But not 10 times more relevant.
As Relevance does not increase proportionally with term frequency. SO we would use Log freq not just simple word count:
Remember that tf
is just word count, so whenever tf > 0
, we would calculate score as 1+Log
, else if word count is zero (tf=0
), score would be zero. So now to calculate score.
Term Freq is not enough
But we must account for how rare a certain word is if compared to other words through all documents. So to complete TF-IDF we must take into consideration document freq, or to be more precise its inverse to take into consideration the rarity of words
1.b Inverse Document Freq
Here, we tend to measure rarity of words, this is done to calculate how many documents a word appeared in. We count a word appeared once in a document once seen in this document (even if it appeared multiple times in this document), as our main goal here is counting documents not words.
Example:
Assume a million document and you look for a list of words and see how many documents these words appeared in.
df here stands for number of documents. So here a word like “calpurnia“ appeared in only one document (very rare) so it would have high score, so a small df (document count) gets a big score, so we get the inverse of document freq:
So by calculating this for all the above terms (words), results would be:
Note:
as you have seen here, IDF (inverse document freq) is independent to the query as it only depends on documents, so we don’t need to calculate idf for each new query, as there is one idf value for each term. This can be used for optimization, calculating idf prior to query
So if we want to measure score just according to IDF (inverse document freq):
1.c Combing both TF & IDF in a Simple Score Calculation
This score calculation needs to:
- Inc with number of occurrences (
TermFreq
) - Inc with rarity of words (inverse Document Freq)
We first calculate weight for a specific term in a specific document.
Applying this to the previous example of Shakespeare's Play:
Now each document is represented by a real-valued vector of tf-idf weights.
Now to calculate score of a specific query versus a specific document.
t==> term (word) , q==> query , d==> document
So you calculate the score for all terms (t) in the query (q) (as tf =0 if this term wasn't found in the given document). And you sum scores for all terms in the query to get a score of this document.
Note:
Now after knowing how to calculate score for a specific document, so to do document retrieval we would need to do this method for all documents, but we tend to use concept of vector space not just simple score formula to get similarities, so we represent both query and document in a vector space so we use similarity calculations
1.d Why Vector Space Model ??
In the previous example, we can say that each document now becomes a vector of real values.
Taking Julius Caesar, for example, its vector would be ==> [ 3.18 6.1 2.54 1.54 0 0 0 ]
We can assume words (terms) to be the dimensions of our space, so now the document would be like a point (or a vector in this space, and we also can think of the queries the same way, as a point (vector) in the space.
BUT this space's can become of Very High Dimensions (terms of tens of millions), as remember our dimensions depend on the number of vocab you have, so we need a more efficient way than running the previous:
Ten million time for our 10 thousand document (assuming our vocab is 10 million and that we have 10 thousand document) and here comes the need to use methods that use the concept of vector space, one of them can simply be regarded as a way to calculate the smallest distance between 2 vectors (a specific document and a query ) in a big dimension space, this is called Euclidean distance.
Here, as we see, this is a tiny space of just two words (Gossip & Jealous), we have 3 documents and 1 query, the 3 documents:
- d1 ==> has to do more with Gossiping
- d3==> has to do more with Jealous
- d2 ==> contains both terms
And it is shown here that q (our query) wants a document that contains both Gossip & Jealous (d2), but Euclidean distance won't return d2 as distance from q (from tip of the vector) to d2 (tip of d2) is bigger than that of d1 and d3.
So we use the concept of measuring angle rather than measuring distance.
So here comes the use of Cosine Similarity
2.a Why Cosine Similarity?
There are many ways to use the concept of vector space, but we specifically use Cosine similarity as when using vector space, we would face a problem, which is different length documents would result in wrong scores (as discussed before), to solve this problem, we must consider using the concept of Length normalization, so this is the reason why we use cosine similarity
2.b Length Normalization
A vector can be length normalized by dividing each of its components by its length – for this, we use the L2 norm:
Now Long and short documents have comparable weights.
2.c Why Use cosine Function
We use cosine similarity because Cosine is a monotonically decreasing function for the interval [0o, 180o] and ranges from 1 → -1
The following two notions are equivalent.
- Rank documents in decreasing order of the angle between query and document
- Rank documents in increasing order of cosine (query, document) from the very nature of cosine function
2.d Cosine Similarity Computation
Now let's really compute the similarity using cosine similarity:
Remember that we calculate TF-IDF for both query and a specific document, then we use the cosine method to get similarity, so we primarily need two variables q and d:
qi is the TF-IDF weight of term i in the query
di is the TF-IDF weight of term i in the document
BUT
We are working on normalized length, so th eq can be simplified to be only dot product (or scalar product)
Example:
You can visualize it to only be:
Example:
All the above explanation was just to get to know the concept behind TF-IDF, as there are already many free tools that implement TF-IDF, the art is not just to create TF-IDF again, but to how to put it in good use.
We need for our chatbot a dataset in the form of sentences and their responses.
Our general steps would be:
- Where each sentence would be assumed to be a document on its own
- Then when a user inputs his chat, we would consider it to be our query
- Then we use TF-IDF and cosine similarity to compare them both, and get the most similar sentence from the dataset that is most similar to the query and output the response
So Let's Start Coding!
First, we would need to import:
- We would primarily use sklearn & numpy library (TF-IDF libraries)
- We would also use some other libraries, their function would appear later (helper Libraries)
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
import pickle
import csv
import timeit
import random
Now let's create a function that takes a string
for an input and outputs the response, our whole work would be inside this function.
Let's define the path of csv file, we put it inside folder named DataSet.
def talk_to_lina_primary(test_set_sentence):
csv_file_path = "DataSet/randychat.csv"
To work with sklearn
, we must load the string chat to a dictionary so, and we would create an array to load the csv file inside, so we would create an array called sentences
.
def talk_to_lina_primary(test_set_sentence):
csv_file_path = "DataSet/randychat.csv"
i = 0
sentences = []
test_set = (test_set_sentence, "")
We would have two sections for our code, one for training and the other would be for running the code. We would use to save our objects in temp files, we would save these temp files in DataSet folder. So we would use a try except clauses for this, and put running phase in the try
section so that when no temp files are present, we would go to the except
section.
def talk_to_lina_primary(test_set_sentence):
csv_file_path = "DataSet/randychat.csv"
tfidf_vectorizer_pikle_path = "DataSet/tfidf_vectorizer.pickle"
tfidf_matrix_train_pikle_path ="DataSet/tfidf_matrix_train.pickle"
i = 0
sentences = []
test_set = (test_set_sentence, "")
sentences.append(" No you.")
sentences.append(" No you.")
try:
except:
For training, we would notice, that:
- Identing the dimension of the space, is independant on the query, also
- Learning TF-IDF of the dataset would also be independant on the query
So we can learn them once and then save them to two temp files:
def talk_to_lina_primary(test_set_sentence):
csv_file_path = "DataSet/randychat.csv"
tfidf_vectorizer_pikle_path = "DataSet/tfidf_vectorizer.pickle"
tfidf_matrix_train_pikle_path ="DataSet/tfidf_matrix_train.pickle"
i = 0
sentences = []
test_set = (test_set_sentence, "")
sentences.append(" No you.")
sentences.append(" No you.")
try:
except:
start = timeit.default_timer()
with open(csv_file_path, "r") as sentences_file:
reader = csv.reader(sentences_file, delimiter=',')
for row in reader:
sentences.append(row[0])
i += 1
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix_train = tfidf_vectorizer.fit_transform(sentences)
stop = timeit.default_timer()
print ("training time took was : ")
print stop - start
f = open(tfidf_vectorizer_pikle_path, 'wb')
pickle.dump(tfidf_vectorizer, f)
f.close()
f = open(tfidf_matrix_train_pikle_path, 'wb')
pickle.dump(tfidf_matrix_train, f)
f.close()
Then we would work on the section of using (after learning):
def talk_to_lina_primary(test_set_sentence):
csv_file_path = "DataSet/randychat.csv"
tfidf_vectorizer_pikle_path = "DataSet/tfidf_vectorizer.pickle"
tfidf_matrix_train_pikle_path ="DataSet/tfidf_matrix_train.pickle"
i = 0
sentences = []
test_set = (test_set_sentence, "")
sentences.append(" No you.")
sentences.append(" No you.")
try:
f = open(tfidf_vectorizer_pikle_path, 'rb')
tfidf_vectorizer = pickle.load(f)
f.close()
f = open(tfidf_matrix_train_pikle_path, 'rb')
tfidf_matrix_train = pickle.load(f)
f.close()
except:
start = timeit.default_timer()
with open(csv_file_path, "r") as sentences_file:
reader = csv.reader(sentences_file, delimiter=',')
for row in reader:
sentences.append(row[0])
i += 1
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix_train = tfidf_vectorizer.fit_transform(sentences)
stop = timeit.default_timer()
print ("training time took was : ")
print stop - start
f = open(tfidf_vectorizer_pikle_path, 'wb')
pickle.dump(tfidf_vectorizer, f)
f.close()
f = open(tfidf_matrix_train_pikle_path, 'wb')
pickle.dump(tfidf_matrix_train, f)
f.close()
Both sections of using and training output two objects, we then use those two objects to actually achieve similarity calculations.
We need:
- Run TF-IDF on query using the learnt dimension space
- Run cosine similarity between both TF-IDF of query & TF-IDF of dataset
def talk_to_lina_primary(test_set_sentence):
csv_file_path = "DataSet/randychat.csv"
tfidf_vectorizer_pikle_path = "DataSet/tfidf_vectorizer.pickle"
tfidf_matrix_train_pikle_path ="DataSet/tfidf_matrix_train.pickle"
i = 0
sentences = []
test_set = (test_set_sentence, "")
sentences.append(" No you.")
sentences.append(" No you.")
try:
f = open(tfidf_vectorizer_pikle_path, 'rb')
tfidf_vectorizer = pickle.load(f)
f.close()
f = open(tfidf_matrix_train_pikle_path, 'rb')
tfidf_matrix_train = pickle.load(f)
f.close()
except:
start = timeit.default_timer()
with open(csv_file_path, "r") as sentences_file:
reader = csv.reader(sentences_file, delimiter=',')
for row in reader:
sentences.append(row[0])
i += 1
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix_train = tfidf_vectorizer.fit_transform(sentences)
stop = timeit.default_timer()
print ("training time took was : ")
print stop - start
f = open(tfidf_vectorizer_pikle_path, 'wb')
pickle.dump(tfidf_vectorizer, f)
f.close()
f = open(tfidf_matrix_train_pikle_path, 'wb')
pickle.dump(tfidf_matrix_train, f)
f.close()
tfidf_matrix_test = tfidf_vectorizer.transform(test_set)
cosine = cosine_similarity(tfidf_matrix_test, tfidf_matrix_train)
cosine = np.delete(cosine, 0)
max = cosine.max()
response_index = 0
if (max > 0.7):
new_max = max - 0.01
list = np.where(cosine > new_max)
response_index = random.choice(list[0])
else:
response_index = np.where(cosine == max)[0][0] + 2
j = 0
with open(csv_file_path, "r") as sentences_file:
reader = csv.reader(sentences_file, delimiter=',')
for row in reader:
j += 1
if j == response_index:
return row[1], response_index,
break
Let's just call this function in a simple interface. We have created the function to return both the line of the response and the response itself , we would only view the response string to the user:
while 1:
sentence = raw_input("talk to Lina : ")
response_primary, line_id_primary = talk_to_lina_primary(sentence)
print response_primary
print
Hope this was fun, wait for the next tut. It would be even more exciting, we would make Lina able to answer user's questions, we would use:
- CKY parsing
- simple regex
- web scrapping from using beautiful soap
- duck duck go
- yahoo answers
History
- 10th September, 2017: Initial version
- 11th March, 2021: Updated