Series Introduction
tut1 retrieval based chatbot
This is Part 2 of a series of multiple tuts about natural language and machine learning, we would manually create a chatbot that is able to:
- chat with users
- answer their questions
- extract data about users
- perform some actions (recommending movie, telling time ....)
The goal of this series is not just to create a chatbot, but it is actually about viewing one practical project that can be achieved using NLP & ML, as my primary goal would be to present both ML & NLP in an interesting way which is to create a chatbot.
In this series, we would cover some techniques, both simple and advanced to reach our final goal. They are:
- Document retrieval using TF-IDF
- Document retrieval using Word2Vect
- cky Parsing
- web scraping
I hope that through this series, you would come to love NLP & ML through this fun project.
We would like to call our Chatbot Lina :D
Introduction
This is Part 2 of the series describing how to build a chatbot. Today, we would discus how to build a system for:
- identifying questions from normal sentences
- looking for answer online
To see last tut describing how to create a retrieval based chatbot, check this link.
We would use primarily two techniques for this:
- natural language parsing using an algorithm named CKY
- web scrapping from multiple websites using beautiful soap library
In this tut, we would give a quick brief on parsing as general, and a quick brief of how the cky natural language parsing algorithm works, and then we would start the fun part of working on our ChatBot to make use out of parsing and web scrapping.
Attached are two files:
- tut2 only
- tut1 + tut2
Technical Background Agenda (a)
This section is optional if you need to know more about the subject. If you need to get started with the code, check Implementation.
- Why parsing & not classification & not simple regex
- Why parsing
- What is PCFG
- What is CKY parsing algorithm
Implementation Agenda (b)
If you would like to skip the technical background and start with the implementation, take a look at the below links:
- Cky algorithm from stat_parser
- Parsing getting tree output then traverse to get string
- Regex through already known parses
- Calling Duck Duck Go API
- Web scraping from Yahoo answers
- Call the functions in a while 1 loop
- Connect to prev tut
References
- CKY implementation by emilmont
- Dan Jurafsky & Chris NLP Playlist
Libraries
- NLTK
- CKY implementation by emilmont
- Beautiful Soap
Technical Background
We would need to identify questions, so for doing so, if we used classification techniques (any classification algorithm like Naive bayes, for example), we would require to train our classifier on question structure (like questions of how many, what is color of, who is .....)
But it doesn't stop at this, as your classifier must understand that:
how much
in the same class as how many
and how old
in the same class as how big
what is
in the same class as who is
This would end up having an enormous number of train sentences, or if you tried regex, you would end up having an enormous number of rules that actually are the same.
So both methods of classification & simple regular expressions won't achieve being highly efficient.
As we have seen in the previous examples:
how much
& how many
are actually the same, as much & many are both adjectives how old
& how big
are actually the same, as old & big are both adjectives, actually that is exactly the same as previous point what is
& who is
are actually the same, as much & many are both adjectives
So we can like replace the actual word of (big , old ....) and replace them by another word (called terminal) that describes that this word is an adjective.
This is precisely what the parsing does, it tries to know what this belongs to, is it a noun, verb, adjective, adverb.
This task is called Natural Language Parsing, as there are many uses for parsing, as any high level language has its own parsing system to convert it to assembly code, but for our task here, we would use parsing but on Natural Language.
There is a well known algorithm for this called CKY, but it works with PCFG , but what are PCFG?
PCFG stands for probability context free grammar:
grammar
stands for they are the grammar of words inside a language, how the words are arranged inside a language, like the rules of a certain language context free
stands for that grammar is built in a non intersecting way probability
stands for how probable a certain grammar rule is over the other
All rules here form what we call grammar.
It simply describes that there are two ways to create a subject.
- When NP (noun phrase), then a VP (verb Phrase) comes after it. This would create a subject, and this rule counts for a probability of 0.9.
- Just by having a verb phrase, this would create a subject with prob of 0.1.
The same goes for all other non terminals.
CKY parsing algorithm stands for the method of parsing natural language sentences , it is developed using dynamic programming , it uses PCFG that we have previously discussed.
We would go briefly on the very concpet of how it is built , as building the cky algorithm itself is not our goal , our goal is to put it in a good use in our chatbot , but it is extremely important to know how it is built , so lets begin !!
There are two main points for CKY to work properly:
- CKY only uses PCFG that are Binary (that only contains 2 terms on right hand side) so if you have a rule with more than 2 terms on the right hand side, you must break them to multiple binary rules.
- Assume we have a dictionary which contains each word and the prob that it is a noun, verb, adjective .....
Assume we need to parse a sentence "people fish in the sea
". To make it simpler, we would only parse as if the sentence was "people fish
".
Assume give grammar (PCFG grammar) is:
CKY works with something called a parse triangle (chart). It is just a way of visualizing how cky parses the sentence:
Here, we assume that:
- People is (from the predefined dictionary of words with their probability):
- Noun Phrase with prob of 0.35
- verb with prob of 0.1
- Noun with prob of 0.5 ==> this makes sense as people is after all a noun
- Fish is (from the predefined dictionary of words with their probability)
- Verb Phrase with prob of 0.06
- Noun Phrase with prob of 0.14
- verb with prob of 0.6 ==> this makes sense as people is after all a verb
- Noun with prob of 0.2
But, when choosing our pair of terminals (VP, NP, V, N), we don't just choose the max out of all prob, but we test on all Possible grammar rules pairs to get max.
So in the previous example, we chose to begin testing with NP → NP NP , since both people and fish can be NP (that is with different prob, of course).
So we would go to the next higher level to get the prob that such a rule could be generated, so to do so, we multiply:
- prob of
people
being NP - prob of
fish
being NP - prob of the rule NP → NP NP itself
This would result in 0.35 x 0.14 x 0.1 = 0.0049.
But we won't just stop here, we must continue till we loop through all possible rules that "people fish
" can make.
So let's choose VP → V NP. This would output 0.007 (we simply write NP as our previous guess):
Then continue on with choosing NP & VP to create S.
S → NP VP, this would create a prob of 0.0189 (we also write the previous guesses).
BUT in
CKY we also take unary PCFG into consideration, so in each step (after the first non-terminal
step, you see if the output of this rule can be generated by a unary rule or not).
S → VP of prob 0.0007
So we have two methods for creating S:
- either using both NP and VP with prob 0.0189
- or by only using the already generated VP with prob 0.0007
So we only use the one with the highest prob and discard the other.
So now, after looping through all possible combinations, we have 3 terminals as our output, 3 possible guesses for how to parse "people fish"
.
- noun phrase with prob 0.0049
- verb phrase with prob 0.007
- complete sentence with prob 0.0189
Here, CKY would choose to parse "people fish"
as complete sentence is composed of noun phrase "people
" and a verb phrase "fish
".
We then use this concept through the whole parse triangle until we reach its top. At each step, we go to the next upper level till we reach the top of the triangle. In this way, we would have achieved a way of parsing sentences in square of time not in an exponential one.
There is a truly smart implementation developed by emilmont found on github for cky parsing. It is truly unique as it doesn't contain pre defined pcfg, but it learns it from QuestionBank and Penn treebanks.
This makes the learnt PCFG truly realistic, not simple arbitrary probabilities for each rule, but true learnt values. This reflects perfectly on the results that this code outputs.
First, we would import the code for cky & nltk library as NLTK would help in storing of the tree:
from stat_parser.parser import Parser, display_tree
import nltk
Let's define a function that takes a sentence and displays the output of parse as a tree:
def parse(sentence):
parser = Parser()
try:
tree = parser.parse(sentence)
except:
return False, ""
print tree
display_tree(tree)
This would be the tree gui when parsing "can you tell me who is Obama
".
But:
to actually work on the output, it would be much easier if we could have the output just:
- MD - PRP - VB - PRP - WP - VBZ - NNP -
So we would need to traverse through the tree, so we use a recursive function for traversing:
def parse(sentence):
parser = Parser()
try:
tree = parser.parse(sentence)
except:
return False, ""
print tree
display_tree(tree)
for i in range(len(tree)):
traverse(tree[i], 0)
Now let's see the function itself:
tree_output = []
imp_list_array = {'Noun': []}
def traverse(parent, x):
try:
for node in parent:
if type(node) is nltk.Tree:
if node.label() == 'ROOT':
a = 6
else:
element_type = node.label()
element_value = node.leaves()[0]
element_sentence = node.leaves()
if str(element_type) == 'NN' or str(element_type) == \
'NNS' or str(element_type) == 'NNP' or str(
element_type) == 'NNPS':
imp_list_array['Noun'].append(str(element_value))
traverse(node, x)
else:
tree_output.append(parent.label())
except:
tree_output.append('NN')
Since we are using global variables, it would be better if we clear these variables at each parse.
def parse(sentence):
while len(tree_output) > 0:
tree_output.pop()
parser = Parser()
try:
tree = parser.parse(sentence)
except:
return False, ""
for i in range(len(tree)):
traverse(tree[i], 0)
def parse(sentence):
while len(tree_output) > 0:
tree_output.pop()
parser = Parser()
try:
tree = parser.parse(sentence)
except:
return False, ""
for i in range(len(tree)):
traverse(tree[i], 0)
tree_output_str = ""
for a in tree_output:
tree_output_str += " - " + a
print tree_output_str
special_parses = [
"WRB - JJ - NNS",
"WRB - JJ - JJ",
"WRB - JJ - VBP - DT - NN",
"WRB - JJ - VBZ - JJ",
"WRB - JJ - VBZ - NNP",
"WRB - JJ - NN - VBP - NNP - VBP",
"WRB - VBP - DT - NN",
"WP - VBD - DT - NN",
"WP - VBP - DT - NN",
"WP - VBZ - DT - NN - IN - NN",
"WDT - NNS",
"WRB - VBZ - NN",
"WRB - VBZ - DT - NN" ,
"WP - VBZ - NNP",
"WP - VBZ - JJ",
"WRB - VBD - NNP",
"WP - VBZ - NN" ,
"WP - VBZ - DT - JJ - NN - IN - NN"
]
try:
regex = reduce(lambda x, y: x + "|" + y, special_parses)
pos_tree_output = tree_output_str.index\
(re.search(regex, tree_output_str).group(0))
pos_var = len(tree_output_str.replace('-', '').split()) - len(
tree_output_str[pos_tree_output:].replace('-', '').split())
fact_question = ' '.join(sentence.split()[pos_var:])
print("it is a fact question")
answr_text = look_in_duck_go(fact_question)
if answr_text=="":
answr_text = look_in_yahoo_answers(fact_question)
return True, answr_text
except:
print("not fact question")
return False, "not fact question"
Duck Duck Go provides a simple API (without even having API key) to get answers to questions, but we first must import some libraries:
import re
from bs4 import BeautifulSoup
import requests
Duck Duck Go returns answer either in XML or JSON, we would choose XML, and we would use BeautifulSoup to extract data out of it.
The XML response would look like:
We would extract the abstract tag, and image tag.
But if abstract tag is empty, we would try to look for text tag.
Now let's write the function itself:
def look_in_duck_go(fact_question):
try:
base_address = "https://api.duckduckgo.com/?q=" + fact_question + "&format=xml"
super_page = requests.get(base_address)
soup_super_page = BeautifulSoup(super_page.content, "xml")
answer = soup_super_page.findAll('Abstract')[0].text
Image = soup_super_page.findAll('Image')[0].text
if (answer == ""):
answer = soup_super_page.findAll('Text')[0].text
print("found in duck duck go")
except:
answer=""
return answer
Duck Duck Go only answers what & who questions, so there would be many other questions left unanswered, so we would use Yahoo answers for that, given its huge community, most of the questions can be found.
This is how you can use Yahoo answers:
https://answers.search.yahoo.com/search?fr=uh3_answers_vert_gs&type=2button&p=your_question
This would return:
We would use the answer found inside the first link.
This is the link in HTML:
<a class="<code> lh-17</code> fz-m"
href="https://answers.yahoo.com/question/index;_ylt=AwrC1C7U5rdZzxMA1QFPmolQ;
_ylu=X3oDMTEybTFvb2wxBGNvbG8DYmYxBHBvcwMxBHZ0aWQDQjI1NTlfMQRzZWMDc3I-?
qid=20091221094648AAFlRbE"
referrerpolicy="unsafe-url" target="_blank" id="yui_3_10_0_1_1505224373621_408">
<b>how</b>
<b id="yui_3_10_0_1_1505224373621_431">many</b>
<b>pyramids</b>
<b>in</b>
<b id="yui_3_10_0_1_1505224373621_407">Egypt</b>
?
</a>
What is really important is the class name which is lh-17
We extract the link (href).
So our function would be:
def look_in_yahoo_answers(fact_question):
fact_question = fact_question.replace(' ' , '%20')
base_address ="https://answers.search.yahoo.com/search?
fr=uh3_answers_vert_gs&type=2button&p="+fact_question
super_page = requests.get(base_address)
soup_super_page = BeautifulSoup(super_page.content, "lxml")
</code>
answr_link = soup_super_page.findAll('a', {'class' : 'lh-17'})[0].attrs["href"]
return answr_text
- Then after getting the link of the answer, this is how it would look like:
We would need to get the best answer which looks in HTML like below:
<span itemprop="text" class="<code>ya-q-full-text</code>"
id="yui_3_17_2_5_1505225015420_1884">
There are more than one hundred pyramids. <br>
The reason the exact number is not fixed is that
there are unfinished or not in good shape pyramids.
So it is debatable for some of them if they would be counted as pyramids or not. <br>
The perfect ones are the three pyramids of Giza.
</span>
We are interested in the class name which is ya-q-full-text
.
So we extract its text:
def look_in_yahoo_answers(fact_question):
fact_question = fact_question.replace(' ' , '%20')
base_address ="https://answers.search.yahoo.com/search?
fr=uh3_answers_vert_gs&type=2button&p="+fact_question
super_page = requests.get(base_address)
soup_super_page = BeautifulSoup(super_page.content, "lxml")
answr_link = soup_super_page.findAll('a', {'class' : 'lh-17'})[0].attrs["href"]
answer_page = requests.get(answr_link)
soup_answer_page = BeautifulSoup(answer_page.content, "lxml")
answr_text = soup_answer_page.findAll('span',
{'class' : '<code>ya-q-full-text</code>'})[0].text
print("found in yahoo answers")
return answr_text
Let's put all the above code in a user friendly interface:
while 1:
tree_output=[]
tree_output_str =""
fact_question = raw_input("talk to Lina : ")
answr_text = parse(fact_question)
print answr_text
print
Here, we would create a simple Python script that would use tut1 & tut2, to create a more complete chatbot. This chatbot would now be able to:
- chat with users using retrieval modal using TF-IDF (see prev tut for more info)
- answer user's questions using parsing & web scraping
So let's start with a Python script called tut_1_2
.
import tf_idf
import parsing_scrapping
We would comment all while 1 in both tf_idf & parsing_scrapping (disable while 1 in inner code) and create a new while 1 in tut_1_2
.
We will start with parsing module before using TF-IDF, as if the question is a fact question this module would easily identify it as fact question while TF-IDF would try its best to respond as if it is a normal chat, we don't need this, so we would start with the parsing module:
while 1:
sentence = raw_input("talk to Lina : ")
parsing_scrapping.tree_output=[]
parsing_scrapping.tree_output_str =""
answr_text = parsing_scrapping.parse(sentence)
if answr_text[0] :
print answr_text[1]
else :
response_primary, line_id_primary = tf_idf.talk_to_lina_primary(sentence)
print response_primary
print
Note:
Please tell me what do you think about this tut, tell me your review and vote this tut. I also would suggest seeing last tut, and wait for coming tuts isA