Text Mining is one of the most critical ways of analyzing and processing unstructured data which forms nearly 80% of the world's data. Today, a majority of organizations and institutions gather and store massive amounts of data in data warehouses, and cloud platforms and this data continues to grow exponentially by the minute as new data comes pouring in from multiple sources. As a result, it becomes a challenge for companies and organizations to store, process, and analyze vast amounts of textual data with traditional tools.
Introduction
Web search needs no introduction. Due to its convenience and the richness of information on the Web, searching the Web is increasingly becoming the dominant information seeking method. People make fewer and fewer trips to libraries, but more and more searches on the Web. Web search has its root in information retrieval (or IR for short), a field of study that helps the user find needed information from a large collection of text documents. Traditional IR assumes that the basic information unit is a document, and a large collection of documents is available to form the text database. On the Web, the documents are Web pages.
Retrieving information simply means finding a set of documents that is relevant to the user query. A ranking of the set of documents is usually also performed according to their relevance scores to the query. The most commonly used query format is a list of keywords, which are also called terms. IR is different from data retrieval in databases using SQL queries because the data in databases are highly structured and stored in relational tables, while information in text is unstructured. There is no structured query language like SQL for text retrieval.
It is safe to say that Web search is the single most important application of IR. To a great extent, Web search also helped IR. Indeed, the tremendous success of search engines has pushed IR to the center stage. Search is, however, not simply a straightforward application of traditional IR models. It uses some IR results, but it also has its unique techniques and presents many new problems for IR research.
First of all, efficiency is a paramount issue for Web search, but is only secondary in traditional IR systems mainly due to the fact that document collections in most IR systems are not very large. However, the number of pages on the Web is huge. For example, Google indexed more than 8 billion pages when this article was written. Web users also demand very fast responses. No matter how effective an algorithm is, if the retrieval cannot be done efficiently, few people will use it.
Web pages are also quite different from conventional text documents used in traditional IR systems. First, Web pages have hyperlinks and anchor texts, which do not exist in traditional documents (except citations in research publications). Hyperlinks are extremely important for search and play a central role in search ranking algorithms as we will see in the next section. Anchor texts associated with hyperlinks too are crucial because a piece of anchor text is often a more accurate description of the page that its hyperlink points to. Second, Web pages are semi-structured. A Web page is not simply a few paragraphs of text like in a traditional document. A Web page has different fields, e.g., title, metadata, body, etc. The information contained in certain fields (e.g., the title field) is more important than in others. Furthermore, the content in a page is typically organized and presented in several structured blocks (of rectangular shapes). Some blocks are important and some are not (e.g., advertisements, privacy policy, copyright notices, etc.). Effectively detecting the main content block(s) of a Web page is useful to Web search because terms appearing in such blocks are more important.
Finally, spamming is a major issue on the Web, but not a concern for traditional IR. This is so because the rank position of a page returned by a search engine is extremely important. If a page is relevant to a query but is ranked very low (e.g., below top 30), then the user is unlikely to look at the page. If the page sells a product, then this is bad for the business. In order to improve the ranking of some target pages, “illegitimate” means, called spamming, are often used to boost their rank positions. Detecting and fighting Web spam is a critical issue as it can push low quality (even irrelevant) pages to the top of the search rank, which harms the quality of the search results and the user’s search experience.
A general IR system architecture
Basic Concepts of Information Retrieval
Information retrieval (IR) is the study of helping users to find information that matches their information needs. Technically, IR studies the acquisition, organization, storage, retrieval, and distribution of information. Historically, IR is about document retrieval, emphasizing document as the basic unit. The user with information need issues a query (user query) to the retrieval system through the query operations module. The retrieval module uses the document index to retrieve those documents that contain some query terms (such documents are likely to be relevant to the query), compute relevance scores for them, and then rank the retrieved documents according to the scores. The ranked documents are then presented to the user. The document collection is also called the text database, which is indexed by the indexer for efficient retrieval.
A user query represents the user’s information needs, which is in one of the following forms:
- Keyword queries: The user expresses his/her information needs with a list of (at least one) keywords (or terms) aiming to find documents that contain some (at least one) or all the query terms. The terms in the list are assumed to be connected with a “soft” version of the logical AND. For example, if one is interested in finding information about Web mining, one may issue the query ‘Web mining’ to an IR or search engine system. ‘Web mining’ is retreated as ‘Web AND mining’. The retrieval system then finds those likely relevant documents and ranks them suitably to present to the user. Note that a retrieved document does not have to contain all the terms in the query. In some IR systems, the ordering of the words is also significant and will affect the retrieval results.
- Boolean queries: The user can use Boolean operators, AND, OR, and NOT to construct complex queries. Thus, such queries consist of terms and Boolean operators. For example, ‘data OR Web’ is a Boolean query, which requests documents that contain the word ‘data’ or ‘Web’. A page is returned for a Boolean query if the query is logically true in the page (i.e., exact match). Although one can write complex Boolean queries using the three operators, users seldom write such queries. Search engines usually support a restricted version of Boolean queries.
- Phrase queries: Such a query consists of a sequence of words that makes up a phrase. Each returned document must contain at least one instance of the phrase. In a search engine, a phrase query is normally enclosed with double quotes. For example, one can issue the following phrase query (including the double quotes), “Web mining techniques and applications” to find documents that contain the exact phrase.
- Proximity queries: The proximity query is a relaxed version of the phrase query and can be a combination of terms and phrases. Proximity queries seek the query terms within close proximity to each other. The closeness is used as a factor in ranking the returned documents or pages. For example, a document that contains all query terms close together is considered more relevant than a page in which the query terms are far apart. Some systems allow the user to specify the maximum allowed distance between the query terms. Most search engines consider both term proximity and term ordering in retrieval.
- Full document queries: When the query is a full document, the user wants to find other documents that are similar to the query document. Some search engines (e.g., Google) allow the user to issue such a query by providing the URL of a query page. Additionally, in the returned results of a search engine, each snippet may have a link called “more like this” or “similar pages.” When the user clicks on the link, a set of pages similar to the page in the snippet is returned.
- Natural language questions: This is the most complex case, and also the ideal case. The user expresses his/her information need as a natural language question. The system then finds the answer. However, such queries are still hard to handle due to the difficulty of natural language understanding. Nevertheless, this is an active research area, called question answering. Some search systems are starting to provide question answering services on some specific types of questions, e.g., definition questions, which ask for definitions of technical terms. Definition questions are usually easier to answer because there are strong linguistic patterns indicating definition sentences, e.g., “defined as”, “refers to”, etc. Definitions can usually be extracted offline.
The query operations module can range from very simple to very complex. In the simplest case, it does nothing but just pass the query to the retrieval engine after some simple pre-processing, e.g., removal of stopwords (words that occur very frequently in text but have little meaning, e.g., “the”, “a”, “in”, etc). We will discuss text pre-processing in the next section. In more complex cases, it needs to transform natural language queries into executable queries. It may also accept user feedback and use it to expand and refine the original queries.
The indexer is the module that indexes the original raw documents in some data structures to enable efficient retrieval. The result is the document index. In the next section, we study a particular type of indexing scheme, called the inverted index, which is used in search engines and most IR systems. An inverted index is easy to build and very efficient to search.
The retrieval system computes a relevance score for each indexed document to the query. According to their relevance scores, the documents are ranked and presented to the user. Note that it usually does not compare the user query with every document in the collection, which is too inefficient. Instead, only a small subset of the documents that contains at least one query term is first found from the index and relevance scores with the user query are then computed only for this subset of documents.
Text and Web Page Pre-Processing
Before the documents in a collection are used for retrieval, some preprocessing tasks are usually performed. For traditional text documents (no HTML tags), the tasks are stopword removal, stemming, and handling of digits, hyphens, punctuations, and cases of letters. For Web pages, additional tasks such as HTML tag removal and identification of main content blocks also require careful considerations. We discuss them in this section.
Stopword Removal
Stopwords are frequently occurring and insignificant words in a language that help construct sentences but do not represent any content of the documents. Articles, prepositions and conjunctions and some pronouns are natural candidates. Common stopwords in English include: a, about, an, are, as, at, be, by, for, from, how, in, is, of, on, or, that, the, these, this, to, was, what, when, where, who, will, with. Such words should be removed before documents are indexed and stored. Stopwords in the query are also removed before retrieval is performed.
Stemming
In many languages, a word has various syntactical forms depending on the contexts that it is used. For example, in English, nouns have plural forms, verbs have gerund forms (by adding “ing”), and verbs used in the past tense are different from the present tense. These are considered as syntactic variations of the same root form. Such variations cause low recall for a retrieval system because a relevant document may contain a variation of a query word but not the exact word itself. This problem can be partially dealt with by stemming.
Stemming refers to the process of reducing words to their stems or roots. A stem is the portion of a word that is left after removing its prefixes and suffixes. In English, most variants of a word are generated by the introduction of suffixes (rather than prefixes). Thus, stemming in English usually means suffix removal, or stripping. For example, “computer”, “computing”, and “compute” are reduced to “comput”. “walks”, “walking” and “walker” are reduced to “walk”. Stemming enables different variations of the word to be considered in retrieval, which improves the recall. There are several stemming algorithms, also known as stemmers.
Over the years, many researchers evaluated the advantages and disadvantages of using stemming. Clearly, stemming increases the recall and reduces the size of the indexing structure. However, it can hurt precision because many irrelevant documents may be considered relevant. For example, both “cop” and “cope” are reduced to the stem “cop”. However, if one is looking for documents about police, a document that contains only “cope” is unlikely to be relevant. Although many experiments have been conducted by researchers, there is still no conclusive evidence one way or the other. In practice, one should experiment with the document collection at hand to see whether stemming helps.
Other Pre-Processing Tasks for Text
Digits: Numbers and terms that contain digits are removed in traditional IR systems except some specific types, e.g., dates, times, and other prespecified types expressed with regular expressions. However, in search engines, they are usually indexed.
Hyphens: Breaking hyphens are usually applied to deal with inconsistency of usage. For example, some people use “state-of-the-art”, but others use “state of the art”. If the hyphens in the first case are removed, we eliminate the inconsistency problem. However, some words may have a hyphen as an integral part of the word, e.g., “Y-21”. Thus, in general, the system can follow a general rule (e.g., removing all hyphens) and also have some exceptions. Note that there are two types of removal, i.e., (1) each hyphen is replaced with a space and (2) each hyphen is simply removed without leaving a space so that “state-of-the-art” may be replaced with “state of the art” or “stateoftheart”. In some systems, both forms are indexed as it is hard to determine which is correct, e.g., if “pre-processing” is converted to “pre processing”, then some relevant pages will not be found if the query term is “preprocessing”.
Punctuation Marks: Punctuation can be dealt with similarly as hyphens.
Case of Letters: All the letters are usually converted to either the upper or lower case.
Web Page Pre-Processing
We have indicated at the beginning of the section that Web pages are different from traditional text documents. Thus, additional pre-processing is needed. We describe some important ones below.
Identifying different text fields: In HTML, there are different text fields, e.g., title, metadata, and body. Identifying them allows the retrieval system to treat terms in different fields differently. For example, in search engines terms that appear in the title field of a page are regarded as more important than terms that appear in other fields and are assigned higher weights because the title is usually a concise description of the page. In the body text, those emphasized terms (e.g., under header tags <h1>
, <h2>
, …, bold tag <b>
, etc.) are also given higher weights.
Identifying anchor text: Anchor text associated with a hyperlink is treated specially in search engines because the anchor text often represents a more accurate description of the information contained in the page pointed to by its link. In the case that the hyperlink points to an external page (not in the same site), it is especially valuable because it is a summary description of the page given by other people rather than the author/owner of the page, and is thus more trustworthy.
Removing HTML tags: The removal of HTML tags can be dealt with similarly to punctuation. One issue needs careful consideration, which affects proximity queries and phrase queries. HTML is inherently a visual presentation language. In a typical commercial page, information is presented in many rectangular blocks. Simply removing HTML tags may cause problems by joining text that should not be joined. They will cause problems for phrase queries and proximity queries. This problem had not been dealt with satisfactorily by search engines at the time when this book was written.
Identifying main content blocks: A typical Web page, especially a commercial page, contains a large amount of information that is not part of the main content of the page. For example, it may contain banner ads, navigation bars, copyright notices, etc., which can lead to poor results for search and mining. In Wikipedia, the main content block of the page is the block containing “Today’s featured article.” It is not desirable to index anchor texts of the navigation links as a part of the content of this page. Several researchers have studied the problem of identifying main content blocks. They showed that search and data mining results can be improved significantly if only the main content blocks are used. We briefly discuss two techniques for finding such blocks in Web pages.
Duplicate Detection
Duplicate documents or pages are not a problem in traditional IR. However, in the context of the Web, it is a significant issue. There are different types of duplication of pages and contents on the Web. Copying a page is usually called duplication or replication, and copying an entire site is called mirroring. Duplicate pages and mirror sites are often used to improve efficiency of browsing and file downloading worldwide due to limited bandwidth across different geographic regions and poor or unpredictable network performances. Of course, some duplicate pages are the results of plagiarism. Detecting such pages and sites can reduce the index size and improve search results. Several methods can be used to find duplicate information. The simplest method is to hash the whole document, e.g., using the MD5 algorithm, or computing an aggregated number (e.g., checksum). However, these methods are only useful for detecting exact duplicates. On the Web, one seldom finds exact duplicates. For example, even different mirror sites may have different URLs, different Web masters, different contact information, different advertisements to suit local needs, etc.
Web Search
We now put it all together and describe the working of a search engine. Since it is difficult to know the internal details of a commercial search engine, most contents in this section are based on research papers, especially the early Google paper. Due to the efficiency problem, latent semantic indexing is probably not used in Web search yet. Current search algorithms are still mainly based on the vector space model and term matching.
A search engine starts with the crawling of pages on the Web. The crawled pages are then parsed, indexed, and stored. At the query time, the index is used for efficient retrieval. We will not discuss crawling here. The subsequent operations of a search engine are described below:
- Parsing: A parser is used to parse the input HTML page, which produces a stream of tokens or terms to be indexed. The parser can be constructed using a lexical analyzer generator such as YACC and Flex (which is from the GNU project). Some pre-processing tasks described in previous sections may also be performed before or after parsing.
- Indexing: This step produces an inverted index, which can be done using any of the methods described in previous sections. For retrieval efficiency, a search engine may build multiple inverted indices. For example, since the titles and anchor texts are often very accurate descriptions of the pages, a small inverted index may be constructed based on the terms appeared in them alone. Note that here the anchor text is for indexing the page that its link points to, not the page containing it. A full index is then built based on all the text in each page, including anchor texts (a piece of anchor text is indexed both for the page that contains it, and for the page that its link points to). In searching, the algorithm may search in the small index first and then the full index. If a sufficient number of relevant pages are found in the small index, the system may not search in the full index.
- Searching and Ranking: Given a user query, searching involves the following steps:
- pre-processing the query terms using some of the methods described in previous sections, e.g., stopword removal and stemming;
- finding pages that contain all (or most of) the query terms in the inverted index;
- ranking the pages and returning them to the user.
The ranking algorithm is the heart of a search engine. However, little is known about the algorithms used in commercial search engines. We give a general description based on the algorithm in the early Google system. As we discussed earlier, traditional IR uses cosine similarity values or any other related measures to rank documents. These measures only consider the content of each document. For the Web, such content based methods are not sufficient. The problem is that on the Web, there are too many relevant documents for almost any query. For example, using “web mining” as the query, the search engine Google estimated that there were 46,500,000 relevant pages. Clearly, there is no way that any user will look at this huge number of pages. Therefore, the issue is how to rank the pages and present the user the “best” pages at the top.
An important ranking factor on the Web is the quality of the pages, which was hardly studied in traditional IR because most documents used in IR evaluations are from reliable sources. However, on the Web, anyone can publish almost anything, so there is no quality control. Although a page may be 100% relevant, it may not be a quality page due to several reasons. For example, the author may not be an expert of the query topic, the information given in the page may be unreliable or biased, etc.
However, the Web does have an important mechanism, the hyperlinks (links), that can be used to assess the quality of each page to some extent. A link from page x to page y is an implicit conveyance of authority of page x to page y. That is, the author of page x believes that page y contains quality or authoritative information. One can also regard the fact that page x points to page y as a vote of page x for page y. This democratic nature of the Web can be exploited to assess the quality of each page. In general, the more votes a page receives, the more likely it is a quality page. The actual algorithms are more involved than simply counting the number of votes or links pointing to a page (called in-links). We will describe the algorithms in the next chapter. PageRank is the most well known such algorithm. It makes use of the link structure of Web pages to compute a quality or reputation score for each page. Thus, a Web page can be evaluated based on both its content factors and its reputation. Content-based evaluation depends on two kinds of information:
- Occurrence Type: There are several types of occurrences of query terms in a page:
- Title: a query term occurs in the title field of the page.
- Anchor text: a query term occurs in the anchor text of a page pointing to the current page being evaluated.
- URL: a query term occurs in the URL of the page. Many URL addresses contain some descriptions of the page. For example, a page on Web mining may have the URL http://www.domain.edu/Web-mining.html.
- Body: a query term occurs in the body field of the page. In this case, the prominence of each term is considered. Prominence means whether the term is emphasized in the text with a large font, or bold and/or italic tags. Different prominence levels can be used in a system. Note that anchor texts in the page can be treated as plain texts for the evaluation of the page.
- Count: The number of occurrences of a term of each type. For example, a query term may appear in the title field of the page 2 times. Then, the title count for the term is 2.
- Position: This is the position of each term in each type of occurrence. The information is used in proximity evaluation involving multiple query terms. Query terms that are near to each other are better than those that are far apart. Furthermore, query terms appearing in the page in the same sequence as they are in the query are also better.
For the computation of the content based score (also called the IR score), each occurrence type is given an associated weight. All type weights form a fixed vector. Each raw term count is converted to a count weight, and all count weights also form a vector.
Let us now look at two kinds of queries, single word queries and multi-word queries. A single word query is the simplest case with only a single term. After obtaining the pages containing the term from the inverted index, we compute the dot product of the type weight vector and the count weight vector of each page, which gives us the IR score of the page. The IR score of each page is then combined with its reputation score to produce the final score of the page.
For a multi-word query, the situation is similar, but more complex since there is now the issue of considering term proximity and ordering. Let us simplify the problem by ignoring the term ordering in the page. Clearly, terms that occur close to each other in a page should be weighted higher than those that occur far apart. Thus multiple occurrences of terms need to be matched so that nearby terms can be identified. For every matched set, a proximity value is calculated, which is based on how far apart the terms are in the page. Counts are also computed for every type and proximity. Each type and proximity pair has a type-proximity-weight. The counts are converted into count-weights. The dot product of the count-weights and the type-proximity-weights gives an IR score to the page. Term ordering can be considered similarly and included in the IR score, which is then combined with the page reputation score to produce the final rank score.
A search engine maintains the following processes in near real time: Web Crawling, Indexing, Searching.
This picture shows the Web Crawler's architecture.
My Web Crawler in action.
Step by Step Walk-throughs
Python script to create the MySQL database:
def create_database():
try:
connection = mysql.connector.connect(host=HOSTNAME, database=DATABASE,\
user=USERNAME, password=PASSWORD,\
autocommit=True)
server_info = connection.get_server_info()
print("MySQL connection is open on", server_info)
sql_drop_table = "DROP TABLE IF EXISTS `occurrence`"
cursor = connection.cursor()
cursor.execute(sql_drop_table)
sql_drop_table = "DROP TABLE IF EXISTS `keyword`"
cursor.execute(sql_drop_table)
sql_drop_table = "DROP TABLE IF EXISTS `webpage`"
cursor.execute(sql_drop_table)
sql_create_table = "CREATE TABLE `webpage` \
(`webpage_id` BIGINT NOT NULL AUTO_INCREMENT, " \
"`url` VARCHAR(256) NOT NULL, `title` VARCHAR(256) NOT NULL, " \
"`content` TEXT NOT NULL, PRIMARY KEY(`webpage_id`)) ENGINE=InnoDB"
cursor.execute(sql_create_table)
sql_create_table = "CREATE TABLE `keyword` \
(`keyword_id` BIGINT NOT NULL AUTO_INCREMENT, " \
"`name` VARCHAR(256) NOT NULL, \
PRIMARY KEY(`keyword_id`)) ENGINE=InnoDB"
cursor.execute(sql_create_table)
sql_create_table = "CREATE TABLE `occurrence` (`webpage_id` BIGINT NOT NULL, " \
"`keyword_id` BIGINT NOT NULL, `counter` BIGINT NOT NULL, " \
"`pagerank` REAL NOT NULL, \
PRIMARY KEY(`webpage_id`, `keyword_id`), " \
"FOREIGN KEY webpage_fk(webpage_id) \
REFERENCES webpage(webpage_id), " \
"FOREIGN KEY keyword_fk(keyword_id) \
REFERENCES keyword(keyword_id)) ENGINE=InnoDB"
cursor.execute(sql_create_table)
sql_create_index = "CREATE OR REPLACE UNIQUE INDEX index_name ON `keyword`(`name`)"
cursor.execute(sql_create_index)
sql_no_of_words = "CREATE OR REPLACE FUNCTION no_of_words\
(token VARCHAR(256)) RETURNS " \
"REAL READS SQL DATA RETURN (SELECT MAX(`counter`) \
FROM `occurrence` " \
"INNER JOIN `keyword` USING(`keyword_id`) WHERE `name` = token)"
cursor.execute(sql_no_of_words)
sql_no_of_pages = "CREATE OR REPLACE FUNCTION \
no_of_pages(token VARCHAR(256)) RETURNS " \
"REAL READS SQL DATA RETURN \
(SELECT COUNT(`webpage_id`) FROM `occurrence` " \
"INNER JOIN `keyword` USING(`keyword_id`) WHERE `name` = token)"
cursor.execute(sql_no_of_pages)
sql_total_pages = "CREATE OR REPLACE FUNCTION total_pages() \
RETURNS REAL READS SQL DATA " \
"RETURN (SELECT COUNT(`webpage_id`) FROM `webpage`)"
cursor.execute(sql_total_pages)
sql_data_mining = "CREATE OR REPLACE FUNCTION data_mining\
(webpage_no BIGINT, token VARCHAR(256)) " \
"RETURNS REAL READS SQL DATA RETURN \
(SELECT SUM(`counter`)/no_of_words(token)*" \
"LOG((1+total_pages())/no_of_pages(token)) \
FROM `occurrence` INNER JOIN `keyword` " \
"USING(`keyword_id`) WHERE `name` = token \
AND `webpage_id` = webpage_no)"
cursor.execute(sql_data_mining)
except mysql.connector.Error as err:
print("MySQL connector error:", str(err))
return False
finally:
if connection.is_connected():
cursor.close()
connection.close()
print("MySQL connection is now closed")
return True
Python script to add and remove URLs from Queue:
def add_url_to_frontier(url):
global visited_urls
global frontier_array
global frontier_score
found = False
if url.find('#') > 0:
url = url.split('#')[0]
if url.endswith('.3g2'):
return
if url.endswith('.3gp'):
return
if url.endswith('.7z'):
return
if url.endswith('.ai'):
return
if url.endswith('.apk'):
return
if url.endswith('.arj'):
return
if url.endswith('.aif'):
return
if url.endswith('.avi'):
return
if url.endswith('.bat'):
return
if url.endswith('.bin'):
return
if url.endswith('.bmp'):
return
if url.endswith('.cda'):
return
if url.endswith('.com'):
return
if url.endswith('.csv'):
return
if url.endswith('.dat'):
return
if url.endswith('.db') or url.endswith('.dbf'):
return
if url.endswith('.deb'):
return
if url.endswith('.dmg'):
return
if url.endswith('.doc') or url.endswith('.docx'):
return
if url.endswith('.email') or url.endswith('.eml'):
return
if url.endswith('.emlx'):
return
if url.endswith('.exe'):
return
if url.endswith('.flv'):
return
if url.endswith('.fon'):
return
if url.endswith('.fnt'):
return
if url.endswith('.gadget'):
return
if url.endswith('.gif'):
return
if url.endswith('.h264'):
return
if url.endswith('.ico'):
return
if url.endswith('.iso'):
return
if url.endswith('.jar'):
return
if url.endswith('.jpg') or url.endswith('.jpeg'):
return
if url.endswith('.log'):
return
if url.endswith('.m4v'):
return
if url.endswith('.mdb'):
return
if url.endswith('.mid') or url.endswith('.midi'):
return
if url.endswith('.mov'):
return
if url.endswith('.mp3') or url.endswith('.mpa'):
return
if url.endswith('.mp4'):
return
if url.endswith('.mpa'):
return
if url.endswith('.mpg') or url.endswith('.mpeg'):
return
if url.endswith('.msg'):
return
if url.endswith('.msi'):
return
if url.endswith('.odt'):
return
if url.endswith('.ods'):
return
if url.endswith('.oft'):
return
if url.endswith('.ogg'):
return
if url.endswith('.ost'):
return
if url.endswith('.otf'):
return
if url.endswith('.pkg'):
return
if url.endswith('.pdf'):
return
if url.endswith('.png'):
return
if url.endswith('.ppt') or url.endswith('.pptx'):
return
if url.endswith('.ps'):
return
if url.endswith('.psd'):
return
if url.endswith('.pst'):
return
if url.endswith('.rar'):
return
if url.endswith('.rpm'):
return
if url.endswith('.rtf'):
return
if url.endswith('.sql'):
return
if url.endswith('.svg'):
return
if url.endswith('.swf'):
return
if url.endswith('.xls') or url.endswith('.xlsx'):
return
if url.endswith('.toast'):
return
if url.endswith('.tar'):
return
if url.endswith('.tar.gz'):
return
if url.endswith('.tex'):
return
if url.endswith('.ttf'):
return
if url.endswith('.txt'):
return
if url.endswith('.tif') or url.endswith('.tiff'):
return
if url.endswith('.vcd'):
return
if url.endswith('.vcf'):
return
if url.endswith('.vob'):
return
if url.endswith('.xml'):
return
if url.endswith('.wav') or url.endswith('.wma'):
return
if url.endswith('.wmv'):
return
if url.endswith('.wpd'):
return
if url.endswith('.wpl'):
return
if url.endswith('.wsf'):
return
if url.endswith('.z') or url.endswith('.zip'):
return
if url not in visited_urls:
if url in frontier_array:
found = True
frontier_score[url] = frontier_score.get(url) + 1
if not found:
frontier_array.append(url)
frontier_score[url] = 1
def extract_url_from_frontier():
global frontier_array
global frontier_score
score = 0
url = None
for item in frontier_array:
if score < frontier_score.get(item):
url = item
score = frontier_score.get(url)
if url:
frontier_array.remove(url)
del frontier_score[url]
visited_urls.append(url)
return url
def download_page_from_url(url):
html_title = None
plain_text = None
try:
req = Request(url)
html_page = urlopen(req)
soup = BeautifulSoup(html_page, "html.parser")
html_title = soup.title.get_text().strip()
plain_text = soup.get_text().strip()
plain_text = " ".join(plain_text.split())
for hyperlink in soup.find_all('a'):
hyperlink = urljoin(url, hyperlink.get('href'))
add_url_to_frontier(hyperlink)
except urllib.error.URLError as err:
print(str(err))
except urllib.error.HTTPError as err:
print(str(err))
except urllib.error.ContentTooShortError as err:
print(str(err))
finally:
return html_title, plain_text
Python script to crawl the Internet:
def web_search_engine():
global webpage_count
print("Starting Web Search Engine thread...")
try:
connection = mysql.connector.connect(host=HOSTNAME, database=DATABASE,\
user=USERNAME, password=PASSWORD,\
autocommit=True)
server_info = connection.get_server_info()
print("MySQL connection is open on", server_info)
while True:
url = extract_url_from_frontier()
if url:
print("Crawling %s... [%d]" % (url, webpage_count + 1))
html_title, plain_text = download_page_from_url(url)
if html_title and plain_text:
if len(html_title) > 0:
connection = analyze_webpage(connection, url, html_title, plain_text)
if (webpage_count > 0) and ((webpage_count % 1000) == 0):
if connection.is_connected():
connection.close()
print("MySQL connection is now closed")
data_mining()
else:
break
except mysql.connector.Error as err:
print("MySQL connector error:", str(err))
finally:
if connection.is_connected():
connection.close()
print("MySQL connection is now closed")
def analyze_webpage(connection, url, html_title, plain_text):
global webpage_count
while not connection.is_connected():
try:
time.sleep(30)
connection = mysql.connector.connect(host=HOSTNAME, database=DATABASE, \
user=USERNAME, password=PASSWORD,
autocommit=True)
server_info = connection.get_server_info()
print("MySQL connection is open on", server_info)
except mysql.connector.Error as err:
print("MySQL connector error:", str(err))
finally:
pass
try:
sql_statement = "INSERT INTO `webpage` (`url`, `title`, `content`) \
VALUES ('%s', '%s', '%s')" % \
(url, html_title.replace("'", "\""), plain_text.replace("'", "\""))
cursor = connection.cursor()
cursor.execute(sql_statement)
if cursor.rowcount == 0:
return connection
sql_last_id = "SET @last_webpage_id = LAST_INSERT_ID()"
cursor = connection.cursor()
cursor.execute(sql_last_id)
cursor.close()
webpage_count = webpage_count + 1
return analyze_keyword(connection, plain_text)
except mysql.connector.Error as err:
print("MySQL connector error:", str(err))
finally:
pass
return connection
def analyze_keyword(connection, plain_text):
global webpage_count
global keyword_array
new_keyword = {}
old_keyword = {}
tokenize_list = tokenize(plain_text)
for keyword in tokenize_list:
if keyword.isascii() and keyword.isalnum():
keyword = keyword.lower()
if keyword not in keyword_array:
keyword_array.append(keyword)
new_keyword[keyword] = 1
else:
if new_keyword.get(keyword) is not None:
new_keyword[keyword] = new_keyword[keyword] + 1
else:
if old_keyword.get(keyword) is None:
old_keyword[keyword] = 1
else:
old_keyword[keyword] = old_keyword[keyword] + 1
try:
for keyword in new_keyword.keys():
while not connection.is_connected():
time.sleep(30)
connection = mysql.connector.connect(host=HOSTNAME, \
database=DATABASE, user=USERNAME,
password=PASSWORD, autocommit=True)
server_info = connection.get_server_info()
print("MySQL connection is open on", server_info)
sql_last_id = "SET @last_webpage_id = %d" % webpage_count
cursor = connection.cursor()
cursor.execute(sql_last_id)
sql_statement = "INSERT INTO `keyword` (`name`) VALUES ('%s')" % keyword
cursor = connection.cursor()
cursor.execute(sql_statement)
if cursor.rowcount == 0:
keyword_array.remove(keyword)
continue
sql_last_id = "SET @last_keyword_id = LAST_INSERT_ID()"
cursor = connection.cursor()
cursor.execute(sql_last_id)
sql_statement = "INSERT INTO `occurrence` (`webpage_id`, `keyword_id`, \
`counter`, `pagerank`) " \
"VALUES (@last_webpage_id, @last_keyword_id, %d, 0.0)" \
% new_keyword[keyword]
cursor = connection.cursor()
cursor.execute(sql_statement)
cursor.close()
for keyword in old_keyword.keys():
while not connection.is_connected():
time.sleep(30)
connection = mysql.connector.connect(host=HOSTNAME, database=DATABASE,
user=USERNAME,
password=PASSWORD, autocommit=True)
server_info = connection.get_server_info()
print("MySQL connection is open on", server_info)
sql_last_id = "SET @last_webpage_id = %d" % webpage_count
cursor = connection.cursor()
cursor.execute(sql_last_id)
sql_last_id = "SET @last_keyword_id = (SELECT `keyword_id` FROM `keyword` \
WHERE `name` = '%s')" % keyword
cursor = connection.cursor()
cursor.execute(sql_last_id)
sql_statement = "INSERT INTO `occurrence` \
(`webpage_id`, `keyword_id`, `counter`, `pagerank`) " \
"VALUES (@last_webpage_id, \
@last_keyword_id, %d, 0.0)" % old_keyword[keyword]
cursor = connection.cursor()
cursor.execute(sql_statement)
cursor.close()
except mysql.connector.Error as err:
print("MySQL connector error:", str(err))
finally:
pass
return connection
def data_mining():
records = None
connection = None
rowcount = 0
try:
print("Starting Data Mining thread... [cleanup]")
connection = mysql.connector.connect(host=HOSTNAME, database=DATABASE,
user=USERNAME, password=PASSWORD,
autocommit=True)
server_info = connection.get_server_info()
print("MySQL connection is open on", server_info)
sql_select_query = "SELECT * FROM `keyword` ORDER BY `keyword_id`"
cursor = connection.cursor()
cursor.execute(sql_select_query)
records = cursor.fetchall()
print("Total number of rows in table:", cursor.rowcount)
rowcount = cursor.rowcount
cursor.close()
except mysql.connector.Error as err:
print("MySQL connector error:", str(err))
finally:
pass
for row in records:
done = False
while not done:
try:
if not connection.is_connected():
time.sleep(30)
connection = mysql.connector.connect(host=HOSTNAME, database=DATABASE,
user=USERNAME,
password=PASSWORD, autocommit=True)
server_info = connection.get_server_info()
print("MySQL connection is open on", server_info)
data_update = connection.cursor()
sql_update_query = "UPDATE `occurrence` \
INNER JOIN `keyword` USING(`keyword_id`)" \
"SET `pagerank` = data_mining(`webpage_id`, `name`) \
WHERE `name` = '%s'" % row[1]
print("applying data mining for '%s'... [%d/%d]" % (row[1],
records.index(row) + 1, rowcount))
data_update.execute(sql_update_query)
data_update.close()
done = True
except mysql.connector.Error as err:
print("MySQL connector error:", str(err))
finally:
pass
try:
if connection.is_connected():
connection.close()
print("MySQL connection is now closed")
except mysql.connector.Error as err:
print("MySQL connector error:", str(err))
finally:
pass
INDEX.HTML file from https://www.text-mining.ro/:
<!DOCTYPE html>
<html>
<head>
<title>text-mining.ro</title>
<meta name="ROBOTS" content="NOINDEX, NOFOLLOW" />
<link rel="icon" type="image/png" href="romania-flag-square-icon-256.png">
<!--
<style type="text/css">
html, body, .container {
height: 100%;
}
.container {
display: -webkit-flexbox;
display: -ms-flexbox;
display: -webkit-flex;
display: flex;
-webkit-flex-align: center;
-ms-flex-align: center;
-webkit-align-items: center;
align-items: center;
justify-content: center;
}
#tfheader {
background-color: #c3dfef;
}
#tfnewsearch {
float: right;
padding: 20px;
}
.tftextinput {
margin: 0;
padding: 5px 15px;
font-family: Arial, Helvetica, sans-serif;
font-size: 14px;
border: 1px solid #0076a3;
border-right: 0px;
border-top-left-radius: 5px 5px;
border-bottom-left-radius: 5px 5px;
}
.tfbutton {
margin: 0;
padding: 5px 15px;
font-family: Arial, Helvetica, sans-serif;
font-size: 14px;
outline: none;
cursor: pointer;
text-align: center;
text-decoration: none;
color: #ffffff;
border: solid 1px #0076a3;
border-right: 0px;
background: #0095cd;
background: -webkit-gradient(linear, left top, left bottom, from(#00adee), to(#0078a5));
background: -moz-linear-gradient(top, #00adee, #0078a5);
border-top-right-radius: 5px 5px;
border-bottom-right-radius: 5px 5px;
}
.tfbutton:hover {
text-decoration: none;
background: #007ead;
background: -webkit-gradient(linear, left top, left bottom, from(#0095cc), to(#00678e));
background: -moz-linear-gradient(top, #0095cc, #00678e);
}
.tfbutton::-moz-focus-inner {
border: 0;
}
.tfclear {
clear: both;
}
</style>
</head>
<body>
<!--
<div class="container">
<div id="tfheader">
<form id="tfnewsearch" method="get" action="search.php">
<input type="text" class="tftextinput" name="q" size="50" maxlength="120"><input type="submit" value="search" class="tfbutton">
<br/>Website developed by <a href="https://www.emvs.site/curriculum-vitae/" target="_blank" >Stefan-Mihai MOGA</a> as part of his dissertation.
</form>
<div class="tfclear"></div>
</div>
</div>
</body>
</html>
SEARCH.PHP file from https://www.text-mining.ro/:
<?php
$servername = "localhost";
$username = "r46882text_engine";
$password = "TextMining2021!@#$";
$dbname = "r46882text_mining";
echo "<!DOCTYPE html>\n";
echo "<html>\n";
echo "\t<head>\n";
echo "\t\t<title>" . $_GET['q'] . "</title>\n";
echo "\t\t<meta charset=\"utf-8\">\n";
echo "\t\t<link rel=\"icon\" type=\"image/png\" href=\"romania-flag-square-icon-256.png\">\n";
echo "\t\t<meta name=\"viewport\" content=\"width=device-width, initial-scale=1\">\n";
echo "\t\t<link rel=\"stylesheet\" href=\"https://maxcdn.bootstrapcdn.com/bootstrap/3.4.0/css/bootstrap.min.css\">\n";
echo "\t\t<script src=\"https://ajax.googleapis.com/ajax/libs/jquery/3.4.0/jquery.min.js\"></script>\n";
echo "\t\t<script src=\"https://maxcdn.bootstrapcdn.com/bootstrap/3.4.0/js/bootstrap.min.js\"></script>\n";
echo "\t</head>\n";
echo "\t<body>\n";
$search = strtolower($_GET['q']);
$counter = 0;
$mysql_clause = "";
$mysql_select = "";
$token = strtok($search, "\t\n\r\"\' !?#$%&|(){}[]*/+-:;<>=.,");
while ($token !== false) {
if ($counter == 0) {
$mysql_clause = "SELECT DISTINCT `webpage_id` FROM `occurrence` INNER JOIN `keyword` USING (`keyword_id`) WHERE `name` = '$token'";
$mysql_select = "(`name` = '$token')";
}
else {
$mysql_clause = "SELECT DISTINCT `webpage_id` FROM `occurrence` INNER JOIN `keyword` USING (`keyword_id`) WHERE `name` = '$token' AND `webpage_id` IN (" . $mysql_clause . ")";
$mysql_select = $mysql_select . " OR (`name` = '$token')";
}
$counter++;
$token = strtok("\t\n\r\"\' !?#$%&|(){}[]*/+-:;<>=.,");
};
if ($counter > 0)
{
$conn = mysqli_connect($servername, $username, $password, $dbname);
if (!$conn) {
die("Connection failed: " . mysqli_connect_error());
}
$statement = "SELECT DISTINCT `webpage_id`, `title`, `url`, `content`, AVG(`pagerank`) AS score FROM `occurrence` INNER JOIN `webpage` USING(`webpage_id`) INNER JOIN `keyword` USING(`keyword_id`) WHERE `webpage_id` IN (" . $mysql_clause . ") AND (" . $mysql_select . ") GROUP BY `webpage_id` ORDER BY score DESC LIMIT 100;";
$result = mysqli_query($conn, $statement);
if (mysqli_num_rows($result) > 0) {
while($row = mysqli_fetch_assoc($result)) {
echo "<div class=\"container-fluid\">" . $row["webpage_id"] . ". <b>" . $row["title"] . "</b> Score: " . $row["score"] . "<br />";
echo "<a href=\"" . $row["url"] . "\">" . $row["url"] . "</a><br />";
echo "<i>" . substr($row["content"], 0, 1024) . "</i></div><br />\n";
}
} else {
echo "0 results";
}
mysqli_close($conn);
}
echo "\t</body>\n";
echo "</html>\n";
?>
Conclusion and Points of Interest
I'm working on Kotlin version for Web Crawler, with the same implementation/functionality as Python script.
Works Cited
- Radu G. Crețulescu, Text mining. Tehnici de clasificare și clustering al documentelor, Ed. Albastră, 2012;
- Smaranda Belciug, Marina Gorunescu, Data mining. Modele predictive și de clasificare, Ed. Albastră, 2012;
- Florin Gorunescu, Data Mining. Concepte, Modele și Tehnici, Ed. Albastră, 2006;
- Bing Liu, Web Data. Exploring Hyperlinks, Contents, and Usage Data, 2nd Edition, Springer, 2006;
- Arvind Arasu, Junghoo Cho, Hector Garcia-Molina, Andreas Paepcke, Sriram Raghavan. Searching the web. ACM Transactions on Internet Technology, 2001, pag. 2-43;
- Junghoo Cho, Hector Garcia-Molina, Lawrence Page. Efficient crawling through URL ordering. Computer Networks, 1998, pag. 161-172;
- Monika R. Henzinger, Allan Heydon, Michael Mitzenmacher, Marc Najork. Measuring index quality using random walks on the Web. Computer Networks, 1999, pag. 1291-1303;
- Alexandros Ntoulas, Junghoo Cho, Christopher Olston. What’s New on the Web? The Evolution of the Web from a Search Engine Perspective. In Proceedings of International Conference on World Wide Web, 2004;
- Dennis Fetterly, Mark Manasse, Marc Najork, Janet Wiener. A large scale study of the evolution of Web pages. Software: Practice and Experience, 2004, pag. 213-237;
- Zdravko Markov, Daniel T. Larose. Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Central Connecticut State University, 2006;
- Soumen Chakrabarti, Martinvanden Berg, Byron Dom. Focused crawling: a new approach to topic-specific Web resource discovery. Computer Networks, 1999, pag. 1623-1640;
- Michelangelo Diligenti, Frans Coetzee, Steve Lawrence, C. Lee Giles, Marco Gori. Focused crawling using context graphs. In Proceedings of International Conference on Very Large Data Bases, 2000;
- Gautam Pant, Padmini Srinivasan. Learning to crawl: Comparing classification schemes. ACM Transactions on Information Systems, 2005, pag. 430-462;
- Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd, The PageRank Citation Ranking: Bringing Order to the Web, Stanford University, 1998;
- Chris Buckley, Implementation of the SMART information retrieval system, Technical Report 85-686, Cornell University, Ithaca, NY, 1985.
History
- December 9th, 2021: Initial version.
- December 20th, 2021: Added works cited
- January 14th, 2022: Added index.html and search.php
- December 23rd, 2022: Moved source code from GitLab to GitHub.
- March 23rd, 2023: Replaced
NULL
throughout the codebase with nullptr
.
This means that the minimum requirement for the application is now VC 2010. - April 16th, 2023 - Updated MFC application to work with the latest MySQL ODBC 8.0 Unicode Driver.
- May 27th, 2023 - Added Python version of web crawler to Web Search Engine repository.
- June 22nd, 2023 - Updated PJ Naughter's
ODBCWrappers
library to the latest version available. - July 28th, 2023:
- Replaced old
CHyperlinkStatic
class with PJ Naughter's CHLinkCtrl
library. - Updated the Python script (frontier is now persistent in database) and the PHP search script.
- October 22nd, 2023:
- Switched to Visual Studio Enterprise 2022 (some changes were made in the source code).
- Added social media links: Twitter, LinkedIn, Facebook, and Instagram.
- Added shortcuts to GitHub repository's Issues, Discussions, and Wiki.
- January 1st, 2024: Updated PJ Naughter's
ODBCWrappers
library to the latest version available.
Updated module to remove usage of _if_exists by now using ODBCVER
and _ATL_MODULES
preprocessor macro checks along with SFINAE
.
- Same version (January 21st, 2024) - Added ReleaseNotes.html and SoftwareContentRegister.html to GitHub repo.