Implementation of XML Information Retrieval by LINQ

farzaneh ansari

4.86/5 (4 votes)

31 May 2012CPOL5 min read

24.3K

391

We describe all concepts according to XML corpus by its full set of words based on the tags Term frequencies, Inverse document frequencies, words from the given document are used.

Introduction

We describe all concepts according to XML corpus by its full set of words based on the tags Term frequencies, Inverse document frequencies, words from the given document are used. One of the interesting features of Language Integrated Query (LINQ, pronounced "link") is LINQ to XML that used in this project. LINQ is a Microsoft .NET Framework component that adds native data querying capabilities to languages. For this approach we add using Namespace of System.Linq, System.Xml.Linq, System.Collections.Generic to this project

Using the code

The primary classes are built a new XML document are the XDocument, XElement and XAttribute classes .Elements property look for elements only at the XElement being searched children nodes, meanwhile .Descendants() traverse all of the XML markup looking for matching nodes starting from the XElement being searched.

Document pre-processing was considered in three tasks:

Elimination of stop words (such as articles and connectives)
Tokenization
Stemming word

This is involved checking each term in a Generic list of strings of stop words .If the term does not occur in this list ,then partitioned into a list of tokens and then process of conflating tokens to their root by stemming (attachment ->attach). For Tokenization, by using Regex and MatchCollection classes from the Regular Expression Name Space and also according to special pattern ("\\w+"), the words are Matched by any character in the range of 0 - 9, A - Z and a – z.

Document representation: Total N-distinct words from all documents are called as index terms (or the vocabulary). Indeed XElement.Descendants().Count() method returns total number of documents.

C++

//load xml file 
 XElement xIr = XElement.Load(Server.MapPath("~\\IR2.xml"));
//count all doc
 lblTotalDocs.Text  = Convert.ToString(xIr.Descendants().Count());
// Match any character in the range 0 - 9, A - Z and a - z (equivalent of POSIX [:alnum:])
 Regex regex = new Regex( "\\w+");
 MatchCollection matchCollection = regex.Matches(el.Value);

Term Frequency (TF): The number of occurrences of each distinct term in collection of documents. TF is sometimes also used to measure how often a word appears in a specific document for every matched term by considering the number of repetition; it is added to the Generic Dictionary. Dictionary Generic class has two types of parameters: Tkey and TValue. The Dictionary generic class provides a mapping from a set of keys to a set of values. Each addition to the dictionary consists of a value and its associated key. For the purposes of enumeration, each item in the dictionary is treated as a KeyValuePair structure representing a value and its key.

C++

 Dictionary<string, List<string>> docsWord = new Dictionary<string, List<string>>(); 
string docId = el.Attribute("ID").Value; 
docsWord.Add(match.Value.ToLower(), new List<string> { docId }); </string>

ZIPF’S LAW: The observation of Zipf on the distribution of words in natural languages is called Zipf’s law. It describes the word behavior in an entire corpus. In natural language, there are a few very frequent terms and very rare terms. According to Zipf’s law, the i^th most frequent term has frequency proportional to 1/i. That is, the frequency of words multiplied by their ranks in a large corpus is approximately constant. If we compute the frequencies of the words in a corpus, and arrange them in decreasing order of frequency then the product of the frequency of a word and its rank (its position in the list) is more or less equal to the product of the frequency and rank of another word. So frequency of a word is inversely proportional to its rank

C++

double zipf = Math.Log(docFreqSorted.Take(200).Last().Value) + Math.Log(200);
int numWordsOccurOnly = docFreq.Where(p => p.Value.Equals(1)).Count();
lblOneDocWords.Text = Convert.ToString(numWordsOccurOnly) + 
  "((" + numWordsOccurOnly * 100) / corpusFreq.Count() + " %))";

Query Processing: Queries in structured retrieval can be either structured or unstructured. Retrieved information from unstructured text– by which we mean “raw” text without markup. As an XDocument object for XML file, a LINQ query By using the select clause in LINQ expression to indicate what data we want returned. LINQ to XML would return back a sequence of XElement objects that represents each of the XML element nodes that match our filter. XML document is a database commonly represented in a tree structure and frequently used in information retrieval systems Once these files are loaded into the LINQ to XML API, you can write queries over that tree. The query syntax is easier than XPath or XQuery for developers who do not use XPath or XQuery on a daily basis. XPath is a language that describes a way to locate and process items in XML documents by using an addressing syntax based on a path through the document's logical structure or hierarchy. An inverted file contains records of the following form:< word, document> for each term we define list of all documents Id of occurrence of this term. By storing an inverted file in Dictionary Generic and using orderby method of this class, sorting of inverted file is implemented .Query processing consists of two steps:1. Identifying the documents containing the path of the query term. 2. Identifying the XML documents containing the raw text of the query term.

Let's consider a simple query searching for documents which mention the term “farzaneh". This query can be expressed in Lambda extension as follows:

C++

Elements ().Where (p => p.Value.ToLower().Contains ("farzaneh")).Select(p=>p.Attribute("ID").Value)

For filtering terms of special document ID as follow:

C++

xIr.Elements().SingleOrDefault(p =>
p.Attribute("ID").Value.Equals("1")).Value

Lambda expressions are similar to anonymous methods. All lambda expressions use the lambda operator =>, which is read as “goes to”. The left side of the lambda operator specifies the input parameters and the right side holds the expression or statement block. It specifies multiple conditions for a conjunctive or disjunctive boolean path query. Selecting the last ‘N’ elements of a XML document, we use the .Take(N) method. In order to select the last two elements, use the XElement.Reverse() which inverts the order of the elements in a sequence, and use Take(2) to return 2th elements.

Points of Interest

XML retrieval exploits the XML mark-up to identify and return the most relevant and specific documents answer to a query.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)