Introduction
We describe all concepts according to XML corpus by its full set of words based on the tags Term frequencies, Inverse document frequencies, words from the given document are used. One of the interesting features of Language Integrated Query (LINQ, pronounced "link") is LINQ to XML that used in this project. LINQ is a Microsoft .NET Framework component that adds native data querying capabilities to languages. For this approach we add using Namespace of
System.Linq
, System.Xml.Linq
, System.Collections.Generic
to this project
Using the code
The primary classes are built a new XML document are the XDocument
,
XElement
and XAttribute
classes .Elements property look for elements only at the
XElement
being searched children nodes, meanwhile
.Descendants()
traverse all of the XML markup looking for matching nodes starting from the
XElement
being searched.
Document pre-processing was considered in three tasks:
- Elimination of stop words (such as articles and connectives)
- Tokenization
- Stemming word
This is involved checking each term in a Generic list of strings of stop words .If the term does not occur in this list ,then partitioned into a list of tokens and then process of conflating tokens to their root by stemming (attachment ->attach). For Tokenization,
by using
Regex
and MatchCollection
classes from the Regular Expression Name Space and also according to special pattern ("\\w+"), the words are Matched by any character in the range of 0 - 9, A - Z and a – z.
Document representation: Total N-distinct words from all documents are called as index terms (or the vocabulary). Indeed
XElement.Descendants().Count()
method returns total number of documents.
XElement xIr = XElement.Load(Server.MapPath("~\\IR2.xml"));
lblTotalDocs.Text = Convert.ToString(xIr.Descendants().Count());
Regex regex = new Regex( "\\w+");
MatchCollection matchCollection = regex.Matches(el.Value);
Term Frequency (TF): The number of occurrences of each distinct term in collection of documents. TF is sometimes also used to measure how often a word appears in a specific document for every matched term by considering the number of repetition; it is added to the Generic Dictionary. Dictionary Generic class has
two types of parameters:
Tkey
and TValue
. The Dictionary generic class provides a mapping from a set of keys to a set of values. Each addition to the dictionary consists of a value and its associated key. For
the purposes of enumeration, each item in the dictionary is treated as a
KeyValuePair
structure representing a value and its key.
Dictionary<string, List<string>> docsWord = new Dictionary<string, List<string>>();
string docId = el.Attribute("ID").Value;
docsWord.Add(match.Value.ToLower(), new List<string> { docId }); </string>
ZIPF’S LAW: The observation of Zipf on the distribution of words in natural languages is called Zipf’s law. It describes the word behavior in an entire corpus. In natural language, there are a few very frequent terms and very rare terms. According to Zipf’s law, the ith most frequent term has frequency proportional to 1/i. That is, the frequency of words multiplied by their ranks in a large corpus is approximately constant. If we compute the frequencies of the words in a corpus, and arrange them in decreasing order of frequency then the product of the frequency of a word and its rank (its position in the list) is more or less equal to the product of the frequency and rank of another word. So frequency of a word is inversely proportional to its rank
double zipf = Math.Log(docFreqSorted.Take(200).Last().Value) + Math.Log(200);
int numWordsOccurOnly = docFreq.Where(p => p.Value.Equals(1)).Count();
lblOneDocWords.Text = Convert.ToString(numWordsOccurOnly) +
"((" + numWordsOccurOnly * 100) / corpusFreq.Count() + " %))";
Query Processing: Queries in structured retrieval can be either structured or unstructured. Retrieved information from unstructured text– by which we mean “raw” text without markup. As an
XDocument
object for XML file, a LINQ query By using the select clause in LINQ expression to indicate what data we want returned. LINQ to XML would return back a sequence of
XElement
objects that represents each of the XML element nodes that match our filter. XML document is a database commonly represented in a tree structure and frequently used in information retrieval systems Once these files are loaded into the LINQ to XML API, you can write queries over that tree. The query syntax is easier than XPath or XQuery for developers who do not use XPath or XQuery on a daily basis. XPath is a language that describes a way to locate and process items in XML documents by using an addressing syntax based on a path through the document's logical structure or hierarchy. An inverted file contains records of the following form:< word, document> for each term we define list of all documents Id of occurrence of this term. By storing an inverted file in Dictionary Generic and using
orderby
method of this class, sorting of inverted file is implemented .Query processing consists of two steps:1. Identifying the documents containing the path of the query term. 2. Identifying the XML documents containing the raw text of the query term.
Let's consider a simple query searching for documents which mention the term “farzaneh". This query can be expressed in Lambda extension as follows:
Elements ().Where (p => p.Value.ToLower().Contains ("farzaneh")).Select(p=>p.Attribute("ID").Value)
For filtering terms of special document ID as follow:
xIr.Elements().SingleOrDefault(p =>
p.Attribute("ID").Value.Equals("1")).Value
Lambda expressions are similar to anonymous methods. All lambda expressions use the lambda operator =>, which is read as “goes to”. The left side
of the lambda operator specifies the input parameters and the right side holds the expression or statement block. It specifies multiple conditions for a conjunctive
or disjunctive boolean path query. Selecting the last ‘N’ elements of a XML document, we use the .Take(N)
method. In order to select the last two elements, use the
XElement.Reverse()
which inverts the order of the elements in a sequence, and use Take(2) to return 2th elements.
Points of Interest
XML retrieval exploits the XML mark-up to identify and return the most relevant and specific documents answer to a query.