Introduction
A search for 'Data, Wiki' yields:
"Data refers to information or facts usually collected as the result of experience, observation, or experiment, or processes within a computer system, or premises. Data may consist of numbers, words, or images, particularly as measurements or observations of a set of variables. Data is often viewed as the lowest level of abstraction from which information and knowledge are derived."
"Data is king", wrote a famous author to start explaining LINQ. As MSDN suggests – to explain LINQ in simple terms - "LINQ is a general purpose query facility (or feature) to query data. Not just relational data or XML, but all sources of informational data. Over the past few years, we have had several LINQ flavors: LINQ to SQL, LINQ to XML, LINQ to CSV files, LINQ to Text files, etc. All of these target a specific medium (or form) of data. The goal of LINQ is to provide a common simple interface to query data.
Today, data comes in many different formats through a variety of different channels, such as database, XML, Raw, Text, Binary, RSS feeds through TCP, UDP, HTTP, FTP, etc. 'LINQ to www' is LINQ to query data from complying (REST like) web sites. To give an example: let's say you have a favorite car listing website that displays thousands of pages of car listins in their web site. Using the LINQ2www tool, you can query that web site to get the desired information. For example, you can query cars priced greater than $15,000 but less than $30,000 using LINQ2www. This means, you do not need to browse manually through hundreds of continuous web pages to extract the list of cars that qualify your interest ($15,000 to $30,000). All you need to do is just write an appropriate LINQ query to extract this information. This same principle can be applied to pages of financial data, corporate accounts data, a web-telephone directory, and what not.
How would you like to read?
It is best to read the article in an orderly fashion. However, not everyone is in the same situation. So, here is a brief guideline to get what you are looking for quickly:
- If you are a good reader who is curious to see how this is done -- Welcome, just skip this section and continue reading on from the next section.
- If you are a geek who can understand an article's content by just reading the title, and all you need to know today is just: how to use the LINQ2www assembly -- Jump to the 'Using the code' section. You may be interested in the 'Points of interest' section too.
- If you are a management person who would like to see a working example, and all you are looking for is a demonstration of the usage of this library -- Jump to 'The Web spider application’ section: A WPF based Web spider to model data into graphical 3D information. Download the demo and just press the 'Go' button.
- If you do not belong to any of the above three interest categories, just leave me a message using the link at the bottom of this article. Well, don't forget to detail your interest category.
The Web spider application
Prerequisite: To run the demo, you need .NET framework 3.5 installed in your machine. If it is not already installed, you may download it for free from here: http://www.microsoft.com/downloads/details.aspx?FamilyId=333325FD-AE52-4E35-B531-508D977D32A6&displaylang=en
The web spider is a sample WPF application that uses the LINQ2www assembly. This application crawls through the Who's who @ CodeProject, page by page. To see a demo of this application, download the demo project from the first line of this article - run the application. Press the 'Go' button. Give it 2 to 3 minutes to see the flow of data from the CodeProject server to your computer's 3D bar chart. You can observe a continuous update in the status bar at the bottom of the WPF application.
What is REST and its relationship to this library ?
'REpresentational State Transfer' is a style of software architecture for systems such as the World Wide Web. All credits go to the doctoral dissertation of Roy Fielding (Reference 1). LINQ2www is a tool capable of querying REST based (complying) web sites. For example, consider the first page of 'Who's who @ CodeProject': http://www.codeproject.com/script/Membership/Profiles.aspx?ml_ob=MemberId&mgtid=-1&mgm=False&pgnum=1.
Now to get the second page, all we need to do is remove the page number 1 to replace it with page number 2. I.e., pgnum=1 should be replaced with pgnum=2 to get to the next page and so on... This is a RESTful web interface.
Using the above web link, we can browse up to 1000 members from page 1 to page 1000. Now, consider your requirement is to just get the list of Gold members from these 1000 web pages. Manually browsing (traversing) through all the 1000 pages is tiresome and error prone. Writing a small program to automatically perform this operation is a good idea. Generalizing such a program so that it can solve a similar problem can be considered as the next step. How about generalizing to this extent: all you have to do is just write 'two lines of code' to get all gold members from 1000 web pages? This is the power of LINQ. We are specializing LINQ to achieve our 'Two line code' goal, which is called LINQ2www - LINQ 2 World Wide Web. One more important thing to be mentioned here is, we do not check or need a 100% REST compliance web site. LINQ2www can work with REST like web links. The basic requirement for LINQ2www to work is all the pages need to be linked to each other through an href link. Even if it does not comply with REST in other areas, LINQ2www will work.
How LINQ works - Basic understanding
Understanding how LINQ works is necessary to specialize it to our needs. Let us start from a very simple example. Consider we have a pipe that carries some liquid. We have Green, Red, and Blue color liquids mixed, flowing in this pipe. Someone sends this mixed liquid through this pipe continuously. All you know from your side is, if you open the faucet (tap), you will get this mixed liquid flowing. Consider, you just need the Green color liquid. You prepare a filter that will filter just the Green color liquid. Now when we fix this filter in the faucet (tap), only the Green liquid will start flowing though the pipe. This filter is LINQ and the mixed liquid is raw data. LINQ helps you to extract information from a raw flow of data. Based on the liquid type, you will need a different filter. Similarly, based on the data variety, you will need a different LINQ. That is how we have LINQ 2 SQL, LINQ to XML etc.
How LINQ works - A little LINQ code
Let us do a small code example. Consider we have a list of names. All we need is names that start with 'S'. Let us try to write the one line LINQ code to get what we want.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
namespace TestLinq4
{
class Program
{
static void Main(string[] args)
{
List<string> listOfNames = new List<string>()
{
"Nina", "Kyle", "Steven",
"Joe", "Neal", "Sanjay", "Elijah",
"Steen", "Stech", "Donn",
"Thomas", "Peter", "Steinberg"
};
var Result = from Name in listOfNames where Name.StartsWith("S") select Name;
foreach (string NameReturned in Result)
{
System.Console.WriteLine(NameReturned);
}
}
}
}
Here, all we do is just write a simple LINQ statement to get a query result. When we enumerate the Result
using foreach
, the query actually gets executed. The data gets filtered by our where
condition. Then, if it passes the where
condition, it is selected and returned to the NameReturned
variable in the foreach
.
How LINQ works - Where is the 'where' method ?
Let's take a closer look. Several questions may arise when seeing the above code for the first time. One important question may be, where is the where
method? And, where is the select
method and what's going on. Where
is a method in the IEnumerable
class. This is a special method called an 'Extension Method'. As the name suggests, you can extend a class (even a sealed
one) by writing static methods. Let us see how to write one soon. Where
is an important method as this is our filter.
First, let us see what an extension method is and how to write one. Then, we will override (specialize) the default where
method in the above sample to add an additional feature to it.
Consider a well known .NET class - String
. As we know, String
is a sealed
class. This means, we cannot specialize (derive from) the String
class. But we can extend the String
class by adding an Extension Method. This extension method will appear as if it is an exposed method of the String
class. Let us write one now.
public static class StringExtension
{
public static bool CompareCaseInsensitive(this string strSource, string strTarget)
{
if (string.Compare(strSource, strTarget, true) == 0)
{
return true;
}
return false;
}
}
Note that the extension method is defined inside a static class StringExtension
. Also note that the static extension method's first parameter starts with the this
keyword, i.e., it is extending the String
class. The picture below shows the intellisense when we have the above extension method defined.
After learning about extension methods, it is not much difficult to figure that where
is an extension method for the IEnumerable
class. Now, let us see how our LINQ statement is converted, which improves our understanding:
var Result = from Name in listOfNames where Name.StartsWith("S") select Name;
is converted to:
var Result = listOfNames.Where(delegate(string item)
{ return item.StartsWith("S");} ).Select(delegate(string item){ return item; } );
LINQ2www - Challenges
After getting enough background, let us see the challenges in writing LINQ to www (World Wide Web).
- The data is not readily available to filter. Consider browsing through thousands of web pages. The library needs to first fetch the web page, and then filter the web page's data using the filter condition (the
where
condition in the LINQ statement). This may block the query execution for a long period of time. This is not agreeable. - Consider everything in (1) goes fine and the user is observing a continuous flow of filtered information. Now the user thinks she had enough information, and would like to stop the update. This can happen very often, not only because continuous update from a web site is time consuming, but also because the information already obtained cwill be sufficient to make ae decision.
- We do not have a standard language to query this type of data. Consider databases; we have a SQL like language to query them. But what about HTML data from web sites? We do not have a standard query language. However, if it is a REST like page, we can expect it to be of a certain standard format.
- We should provide the interface in such a way that more than one query can simultaneously execute using one HTTP connection. This improves performance from the client side and conserves the server resources.
Using the code
Let us see how to use the code. This will help us to understand how some of the challenges are met.
Let us take CodeProject's Who's who web link for our example. Consider our goal is to fetch different types of membership statuses available in CodeProject. The code to perform this query is as below:
Linq2www linq2wwwUrl = new Linq2www("http://www.codeproject.com/script/" +
"Membership/Profiles.aspx?mgtid=1&%3bmgm=True" +
"&ml_ob=MemberId&mgm=False&pgnum=1",
"http://www.codeproject.com/script/Membership/Profiles.aspx?" +
"mgtid=1&%3bmgm=True&ml_ob=MemberId&mgm=False&pgnum=");
int CancelId = from webItem in linq2wwwUrl where webItem.GetMatchDyn(
@"class=""Member(?<name>.*?)"">(\k<name>)",
this.CallThisMethod) select webItem;
The first line constructor takes two arguments. The first argument is the starting web address link. The next is an optional parameter that tells that when you fetch the first page, search for the next page link which will look like this parameter (i.e., the second parameter).
The next line is a little different LINQ statement compared to regular ones. As you might have assumed, we are using a Regular Expression to extract the desired information from this web site. Another difference is, we pass in a callback method to call with the updates. This means, while traversing the web pages like a web spider/crawler, if it finds anything that matches this Regular Expression, it makes a call to the callback method that is supplied (CallThisMethod
). This method will be called continuously until all the pages and their links are visited. However, the user can cancel this query anytime using the return value integer CancelId
. The code for doing this is:
linq2wwwUrl.CancelUpdate(cancel1);
So now, we know how challenges (1), (2), and (3) mentioned in the previous section are resolved. We are using a callback method to update the query caller. This will enable the user to cancel the update whenever it is necessary. All we need out of this LINQ query is filtered information, which is received asynchronously. As we do not have any standard language to query HTML data, we use a Regular Expression, which is a powerful tool to query any raw, text like data.
LINQ2www - Override of 'where' method
Why is this LINQ a different flavor ? The LINQ2www is a different flavor as we do two unconventional things in this LINQ, which is explained in this section.
The where
method is the only method we override in LINQ2www. The purpose of the where
method is a little unconventional here. The where
method just sets the condition and the callback. The unconventional part is: it is not enumerating through the data. This is because, as you know, complete data is not available at the time of the where
method's invocation. However, it sets the necessary information in the class so that the callback will start getting the filtered information. The implementation of the where
method is as follows:
public static class LinqExtnsnOvrride
{
public static int Where<linq2www>(
this IEnumerable<linq2www> enumLinq2www, Func<linq2www,> predicate)
{
enumLinq2www.GetEnumerator().Reset();
Linq2www Item = enumLinq2www.First();
return predicate(Item);
}
}
The next unconventional part the where
method does is, it returns an integer. This is the ID that needs to be passed to stop getting updates in the callback method. The first and the second line resets the enumerator to the first item in the collection. The next line calls the callback, there by setting the Regular expression to filter and set the callback function.
LINQ2www - Putting it all together
Let us put all that we walked through to explain the LINQ2www
class. Let us follow the top down approach.
Linq2www linq2wwwUrl = new Linq2www(WebLink, webLinkTemplate);
When the LINQ2www
constructor is called, we create a background thread. The job of this thread is to get the content of the web link. We store it in a multimap (Reference 3). Multimap is a sophisticated dictionary like collection class. It can store key-value pairs. First, we store the weblink and its content in the multimap. Next, we parse (traverse through the contents of) the weblink to find the connecting next page. If the optional second parameter is provided, we use that as a template to find the next page. Otherwise, we use the web link to create a connecting template. Once we find the next page, we again do the same thing - store the link and contents in the multimap. Then, we parse through the contents to find the next link. We do this until we are exhausted with no more *new* links to browse.
Shown below is the second line in our sample explained above. This line actually fetches useful filtered information for us from the whole data. As you will notice, we are passing two parameters to GetMatchDyn
. The first parameter is the Regular expression - the filter. The second parameter is the callback. This callback receives the filtered information continuously.
int CancelId = from webItem in linq2wwwUrl where
webItem.GetMatchDyn(MyRegularExpression, this.CallThisMethod) select webItem;
Let us see how we do it. In the previous section, we saw the where
method override. Our overridden where
is called when the above line is executed. As we saw in the above section, the where
method calls GetMatchDyn
. This means, the where
method calls the delegate, which in turn calls GetMatchDyn
. GetMatchDyn
creates a thread. This thread reads the data stored in the multimap. This thread moves (enumerates) item by item in the multimap to read each weblink's data. It filters the data using the Regular Expression passed to GetMathDyn
by the caller. Once the Regular Expression matches, it calls the callback method passed in by the user. Remember, this is the second parameter to the GetMatchDyn
method.
The above picture explains what we described in this section.
Last but not least, we should provide a method to cancel the LINQ call. As we saw before, since this is a continuous update from HTTP, it can be really time consuming. The user should be able to cancel the update anytime. This can be performed easily by using the return value (integer) from the LINQ call we made. The line below does this:
linq2wwwUrl.CancelUpdate(CancelId);
This simple method is defined as below:
public bool CancelUpdate(int ThreadId)
{
bool retVal = false;
Regex regDet = threadDetails.GetFirstItem(ThreadId);
if (regDet != null)
{
retVal = threadDetails.Remove(ThreadId);
}
lock (this) Monitor.PulseAll(this);
return retVal;
}
All we do is: we remove the Regular expression that we stored when the GetMatchDyn
method was called previously. Then we trigger the thread created by the GetMatchDyn
method. This thread, when trying to read the Regular Expression (that it is tied to) will get a null value. This is because, we just removed this before triggering the thread. The thread that was created by the GetMatchDyn
method will close down gracefully. Hence the callback will stop receiving any more updates.
Points of interest
- This project can be easily extended to search any type of web page, not just REST like www.
- If someone creative can bring up an easy language (easier than Regular Expressions) to query HTML data, this project can be extended to support that language by implementing an
IQueryable
interface. - The LINQ2www is a kind of web robot. So we need to comply with certain standards (Reference 5).
- Unfortunately, we use Regular Expressions as a tool to filter data. Not everyone is familiar using Regular Expressions. In case you are new to Regular Expressions, Reference 6 is a good introductory read. After reading that, you may consider reading References 7 or 8. I found Reference 7 to be a useful quick reference.
- The WPF 3D bar chart used in the sample application source code is also available in CodeProject - Reference 4.
- The multimap source code is available in Reference 3.
Important reminder!
I would like to hear from you guys on your feedback. Please leave a detailed message irrespective of your vote. Thanks!
References
- http://en.wikipedia.org/wiki/Representational_State_Transfer
- http://msdn.microsoft.com/en-us/library/bb308959.aspx
- http://www.codeproject.com/KB/cs/MultiMap_P_2.aspx
- http://www.codeproject.com/KB/WPF/WPF_3D_Bar_chart_control.aspx
- http://en.wikipedia.org/wiki/Representational_State_Transfer
- http://www.radsoftware.com.au/articles/regexlearnsyntax.aspx
- http://www.boost.org/doc/libs/1_38_0/libs/regex/doc/html/index.html
- http://msdn.microsoft.com/en-us/library/2k3te2cs(VS.80).aspx
History
- 26th April 2009 - First version.