Introduction
A lot of websites provide search capabilities, where you can simply type several words, press a "Search" button, and you'll receive a list of pages which contain these words. It's simple. But how can you implement these features in your own web application? Yes, you have to use an indexing service which will index your files or web pages. After that, you can use full text search features.
There are a lot of solutions which allow you to provide this functionality in your application. One of them is Microsoft Indexing Service. It's part of Windows 2000 and later Windows versions. So, if you only provide Windows solutions (ASP.NET web applications, Windows Forms applications, etc.), you have to take a look at this Microsoft product.
One of the biggest advantages of Indexing Service is that it's totally free. You can use it without any restrictions or additional licenses. I think that this is so big, because other indexing products cost a lot of money. If you are developing a small or medium sized applications, you don't want to pay thousands of dollars for a full text search tool.
If you choose to use the Indexing Service, you should remember that it can only index file systems. For example, you can't use it for indexing files stored in your database. This is a big minus of the Microsoft Indexing Service, but I believe that you can easily solve this limitation.
In this article, I'll try to describe how to install, configure, and use the Microsoft Indexing Service. We'll develop a simple application which will allow us to use full text search features for web pages located on our local file system.
Installing and configuring the Microsoft Indexing Service
If you are using Windows XP or later, you'll be using Microsoft Indexing Service 3.0. And, if you're still using Windows 2000, you'll be using Microsoft Indexing Service 2.0. This service is installed to your machine, by default. But, you could disable its installation when installing the Operating System. You have to specify that Indexing Service be installed on your machine. To do this, you go to "Add or Remove Programs" in your Control Panel. Choose "Add/Remove Windows Components" there. You have to check that "Indexing Service" is installed. If it isn't installed, install it.
Now, Microsoft Indexing Service has been installed, and you can configure it. Open the "Computer Management" configuration tool. Choose "Services and Applications", "Indexing Service". In this entry, you can manage your Microsoft Indexing Service.
First of all, you should create a new catalog in Indexing Service for the folder which will contain the indexes. Open the context menu for "Indexing Service" and choose "Catalog" in the "New" submenu. Type "Name", choose "Location", and press "OK".
After that, you have to add the folders which will be indexed. For this, choose the "Directories" entry, open its context menu, and choose "Directory" from the "New" submenu. Choose the folder with your documents in the opened dialog box, and press "OK" to include the selected directory to the index. If you decide to exclude the folder from the existing index, please choose "No" for the "Include in Index?" parameter in this dialog window. This parameter is "Yes", by default.
If your Indexing Service is started, it will index the new catalog. Otherwise, you should start Indexing Service and it will index the catalog automatically. You can create or recreate an index folder manually. To do this, you should open the context menu for the specified folder in the existing catalog and choose "Rescan (Full)" or "Rescan (Incremental)" in the "All Tasks" submenu. Of course, your Microsoft Indexing Service has to be started at this time.
If you choose the "Indexing Service" entry in your "Computer Management", you will see the state of the Indexing Service. Sometimes, this information can help you if you have a big storage and can't find the file there.
There is another important setting for Indexing Service – "Indexing Service Usage". This setting allows you to tell Indexing Service how often it should update the indexes. For example, if your application only uses static storage, the service need not update the index so often because if you use dynamic data storage, your data is updated very often. To configure this parameter, you should open the context menu for the "Indexing Service" entry and choose "Tune Performance" in the "All Tasks" submenu.
Now, you can check the index. To do this, choose "Query the Catalog" in your catalog. You'll see a form which allows you to search something in your index. First of all, you can test a simple full text search. Enter something in the query field and press the "Search" button. Now, you will be able to see the files which contain the entered words. Of course, you can execute more difficult queries using this tool. Choose "Advanced query" if you want to execute some complex queries. You can use Microsoft Indexing Service queries to get the required information. This query language is the same as SQL, but it contains some syntax extensions.
Query Microsoft Indexing Service
You can use SQL to query Microsoft Indexing Service. But, there are several extensions for Indexing Service's SQL dialect which you have to know about.
The most useful command, when you use the Microsoft Indexing Service, is the SELECT
command. It's clear, because you shouldn't add, delete, or update information in your indexes. You use Select
to query the Indexing Service to retrieve some information about indexed files. Let's see an example query:
SELECT Path FROM SCOPE() WHERE FREETEXT(Contents, 'Hello World')
This query returns you all paths to files which contain the "Hello World" text. And, it can help me describe to you Microsoft Indexing Service's SQL extensions.
First of all, let's look at the FROM
expression. In this example, we query all the data which the index contains. The SCOPE()
function allows you to tell the Indexing Service which data you have decided to examine. By default, if you don't use any parameters, it examines all the data in your index. This function can optimize your queries, because it can limit the indexes for search. For example, you can use SCOPE ('"/books"')
. Here, you will query only the "/books" folder, not all the folders in your index. The query execution speed will be more than if you would use a simple SCOPE()
function. For more search limitations, you can use special traversal types. For example, SCOPE ('DEEP TRAVERSAL OF "/books"')
. If you use this expression, Indexing Service will search in the "/books" directory and in all the directories beneath it. If you use SHALLOW TRAVERSAL
, Microsoft Indexing Service will examine only the "/books" directory. For example, SCOPE('SHALLOW TRAVERSAL OF "/books"')
.
The WHERE
expression is the same as in SQL, but there are few extensions for it too. There are Comparison Predicates. You can see them in this table:
Operator | Symbol | Example |
Equals | = | WHERE DocAuthor = 'John Doe' |
Not equals | != or <> | WHERE DocTitle != 'Finance' |
Less than | < | WHERE WordCount < 1000 |
Greater than | > | WHERE WordCount > 500 |
Less than or equal to | <= | WHERE WordCount <= 500 |
Greater than or equal to | >= | WHERE WordCount >= 500 |
You also can use Boolean operators which are evaluated using the following rules:
- NOT is evaluated before AND. NOT can only occur after AND (as in AND NOT; the combination OR NOT is not allowed).
- AND is evaluated before OR.
- AND expressions are associative, and can be applied in any order. For example, A AND B AND C, is the same as (A AND B) AND C, which is the same as A AND (B AND C) .
- OR expressions are associative, and can be applied in any order.
There is a LIKE
predicate too. But, there are several predicates which extend the SQL language:
ARRAY
. This predicate performs comparisons of two arrays using logical operators. For example, ... WHERE username = SOME ARRAY ['Admin' , 'root']
. This example returns you files which contain the username parameter as 'Admin' or 'root'.CONTAINS
. This predicate is used for full text search. For example, …WHERE CONTAINS(country,'"USA" OR "Russia"')
. This example returns files which contains a country property which is "USA" or "Russia".FREETEXT
. This predicate allows you to find words and phrases in indexed files. It's better to use it if you need to find anything in the contents of your files. For example, …WHERE FREETEXT(Contents,'Hello World !!!')
.MATHCES
. This predicate performs queries using a Regular-Expression pattern. It's more powerful than the LIKE
predicate. For example, … WHERE MATCHES (Contents, '|(USA|)|{1|}' )
. This example matches any string in which exactly one instance of the pattern "BUSA" occurs.
For additional information, you have to go to the Indexing Service articles on the MSDN website.
Now you know how to prepare queries for the Microsoft Indexing Service, but you still need to take a list of properties which can be used in your queries. There are a lot of default properties for each index, which you can find in the following table.
Friendly Name | Data type | Property |
A_HRef | DBTYPE_WSTR | DBTYPE_BYREF | Text of HTML HREF. This property name was created for the Microsoft® Site Server, and corresponds with the Indexing Service property name HtmlHRef . Can be queried, but not retrieved. |
Access | VT_FILETIME | Last time a file was accessed. |
All | (not applicable) | Searches every property for a string. Can be queried, but not retrieved. |
AllocSize | DBTYPE_I8 | Size of disk allocation for a file. |
Attrib | DBTYPE_UI4 | File attributes. Documented in the Win32 SDK. |
ClassId | DBTYPE_GUID | Class ID of an object, for example, WordPerfect, Word, and so on. |
Characterization | DBTYPE_WSTR | DBTYPE_BYREF | Characterization, or abstract, of a document. Computed by Indexing Service. |
Contents | (not applicable) | Main contents of the file. Can be queried, but not retrieved. |
Create | VT_FILETIME | The time the file was created. |
Directory | DBTYPE_WSTR | DBTYPE_BYREF | The physical path to the file, not including the file name. |
DocAppName | DBTYPE_WSTR | DBTYPE_BYREF | Name of the application that created the file. |
DocAuthor | DBTYPE_WSTR | DBTYPE_BYREF | Author of the document. |
DocByteCount | DBTYPE_14 | Number of bytes in a document. |
DocCategory | DBTYPE_STR | DBTYPE_BYREF | Type of a document such as a memo, schedule, or whitepaper. |
DocCharCount | DBTYPE_I4 | Number of characters in a document. |
DocComments | DBTYPE_WSTR | DBTYPE_BYREF | Comments about the document. |
DocCompany | DBTYPE_STR | DBTYPE_BYREF | Name of the company for which the document was written. |
DocCreatedTm | VT_FILETIME | The time the document was created. |
DocEditTime | VT_FILETIME | Total time spent editing the document. |
DocHiddenCount | DBTYPE_14 | Number of hidden slides in a Microsoft® PowerPoint document. |
DocKeywords | DBTYPE_WSTR | DBTYPE_BYREF | Document keywords. |
DocLastAuthor | DBTYPE_WSTR | DBTYPE_BYREF | Most recent user who edited the document. |
DocLastPrinted | VT_FILETIME | The time the document was last printed. |
DocLastSavedTm | VT_FILETIME | The time the document was last saved. |
DocLineCount | DBTYPE_14 | Number of lines contained in a document. |
DocManager | DBTYPE_STR | DBTYPE_BYREF | Name of the manager of the document's author. |
DocNoteCount | DBTYPE_14 | Number of pages with notes in a PowerPoint document. |
DocPageCount | DBTYPE_I4 | Number of pages in a document. |
DocParaCount | DBTYPE_14 | Number of paragraphs in a document. |
DocPartTitles | DBTYPE_STR | DBTYPE_VECTOR | Names of document parts. For example, in Excel, part titles are the names of spread sheets; in PowerPoint, slide titles, and in Word for Windows, the names of the documents in the master document. |
DocPresentationTarget | DBTYPE_STR | DBTYPE_BYREF | Target format (35mm, printer, video, and so on) for a presentation in PowerPoint. |
DocRevNumber | DBTYPE_WSTR | DBTYPE_BYREF | Current version number of the document. |
DocSlideCount | DBTYPE_14 | Number of slides in a PowerPoint document. |
DocSubject | DBTYPE_WSTR | DBTYPE_BYREF | Subject of the document. |
DocTemplate | DBTYPE_WSTR | DBTYPE_BYREF | Name of template for a document. |
DocTitle | DBTYPE_WSTR | DBTYPE_BYREF | Title of the document. |
DocWordCount | DBTYPE_I4 | Number of words in the document. |
FileIndex | DBTYPE_I8 | Unique ID of the file. |
FileName | DBTYPE_WSTR | DBTYPE_BYREF | Name of the file. |
HitCount | DBTYPE_I4 | Number of hits (words matching a query) in the file. |
HtmlHRef | DBTYPE_WSTR | DBTYPE_BYREF | Text of HTML HREF. Can be queried, but not retrieved. |
HtmlHeading1 | DBTYPE_WSTR | DBTYPE_BYREF | Text of HTML document in style H1. Can be queried, but not retrieved. |
HtmlHeading2 | DBTYPE_WSTR | DBTYPE_BYREF | Text of HTML document in style H2. Can be queried, but not retrieved. |
HtmlHeading3 | DBTYPE_WSTR | DBTYPE_BYREF | Text of HTML document in style H3. Can be queried, but not retrieved. |
HtmlHeading4 | DBTYPE_WSTR | DBTYPE_BYREF | Text of HTML document in style H4. Can be queried, but not retrieved. |
HtmlHeading5 | DBTYPE_WSTR | DBTYPE_BYREF | Text of HTML document in style H5. Can be queried, but not retrieved. |
HtmlHeading6 | DBTYPE_WSTR | DBTYPE_BYREF | Text of HTML document in style H6. Can be queried, but not retrieved. |
Img_Alt | DBTYPE_WSTR | DBTYPE_BYREF | Alternate text for <IMG> tags. Can be queried, but not retrieved. |
Path | DBTYPE_WSTR | DBTYPE_BYREF | Full physical path to a file, including file name. |
Rank | DBTYPE_I4 | Rank of row. Ranges from 0 to 1000. Larger numbers indicate better matches. |
RankVector | DBTYPE_I4 | DBTYPE_VECTOR | Ranks of individual components of a vector query. |
ShortFileName | DBTYPE_WSTR | DBTYPE_BYREF | Short (8.3) file name. |
Size | DBTYPE_I8 | Size of file, in bytes. |
USN | DBTYPE_I8 | Update Sequence Number. NTFS drives only. |
VPath | DBTYPE_WSTR | DBTYPE_BYREF | Full virtual path to a file, including the file name. If more than one possible path, then the best match for the specific query is chosen. |
WorkId | DBTYPE_I4 | Internal ID for a file. Used within Indexing Service. |
Write | VT_FILETIME | Last time the file was written. |
As you can see, there are a lot of indexed properties for each file, but sometimes, you want to extend this list.
How to add new properties for an indexed file
First of all, this feature works only for web pages, because it is based on the HTML <meta>
tag.
Let's say, you have several indexed web pages and you want to add several special properties for them. For example, if you want to add "country" and "city" properties, you should add <meta>
tags to all files which will contain these new properties:
<meta name="country" content="Russia" />
<meta name="city" content="Moscow" />
After these changes, you have to restart Indexing Service. Now, you can open the entry "Properties" and see that Microsoft Indexing Service knows about your special parameters for files. But still, you can't use these new parameters in your queries.
Select the "Properties" node of your catalog and choose the property which you added to the files using the <meta>
tag. Double click on the property, switch on the "Cached" checkbox, and choose the data type for the new property from the opened dialog box.
After that, you should create a Column Definition File which contains information about your newly added parameters. The File could have an ".idq" extension, but this isn't important. A Column Definition File uses the following format:
[Names]
Propertyname( Data type ) = GUID ["Name" | Property ID]
The data type parameter is optional. If you don't define it, Microsoft Indexing Service will take the data type from the parameters definition for your catalog.
For my example, it contains this:
[Names]
country = d1b5d3f0-c0b3-11cf-9a92-00a0c908dbf1 "country"
city = d1b5d3f0-c0b3-11cf-9a92-00a0c908dbf1 "city"
All these data can be taken from the dialog box for the properties configuration.
After the Columns Definition File is created, information about this file has to be added to the Indexing Service Registry settings. Add a string entry named "DefaultColumnFile" to the Registry key "HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\ContentIndexCommon". "DefaultColumnFile" should contain the full path to your Columns Definition File.
Restart Microsoft Indexing Service. After that, run a full rescan of your indexed folder. Now, you will be able to use the new parameters in your queries.
Using Microsoft Indexing Service in WinForms applications
Microsoft Indexing Service exposes itself to the developer as an OLE DB provider. Its name is MSIDXS. You can use ADO.NET for querying your Indexing Service. To do this, you have to create a new System.Data.OleDb.OleDbConnection
object using this sample connection string:
Provider= "MSIDXS";Data Source="Documents"
In the Data Source parameter, you should use the name of your catalog in Indexing Service.
Let's create a sample code which will query Indexing Service for a few words from the file contents. In this sample, there is a queryString
variable. It is an instance of the SearchParameters
structure. This structure contains information about the data source and the query string. Here is the definition of this structure:
struct SearchParameters
{
private string storage;
public string Storage
{
get { return storage; }
set { storage = value; }
}
private string query;
public string Query
{
get { return query; }
set { query = value; }
}
}
First of all, you create a new OleDbConnection
object:
string connectionString =
string.Format("Provider= \"MSIDXS\";Data Source=\"{0}\";",
queryString.Storage);
OleDbConnection connection = new OleDbConnection(connectionString);
After that, you have to create a new OleDbCommand
associated with this connection:
string query = string.Format(@"SELECT Path FROM scope() " +
@"WHERE FREETEXT(Contents, '{0}')", queryString.Query);
OleDbCommand command = new OleDbCommand(query, connection);
Note that the MSIDXS provider doesn't support commands with parameters. This is bad. I hope that Microsoft will fix this issue in the next version of the Microsoft Indexing Service.
You are now able to execute this command and retrieve a list of files which contain the selected text:
connection.Open();
ArrayList result = new ArrayList();
OleDbDataReader reader = command.ExecuteReader();
while (reader.Read())
{
result.Add(reader.GetString(0));
}
connection.Close();
In this code, checking the returned value for NULL
is not necessary, because Indexing Service always returns a path to a found file.
Summary
Microsoft Indexing Service is a totally free and powerful product which is included with Windows 2000 or later versions. It's very simple to use. You can easily create indexes. You can also query these indexes using an OLEDB data provider. If you are working with Microsoft .NET, it is really easy to use. In this article, I have tried to describe how to install, configure, and query the Microsoft Indexing Service. I also recommend you see my example, which I have attached to this article. This example will show you how to use the full text search features. I hope that this article will help you to start using Indexing Service effectively.
When I prepared this article, I used these materials: