Introduction
The IFilter
interface is an important component for the Microsoft Indexing Service. The intent of this project is to provide a solution which is high performing and also has a low memory footprint. This will be accomplished by using a TextReader
and manipulating a wrapper of IFilter
to interop with managed code.
The article is written to target users who have had development experience with COM in a managed environment, and is aimed at creating a solution that can extract various file contents using an IFilter
implementation.
The solution in the source code package contains a test project, which includes a Unit Test and a load test. These were created using a higher edition of Visual Studio. If you cannot open the test project, you can always remove the IFilterTest project.
Background
The IFilter
component is an in-process COM server that extracts the text and values for a specific file type. The appropriate IFilter
component for a file type is called by the Filtering component.
A customized IFilter
component can be developed for almost any selected file type. The standard IFilter
components supplied with Indexing Service include the following:
File Name
|
Description
|
mimefilt.dll
|
Filters Multipurpose Internet Mail Extension (MIME) files.
|
nlhtml.dll
|
Filters HTML 3.0 or earlier files.
|
offfilt.dll
|
Filters Microsoft Office files: Microsoft Word, Microsoft Excel, and Microsoft PowerPoint®.
|
query.dll
|
Filters plain text files (default filter) and binary files (null filter).
|
The IFilter
interface scans documents for text and properties (also called attributes). It extracts chunks of text from these documents, filtering out embedded formatting and retaining information about the position of the text. It also extracts chunks of values, which are properties of an entire document or of well-defined parts of a document. IFilter
provides the foundation for building higher-level applications such as document indexers and application-independent viewers.
Main Classes Diagram
IFilter Interface and Mixed Class
The IFilter
code:
[ComImport, Guid(Constants.IFilterGUID),
InterfaceType(ComInterfaceType.InterfaceIsIUnknown),
SuppressUnmanagedCodeSecurity, ComVisible(true), AutomationProxy(false)]
public interface IFilter
{
[PreserveSig]
[MethodImpl(MethodImplOptions.InternalCall,
MethodCodeType = MethodCodeType.Runtime)]
IFilterReturnCodes Init(
[MarshalAs(UnmanagedType.U4)]IFILTER_INIT grfFlags,
uint cAttributes,
FULLPROPSPEC[] aAttributes,
out IFILTER_FLAGS pdwFlags);
[PreserveSig]
[MethodImpl(MethodImplOptions.InternalCall,
MethodCodeType = MethodCodeType.Runtime)]
IFilterReturnCodes GetChunk(out STAT_CHUNK pStat);
[PreserveSig]
[MethodImpl(MethodImplOptions.InternalCall,
MethodCodeType = MethodCodeType.Runtime)]
IFilterReturnCodes GetText(
ref uint pcwcBuffer,
[Out]IntPtr awcBuffer);
[PreserveSig]
[MethodImpl(MethodImplOptions.InternalCall,
MethodCodeType = MethodCodeType.Runtime)]
IFilterReturnCodes GetValue(
out PROPVARIANT PropVal);
[PreserveSig]
[MethodImpl(MethodImplOptions.InternalCall,
MethodCodeType = MethodCodeType.Runtime)]
IFilterReturnCodes BindRegion(ref FILTERREGION origPos,ref Guid riid,
ref object ppunk);
}
The mixed class:
public class MixedIFilterClass : IFilterClass, IDisposable
{
public override string TmpFilePath
{
get;
set;
}
public override Object InternalObj
{
get;
set;
}
~MixedIFilterClass()
{
Dispose(false);
}
protected virtual void Dispose(bool disposing)
{
if(null != InternalObj)
{
Marshal.ReleaseComObject(InternalObj);
InternalObj = null;
}
if (null != TmpFilePath)
try
{
File.Delete(TmpFilePath);
TmpFilePath = null;
}
catch { }
if (disposing)
GC.SuppressFinalize(this);
}
public void Dispose()
{
Dispose(true);
}
}
How it Works
There are two steps needed to show how the process works They are:
- Get the current chunk
- Call
GetText()
on the chunk
Step 1: Get the current chunk
If you reach the last chunk, terminate the reading process.
var returnCode = _filter.GetChunk(out chunk);
Step 2: Call GetText() on the chunk
Depending on the state gotten from the GetChunk
method, call the GetText
method on the text chunk. When reading the end of current chunk flag, repeat step 1.
while (true)
{
if (remaining <= _topSize)
return;
bool useBuffer = !forceDirectlyWrite && remaining < BufferSize;
var size = BufferSize;
if (useBuffer)
size -= _topSize;
else
{
if (remaining < BufferSize)
size = (uint)remaining;
}
if (size < ResBufSize)
size = ResBufSize;
var handle = GCHandle.Alloc(useBuffer ? _buffer : array,
GCHandleType.Pinned);
var ptr = Marshal.UnsafeAddrOfPinnedArrayElement(
useBuffer ? _buffer : array, useBuffer ? (int)_topSize : offset);
IFilterReturnCodes returnCode;
try
{
#if DEBUG
Trace.Write(size);
#endif
returnCode = _filter.GetText(ref size, ptr);
#if DEBUG
Trace.WriteLine("->"+size);
#endif
}
finally
{
handle.Free();
}
if(returnCode != IFilterReturnCodes.FILTER_E_NO_TEXT)
{
if (useBuffer)
_topSize += size;
else
{
offset += (int)size;
remaining -= (int)size;
}
if(_topSize > BufferSize)
{
_resTopSize = _topSize - BufferSize;
_topSize = BufferSize;
}
}
if (returnCode == IFilterReturnCodes.FILTER_S_LAST_TEXT ||
returnCode == IFilterReturnCodes.FILTER_E_NO_MORE_TEXT ||
(returnCode == IFilterReturnCodes.FILTER_E_NO_TEXT && size != 0) ||
(null == FileName && IgnoreError && returnCode ==
IFilterReturnCodes.E_INVALIDARG))
{
_endOfCurrChunk = true;
if (remaining <= _topSize)
return;
break;
}
if(returnCode != IFilterReturnCodes.S_OK)
{
throw new Exception(
"a error occur when getting text by current filter",
new Exception(returnCode.ToString()));
}
}
Using the Code
The following code uses just a filename:
var fileName = "";
using (var reader = new FilterReader(fileName))
{
reader.Init();
}
This code will specify a file and an extension:
using (var reader = new FilterReader(fileName, ".docx"))
{
reader.Init();
}
using (var reader = new FilterReader(fileName, 0x1000))
{
reader.Init();
}
The code below shows how to pass a byte
array into a FilterReader
:
byte[] bytes = null;
using (var reader = new FilterReader(bytes, ".docx"))
{
reader.Init();
}
using (var reader = new FilterReader(bytes, ".docx", 0x1000))
{
reader.Init();
}
using (var reader = new FilterReader(bytes))
{
reader.Init();
}
Reference
History
- 2/9/2010 - Updated download files.
- 3/03/2009 - v2.0: Replaced Copyright and added OLE FMTIDs, Windows Search Schema, and OLE property interfaces for further needs.
- 2/05/2009 - v2.0: Reconstructed some phrases (thanks Sean Kenney for reviewing this article).
- 1/07/2009 - v2.0: Added comments for Adobe's PDF filter and small changes.
- 12/09/2008 - v2.0 [Stable release].
- 11/24/2008 - v1.0 [Initial release].