Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles
(untagged)

Read Document Text Directly from Microsoft Word File

0.00/5 (No votes)
6 Jan 2008 4  
A simple way to obtain document text from *.doc file.

Introduction

In this article, we'll take a brief look into Microsoft Word binary file format and present a simple way to obtain document text from *.doc files.

Links:

  • Cellbi.DocFramework - Read/Write Microsoft Word documents, complex formatting, sections and tables are supported.
  • GetWordTextSrc - The complete source code for this article.

You may also take a look at the project we are currently working on.

The most popular file format for rich text representation is *.doc file format created by Microsoft. OLE (Object Linking and Embedding) is the most popular way to work with Microsoft Word documents programmatically, but this method has several disadvantages, e.g. low speed, need of Microsoft Office to be installed and inconvenient document model.

Another way to manipulate Microsoft Word files is to read and write them directly. This is a way to solve the disadvantages described above. Direct manipulation provides much more speed and allows use of your own document model. However the most significant difficulty here is that we need binary file format knowledge.

OLE Structured Storage

Word file structure is represented as a file system within a file. This technology, called OLE structured storage, allows storing multiple kinds of objects in a single document. OLE structured storage is a collection of two object types: storages and streams.

StgOpenStorage WinAPI function provides access to the root storage object in a structured storage system. Here is the declaration:

[DllImport(Ole32Dll, CharSet = CharSet.Unicode)]
public static extern int StgOpenStorage(string pwcsName,
    IStorage pstgPriority,
    int grfMode,
    IntPtr snbExclude,
    int reserved,
    out IStorage ppstgOpen);

In this declaration we'll mainly use pwcsName - file path that contains the storage object to open, and ppstgOpen - IStorage implementer used to work with the root file storage. Here is an example code illustrating the StgOpenStorage function usage:

const int _DefaultFlags = (int)(STGMFlags.STGM_READWRITE |
                                STGMFlags.STGM_SHARE_EXCLUSIVE);

public static IStorage OpenRootStorage(string path)
{
  IStorage storage;
  int result = NativeMethods.StgOpenStorage(
      path,
      null,
      _DefaultFlags,
      IntPtr.Zero,
      0,
      out storage);

  if (result != 0)
    return null;
  return storage;
}

IStorage interface translation and implementation of some necessary structures can be found in the article's source code.

Reading Document Streams

The root Word file storage contains document and table streams, which we'll use to read document text. The document stream contains the Word file header (FIB � File Information Block), document text and formatting information. The document text and its formatting are stored as a set of pieces, and each piece has an offset into the document stream. The table stream contains information about text location represented as a collection of piece descriptors.

To get access to any stream in the root storage, we'll use the OpenStream WinAPI function provided by the IStorage interface.

  int OpenStream(string pwcsName,
          IntPtr reserved1,
          int grfMode,
          int reserved2,
          out UCOMIStream ppstm);

In this declaration pwcsName is the name of the storage and ppstm is the pointer to the resulting stream, so to get access to the file document and table streams, we'll use the following code:

const int _DefaultFlags = (int)(STGMFlags.STGM_READWRITE |
                                STGMFlags.STGM_SHARE_EXCLUSIVE);

UCOMIStream OpenStream(IStorage storage, string name)
{
  UCOMIStream stream;
  int result = storage.OpenStream(name, IntPtr.Zero, _DefaultFlags, 0, out stream);
  if (result != 0)
    return null;
  return stream;
}

byte[] GetStreamData(UCOMIStream stream)
{
  STATSTG stat;
  stream.Stat(out stat, 0);
  long size = stat.cbSize;
  byte[] buffer = new byte[size];

  stream.Read(buffer, (int)size, IntPtr.Zero);
  return buffer;
}

BinaryReader GetStreamReader(UCOMIStream stream)
{
  if (stream == null)
    return null;

  byte[] streamData = GetStreamData(stream);
  MemoryStream memoryStream = new MemoryStream(streamData);
  return new BinaryReader(memoryStream);
}

void GetStreamsData(string path, out BinaryReader documentStreamReader,
        out BinaryReader tableStreamReader)
{
  IStorage rootStorage = OpenRootStorage(path);

  UCOMIStream documentStream = OpenStream(rootStorage, "WordDocument");
  UCOMIStream tableStream = OpenStream(rootStorage, "0Table");

  documentStreamReader = GetStreamReader(documentStream);
  tableStreamReader = GetStreamReader(tableStream);
}

Reading Document Text

Now when we have access to the main file streams, we may read information about the document text location. This information can be obtained from the table stream at the fib.clxOffset with fib.clxLength:

void GetDataFromFib(BinaryReader tableStreamReader, out int pieceCollOffset,
    out uint pieceCollLength)
{
  tableStreamReader.BaseStream.Seek(418, SeekOrigin.Begin);
  pieceCollOffset = reader.ReadInt32();
  pieceCollLength = reader.ReadUInt32();
}

Having this information in place we may read all piece descriptors from the table stream. Each piece descriptor contains information about the text part stored in the document. Here is the code illustrating how to do this:

PieceDescriptorCollection GetPieceDescriptors(BinaryReader tableStreamReader,
    int pieceCollOffset, uint pieceCollLength)
{
  PieceDescriptorCollection result =
        new PieceDescriptorCollection(pieceCollOffset, pieceCollLength);
  result.Read(tableStreamReader);
  return result;
}

Note that all work to read piece descriptors is done inside the PieceDescriptorCollection class. See this article's source code for complete implementation.

The last step is to read the document text. Here is how to do this:

string LoadText(BinaryReader documentReader, PieceDescriptorCollection pieces)
{
  text = string.Empty;
  if (documentReader == null || pieces == null)
    return text;

  int count = pieces.Count;
  for (int i = 0; i < count; i++)
  {
    uint pieceStart;
    uint pieceEnd;
    bool isUnicode = pieces.GetPieceFileBounds(i, out pieceStart, out pieceEnd);

    documentReader.BaseStream.Seek(pieceStart, SeekOrigin.Begin);
    text += ReadString(documentReader, pieceEnd - pieceStart, isUnicode);
  }
  return text;
}

The LoadText method iterates over all document pieces, gets each piece bounds and reads document text. The ReadString method is simple:

string ReadString(BinaryReader reader, uint length, bool isUnicode)
{
  if (length == 0)
    return string.Empty;

  if (isUnicode)
    length = length / 2;

  string result = string.Empty;
  for (int i = 0; i < length; i++)
  {
    object ch = isUnicode ? reader.ReadInt16() : reader.ReadByte();
    result += (char)ch;
  }
  return result;
}

That's all. Thanks for your attention. Hope this article will be useful. Please let me know if there are any problems.

History

  • January 7th, 2008: Initial release

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here