Introduction
For one of my projects, I had a requirement where the data from Word documents had to be extracted and exported to a database. The biggest challenge was that I had to support the existing Word documents. Basically, there were thousands of Word documents of the same format which had chunks of data. This document format was never designed to be read by another system. This means, no bookmarks, merge fields, styles to identify the actual data from the standard instructions, etc. Luckily, for our rescue, all the input fields were in the table. But these tables were again of different formats, some with single row/cell and some with varying number.
I use Aspose Words extensively for creating and manipulating Word documents. And considering the expertise I had with the component, I decided to go with it. To solve the issue, I created a similar table model in C# so that I can use it later on while reading the documents.
Below, you can see that I created a class called WordDocumentTable
with three properties, i.e., TableID
, RowID
and ColumnID
. As I explained earlier, we had no support for TableID
/RowID
s, these properties simply imply the position in the Word document. The start index is assumed to be 0
.
public class WordDocumentTable
{
public WordDocumentTable(int PiTableID)
{
MiTableID = PiTableID;
}
public WordDocumentTable(int PiTableID, int PiColumnID)
{
MiTableID = PiTableID;
MiColumnID = PiColumnID;
}
public WordDocumentTable(int PiTableID, int PiColumnID, int PiRowID)
{
MiTableID = PiTableID;
MiColumnID = PiColumnID;
MiRowID = PiRowID;
}
private int MiTableID = 0;
public int TableID
{
get { return MiTableID; }
set { MiTableID = value; }
}
private int MiRowID = 0;
public int RowID
{
get { return MiRowID; }
set { MiRowID = value; }
}
private int MiColumnID = 0;
public int ColumnID
{
get { return MiColumnID; }
set { MiColumnID = value; }
}
}
Now comes the extraction part. Below, you will see the collection of table cells which I want to read from the document.
private List<WordDocumentTable> WordDocumentTables
{
get
{
List<WordDocumentTable> wordDocTable = new List<WordDocumentTable>();
wordDocTable.Add(new WordDocumentTable(0));
wordDocTable.Add(new WordDocumentTable(1, 1));
wordDocTable.Add(new WordDocumentTable(2, 1, 1));
return wordDocTable;
}
}
Below is the method which extracts the data from Aspose Word Document based on the Table, Row and Cell.
public void ExtractTableData(byte[] PobjData)
{
using (MemoryStream LobjStream = new MemoryStream(PobjData))
{
Document LobjAsposeDocument = new Document(LobjStream);
foreach(WordDocumentTable wordDocTable in WordDocumentTables)
{
Aspose.Words.Tables.Table table = (Aspose.Words.Tables.Table)
LobjAsposeDocument.GetChild
(NodeType.Table, wordDocTable.TableID, true);
string cellData = table.Range.Text;
if (wordDocTable.ColumnID > 0)
{
if (wordDocTable.RowID == 0)
{
NodeCollection LobjCells =
table.GetChildNodes(NodeType.Cell, true);
cellData = LobjCells[wordDocTable.ColumnID].ToTxt();
}
else
{
NodeCollection LobjRows =
table.GetChildNodes(NodeType.Row, true);
cellData = ((Row)(LobjRows[wordDocTable.RowID])).
Cells[wordDocTable.ColumnID].ToTxt();
}
}
Console.WriteLine(String.Format("Data in Table {0},
Row {1}, Column {2} : {3}",
wordDocTable.TableID,
wordDocTable.RowID,
wordDocTable.ColumnID,
cellData);
}
}
}
The post Extracting Table Data from Word Document using Aspose Words appeared first on Manas Bhardwaj's Stream.