Extracting Table Data from Word Document using Aspose Words

Manas Bhardwaj

5.00/5 (4 votes)

10 Aug 2013CPOL1 min read

21.1K

How to extract table data from Word document using Aspose Words

Introduction

For one of my projects, I had a requirement where the data from Word documents had to be extracted and exported to a database. The biggest challenge was that I had to support the existing Word documents. Basically, there were thousands of Word documents of the same format which had chunks of data. This document format was never designed to be read by another system. This means, no bookmarks, merge fields, styles to identify the actual data from the standard instructions, etc. Luckily, for our rescue, all the input fields were in the table. But these tables were again of different formats, some with single row/cell and some with varying number.

I use Aspose Words extensively for creating and manipulating Word documents. And considering the expertise I had with the component, I decided to go with it. To solve the issue, I created a similar table model in C# so that I can use it later on while reading the documents.

Below, you can see that I created a class called WordDocumentTable with three properties, i.e., TableID, RowID and ColumnID. As I explained earlier, we had no support for TableID/RowIDs, these properties simply imply the position in the Word document. The start index is assumed to be 0.

public class WordDocumentTable
{ 
	public WordDocumentTable(int PiTableID) 
	{  
		MiTableID = PiTableID; 
	}

	public WordDocumentTable(int PiTableID, int PiColumnID) 
	{  
		MiTableID = PiTableID;  
		MiColumnID = PiColumnID; 
	}

	public WordDocumentTable(int PiTableID, int PiColumnID, int PiRowID) 
	{  
		MiTableID = PiTableID;  
		MiColumnID = PiColumnID;  
		MiRowID = PiRowID; 
	}

	private int MiTableID = 0;

	public int TableID 
	{  
		get { return MiTableID; }  
		set { MiTableID = value; } 
	}        

	private int MiRowID = 0;    
	public int RowID 
	{  
		get { return MiRowID; }  
		set { MiRowID = value; } 
	}

	private int MiColumnID = 0;    
	public int ColumnID 
	{  
		get { return MiColumnID; }  
		set { MiColumnID = value; } 
	}
}

Now comes the extraction part. Below, you will see the collection of table cells which I want to read from the document.

private List<WordDocumentTable> WordDocumentTables
{  
	get  
	{    
		List<WordDocumentTable> wordDocTable = new List<WordDocumentTable>();      
		//Reads the data from the first Table of the document.    
		wordDocTable.Add(new WordDocumentTable(0));      
		//Reads the data from the second table and its second column. 
		//This table has only one row.    
		wordDocTable.Add(new WordDocumentTable(1, 1));      
		//Reads the data from third table, second row and second cell.    
		wordDocTable.Add(new WordDocumentTable(2, 1, 1));  
		return wordDocTable;  
	}
}

Below is the method which extracts the data from Aspose Word Document based on the Table, Row and Cell.

public void ExtractTableData(byte[] PobjData)
{          
	using (MemoryStream LobjStream = new MemoryStream(PobjData)) 
	{  
		Document LobjAsposeDocument = new Document(LobjStream);     
		foreach(WordDocumentTable wordDocTable in WordDocumentTables)  
		{   
			Aspose.Words.Tables.Table table = (Aspose.Words.Tables.Table)
			LobjAsposeDocument.GetChild
			(NodeType.Table, wordDocTable.TableID, true);   
			string cellData = table.Range.Text;

			if (wordDocTable.ColumnID > 0)   
			{    
				if (wordDocTable.RowID == 0)    
				{     
					NodeCollection LobjCells = 
					table.GetChildNodes(NodeType.Cell, true);     
					cellData = LobjCells[wordDocTable.ColumnID].ToTxt();
				}    
				else    
				{     
					NodeCollection LobjRows = 
					table.GetChildNodes(NodeType.Row, true);     
					cellData = ((Row)(LobjRows[wordDocTable.RowID])).
					Cells[wordDocTable.ColumnID].ToTxt();    
				}   
			}

			Console.WriteLine(String.Format("Data in Table {0}, 
					Row {1}, Column {2} : {3}",           
									wordDocTable.TableID,          
									wordDocTable.RowID,          
									wordDocTable.ColumnID,          
									cellData);              
		} 
	}
}

The post Extracting Table Data from Word Document using Aspose Words appeared first on Manas Bhardwaj's Stream.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)