In my quest to get a rather hack-ey work project DONE and off my machine, I discovered an exceptionally useful OSS tool for working with Microsoft Excel data, and at the same time, I also found a handy library, DocX, for working with Microsoft Word files which not only spares one the issues of Interop/COM, but in fact does not require Interop assemblies at all. We'll take a look at DocX in my next post.
Mostly, I need to get data out of Excel, which admittedly, can be done using ADO.NET. However, LINQ to Excel makes this pretty disgustingly easy.
Image by Elif Ayiter - Some Rights Reserved
LinqToExcel by Paul Yoder is a .NET library which facilitates querying Excel spreadsheets and .CSV files using the familiar LINQ syntax. The source is available at Yoder's Github Repository, and there is also a Nuget package available. The source is available with a broadly permissive license which essentially allows you to do whatever you want with the code, so long as a copy of the license is included.
It is important to note that you need to use the 64 bit version of LinqToExcel if you are targeting the x64 processor architecture with your application. Also in this case, you will need the 64 bit version of the Access Database Engine installed. This means you are taking a dependency which may not be present or installable on a remote server.
Update 1018/2013 – As it turns out, as cool as LinqToExcel is for most cases, for MY specific project there was an issue. That dependency on the ACE Database engine. My project was to be deployed to Windows Azure Websites, which does not have support for the ACE Engine. Back to the drawing board. I found another library, ExcelDataReader, which is not as slick as LinqToExcel for simply reading data, but which carries NO DEPENDENCIES on proprietary MS technology which would create issues deploying to a remote server.
To get started, either go to the Solution Explorer, you can simply enter the following in the Nuget Package Manager Console:
NOTE: When adding Nuget packages to your project, consider keeping source control bloat down by using Nuget Package Restore so that packages are downloaded automatically on build rather than cluttering up your repo.
Get LinqToExcel Using the Nuget Package Manager Console
PM> Install-Package LinqToExcel
Or, you can right-click on "Manage Nuget Packages for Solution" and search for "LinqToExcel" in the online packages in the Package Manager window:
Get LinqToExcel Using Solutions Explorer GUI Menu Item
Wanna get some data from an Excel spreadsheet? It's this easy:
Consider the following spreadsheet:
Example Spreadsheet Data
Here is the very simplest way we can query this data and (for the sake of this example) write it out to the console:
Straight and Simple Query Using Linq Syntax
void PrintArtistAlbums()
{
string pathToExcelFile = ""
+ @"D:\Code\Blog Projects\BlogSandbox\ArtistAlbums.xlsx";
string sheetName = "Sheet1";
var excelFile = new ExcelQueryFactory(pathToExcelFile);
var artistAlbums = from a in excelFile.Worksheet(sheetName) select a;
foreach (var a in artistAlbums)
{
string artistInfo = "Artist Name: {0}; Album: {1}";
Console.WriteLine(string.Format(artistInfo, a["Name"], a["Title"]));
}
}
That's it. Initialize an ExcelQueryFactory
object, and use Linq the way you are accustomed to.
In the previous example, we are accepting what LinqToExcel sends us by default, which is an ExcelQueryable<LinqToExcel.Row>
object (which implements IQueryable
and other useful interfaces). Each Row
object is composed of (take a guess . . .) LinqToExcel.Cell
objects.
But it gets better. LinqToExcel will also materialize our data to our objects, if the column names match:
LinqToExcel Query Materializing a Class
public class ArtistAlbum
{
public string Name { get; set; }
public string Title { get; set; }
}
void PrintArtistAlbums()
{
string pathToExcelFile = ""
+ @"D:\Code\Blog Projects\BlogSandbox\ArtistAlbums.xlsx";
string sheetName = "Sheet1";
var excelFile = new ExcelQueryFactory(pathToExcelFile);
var artistAlbums = from a in excelFile.Worksheet<ArtistAlbum>(sheetName) select a;
foreach (var a in artistAlbums)
{
string artistInfo = "Artist Name: {0}; Album: {1}";
Console.WriteLine(artistInfo, a.Name, a.Title);
}
}
Note the items called out in red above. In this case, we have told LinqToExcel that the objects it expects to get back from the query should map to the ArtistAlbum
class. If your column names match, you're all set – pull your data in and populate the class like you would with any other Linq query.
If your columns don't match, you still have some options. Let's say our boss gives us a spreadsheet each week to do something important within our application. He is also fond of crazy column names. We can take advantage of the AddMapping
method provided by the ExcelQueryFactory
object to map spreadsheet columns to the right property on our class.
Let's say that our ArtistAlbum
class is defined as before, but the boss insists on naming the "Name
" column on HIS spreadsheet as "Artist Name," and the "Title
" column as "Album Title". In this case, we can do this:
Use the AddMapping Method to Map Column Names to Property Names
void PrintArtistAlbums()
{
string pathToExcelFile = ""
+ @"D:\Code\Blog Projects\BlogSandbox\ArtistAlbums.xlsx";
string sheetName = "Sheet1";
var excelFile = new ExcelQueryFactory(pathToExcelFile);
excelFile.AddMapping("Name", "Artist Name");
excelFile.AddMapping("Title", "Album Title");
var artistAlbums = from a in excelFile.Worksheet<ArtistAlbum>(sheetName) select a;
foreach (var a in artistAlbums)
{
string artistInfo = "Artist Name: {0}; Album: {1}";
Console.WriteLine(artistInfo, a.Name, a.Title);
}
}
We can use the GetWorksheetNames
method to query Excel and return an IEnumerable<string>
containing the names of all the Worksheets within a Workbook:
Retrieve the Worksheet Names Contained within an Excel Workbook
public IEnumerable<string> getWorkSheets()
{
string pathToExcelFile = ""
+ @"D:\Code\Blog Projects\BlogSandbox\ArtistAlbums.xlsx";
var excelFile = new ExcelQueryFactory(pathToExcelFile);
var worksheetsList = excelFile.GetWorksheetNames();
return excelFile.GetWorksheetNames();
}
Often, it's handy to be able to grab the column names available within a particular worksheet. That's also easy:
var columnNames = _excel.GetColumnNames(selectedSheetName);
This returns a simple IEnumerable<string>
containing the column names in the Worksheet.
This might be useful as a way to check and see if the properties exposed by one of our classes are all represented by columns in a spreadsheet. We can use reflection to walk the public
properties exposed by a class instance, then compare the property names with our column names (case-insensitively, if we choose). This is a really basic example, but you get the idea.
Using the GetColumns Method to Validate Worksheet Columns Against Object Properties
public IEnumerable<string> ValidateSheetColumns(string selectedSheetName)
{
string pathToExcelFile = ""
+ @"D:\Code\Blog Projects\BlogSandbox\ArtistAlbums.xlsx";
var excelFile = new ExcelQueryFactory(pathToExcelFile);
var SheetColumnNames = excelFile.GetColumnNames(selectedSheetName);
var prototype = new ArtistAlbum();
var ClassProperties = prototype.GetType().GetProperties();
var requiredColumns = (from p in ClassProperties select p.Name);
var missingColumns = requiredColumns.ToList();
foreach (string column in requiredColumns)
{
foreach (var candidate in SheetColumnNames)
{
if (column.Equals(candidate, StringComparison.InvariantCultureIgnoreCase))
{
missingColumns.Remove(column);
}
}
}
return missingColumns;
}
Of course, the above only works well with simple POCO classes. If you have read-only properties on your class, you might run into problems. But for the sake of example, you get the point.
Queries against Excel using LinqToExcel are nice and fast. However, it is important to remember that the initialization of an instance of ExcelQueryFactory
is relatively expensive (you are essentially initiating a database connection from a cold start, then pulling a large amount of XML data into memory). This should be considered in designing your application's use of this library.
Where possible, I recommend doing your best to initialize the ExcelQueryFactory
as few times as possible within the context of a given operation. My examples above include initialization in each example so that they can stand on their own. Most likely, though, in a real-world context, I would try to make the ExcelQueryFactory
instance a class member variable where it made sense.
Update 10-4-2013 - Also, pay close attention to dates when consuming Excel data, and make sure the Excel 1900/1904 date funkiness is dealt with properly with whatever framework you are using! I haven't checked this with LinqToExcel yet, but this is always an important bugaboo to keep in mind when working with Excel data.
LinqToExcel is a cool enough tool that I decided to write this quick article. While I personally avoid working with Excel data wherever possible (Excel is for analysis of properly persisted data, dammit!), there is often no getting around it. While there are certainly other ways to work with Excel data in your application, LinqToExcel makes this about as simple as it can get.
Update: 10/15/2013 - As indicated above, use of this library requires that the ACE Database Engine be installed on the host machine, which makes LinqToExcel less than ideal for deployment to a server which does not have this .exe installed. While the ACE engine is re-distributable, is requires installation as opposed to inclusion as a dll. This requirement effectively tanked LinqToExcel for my current project, which is to be deployed as a Windows Azure Website.
There is an alternative solution, ExcelDataReader, which takes no dependency on proprietary MS technology, and which can be used to read data from Excel files. It is not as smooth to use as LinqToExcel, nor as fully featured, but it gets the job done. I will take a look at that in an upcoming article.