Introduction
Language Integrated Query (LINQ) provides an excellent interface to query data objects stored in various forms. LINQ to SQL, LINQ to objects, LINQ to XML are few examples. LINQ provides common interface for querying any type of data source. As LINQ queries are integrated into C# language, it expedites development. This article is about an approach for LINQ to CSV (Comma separated) file data.
Background
The main design objective is to facilitate rapid business application development. Consider you have a CSV file that contains business data. You need to write feature, say Business_Feature1
to consume this data to provide aggregate. Now, it's not very difficult to develop such Business_Feature1
in a day or two. Then, again few days later, say you need to write Business_Feature2
to consume another CSV file to provide top sales region. Again, the developer needs to spend another 1 or 2 days to develop this feature. The smartest approach is to separate concerns while developing Business_Feature1
. That is, Business_Feature1
is developed as two modules. Module 1, consumes CSV file to convert it into objects, while Module 2 performs business actions on these objects to provide neat business feature. This would most likely take same time. However, if this smart approach is followed, developing Business_Feature2
becomes snappy, just few hours. This is because Module 1 from Business_Feature1
is reused for Rapid application development. This not only shortens development time but also increases quality output. As Module 1 has gone through rigorous testing phase, it is more stable and error free.
Now, with this background, picture writing that reusable module, Module 1 as one of C# classic language feature LINQ. Then, development of Business feature would not take hours, but minutes because all you need to write is a simple LINQ query to get the desired feature development.
Comparison of other LINQ 2 CSV with this library:
Using the Code
The main advantage of this LINQ to CSV library is, developers need not write any special code at all. They can import the library linq2csv
in their project and just write LINQ query to CSV file. All data binding happens runtime.
A sample project is included to use this library (Linq2CSV
). Let's say we have a CSV file like below. Region, NA is North America (not, Not Applicable). Our business objective is to find net profit by region.
Input CSV File
REGION | COUNTRY | ACCOUNT_ID | CURRENCY | GROSS_PROFIT | TAX_RATE |
NA | USA | 1 | USD | 1234232.76 | 10 |
NA | Canada | 2 | CAN | 453232.4576 | 10 |
APAC | Hong Kong | 3 | HKD | 124342.1 | 1 |
EMEA | Saudi Arabia | 4 | SAR | 2345234535 | 0 |
Given LINQ to CSV library in our hand, this task is just going to be few minutes of development. This is because number of lines required for this development is just 1 line, i.e., 1 LINQ query as below:
result = from data in dataToOperate
group data by ((dynamic)data).region
into profitByRegion
select new
{
Region = profitByRegion.Key,
NetProfitUSD = profitByRegion.SumOfDecimals(dataItem =>
{
decimal netProfit = 0.0M;
decimal tax = 0.0M;
if (((dynamic)dataItem).gross_profit > 0)
tax = ((dynamic)dataItem).gross_profit *
(((dynamic)dataItem).tax_rate / 100.0M);
netProfit = ((dynamic)dataItem).gross_profit - tax;
return netProfit * ((dynamic)dataItem).conversion_factor;
})
};
If you are familiar with LINQ, this will look easy.
Output looks as below based on CSV file input given in this article:
### Total Net Profit (Tax deducted) by Region in USD ###
Region : NA | NetProfitUSD : 127285002.6245864620 |
Region : APAC | NetProfitUSD : 27333822.56032211400 |
Region : EMEA | NetProfitUSD : 1259879792.90627115830 |
Region : APC | NetProfitUSD : 434647.7561869611525 |
Download the sample from the link at the top of this article to see a quick run output. You may change content of CSV files to see how sample output changes in response to your changes.
Sample output looks like this:
Design Motivation
The main design objective is to facilitate rapid business application development. To achieve this design goal, components of software that are common for similar use-case are precisely separated as re-usable modules. This enables dev teams to focus on just developing business functionality while boiler plate modules are ready.
Data is loaded from CSV file into C# dynamic object. This part is boiler plate component that can be re-used. CSV file content is loaded as array of objects into memory with CSV headers as object property, i.e., item[0].gross_profit
will get first object in array, property named gross_profit
. Business feature can be easily developed with just one line of C# LINQ query. This module is essentially to LINQ for CSV file. New feature development involves just writing LINQ query for business-case.
- Trade-off between Pure LINQ vs Hybrid LINQ (see Trade-off section) is analyzed
- Interface based design for cleaner abstractions
- Generic enough to work on any CSV file with headers
- CSV file is loaded as array of objects with object’s member properties to operate. This enables writing LINQ queries on these objects.
Type Safety - Runtime Type Safety Check
This library is totally type safe and checks each and every element in CSV for type safety. If there is a type safety mismatch found, an exception is thrown. This helps users to handle issues immediately instead of wrong data or exception while processing query result. Yes, remember, LINQ is pretty late binding while query results are processed.
This is how CSV file feed type information to the linq2csv
library:
The Alter Ego, Dynamic Type
In C# dynamic type would resolve at runtime. The dynamic type changes its type at runtime based on type at RHS (Right hand side). For example, in the below code:
dynamic runTimeType = "Hello World";
Console.Writeline("Type of runTimeType: {0}", runTimeType.GetType().ToString());
This will print System.String
.
If you assign any class to dynamic type, compiler will not check for correctness of method, property or any operation on the object. This is because this is resolved at runtime. Dynamic types do not have support of intellisense in Visual Studio IDE. However, despite compile time check is not being done, runtime exception is thrown if method or property is incorrect.
Trade-offs and Performance
- Unlike other LINQ to CSV libraries, you do not need to write even single line of code to consume your CSV file. You can import the library and write LINQ query targeting contents in CSV file. All binding happens magically at runtime.
- Trade-off between Pure LINQ vs Hybrid LINQ is analyzed. Pure LINQ is termed as implementing LINQ to CSV functionality as purely LINQ extension function. Hybrid LINQ is termed as implementing LINQ to CSV functionality via dynamic objects. Hybrid LINQ approach is taken for the below reasons:
- Pure LINQ needs to load CSV file for every query vs Hybrid LINQ loads CSV file into memory once to use for all queries.
- Pure LINQ is inefficient due to “C# closure” issue in pure LINQ queries vs Hybrid approach has no closure issue.
- Fast CSV file reader using .NET parallel library
- Log4net, a 3rd party logger is used. Log4net is feature rich to suit such projects
- Usage of LINQ helps performance as unnecessary memory allocations are avoided. Query results are obtained only when read.
Writing new LINQ extensions are kept to a minimal while using most out of built in LINQ functionalities.
Folder Structure
Main folder contains bin, doc, sample, source and specification. As the name suggests, each directory contains appropriate content. Specification directory contains requirements document.
- Bin – binaries in both release and debug configuration
- Sample – working sample program in both release and debug mode
- Source – source code of library, sample program and unit test
- Specification – Requirement documents
- Readme.docx – The file that give details about project
Brief Code Walk-through
Here is the main functional part of code. Surprisingly, it is just one line of code that completes business requirement.
One-line LINQ query for Gross Profit Total by Country in USD:
result = from data in dataToOperate
group data by ((dynamic)data).country
into profitByCountry
select new
{
Country = profitByCountry.Key,
GrossProfitUSD =
profitByCountry.SumOfDecimals(item =>
((dynamic)item).gross_profit *(((dynamic)item).conversion_factor))
};
Here, dataToOperate
is data structure exposed by CSV reader. This data is grouped by country. Then Total
is added using SumOfDecimals
method. Gross profit is converted to USD using conversion rate.
Limitations and Assumptions
Below are assumptions and limitations of this module:
- Tax rate is computed without any tax slabs or rules specific to country.
- Tax rate is calculated to be zero for loss making accounts (i.e., negative profit accounts).
- Type safety of input CSV file items are not checked, but desired. This feature can be added, given type information in CSV file.
- Modification of CSV file content will not flow to output result set at runtime.
Even if type-safety is implemented, it is runtime rather than compile time. This is because we get input data (CSV file) at run-time.
Conclusion and Future Scope
This is an interesting module that uses one of C# language’s finest feature LINQ. This module can be easily extended to live display of CSV data, the ability to use LINQ for any underlying data structure, Type safety check, Run time querying (from config file, user input, etc.). In a similar fashion, IOutput
interface can be extended to support html5 push data, Text output, Custom output for thick client.
If you are planning to use this library (assembly) in your project, please drop me an email. I will be very happy to provide you with the professional version of this library. Also, do not forget to vote for this article/project.
History
- 3rd February, 2019 - Version 1
- 16th February 2019 -
RuntimeType
safety check is added - Version 1.1