Introduction
Hi .. this is the continuation for crazy stuff with LINQ that we have started from years ago since I created LINQ to XML. I come up today with
LINQ to CSV as a continuation to the dream LINQ to Everything.
Background
The CSV stands for Comma Separated Values, which is a text-based file that
contains a bunch of values separated by commas, it has either .csv or a .txt
extension.
The CSV file format is often used to
exchange data between disparate applications. The file format, as it is used
in Microsoft Excel, has become a pseudo standard throughout the industry, even
among non-Microsoft platforms.
The CSV Format
- Each record is one line, but fields may contain
embedded line-breaks, so a record may span more than one line
- The first record in a CSV file may be a header record
containing column (field) names.
- Fields are separated with commas.
1997,Ford,E350
- Leading and trailing space-characters adjacent to
comma field separators are ignored.
"luxurious truck"
- Fields with embedded commas must be delimited with double-quote
characters.
1997,Ford,E350,"Super, luxurious truck"
- Fields that contain double quote characters must be
surrounded by
double-quotes, and the embedded double-quotes must each be represented
by a pair of consecutive double quotes.
1997,Ford,E350,"Super, ""luxurious"" truck"
- A field that contains embedded line-breaks must be
surrounded by
double-quotes.
1997,Ford,E350,"Go get one now
they are going fast"
This is an example of how the CSV looks like:
Year,Make,Model,Description,Price
1997,Ford,E350,"ac, abs, moon",3000.00
1999,Chevy,"Venture ""Extended
Edition""","",4900.00
1999,Chevy,"Venture ""Extended Edition, Very
Large""",,5000.00
1996,Jeep,Grand Cherokee,"MUST SELL!
air, moon roof, loaded",4799.00
I quote the above sample from
Wikipedia, so for more details please check this link.
I saw many implementations to read and write into csv
files, even LINQ to CSV, nothing but I didn't see much of flexibility during reading
the values or querying them also many of the articles which i have seen parse the csv files
line by line which is good to achieve their goals but almost fails when we have
multi-line values. So i struggle to do it myself in LINQy style. Out of doubt we know that the CSV
fields depend on the the file itself and there's no CLR object that contains such
properties. Instead some guys have a great job in LINQ to CSV and they create a CLR
objects that contain the same properties, it's kind of mapping .. believe me this way
it 'll bother the programmers after awhile, coz whenever the fields have changed in the files he/she
needs to create another POCO to map these fields.
So i was thinking for a couple of days to do it in a better way .. it 'll be nice if we write record.Name instead of magic
string record["Name"].
And as we know in .NET 4 the Dynamic and DLR becomes
increasingly important for almost programmers to do dynamic programming stuff. I think what we have is a kind of dynamic programming, as Anders Hejlsberg said "Use the
strongly types when you can, but when you can't the dynamic comes for the rescue"
i still remember those words from DevDays 2010 Conference. And that's the big challenge for me
personally to do such thing.
In my tiny library i 'll focus mainly into Dynamic Programming &
Regular Expressions to accomplish my goal, there 're many materials you can find
them on the internet talking about each one of them.
Using the code
The Record
class is very important class to represent the CSV record
dynamically, in other words it contains all the fields and values for specific record
in CSV file, and these fields and vaules become a member for this object at the
runtime, which is leverage from the complexity of reflection and related stuff. This
class seems as ExpandoObject
implementation, which contains a data dictionary to
store the keys and values for each record in the CSV file.
Public Class Record Inherits DynamicObject
Private dictionary As New Dictionary(Of String, Object)
Public Sub AddPair(key As String, value As Object)
dictionary(key) = value
End Sub
Public Overrides Function TryGetMember(binder As GetMemberBinder, _
ByRef result As Object) As Boolean
result = dictionary(binder.Name)
Return True
End Function
Public Overrides Function TrySetMember(binder As SetMemberBinder, _
value As Object) As Boolean
dictionary(binder.Name) = value
Return True
End Function
End Class
If we dig into the Record class we'll notice that implement DynamicObject
class which is enable us to create a dynamic object that can discover its properties
and methods at run-time. And simply we override the TryGetMember
and
TrySetMember
methods to implement the getter and setter for the Record class members.
One of the coolest function is AddPair()
which allow us to add properties to this class at runtime?!! may be the sound is strange, this and much more is kind
o goodness from DLR and dynamic support in the framework itself. Which is something tough in the "Reflection Days".
The CSVDataContext
is the container of all the records that 'll be fetch from the CSV file. It's straightforward accept the file path to be parsed. Fetch the fields to the Key
property as well as the values into the Records property.
In the code below we notice that we have two properties Keys which store the fields names, and the second one is Records which hold the actual data of each record.
The main logic happen in the constructor which accept the file name for the CSV file, second thing get the fields from the file and iterate through each line for that file
and apply the following regular expression
"[" & vbTab & ",](?=(?:[^""]|""[^""]*"")*$)"
which is return the tokens that matches the above pattern, which means give me all the tokens that separated by commas, and if they 're inside the double quotation please
treat them as data. Last thing store them in the Records
property which was already created.
Public Class CSVDataContext
Private Property keys As List(Of String)
Public Property Records As List(Of Object)
Public Sub New(filePath As String)
If File.Exists(filePath) Then
Try
Dim lines = File.ReadAllLines(filePath)
keys = New List(Of String)(lines(0).Replace(" ", "").Split(","))
Records = New List(Of Object)()
Dim tokens As IEnumerable(Of String)
Dim flag As Boolean
For i As Integer = 1 To lines.Length - 1
Dim record As New Record()
flag = lines(i).Where(Function(c) c = """").Count() Mod 2 = 1
If flag Then lines(i) = String.Concat(lines(i), lines(i + 1))
tokens = Regex.Split(lines(i), "[" & vbTab & ",](?=(?:[^""]|""[^""]*"")*$)")
For j = 0 To keys.Count - 1
record.AddPair(keys(j), tokens)
Next
If flag Then i += 1
Records.Add(record)
Next
Catch ex As Exception
Throw New FormatException()
End Try
Else
Throw New FileNotFoundException()
End If
End Sub
End Class
The remaining stuff is instantiate the CSVDataContext
and play with the Records property in the LINQy fashion :)
Dim db As New CSVDataContext("test.csv")
Dim query = From r In db.Records
Select r
For Each q In query
Console.WriteLine("{0} {1} {2} {3}", q.Year, q.Make.ToString().PadRight(10, _
" "), q.Model.ToString().PadRight(45, " "), q.Price)
Next
In the above snippet we create an object from our DataContext
, after that we create a simple query that fetches all the record from
test.csv file. Last thing we iterate
the query result and print them in the console. And if we take a deep look to that query we 'll notice that's seem to be a normal LINQ to SQL, there's no magic strings or something
like that and that's the beauty of dynamic objects that allow us to create custom properties and discover them at run-time.
Points of Interest
Dynamic programming makes the developers life much easier than before, and with LINQ the life becomes too sweet :) and the combination for different technologies
and features explore the power of the .NET.