Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / Languages / VB

LINQ to CSV

4.82/5 (6 votes)
13 Jul 2013CPOL5 min read 36.3K   431  
LINQ to CSV files.

Introduction 

Hi .. this is the continuation for crazy stuff with LINQ that we have started from years ago since I created LINQ to XML. I come up today with LINQ to CSV as a continuation to the dream LINQ to Everything.

Background

The CSV stands for Comma Separated Values, which is a text-based file that contains a bunch of values separated by commas, it has either .csv or a .txt extension.

The CSV file format is often used to exchange data between disparate applications. The file format, as it is used in Microsoft Excel, has become a pseudo standard throughout the industry, even among non-Microsoft platforms.

The CSV Format

  • Each record is one line, but fields may contain embedded line-breaks, so a record may span more than one line
  • The first record in a CSV file may be a header record containing column (field) names.
  • Fields are separated with commas.
    1997,Ford,E350   
  • Leading and trailing space-characters adjacent to comma field separators are ignored.
    "luxurious truck" 
  •  Fields with embedded commas must be delimited with double-quote characters.
    1997,Ford,E350,"Super, luxurious truck"  
  •  Fields that contain double quote characters must be surrounded by double-quotes, and the embedded double-quotes must each be represented by a pair of consecutive double quotes.
    1997,Ford,E350,"Super, ""luxurious"" truck"  
  • A field that contains embedded line-breaks must be surrounded by double-quotes.

    1997,Ford,E350,"Go get one now
    they are going fast" 

This is an example of how the CSV looks like:

Year,Make,Model,Description,Price
1997,Ford,E350,"ac, abs, moon",3000.00
1999,Chevy,"Venture ""Extended 

Edition""","",4900.00
1999,Chevy,"Venture ""Extended Edition, Very 

Large""",,5000.00
1996,Jeep,Grand Cherokee,"MUST SELL!
air, moon roof, loaded",4799.00  

I quote the above sample from Wikipedia, so for more details please check this link.

I saw many implementations to read and write into csv files, even LINQ to CSV, nothing but I didn't see much of flexibility during reading the values or querying them also many of the articles which i have seen parse the csv files line by line which is good to achieve their goals but almost fails when we have multi-line values. So i struggle to do it myself in LINQy style. Out of doubt we know that the CSV fields depend on the the file itself and there's no CLR object that contains such properties. Instead some guys have a great job in LINQ to CSV and they create a CLR objects that contain the same properties, it's kind of mapping .. believe me this way it 'll bother the programmers after awhile, coz whenever the fields have changed in the files he/she needs to create another POCO to map these fields.

So i was thinking for a couple of days to do it  in a better way .. it 'll be nice if we write record.Name instead of magic string record["Name"].

And as we know in .NET 4 the Dynamic and DLR becomes increasingly important for almost programmers to do dynamic programming stuff. I think what we have is a kind of dynamic programming, as Anders Hejlsberg said "Use the strongly types when you can, but when you can't the dynamic comes for the rescue" i still remember those words from DevDays 2010 Conference. And that's the big challenge for me personally to do such thing.

In my tiny library i 'll focus mainly into Dynamic Programming & Regular Expressions to accomplish my goal, there 're many materials you can find them on the internet talking about each one of them.

Using the code

The Record class is very important class to represent the CSV record dynamically, in other words it contains all the fields and values for specific record in CSV file, and these fields and vaules become a member for this object at the runtime, which is leverage from the complexity of reflection and related stuff. This class seems as ExpandoObject implementation, which contains a data dictionary to store the keys and values for each record in the CSV file.  

VB
Public Class Record Inherits DynamicObject

    Private dictionary As New Dictionary(Of String, Object)
    Public Sub AddPair(key As String, value As Object)
        dictionary(key) = value
    End Sub
    Public Overrides Function TryGetMember(binder As GetMemberBinder, _
                     ByRef result As Object) As Boolean
        result = dictionary(binder.Name)
        Return True
    End Function

    Public Overrides Function TrySetMember(binder As SetMemberBinder, _
           value As Object) As Boolean
        dictionary(binder.Name) = value
        Return True
    End Function
End Class

If we dig into the Record class we'll notice  that implement DynamicObject class which is enable us to create a dynamic object that can discover its properties and methods at run-time. And simply we override the TryGetMember and TrySetMember methods to implement the getter and setter for the Record class members.

One of the coolest function is AddPair() which allow us to add properties to this class at runtime?!! may be the sound is strange, this and much more is kind o goodness from DLR and dynamic support in the framework itself. Which is something tough in the "Reflection Days".

The CSVDataContext is the container of all the records that 'll be fetch from the CSV file. It's straightforward accept the file path to be parsed. Fetch the fields to the Key property as well as the values into the Records property.

In the code below we notice that we have two properties Keys which store the fields names, and the second one is Records which hold the actual data of each record.

The main logic happen in the constructor which accept the file name for the CSV file, second thing get the fields from the file and iterate through each line for that file and apply the following regular expression

VB
"[" & vbTab & ",](?=(?:[^""]|""[^""]*"")*$)" 

which is return the tokens that matches the above pattern, which means give me all the tokens that separated by commas, and if they 're inside the double quotation please treat them as data. Last thing store them in the Records property which was already created.

VB
Public Class CSVDataContext
    Private Property keys As List(Of String)
    Public Property Records As List(Of Object)

    Public Sub New(filePath As String)
        If File.Exists(filePath) Then
            Try
                Dim lines = File.ReadAllLines(filePath)
                keys = New List(Of String)(lines(0).Replace(" ", "").Split(","))
                Records = New List(Of Object)()
                Dim tokens As IEnumerable(Of String)
                Dim flag As Boolean
                For i As Integer = 1 To lines.Length - 1
                    Dim record As New Record()
                    flag = lines(i).Where(Function(c) c = """").Count() Mod 2 = 1
                    If flag Then lines(i) = String.Concat(lines(i), lines(i + 1))
                    tokens = Regex.Split(lines(i), "[" & vbTab & ",](?=(?:[^""]|""[^""]*"")*$)")
                    For j = 0 To keys.Count - 1
                        record.AddPair(keys(j), tokens)
                    Next
                    If flag Then i += 1
                    Records.Add(record)
                Next
            Catch ex As Exception
                Throw New FormatException()
            End Try
        Else
            Throw New FileNotFoundException()
        End If
    End Sub
End Class

The remaining stuff is instantiate the CSVDataContext and play with the Records property in the LINQy fashion :)

VB
Dim db As New CSVDataContext("test.csv")
Dim query = From r In db.Records
            Select r

For Each q In query
    Console.WriteLine("{0} {1} {2} {3}", q.Year, q.Make.ToString().PadRight(10, _
      " "), q.Model.ToString().PadRight(45, " "), q.Price)
Next

In the above snippet we create an object from our DataContext, after that we create a simple query that fetches all the record from test.csv file. Last thing we iterate the query result and print them in the console. And if we take a deep look to that query we 'll notice that's seem to be a normal LINQ to SQL, there's no magic strings or something like that and that's the beauty of dynamic objects that allow us to create custom properties and discover them at run-time.

Points of Interest 

Dynamic programming makes the developers life much easier than before, and with LINQ the life becomes too sweet :) and the combination for different technologies and features explore the power of the .NET.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)