Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / Languages / VB

A powerful CSV document wrapper library

5.00/5 (2 votes)
19 Jun 2014CPOL5 min read 22.5K   391  
I need a tool for exchange the data between my .NET program and R program, yeah this powerful library makes things better!

Download DocumentFormat.Csv-noexe.zip

Download DocumentFormat.Csv.zip

 

Why chose csv?

  1. First, csv document just a formatted plant text file that can be easily read and write by any program.
  2. Secondly, if we just want a data file but not a well formatted Excel file with the extract style data, then the csv document is the best choice for the cross-platform software as the MS Office Excel will not works on LINUX and the .NET program on LINUX mono environment is not easily using the Excel library C++ interface from the OpenOffice without any modification of our source code.
  3. Third, as a biological researcher, the R program is our most use experiment data statics software, and when we develop a .NET program interact with R on LINUX/Windows via RDotNET, the csv document is the best choice for the data exchange between the R and the tools that I’ve develop. It’s convenient using Visual Basic operate the R to create a complex object in R using RDotNET via exchange the data in csv:
VB.NET
Call DataCollection.SaveTo(CsvPath)
R = R << Push(String.Format("ExperimentData <- read.csv(""{0}"")", CsvPath))
R operation statements, balabalabala....

Image 1

Picture1. The advantages of csv document

 

Important Components in this library

Implementation of Csv file object

CsvFile Object

The CsvFile object instance stands for a CSV file in the file system. Create a csv file object in your code just easy:

Dim CsvFile As Microsoft.VisualBasic.DataVisualization.DocumentFormat.Csv.File =
    Microsoft.VisualBasic.DataVisualization.DocumentFormat.Csv.File.Load(Path:=CsvPath)

The Load function in the Csv.File class define shows below:

Public Shared Function Load(Path As String, Optional encoding As System.Text.Encoding = Nothing) As File
   If encoding Is Nothing Then
      encoding = System.Text.Encoding.Default
   End If

   Dim Csv = CType(System.IO.File.ReadAllLines(Path, encoding:=encoding), File)
   Csv.File = Path
   Return Csv
End Function

The save method for save the data in the csv object into the filesystem.

VB.NET
''' <summary>
'''
''' </summary>
''' <param name="Path"></param>
''' <param name="LazySaved">Optional, this is for the consideration of performance and memory consumption.
''' When a data file is very large, then you may encounter a out of memory exception on a 32 bit platform,
''' then you should set this parameter to True to avoid this problem. Defualt is False for have a better
''' performance.
''' (当估计到文件的数据量很大的时候,请使用本参数,以避免内存溢出致使应用程序崩溃,默认为False,不开启缓存)
''' </param>
''' <remarks>当目标保存路径不存在的时候,会自动创建文件夹</remarks>
Public Sub Save(Optional Path As String = "", Optional LazySaved As Boolean = False, Optional encoding As System.Text.Encoding = Nothing)

There is an encoding optional parameter; this is for the consideration of avoid the messy code for Chinese character on different operation system, as an example is that Windows is using ANSI as default but the LINUX is using Unicode as default.

Here is the major data structure definition of the Csv.File class:

 

VB.NET
''' <summary>
''' A comma character seperate table file that can be read and write in the EXCEL.(一个能够被Excel程序所读取的表格文件)
''' </summary>
''' <remarks></remarks>
Public Class File : Implements Generic.IEnumerable(Of Csv.File.Row)
    Implements Generic.IList(Of DocumentFormat.Csv.File.Row)

    ''' <summary>
    ''' First line in the table is the column name definition line.
    ''' </summary>
    ''' <remarks></remarks>
    Protected Friend _InnerListTable As List(Of Row) = New List(Of Row)

And some useful method which are defined in this Csv.File class for data manipulation:

VB.NET
''' <summary>
''' Add a data row collection into this Csv file object instance and then return the total row number after the add operation.
''' (向CSV文件之中批量添加行记录,之后返回当前所打开的文件在添加纪录之后的总行数)
''' </summary>
''' <param name="RowCollection"></param>
''' <returns></returns>
''' <remarks></remarks>
Public Function AppendRange(RowCollection As Generic.IEnumerable(Of Row)) As Long

''' <summary>
''' Get a data row in the specific row number, when a row is not exists in current csv file then the function will return a empty row.
''' (当目标对象不存在的时候,会返回一个空行对象)
''' </summary>
''' <param name="line"></param>
''' <returns></returns>
''' <remarks></remarks>
Public Function Get_DataRow(line As Integer) As Row

''' <summary>
''' Using the content in a specific column as the target for search using a specific keyword, and then return all of the rows that have the query keyword.
''' (以指定的列中的内容搜索关键词,并返回检索成功的行的集合)
''' </summary>
''' <param name="KeyWord"></param>
''' <param name="Column"></param>
''' <returns></returns>
''' <remarks></remarks>
Public Function FindAtColumn(KeyWord As String, Column As Integer) As Row()

''' <summary>
''' Get and set the string content in a specific table cell.(设置或者获取某一个指定的单元格中的字符串内容)
''' </summary>
''' <param name="X"></param>
''' <param name="Y"></param>
''' <value></value>
''' <returns></returns>
''' <remarks></remarks>
Public Property Cell(X As Integer, Y As Integer) As String

''' <summary>
''' Delete all of the row that meet the delete condition.(将所有满足条件的行进行删除)
''' </summary>
''' <param name="Condition"></param>
''' <remarks></remarks>
Public Function Remove(Condition As Func(Of Row, Boolean)) As Row()

''' <summary>
''' Remove the item in a specific row collection.
''' </summary>
''' <param name="RowCollection"></param>
''' <remarks></remarks>
Public Sub RemoveRange(RowCollection As Generic.IEnumerable(Of Row))

''' <summary>
''' Insert a new empty line of row data before the specific row number.(在指定列标号的列之前插入一行空列)
''' </summary>
''' <param name="column"></param>
''' <remarks></remarks>
Public Function InsertEmptyColumnBefore(column As Integer) As Integer

Row Object

VB.NET
''' <summary>
''' A line of data in the csv file.(Csv表格文件之中的一行)
''' </summary>
''' <remarks></remarks>
Public Class Row : Implements Generic.IEnumerable(Of String)
    Implements Generic.IList(Of System.String)

    ''' <summary>
    ''' 本行对象中的所有的单元格的数据集合
    ''' </summary>
    ''' <remarks></remarks>
    Protected Friend _InnerColumnList As List(Of String) = New List(Of String)

How to parse the data string into a row object

As we can see, the row object is major consist of a string collection, so that we just need to split the string into a string collection, and a magical regular expression makes this work easy:

VB.NET
''' <summary>
''' A regex expression string that use for split the line text.
''' </summary>
''' <remarks></remarks>
Protected Friend Const SplitRegxExpression As String = "[" & vbTab & ",](?=(?:[^""]|""[^""]*"")*$)"

And the detail of the code implementation of the parsing job:

VB.NET
''' <summary>
''' Row parsing into column tokens
''' </summary>
''' <param name="Line"></param>
''' <returns></returns>
''' <remarks></remarks>
Public Shared Widening Operator CType(Line As String) As Row
    Dim Row = Regex.Split(Line, SplitRegxExpression)

    For i As Integer = 0 To Row.Count - 1
        If Not String.IsNullOrEmpty(Row(i)) Then
            If Row(i).First = """"c AndAlso Row(i).Last = """"c Then
                Row(i) = Mid(Row(i), 2, Len(Row(i)) - 2)
            End If
        End If
    Next

    Return Row
End Operator

And some major method for the row data manipulation:

VB.NET
''' <summary>
''' Get the cell data in a specific column number. if the column is not exists in this row then will return a empty string.
''' (获取某一列中的数据,若该列不存在则返回空字符串)
''' </summary>
''' <param name="Index"></param>
''' <returns></returns>
''' <remarks></remarks>
Default Public Property Column(Index As Integer) As String Implements IList(Of String).Item

''' <summary>
''' is this row object contains any data?
''' </summary>
''' <value></value>
''' <returns></returns>
''' <remarks></remarks>
Public ReadOnly Property IsNullOrEmpty As Boolean

''' <summary>
''' insert the data into a spercific column 
''' </summary>
''' <param name="value"></param>
''' <param name="column"></param>
''' <returns>仅为LINQ查询使用的一个无意义的值</returns>
''' <remarks></remarks>
Public Function InsertAt(value As String, column As Integer) As Integer

''' <summary>
''' Takes the data in the specific number of columns, if columns is not exists in this row object, then a part of returned data will be the empty string.
''' </summary>
''' <param name="Count"></param>
''' <returns></returns>
''' <remarks></remarks>
Public Function Takes(Count As Integer) As String()

''' <summary>
''' Takes the data in the specific column index collection, if the column is not exists in the row object, then a part of the returned data will be the empty string.
''' </summary>
''' <param name="Cols"></param>
''' <param name="retNullable">(当不存在数据的时候是否返回空字符串,默认返回空字符串)</param>
''' <returns></returns>
''' <remarks></remarks>
Public Function Takes(Cols As Integer(), Optional retNullable As Boolean = True) As String()
Public Function AddRange(values As Generic.IEnumerable(Of String)) As Integer
Public Sub Add(columnValue As String) Implements ICollection(Of String).Add

Imports operations

As the csv file format is using the comma character to separate the cell data, but some text file is not using the comma as delimiter, it may use TAB or several space characters, then we can imports the data in the target text file to create a csv file object, just like the Excel application data imports operation does.

There are two types of data imports operation are define in the DataImports module:

VB.NET
''' <summary>
''' Imports the data in a well formatted text file using a specific delimiter, default delimiter is comma character.
''' </summary>
''' <param name="txtPath">The file path for the data imports text file.(将要进行数据导入的文本文件)</param>
''' <returns></returns>
''' <remarks></remarks>
Public Function [Imports](txtPath As String, Optional delimiter As Char = ","c, Optional encoding As System.Text.Encoding = Nothing) As Microsoft.VisualBasic.DataVisualization.DocumentFormat.Csv.File
    Dim Expression As String = String.Format(SplitRegxExpression, delimiter)

    If encoding Is Nothing Then
        encoding = System.Text.Encoding.Default
    End If

    Dim Lines As String() = IO.File.ReadAllLines(txtPath, encoding)
    Dim LQuery = (From line As String In Lines Select RowParsing(line, Expression)).ToArray

    Return LQuery
End Function

''' <summary>
''' Imports the data in a well formatted text file using the fix length as the data separate method.
''' </summary>
''' <param name="txtPath"></param>
''' <param name="length">固定的列字符数的宽度</param>
''' <returns></returns>
''' <remarks></remarks>
Public Function FixLengthImports(txtPath As String, Optional length As Integer = 10, Optional encoding As System.Text.Encoding = Nothing) As Microsoft.VisualBasic.DataVisualization.DocumentFormat.Csv.File
    If encoding Is Nothing Then
        encoding = System.Text.Encoding.Default
    End If

    Dim Lines As String() = IO.File.ReadAllLines(txtPath, encoding)
    Dim LQuery = (From line As String In Lines Select RowParsing(line, length:=length)).ToArray
    Return LQuery
End Function

The amazing Reflection Wrapper

This part of my job in this library is the most powerful and amazing works! It makes my coding job happier and I believe it will do the same on you! Let’s see an example of code on it! All of these wrapper operations are define in the namespace Microsoft.VisualBasic.DataVisualization.DocumentFormat.Csv.Reflection

How it works?

The reflection wrapper operations are based on the reflection operation in .NET. There three step to implement this function:

Image 2

Picture2. Workflow for the wrapper operation

First, Create Data Schema

Required custom attribute:

VB.NET
Namespace Csv.Reflection

    <AttributeUsage(AttributeTargets.Property, allowmultiple:=True, inherited:=False)>
    Public Class ColumnAttribute : Inherits Attribute

        Protected Friend _ElementDataType As Type = Type.String
        Protected Friend _Name As String
        Protected Friend _bindProperty As System.Reflection.PropertyInfo

        Sub New(Name As String, Optional Type As Type = Type.String)

    ''' <summary>
    ''' 并不建议使用本Csv属性来储存大量的文本字符串,极容易出错
    ''' </summary>
    ''' <remarks></remarks>
    <AttributeUsage(AttributeTargets.Property, allowmultiple:=True, inherited:=False)>
    Public Class ArrayColumn : Inherits Attribute

        Protected Friend _Delimiter As String
        Protected Friend _ElementDataType As ColumnAttribute.Type
        Protected Friend _Name As String
        Protected Friend _bindProperty As System.Reflection.PropertyInfo

        Sub New(Name As String, Optional Delimiter As String = "; ", Optional ElementDataType As ColumnAttribute.Type = ColumnAttribute.Type.String)

Tow function for creating the data schema:

VB.NET
Private Shared Function GetArrayColumns(ItemTypeProperties As System.Reflection.PropertyInfo()) As Reflection.ArrayColumn()
    Dim LQuery = (From [PropertyInfo] As System.Reflection.PropertyInfo
                  In ItemTypeProperties
                  Let attrs As Object() = [PropertyInfo].GetCustomAttributes(Reflection.ArrayColumn.TypeInfo, inherit:=False)
                  Where Not attrs.IsNullOrEmpty
                  Select DirectCast(attrs.First, Reflection.ArrayColumn).Bind(PropertyInfo)).ToArray
    Return LQuery
End Function

Private Shared Function GetColumns(Type As System.Type) As Reflection.ColumnAttribute()
    Dim ItemTypeProperties = Type.GetProperties(BindingFlags.Public Or BindingFlags.Instance)
    Dim [Property] As PropertyInfo
    Dim ColumnList As List(Of Reflection.ColumnAttribute) = New List(Of ColumnAttribute)

    For i As Integer = 0 To ItemTypeProperties.Length - 1
        [Property] = ItemTypeProperties(i)
        Dim ColumnAttrList = (From attr In [Property].GetCustomAttributes(ColumnAttribute.TypeInfo, inherit:=False) Select DirectCast(attr, ColumnAttribute)).ToArray
 
        If Not ColumnAttrList.IsNullOrEmpty Then
            For idx As Integer = 0 To ColumnAttrList.Count - 1
                Dim columnAttr = ColumnAttrList(idx)
                If Not Len(columnAttr._Name) > 0 Then
                    columnAttr._Name = [Property].Name
                End If

                ColumnList.Add(columnAttr.Bind([Property]))
            Next
        End If
     Next

     Return ColumnList.ToArray
End Function

As just some basic data type can be convert into string or convert string into the target type directly, then I define a type enums to specific the dat type of target property, and this code make the program performance better than we directly get the type information from the property using reflection.

VB.NET
Public Enum Type
     [Object] = -1
     [Integer] = 0
     [Long]
     [String]
     [Double]
     [DateTime]
     [Bool]
End Enum

Secondary, Data Convection

After we create a data schema from the reflection operation above, then we are able to make a data type convention between the data object and the row object.

Here is the data convert method:

VB.NET
Friend Shared ReadOnly CTypers As System.Func(Of String, Object)() = {
      Function(s As String) CInt(Val(s)),
      Function(s As String) CType(Val(s), Long),
      Function(s As String) s,
      Function(s As String) Val(s),
      Function(s As String) CType(s, DateTime),
      Function(s As String) CType(s, Boolean)}

Public Shared Function Convert(s As String, DataType As Type) As Object
      Return CTypers(DataType)(s)
End Function

There is another problem that the wrapper operation can convert a data array object, so that I create this method for array convert into a row string and convert the array string into an object array:

VB.NET
Public Function CreateObject(cellData As String) As String()
    Dim Tokens As String() = Split(cellData, _Delimiter)
    Return Tokens
End Function

Private ReadOnly Property DelimiterLength As Integer
    Get
        Return Len(_Delimiter)
    End Get
End Property

Public Function CreateObject(Of T)(DataCollection As Generic.IEnumerable(Of T)) As String
    Dim Type As System.Type = GetType(T)
    Dim sBuilder As StringBuilder = New StringBuilder(1024)

    If Not (Type.IsValueType OrElse Type = GetType(String)) Then
        Call Console.WriteLine("[WARNNING] DataType ""{0}"" is not a validated value type, trying to get string data from its Object.ToString() method!", Type.FullName)
    End If

    Dim StringCollection As String() = (From item In DataCollection Let _create = Function() item.ToString Select _create()).ToArray

    For Each item In StringCollection
        Call sBuilder.AppendFormat("{0}{1}", item, _Delimiter)
    Next

    Call sBuilder.Remove(sBuilder.Length - DelimiterLength, DelimiterLength)

    Return sBuilder.ToString
End Function

There is a problem that we passing the collection data into the Property.SetValue method, as we parse a string array load from the text file but the data type of the target property may be a Double array,then If we directly pass this object collection parameter into the Property.SetValue method then we will get a data type miss match exception, so that I using a lambda expression to make the thing work.

VB.NET
Protected Friend Shared ReadOnly FillObjects As Dictionary(Of Reflection.ColumnAttribute.Type, System.Action(Of PropertyInfo, Object, String())) =
    New Dictionary(Of ColumnAttribute.Type, Action(Of PropertyInfo, Object, String())) 
    From {
         {ColumnAttribute.Type.Bool, Sub(bindProperty As System.Reflection.PropertyInfo, FilledObject As Object, strValues As String()) _
              Call bindProperty.SetValue(FilledObject, (From strData As String In strValues Select CType(strData, Boolean)).ToArray, Nothing)},
         {ColumnAttribute.Type.DateTime, Sub(bindProperty As System.Reflection.PropertyInfo, FilledObject As Object, strValues As String()) _
              Call bindProperty.SetValue(FilledObject, (From strData As String In strValues Select CType(strData, DateTime)).ToArray, Nothing)},
         {ColumnAttribute.Type.Double, Sub(bindProperty As System.Reflection.PropertyInfo, FilledObject As Object, strValues As String()) _
              Call bindProperty.SetValue(FilledObject, (From strData As String In strValues Select CType(strData, Double)).ToArray, Nothing)},
         {ColumnAttribute.Type.Integer, Sub(bindProperty As System.Reflection.PropertyInfo, FilledObject As Object, strValues As String()) _
              Call bindProperty.SetValue(FilledObject, (From strData As String In strValues Select CType(strData, Integer)).ToArray, Nothing)},
         {ColumnAttribute.Type.Long, Sub(bindProperty As System.Reflection.PropertyInfo, FilledObject As Object, strValues As String()) _
              Call bindProperty.SetValue(FilledObject, (From strData As String In strValues Select CType(strData, Long)).ToArray, Nothing)},
         {ColumnAttribute.Type.String, Sub(bindProperty As System.Reflection.PropertyInfo, FilledObject As Object, strValues As String()) _
              Call bindProperty.SetValue(FilledObject, strValues, Nothing)}}

All of the wrapper operation code can be found at Csv.Reflection.Reflector class:

VB.NET
''' <summary>
''' Method for load a csv data file into a specific type of object collection.
''' </summary>
''' <typeparam name="ItemType"></typeparam>
''' <param name="Explicit">当本参数值为False的时候,所有的简单属性值都将被解析出来,而忽略掉其是否带有<see cref="Csv.Reflection.ColumnAttribute"></see>自定义属性</param>
''' <param name="Path"></param>
''' <returns></returns>
''' <remarks></remarks>
Public Shared Function Load(Of ItemType)(Path As String, Optional Explicit As Boolean = True, Optional encoding As System.Text.Encoding = Nothing) As List(Of ItemType)

''' <summary>
''' Save the specifc type object collection into the csv data file
''' </summary>
''' <typeparam name="ItemType"></typeparam>
''' <param name="Collection"></param>
''' <param name="Explicit"></param>
''' <returns></returns>
''' <remarks></remarks>
Public Shared Function Save(Of ItemType)(Collection As Generic.IEnumerable(Of ItemType), Optional Explicit As Boolean = True) As Microsoft.VisualBasic.DataVisualization.DocumentFormat.Csv.File

Those two methods above have a optional parameter Explicit, which means True for just parse the property have the column custom attribute and False for parse all property even they have no column custom attribute, I just want makes thing better and easier control. This is just like that when you do the xml serialization all of the property without xml serialization custom attribute also can be serialize into the xml file.

And I have created two extension methods for makes the IO operation more easily:

VB.NET
''' <summary>
''' Load a csv data file document using a specific object type.(将某一个Csv数据文件加载仅一个特定类型的对象集合中)
''' </summary>
''' <typeparam name="T"></typeparam>
''' <param name="Path"></param>
''' <param name="explicit"></param>
''' <param name="encoding"></param>
''' <returns></returns>
''' <remarks></remarks>
<Extension> Public Function LoadCsv(Of T)(Path As String, Optional explicit As Boolean = True, Optional encoding As System.Text.Encoding = Nothing) As List(Of T)
    Return Csv.Reflection.Reflector.Load(Of T)(Path, explicit, encoding)
End Function

<Extension> Public Sub SaveTo(Of T)(Collection As Generic.IEnumerable(Of T), path As String, Optional explicit As Boolean = True, Optional encoding As System.Text.Encoding = Nothing)
    Call Csv.Reflection.Reflector.Save(Collection, explicit).Save(path, False, encoding)
End Sub

 

A Simple Example: How to use?

If we have a collection of experiment data in such data structure, and after we preprocessing in our .NET program and then passing this data package to the R program and create a complex structure object in R, how we goanna do? The most amazing thing is using the reflection wrapper in this library to complete this job:

As we have such an example data structure in the .NET program, and the custom attribute have already define on the target property:

VB.NET
Public Class ExampleExperimentData
    Public Property Id As String
    <Microsoft.VisualBasic.DataVisualization.DocumentFormat.Csv.Reflection.Column("RPKM", DataVisualization.DocumentFormat.Csv.Reflection.ColumnAttribute.Type.Double)>
    Public Property ExpressionRPKM As Double
    <Microsoft.VisualBasic.DataVisualization.DocumentFormat.Csv.Reflection.ArrayColumn("tags")>
    Public Property Tags As String()
End Class

Then the target csv data file were load into a specific type of data collection object! Things just simple and quite smart!

VB.NET
Dim CsvPath As String = "csv/data/file/path"
Dim DataCollection = CsvPath.LoadCsv(Of ExampleExperimentData)(explicit:=True)

You see, As the same as read data, save the data in the data collection object is simple too!

Call DataCollection.SaveTo(CsvPath, explicit:=True)

 

A simple test example

VB.NET
Imports Microsoft.VisualBasic.DataVisualization.DocumentFormat.Extensions

Module TestMain
    Sub Main()
        Dim DataCollection As ExampleExperimentData() = New ExampleExperimentData() {
            New ExampleExperimentData With {.Id = "GeneId_0001", .ExpressionRPKM = 0, .Tags = New String() {"Up", "Regulator"}},
            New ExampleExperimentData With {.Id = "GeneId_0002", .ExpressionRPKM = 1, .Tags = New String() {"Up", "PathwayA"}},
            New ExampleExperimentData With {.Id = "GeneId_0003", .ExpressionRPKM = 2, .Tags = New String() {"Down", "Virulence"}}}

        Dim CsvPath As String = "./TestData.csv"

        Call DataCollection.SaveTo(CsvPath, explicit:=False)

        DataCollection = Nothing
        DataCollection = CsvPath.LoadCsv(Of ExampleExperimentData)(explicit:=False).ToArray

        Dim File = Microsoft.VisualBasic.DataVisualization.DocumentFormat.Csv.Reflection.Reflector.Save(Of ExampleExperimentData)(DataCollection, Explicit:=False)

        For Each row In File
            Call Console.WriteLine(row.ToString)
        Next

        Console.WriteLine("Press any key to continute...")
        Console.Read()
    End Sub

    Public Class ExampleExperimentData
        Public Property Id As String
        <Microsoft.VisualBasic.DataVisualization.DocumentFormat.Csv.Reflection.Column("RPKM", DataVisualization.DocumentFormat.Csv.Reflection.ColumnAttribute.Type.Double)>
        Public Property ExpressionRPKM As Double
        <Microsoft.VisualBasic.DataVisualization.DocumentFormat.Csv.Reflection.ArrayColumn("tags")>
        Public Property Tags As String()
    End Class
End Module

Image 3

Image 4

Picture3. Test example console output and auto generate csv document

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)