Download DocumentFormat.Csv-noexe.zip
Download DocumentFormat.Csv.zip
Why chose csv?
- First, csv document just a formatted plant text file that can be easily read and write by any program.
- Secondly, if we just want a data file but not a well formatted Excel file with the extract style data, then the csv document is the best choice for the cross-platform software as the MS Office Excel will not works on LINUX and the .NET program on LINUX mono environment is not easily using the Excel library C++ interface from the OpenOffice without any modification of our source code.
- Third, as a biological researcher, the R program is our most use experiment data statics software, and when we develop a .NET program interact with R on LINUX/Windows via RDotNET, the csv document is the best choice for the data exchange between the R and the tools that I’ve develop. It’s convenient using Visual Basic operate the R to create a complex object in R using RDotNET via exchange the data in csv:
Call DataCollection.SaveTo(CsvPath)
R = R << Push(String.Format("ExperimentData <- read.csv(""{0}"")", CsvPath))
R operation statements, balabalabala....
Picture1. The advantages of csv document
Important Components in this library
Implementation of Csv file object
CsvFile Object
The CsvFile object instance stands for a CSV file in the file system. Create a csv file object in your code just easy:
Dim CsvFile As Microsoft.VisualBasic.DataVisualization.DocumentFormat.Csv.File =
Microsoft.VisualBasic.DataVisualization.DocumentFormat.Csv.File.Load(Path:=CsvPath)
The Load function in the Csv.File class define shows below:
Public Shared Function Load(Path As String, Optional encoding As System.Text.Encoding = Nothing) As File
If encoding Is Nothing Then
encoding = System.Text.Encoding.Default
End If
Dim Csv = CType(System.IO.File.ReadAllLines(Path, encoding:=encoding), File)
Csv.File = Path
Return Csv
End Function
The save method for save the data in the csv object into the filesystem.
Public Sub Save(Optional Path As String = "", Optional LazySaved As Boolean = False, Optional encoding As System.Text.Encoding = Nothing)
There is an encoding optional parameter; this is for the consideration of avoid the messy code for Chinese character on different operation system, as an example is that Windows is using ANSI as default but the LINUX is using Unicode as default.
Here is the major data structure definition of the Csv.File class:
Public Class File : Implements Generic.IEnumerable(Of Csv.File.Row)
Implements Generic.IList(Of DocumentFormat.Csv.File.Row)
Protected Friend _InnerListTable As List(Of Row) = New List(Of Row)
And some useful method which are defined in this Csv.File class for data manipulation:
Public Function AppendRange(RowCollection As Generic.IEnumerable(Of Row)) As Long
Public Function Get_DataRow(line As Integer) As Row
Public Function FindAtColumn(KeyWord As String, Column As Integer) As Row()
Public Property Cell(X As Integer, Y As Integer) As String
Public Function Remove(Condition As Func(Of Row, Boolean)) As Row()
Public Sub RemoveRange(RowCollection As Generic.IEnumerable(Of Row))
Public Function InsertEmptyColumnBefore(column As Integer) As Integer
Row Object
Public Class Row : Implements Generic.IEnumerable(Of String)
Implements Generic.IList(Of System.String)
Protected Friend _InnerColumnList As List(Of String) = New List(Of String)
How to parse the data string into a row object
As we can see, the row object is major consist of a string collection, so that we just need to split the string into a string collection, and a magical regular expression makes this work easy:
Protected Friend Const SplitRegxExpression As String = "[" & vbTab & ",](?=(?:[^""]|""[^""]*"")*$)"
And the detail of the code implementation of the parsing job:
Public Shared Widening Operator CType(Line As String) As Row
Dim Row = Regex.Split(Line, SplitRegxExpression)
For i As Integer = 0 To Row.Count - 1
If Not String.IsNullOrEmpty(Row(i)) Then
If Row(i).First = """"c AndAlso Row(i).Last = """"c Then
Row(i) = Mid(Row(i), 2, Len(Row(i)) - 2)
End If
End If
Next
Return Row
End Operator
And some major method for the row data manipulation:
Default Public Property Column(Index As Integer) As String Implements IList(Of String).Item
Public ReadOnly Property IsNullOrEmpty As Boolean
Public Function InsertAt(value As String, column As Integer) As Integer
Public Function Takes(Count As Integer) As String()
Public Function Takes(Cols As Integer(), Optional retNullable As Boolean = True) As String()
Public Function AddRange(values As Generic.IEnumerable(Of String)) As Integer
Public Sub Add(columnValue As String) Implements ICollection(Of String).Add
Imports operations
As the csv file format is using the comma character to separate the cell data, but some text file is not using the comma as delimiter, it may use TAB or several space characters, then we can imports the data in the target text file to create a csv file object, just like the Excel application data imports operation does.
There are two types of data imports operation are define in the DataImports module:
Public Function [Imports](txtPath As String, Optional delimiter As Char = ","c, Optional encoding As System.Text.Encoding = Nothing) As Microsoft.VisualBasic.DataVisualization.DocumentFormat.Csv.File
Dim Expression As String = String.Format(SplitRegxExpression, delimiter)
If encoding Is Nothing Then
encoding = System.Text.Encoding.Default
End If
Dim Lines As String() = IO.File.ReadAllLines(txtPath, encoding)
Dim LQuery = (From line As String In Lines Select RowParsing(line, Expression)).ToArray
Return LQuery
End Function
Public Function FixLengthImports(txtPath As String, Optional length As Integer = 10, Optional encoding As System.Text.Encoding = Nothing) As Microsoft.VisualBasic.DataVisualization.DocumentFormat.Csv.File
If encoding Is Nothing Then
encoding = System.Text.Encoding.Default
End If
Dim Lines As String() = IO.File.ReadAllLines(txtPath, encoding)
Dim LQuery = (From line As String In Lines Select RowParsing(line, length:=length)).ToArray
Return LQuery
End Function
The amazing Reflection Wrapper
This part of my job in this library is the most powerful and amazing works! It makes my coding job happier and I believe it will do the same on you! Let’s see an example of code on it! All of these wrapper operations are define in the namespace Microsoft.VisualBasic.DataVisualization.DocumentFormat.Csv.Reflection
How it works?
The reflection wrapper operations are based on the reflection operation in .NET. There three step to implement this function:
Picture2. Workflow for the wrapper operation
First, Create Data Schema
Required custom attribute:
Namespace Csv.Reflection
<AttributeUsage(AttributeTargets.Property, allowmultiple:=True, inherited:=False)>
Public Class ColumnAttribute : Inherits Attribute
Protected Friend _ElementDataType As Type = Type.String
Protected Friend _Name As String
Protected Friend _bindProperty As System.Reflection.PropertyInfo
Sub New(Name As String, Optional Type As Type = Type.String)
<AttributeUsage(AttributeTargets.Property, allowmultiple:=True, inherited:=False)>
Public Class ArrayColumn : Inherits Attribute
Protected Friend _Delimiter As String
Protected Friend _ElementDataType As ColumnAttribute.Type
Protected Friend _Name As String
Protected Friend _bindProperty As System.Reflection.PropertyInfo
Sub New(Name As String, Optional Delimiter As String = "; ", Optional ElementDataType As ColumnAttribute.Type = ColumnAttribute.Type.String)
Tow function for creating the data schema:
Private Shared Function GetArrayColumns(ItemTypeProperties As System.Reflection.PropertyInfo()) As Reflection.ArrayColumn()
Dim LQuery = (From [PropertyInfo] As System.Reflection.PropertyInfo
In ItemTypeProperties
Let attrs As Object() = [PropertyInfo].GetCustomAttributes(Reflection.ArrayColumn.TypeInfo, inherit:=False)
Where Not attrs.IsNullOrEmpty
Select DirectCast(attrs.First, Reflection.ArrayColumn).Bind(PropertyInfo)).ToArray
Return LQuery
End Function
Private Shared Function GetColumns(Type As System.Type) As Reflection.ColumnAttribute()
Dim ItemTypeProperties = Type.GetProperties(BindingFlags.Public Or BindingFlags.Instance)
Dim [Property] As PropertyInfo
Dim ColumnList As List(Of Reflection.ColumnAttribute) = New List(Of ColumnAttribute)
For i As Integer = 0 To ItemTypeProperties.Length - 1
[Property] = ItemTypeProperties(i)
Dim ColumnAttrList = (From attr In [Property].GetCustomAttributes(ColumnAttribute.TypeInfo, inherit:=False) Select DirectCast(attr, ColumnAttribute)).ToArray
If Not ColumnAttrList.IsNullOrEmpty Then
For idx As Integer = 0 To ColumnAttrList.Count - 1
Dim columnAttr = ColumnAttrList(idx)
If Not Len(columnAttr._Name) > 0 Then
columnAttr._Name = [Property].Name
End If
ColumnList.Add(columnAttr.Bind([Property]))
Next
End If
Next
Return ColumnList.ToArray
End Function
As just some basic data type can be convert into string or convert string into the target type directly, then I define a type enums to specific the dat type of target property, and this code make the program performance better than we directly get the type information from the property using reflection.
Public Enum Type
[Object] = -1
[Integer] = 0
[Long]
[String]
[Double]
[DateTime]
[Bool]
End Enum
Secondary, Data Convection
After we create a data schema from the reflection operation above, then we are able to make a data type convention between the data object and the row object.
Here is the data convert method:
Friend Shared ReadOnly CTypers As System.Func(Of String, Object)() = {
Function(s As String) CInt(Val(s)),
Function(s As String) CType(Val(s), Long),
Function(s As String) s,
Function(s As String) Val(s),
Function(s As String) CType(s, DateTime),
Function(s As String) CType(s, Boolean)}
Public Shared Function Convert(s As String, DataType As Type) As Object
Return CTypers(DataType)(s)
End Function
There is another problem that the wrapper operation can convert a data array object, so that I create this method for array convert into a row string and convert the array string into an object array:
Public Function CreateObject(cellData As String) As String()
Dim Tokens As String() = Split(cellData, _Delimiter)
Return Tokens
End Function
Private ReadOnly Property DelimiterLength As Integer
Get
Return Len(_Delimiter)
End Get
End Property
Public Function CreateObject(Of T)(DataCollection As Generic.IEnumerable(Of T)) As String
Dim Type As System.Type = GetType(T)
Dim sBuilder As StringBuilder = New StringBuilder(1024)
If Not (Type.IsValueType OrElse Type = GetType(String)) Then
Call Console.WriteLine("[WARNNING] DataType ""{0}"" is not a validated value type, trying to get string data from its Object.ToString() method!", Type.FullName)
End If
Dim StringCollection As String() = (From item In DataCollection Let _create = Function() item.ToString Select _create()).ToArray
For Each item In StringCollection
Call sBuilder.AppendFormat("{0}{1}", item, _Delimiter)
Next
Call sBuilder.Remove(sBuilder.Length - DelimiterLength, DelimiterLength)
Return sBuilder.ToString
End Function
There is a problem that we passing the collection data into the Property.SetValue method, as we parse a string array load from the text file but the data type of the target property may be a Double array,then If we directly pass this object collection parameter into the Property.SetValue method then we will get a data type miss match exception, so that I using a lambda expression to make the thing work.
Protected Friend Shared ReadOnly FillObjects As Dictionary(Of Reflection.ColumnAttribute.Type, System.Action(Of PropertyInfo, Object, String())) =
New Dictionary(Of ColumnAttribute.Type, Action(Of PropertyInfo, Object, String()))
From {
{ColumnAttribute.Type.Bool, Sub(bindProperty As System.Reflection.PropertyInfo, FilledObject As Object, strValues As String()) _
Call bindProperty.SetValue(FilledObject, (From strData As String In strValues Select CType(strData, Boolean)).ToArray, Nothing)},
{ColumnAttribute.Type.DateTime, Sub(bindProperty As System.Reflection.PropertyInfo, FilledObject As Object, strValues As String()) _
Call bindProperty.SetValue(FilledObject, (From strData As String In strValues Select CType(strData, DateTime)).ToArray, Nothing)},
{ColumnAttribute.Type.Double, Sub(bindProperty As System.Reflection.PropertyInfo, FilledObject As Object, strValues As String()) _
Call bindProperty.SetValue(FilledObject, (From strData As String In strValues Select CType(strData, Double)).ToArray, Nothing)},
{ColumnAttribute.Type.Integer, Sub(bindProperty As System.Reflection.PropertyInfo, FilledObject As Object, strValues As String()) _
Call bindProperty.SetValue(FilledObject, (From strData As String In strValues Select CType(strData, Integer)).ToArray, Nothing)},
{ColumnAttribute.Type.Long, Sub(bindProperty As System.Reflection.PropertyInfo, FilledObject As Object, strValues As String()) _
Call bindProperty.SetValue(FilledObject, (From strData As String In strValues Select CType(strData, Long)).ToArray, Nothing)},
{ColumnAttribute.Type.String, Sub(bindProperty As System.Reflection.PropertyInfo, FilledObject As Object, strValues As String()) _
Call bindProperty.SetValue(FilledObject, strValues, Nothing)}}
All of the wrapper operation code can be found at Csv.Reflection.Reflector class:
Public Shared Function Load(Of ItemType)(Path As String, Optional Explicit As Boolean = True, Optional encoding As System.Text.Encoding = Nothing) As List(Of ItemType)
Public Shared Function Save(Of ItemType)(Collection As Generic.IEnumerable(Of ItemType), Optional Explicit As Boolean = True) As Microsoft.VisualBasic.DataVisualization.DocumentFormat.Csv.File
Those two methods above have a optional parameter Explicit, which means True for just parse the property have the column custom attribute and False for parse all property even they have no column custom attribute, I just want makes thing better and easier control. This is just like that when you do the xml serialization all of the property without xml serialization custom attribute also can be serialize into the xml file.
And I have created two extension methods for makes the IO operation more easily:
<Extension> Public Function LoadCsv(Of T)(Path As String, Optional explicit As Boolean = True, Optional encoding As System.Text.Encoding = Nothing) As List(Of T)
Return Csv.Reflection.Reflector.Load(Of T)(Path, explicit, encoding)
End Function
<Extension> Public Sub SaveTo(Of T)(Collection As Generic.IEnumerable(Of T), path As String, Optional explicit As Boolean = True, Optional encoding As System.Text.Encoding = Nothing)
Call Csv.Reflection.Reflector.Save(Collection, explicit).Save(path, False, encoding)
End Sub
A Simple Example: How to use?
If we have a collection of experiment data in such data structure, and after we preprocessing in our .NET program and then passing this data package to the R program and create a complex structure object in R, how we goanna do? The most amazing thing is using the reflection wrapper in this library to complete this job:
As we have such an example data structure in the .NET program, and the custom attribute have already define on the target property:
Public Class ExampleExperimentData
Public Property Id As String
<Microsoft.VisualBasic.DataVisualization.DocumentFormat.Csv.Reflection.Column("RPKM", DataVisualization.DocumentFormat.Csv.Reflection.ColumnAttribute.Type.Double)>
Public Property ExpressionRPKM As Double
<Microsoft.VisualBasic.DataVisualization.DocumentFormat.Csv.Reflection.ArrayColumn("tags")>
Public Property Tags As String()
End Class
Then the target csv data file were load into a specific type of data collection object! Things just simple and quite smart!
Dim CsvPath As String = "csv/data/file/path"
Dim DataCollection = CsvPath.LoadCsv(Of ExampleExperimentData)(explicit:=True)
You see, As the same as read data, save the data in the data collection object is simple too!
Call DataCollection.SaveTo(CsvPath, explicit:=True)
A simple test example
Imports Microsoft.VisualBasic.DataVisualization.DocumentFormat.Extensions
Module TestMain
Sub Main()
Dim DataCollection As ExampleExperimentData() = New ExampleExperimentData() {
New ExampleExperimentData With {.Id = "GeneId_0001", .ExpressionRPKM = 0, .Tags = New String() {"Up", "Regulator"}},
New ExampleExperimentData With {.Id = "GeneId_0002", .ExpressionRPKM = 1, .Tags = New String() {"Up", "PathwayA"}},
New ExampleExperimentData With {.Id = "GeneId_0003", .ExpressionRPKM = 2, .Tags = New String() {"Down", "Virulence"}}}
Dim CsvPath As String = "./TestData.csv"
Call DataCollection.SaveTo(CsvPath, explicit:=False)
DataCollection = Nothing
DataCollection = CsvPath.LoadCsv(Of ExampleExperimentData)(explicit:=False).ToArray
Dim File = Microsoft.VisualBasic.DataVisualization.DocumentFormat.Csv.Reflection.Reflector.Save(Of ExampleExperimentData)(DataCollection, Explicit:=False)
For Each row In File
Call Console.WriteLine(row.ToString)
Next
Console.WriteLine("Press any key to continute...")
Console.Read()
End Sub
Public Class ExampleExperimentData
Public Property Id As String
<Microsoft.VisualBasic.DataVisualization.DocumentFormat.Csv.Reflection.Column("RPKM", DataVisualization.DocumentFormat.Csv.Reflection.ColumnAttribute.Type.Double)>
Public Property ExpressionRPKM As Double
<Microsoft.VisualBasic.DataVisualization.DocumentFormat.Csv.Reflection.ArrayColumn("tags")>
Public Property Tags As String()
End Class
End Module
Picture3. Test example console output and auto generate csv document