Introduction
Where is a general CSV serializer when you need one? Not there, right?
This article builds a general CSV serialization solution.
Design
We would expect such a serializer to deliver the following abilities:
- Serialize lists of a specified Type to a flat CSV file.
- Reconstitute objects from the same (or a similarly constructed) file.
- Mimic the usage of the .NET XML Serializer, or Binary Formatter, in writing-to and reading-from a stream object.
- Nominate properties to ignore from the target Type to be serialized using a
CsvIgnore
attribute. - Allow the user to specify a delimiter character other than the default comma character.
Concerns
The following are concerns that represent risks and may not be apparent at the outset:
- The separator character may appear in the value of a property being serialized. This will cause the resulting row to appear to have more columns than expected.
- Likewise, a NewLine may appear in the value of a property. This will cause the resulting row to span multiple rows.
Both of these problems will render the row (if not the entire CSV file) unreadable, and so must be dealt with.
We solve both these problems with the same approach: replace the offending character sequence with some User-Specified string. On serialization each
occurrence of the problem characters is to be replaced with a dummy character, and then on deserialization, the original character replaces the dummy characters, and the original string is restored.
The user needs to be aware that in each of these cases the replacement string must not possibly appear in the normal data to be serialized. Otherwise deserialization will replace valid occurrences of the replacement string with either the Separator or NewLine characters.
CSV Serialization
Given that CSV is a flat-file structure, we do not have to consider deep object serialization. Thus, associations will be rendered as a single text value.
The CSV structure also implies that the data to be serialized should be a list of objects, so that the Serialization method is to receive an
IList
of the target Type. The Deserialization method will return such a list.
The Class
From Design Requirements 1, 2, 3, and 5, and the discussion so far, we can define some of the main requirements for the class.
The class will:
- Expose a delimiter character (comma by default).
- Expose a replacement string
- Reference the target Type to be serialized/deserialized by the generic parameter
T
. - Build and maintain a list of
PropertyInfo
for the properties to be serialized/deserialized. - Expose a Serialization method that accepts a
Stream
object and an
IList
of data. - Expose a Deserialization method that accepts a
Stream
object and returns an
IList
of data.
We can now define this rough outline as follows:
public class CsvSerializer<T> where T : class, new()
{
public char Separator { get; set; }
public string Replacement { get; set; }
private List<PropertyInfo> _properties;
public void Serialize(Stream stream, IList<T> data) { }
public IList<T> Deserialize(Stream stream) { }
.
.
.
}
List of PropertyInfo
The first step carried-out by the class will be to assemble a list of properties that are to be rendered to CSV (in the form of a
List
of PropertyInfo
). This step will be executed on construction.
This list of properties is used for three purposes:
- To create the CSV header.
- To get instance values by reflection on serialization.
- To set instance values by reflection on deserialization.
public CsvSerializer()
{
var type = typeof(T);
var properties = type.GetProperties(BindingFlags.Public | BindingFlags.Instance
| BindingFlags.GetProperty | BindingFlags.SetProperty);
_properties = (from a in properties
where a.GetCustomAttribute<CsvIgnoreAttribute>() == null
orderby a.Name
select a).ToList();
}
CsvIgnoreAttribute
As with XmlSerializer
, we want to be able to nominate properties to exclude from serialization by decorating the property with a
CsvIgnore
attribute.
The CsvIgnore
attribute is defined simply as follows:
public class CsvIgnoreAttribute : Attribute { }
CSV Header
The header row for the file will be a comma-separated string of the names of the properties. The list of properties will already be sorted by name, so we can be confident that the column order will match the order of values when it comes time to render each row.
We create a private GetHeader
method that returns the header row as a string:
private string GetHeader()
{
var columns = Properties.Select(a => a.Name).ToArray();
var header = string.Join(Separator.ToString(), columns);
return header;
}
The Serialization Method
Once the CSV header row is created, the serializer iterates through the enumerable list of data items, and for each item retrieves the values for all properties in the list of PropertyInfo
(discussed above). The values are added to a string array which is finally joined by the
separator character.
The serialization method itself accepts a stream object and an IList
of objects of Type
T
.
After iterating over the list and generating each row, it uses a stream writer to write the final CSV text to the stream object.
public void Serialize(Stream stream, IList<T> data)
{
var sb = new StringBuilder();
var values = new List<string>();
sb.AppendLine(GetHeader());
var row = 1;
foreach (var item in data)
{
values.Clear();
foreach (var p in _properties)
{
var raw = p.GetValue(item);
var value = raw == null ?
"" :
raw.ToString().Replace(Separator.ToString(), Replacement);
values.Add(value);
}
sb.AppendLine(string.Join(Separator.ToString(), values.ToArray()));
}
using (var sw = new StreamWriter(stream))
{
sw.Write(sb.ToString().Trim());
}
}
The Deserialization Method
The deserialization method accepts a stream object that represents a CSV text file.
We read the first line as the CSV header row. The header is split with the
separator character, then stored in a columns array, and later indexed to reference the
properties by name.
The remaining text is split by NewLine and stored in 'rows'. Each row is then split using the
separator character. The resulting parts then represent the values to be deserialized to a new instance of the Type
T
. The order of these values will appear in the same order as the columns array so that we can retrieve the column name by index, and use the column name to retrieve the corresponding
PropertyInfo
from the PropertyInfo
list.
Each string part can then be converted to the correct value by using the .NET
TypeConverter
for the property and stored in the new instance of the target object using the
PropertyInfo
SetValue
method.
public IList<T> Deserialize(Stream stream)
{
string[] columns;
string[] rows;
try
{
using (var sr = new StreamReader(stream))
{
columns = sr.ReadLine().Split(Separator);
rows = sr.ReadToEnd().Split(new string[] { Environment.NewLine }, StringSplitOptions.None);
}
}
catch (Exception ex)
{
throw new InvalidCsvFormatException(
"The CSV File is Invalid. See Inner Exception for more inoformation.", ex);
}
var data = new List<T>();
for (int row = 0; row < rows.Length; row++)
{
var line = rows[row];
if (string.IsNullOrWhiteSpace(line))
{
throw new InvalidCsvFormatException(string.Format(
@"Error: Empty line at line number: {0}", row));
}
var parts = line.Split(Separator);
var datum = new T();
for (int i = 0; i < parts.Length; i++)
{
var value = parts[i];
var column = columns[i];
value = value.Replace(Replacement, Separator.ToString());
var p = _properties.First(a => a.Name == column);
var converter = TypeDescriptor.GetConverter(p.PropertyType);
var convertedvalue = converter.ConvertFrom(value);
p.SetValue(datum, convertedvalue);
}
data.Add(datum);
}
return data;
}
Further Considerations and Enhancements
Here we discuss some useful settings that can be used to customize serialization for specific situations.
Syntactical and Functional Options
Separator
The CSV Separator character
default: comma Replacement
Replacement string if the Separator character appears in a field value.
default:
(char)255
NewLineReplacement
Replacement string if a NewLine appears in a field value
default:
(char)254
UseLineNumbers
An additional column can be inserted into the CSV for Row Number
default:
true
UseEofLiteral
An EOF literal can be used to mark the end-of-file
default:
false
IgnoreEmptyLines
If not true then exceptions will be thrown on encountering blank lines during deserialization
default:
true
IgnoreReferenceTypesExceptString
Exclude reference type properties other than string.
(default:
true
). RowNumberColumnTitle
Title of the Row-Number column if
UseLineNumbers
is true
default: "RowNumber
"UseTextQualifier
Place double inverted commas around all values. Default is false.
The Separator
and Replacement
options are already considered by exposing the
Separator
and Replacement
properties, so that the consumer is able to use any other character for the delimiter and set the
replacement string.
It is important that the consumer understand the range of values for the types being serialized such that they specify a
replacement string that cannot occur in the data range, otherwise valid instances of the
replacement string appearing in normal data will be erroneously converted to the separator character on deserialization.
RowNumberColumnTitle
will work in conjunction with the UseLineNumbers
option and introduce an additional first-column into the CSV file containing the row number.
It was found that CsvIgnore
was consistently applied to Reference Types such that Reference Types make little sense in most CSV serialization contexts. This statement does not apply to the
String
type however. So it appeared to be sensible to include this property to allow the automatic exclusion of all Reference Types other than
String
.
It was specified in the requirements that prompted the creating of this class, that the last line of the CSV file contain the string literal "EOF". This is not
a common CSV practice and so an option is introduced to allow for this possibility. The implementation simply adds
an additional line to the CSV text during serialization. And allows for the possibility of an EOF line during deserialization.
The IgnoreEmptyLines
option allows the deserializer to ignore empty or corrupt lines of text. Otherwise an
InvalidCsvFormatException
is thrown.
Usage
As with the XmlSerializer
, it is a requirement that the target Type exposes a default paramaterless constructor.
Usage is then a two stage process. Firstly, create a CsvSerializer
instance of the target Type. Then secondly, call either the Serialization or Deserialization methods with a valid stream object.
The Serialization method will of course require the list of items to be serialized:
var data = GetPeople();
using (var stream = new FileStream("persons.csv", FileMode.Create, FileAccess.Write))
{
var cs = new CsvSerializer<Person>();
cs.Serialize(stream, data);
}
While the Deserialization method will return a list of objects of the target Type:
IList<Person> data = null;
using (var stream = new FileStream("persons.csv", FileMode.Open, FileAccess.Read))
{
var cs = new CsvSerializer<Person>();
data = cs.Deserialize(stream);
}
Note
Although the code provided here is geared towards a general solution it was prompted by the criteria for a particular use-case.
Please test and modify the code for other particular cases not considered
here.
Further Work
Here is a list of ideas indicating possible further work and extension.
- Performance: the solution may not be sufficiently optimized for speed so that further work will be required to address any performance issues.
ColumnAttribute
: Introduce a Col
umn attribute to name columns other than by
property names. IgnoreList
: Rather than decorate properties with the
CsvIgnore
attribute, it may be desirable, rather, to reference a list of property names to ignore.- Use a character array to specify a list of separator characters for deserialization where the stream may contain various delimiters.
Updates
- Add
UseTextQualifier
Property, by user Gogowater.