This article is about how to read a large CSV or any character separated values file chunk by chunk, and populate DataTable an Entity List representing each chunk.
Background
Imagine a scenario where we have to process or store contents of a large character separated values file in a database. I had a large .CSV file with 9-12 million rows, the file size was around 700-800 MB. Initially, I had tried GenericParser, CsvHelper
and a few other things but ended up with an out of memory or a very slow solution. Then I made a parser of my own to chunk data as DataTable
. A few days later, I had to validate the source data also, so I had to make some changes to the parser to chunk data as List<TSource>
.
Today, I going to share both solutions to:
- read data as entity mapped
- read data as
DataTable
Aren't Other Good Libraries Out There?
Yes, there are some libraries. Most of them are loading the entire file to a list/datatable, few are chunking the data but validation features aren't there. If we are intending to log data validation errors and error line details to a separate file/DB during the data processing time, we will not be able to do it or need to maintain two different processes.
Why Do We Need This?
- Uploading large data file content as a
DataTable
to a database using SqlBulkCopy
- Uploading large data file content as an entity list
List<TSource>
to a database using Entity Framework - Validating large file content
Here, we are going to focus more on data batching or chunking process.
Good Practices!!!
- During these type of scenarios, use
yield return
and IEnumerable<TSource>
. - During insertion to a database, we should maintain a truncation for the entire file.
- For Entity Framework, it is good practice to commit small lists at a time and reinitialize DB context periodically using existing connection and transaction.
- Maintain a file process and error logs if any. Move a processed file to an archive location after processing.
Here I am adding CSV examples, but it will work for any character separated values file.
Core
Here are some of the Interfaces to be used in our helper project.
using System.Collections.Generic;
namespace CsvBatch
{
public interface ICsvLine
{
long LineNo { get; }
string Line { get; }
char Splitter { get; }
List<string> LineValues();
void Set(long lineNo, string line, char splitter);
}
public interface ICsvDataMapper : ICsvLine
{
}
public interface ICsvDataMapper<T> : ICsvDataMapper
{
T Map();
bool Map(out T entity, out List<string> errors);
}
}
This class represents a data line of a file.
using System.Collections.Generic;
namespace CsvBatch
{
public class CsvLine : ICsvLine
{
public long LineNo { get; protected set; }
public string Line { get; protected set; }
public char Splitter { get; protected set; }
public List<string> LineValues()
{
var values = string.IsNullOrEmpty(Line) ?
new List<string>() : new List<string>(Line.Split(Splitter));
return values;
}
public void Set(long lineNo, string line, char splitter)
{
LineNo = lineNo;
Line = line;
Splitter = splitter;
}
}
}
This class represents the file.
using System;
using System.Collections.Generic;
using System.Data;
using System.IO;
using System.Linq;
using System.Reflection;
using System.Text;
using System.Threading.Tasks;
namespace CsvBatch
{
public class FileReader
{
public readonly string Path;
public readonly char Splitter;
public readonly bool HasHeaderLine;
private List<string> _header;
private FileReader(char splitter, bool hasHeaderLine)
{
Splitter = splitter;
HasHeaderLine = hasHeaderLine;
}
public FileReader(string path, bool hasHeaderLine, char splitter) :
this(splitter, hasHeaderLine)
{
Path = path;
}
public virtual List<string> Headers()
{
if (_header != null)
{
return _header;
}
if (!HasHeaderLine)
{
_header = new List<string>();
return _header;
}
_header = new List<string>();
using (var reader = new StreamReader(File.OpenRead(Path)))
{
if (HasHeaderLine && !reader.EndOfStream)
{
_header = reader.ReadLine().Split(Splitter).ToList();
}
}
return _header;
}
public virtual IEnumerable<string> Lines()
{
using (var reader = new StreamReader(File.OpenRead(Path)))
{
if (HasHeaderLine && !reader.EndOfStream)
{
reader.ReadLine();
}
while (!reader.EndOfStream)
{
var line = reader.ReadLine();
yield return line;
}
}
}
public virtual IEnumerable<CsvLine> Rows()
{
using (var reader = new StreamReader(File.OpenRead(Path)))
{
long lineNo = 0;
if (HasHeaderLine && !reader.EndOfStream)
{
++lineNo;
reader.ReadLine();
}
while (!reader.EndOfStream)
{
var line = new CsvLine();
line.Set(++lineNo, reader.ReadLine(), Splitter);
yield return line;
}
}
}
public IEnumerable<List<CsvLine>> Rows(int batchSize)
{
List<CsvLine> list = new List<CsvLine>();
foreach (var row in Rows())
{
list.Add(row);
if (list.Count == batchSize)
{
yield return list;
list = new List<CsvLine>();
}
}
if (list.Count > 0)
{
yield return list;
}
}
}
}
It is a mapper base class for entity mapping. We are going to use this configuration class to convert a data line to an expected entity. Here, we have the option to validate a data line before getting the mapped entity.
using System;
using System.Collections.Generic;
using System.Linq;
namespace CsvBatch
{
public abstract class EntityLineMapper<TEntilty> : CsvLine, ICsvDataMapper<TEntilty>
{
public TEntilty Map()
{
TEntilty entity;
List<string> mapErrors;
if (!Map(out entity, out mapErrors))
{
throw new Exception("File to entity map error.");
}
return entity;
}
public bool Map(out TEntilty entity, out List<string> mapErrors)
{
entity = Map(LineValues(), out mapErrors);
return !mapErrors.Any();
}
protected abstract TEntilty Map(List<string> lineValues, out List<string> mapErrors);
}
}
Read As Entity
We are going to use this class to read data as entity list. This class in inhering from previously shared FileReader
.
using System;
using System.Collections.Generic;
using System.Data;
using System.IO;
using System.Linq;
using System.Reflection;
using System.Text;
using System.Threading.Tasks;
namespace CsvBatch
{
public class ReaderAsEntity : FileReader
{
public ReaderAsEntity(string path, bool hasHeaderLine = true,
char splitter = ',') : base(path, hasHeaderLine, splitter)
{
}
public IEnumerable<List<TEntity>> Rows<TMapper, TEntity>(int batchSize)
where TMapper : ICsvDataMapper<TEntity>, new()
{
TMapper mapper = new TMapper();
foreach (var batch in Rows(batchSize))
{
List<TEntity> batchQueue = new List<TEntity>(batchSize);
foreach (var row in batch)
{
mapper.Set(row.LineNo, row.Line, row.Splitter);
batchQueue.Add(mapper.Map());
}
yield return batchQueue;
}
}
public IEnumerable<List<TMapper>> Rows<TMapper>(int batchSize)
where TMapper : ICsvDataMapper, new()
{
foreach (var batch in Rows(batchSize))
{
List<TMapper> batchQueue = new List<TMapper>(batchSize);
foreach (var row in batch)
{
TMapper mapper = new TMapper();
mapper.Set(row.LineNo, row.Line, row.Splitter);
batchQueue.Add(mapper);
}
yield return batchQueue;
}
}
}
}
Read As DataTable
This class will be used to read data as DataTable
list.
using System;
using System.Collections.Generic;
using System.Data;
using System.IO;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace CsvBatch
{
public class ReaderAsDataTable : FileReader
{
public int? ExtectedColumnNumber { get; protected set; }
public bool IncludeLineNumber { get; protected set; }
public string TableName { get; protected set; }
public string DefaultColumnName { get; protected set; }
public string DefaultLineNumberColumnName { get; protected set; }
public List<string> _expectedHeaders;
private ReaderAsDataTable(string path, bool hasHeaderLine,
char splitter, string tableName, string defaultColumnName,
bool includeLineNumber, string defaultLineNumberColumnName)
: base(path, hasHeaderLine, splitter)
{
IncludeLineNumber = includeLineNumber;
TableName = tableName;
DefaultColumnName = defaultColumnName;
DefaultLineNumberColumnName = defaultLineNumberColumnName;
}
public ReaderAsDataTable(string path, char splitter = ',',
string tableName = "TableName", string defaultColumnName = "ColumnName",
bool includeLineNumber = true,
string defaultLineNumberColumnName = "ColumnRowNumber")
: this(path, true, splitter, tableName, defaultColumnName,
includeLineNumber, defaultLineNumberColumnName)
{
}
public ReaderAsDataTable(string path, int extectedColumnNumber,
char splitter = ',', string tableName = "TableName",
string defaultColumnName = "ColumnName",
bool includeLineNumber = true,
string defaultLineNumberColumnName = "ColumnRowNumber")
: this(path, false, splitter, tableName, defaultColumnName,
includeLineNumber, defaultLineNumberColumnName)
{
ExtectedColumnNumber = extectedColumnNumber;
}
public List<string> ExpectedHeaders()
{
if (_expectedHeaders != null)
{
return _expectedHeaders;
}
List<string> headers = new List<string>();
if (ExtectedColumnNumber != null)
{
for (int i = 1; i <= ExtectedColumnNumber; i++)
{
string columnName = DefaultColumnName + i;
headers.Add(columnName);
}
}
else
{
headers.AddRange(Headers());
}
if (IncludeLineNumber)
{
string columnName = DefaultLineNumberColumnName;
headers.Add(columnName);
}
_expectedHeaders = headers;
return _expectedHeaders;
}
public DataTable DataTable()
{
DataTable table = NewDataTable();
table.BeginLoadData();
foreach (var item in Rows())
{
var row = NewDataRow(table, item);
table.Rows.Add(row);
}
table.EndLoadData();
return table;
}
public IEnumerable<DataTable> DataTables(int batchSize)
{
int tableCount = 0;
foreach (var batch in Rows(batchSize))
{
++tableCount;
DataTable table = NewDataTable(tableCount);
table.BeginLoadData();
foreach (var item in batch)
{
var row = NewDataRow(table, item);
table.Rows.Add(row);
}
table.EndLoadData();
yield return table;
}
}
private DataTable NewDataTable(int tableCount = 0)
{
DataTable table = new DataTable();
table.TableName = tableCount == 0 ? TableName : TableName + tableCount;
foreach (string header in ExpectedHeaders())
{
string columnName = header;
if (table.Columns[columnName] == null)
{
table.Columns.Add(columnName);
}
}
return table;
}
private DataRow NewDataRow(DataTable table, CsvLine item)
{
DataRow row = table.NewRow();
var lineValues = item.LineValues();
if (IncludeLineNumber)
{
lineValues.Add(item.LineNo.ToString());
}
row.ItemArray = lineValues.ToArray();
return row;
}
}
}
Using: Read As Entity
Entity Model and Data Mapping
Model
To read data as an entity, we need a model and a model mapping configuration. Here is our model class:
namespace CsvBatch
{
public class Student
{
public string Name { get; internal set; }
public int Age { get; internal set; }
public int Level { get; internal set; }
}
}
Data Mapping
Let's create the model mapping configuration class by inheriting EntityLineMapper<TSource>
class. The base class will force the child class to implement a method TSource Map(List<string> lineValues, out List<string> mapErrors)
.
List<string> lineValues
contains split line values List<string> mapErrors
error validation messages
Here, I am validating data and adding error messages to List<string> mapErrors.
It is not mandatory if the data source providing valid data, all the time. Trimming values is a good practice and we can do it here if needed.
using System;
using System.Collections.Generic;
using System.Text;
using System.Threading.Tasks;
namespace CsvBatch
{
public class StudentMapper : EntityLineMapper<Student>
{
protected override Student Map(List<string> lineValues, out List<string> mapErrors)
{
Student entity = new Student();
mapErrors = new List<string>();
if (lineValues.Count < 3)
{
mapErrors.Add("Not enough enough");
return entity;
}
entity.Name = lineValues[0];
string ageString = lineValues[1];
int age;
if (int.TryParse(ageString, out age))
{
entity.Age = age;
}
else
{
mapErrors.Add("Age is not a number");
}
string levelString = lineValues[2];
int level;
if (int.TryParse(levelString, out level))
{
entity.Level = level;
}
else
{
mapErrors.Add("Level is not a number");
}
return entity;
}
}
}
Read File Without Validation
Here, all the rows of the source file are valid.
File Sample
Name,Age,Level
Dan,8,1
Ben,1,1
Henrik,9,1
Dan,8,2
Ben,1,2
Henrik,9,2
Dan,8,3
Load Data
string path = FullPath(@"Student.csv");
bool hasHeader = true;
char separator = ',';
int batchSize = 3;
IEnumerable<List<Student>> enumerable =
new ReaderAsEntity(path, hasHeader, separator).Rows<StudentMapper, Student>(batchSize);
List<List<Student>> list = enumerable.ToList();
Assert.AreEqual(3, list.Count);
Assert.AreEqual(3, list[0].Count);
Assert.AreEqual(3, list[1].Count);
Assert.AreEqual(1, list[2].Count);
Read File With Validation
Here, only the last row of the source file is valid.
File Sample
Name,Age,Level
Ben,,1
Henrik,1,
Dan,8,2
Load Data
The usages are almost as before. but the function will output validation error messages.
Student entity
line to entity model List<string> errors
validation error messages, by default, its count should be zero
string path = FullPath(@"Student_Validation.csv");
bool hasHeader = true;
char separator = ',';
int batchSize = 10;
var reader = new ReaderAsEntity(path, hasHeader, separator);
Student entity; List<string> errors;
foreach (var list in reader.Rows<StudentMapper>(batchSize))
{
foreach (var item in list)
{
if (item.Map(out entity, out errors))
{
}
else
{
}
}
}
Using: Read As DataTable
Currently while reading a file data as DataTable
, we don't have validation options like the entity mapping. I will add the validation option soon.
The data type of all columns data type will be string
. An additional column ColumnRowNumber
will be added to the DataTable
to determine the data line number, this is configurable at the contractor end. If we check the constructor of ReaderAsDataTable
class, we will see a few default parameters. Please switch between the constructors or pass values to the default parameters depending on the need.
Read File With Header
File Sample
Name,Age,Level
Dan,8,1
Ben,1,1
Henrik,9,1
Dan,8,2
Ben,1,2
Henrik,9,2
Dan,8,3
Load Data
string path = FullPath(@"Student.csv");
int batchSize = 3;
char separator = ',';
List<DataTable> dataTables =
new ReaderAsDataTable(path, separator).DataTables(batchSize).ToList();
Assert.AreEqual(3, dataTables.Count);
Read File Without Header
File Sample
Dan,8,1
Ben,1,1
Henrik,9,1
Dan,8,2
Ben,1,2
Henrik,9,2
Dan,8,3
Load Data
string path = FullPath(@"Student_No_Header.csv");
int batchSize = 3;
char separator = ',';
int numberOfColumnsInFile = 3;
List<DataTable> dataTables = new ReaderAsDataTable
(path, numberOfColumnsInFile, separator).DataTables(batchSize).ToList();
Assert.AreEqual(3, dataTables.Count);
As we don't have a header in the file, column names will be auto generated like:
ColumnName1, ColumnName2, ColumnName3 ...
Bonus: Split IEnumerable/List In Batches
This extension method will split any IEnumerable<TSource>
object into evenly sized chunks.
IEnumerable Helper
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace CsvBatch
{
public static class ListExtensions
{
public static IEnumerable<List<TSource>> Batch<TSource>
(this IEnumerable<TSource> sourceList, int size)
{
List<TSource> batchList = new List<TSource>(size);
foreach (TSource obj in sourceList)
{
batchList.Add(obj);
if (batchList.Count == size)
{
yield return batchList;
batchList = new List<TSource>();
}
}
if (batchList.Count > 0)
{
yield return batchList;
}
}
}
}
Using the Helper
List<Student> source = new List<Student>()
{
new Student() { Name = "Dan", Age = 10, Level = 1},
new Student() { Name = "Ben", Age = 11, Level = 1},
new Student() { Name = "Dan", Age = 10, Level = 2},
new Student() { Name = "Ben", Age = 11, Level = 2},
new Student() { Name = "Dan", Age = 10, Level = 3},
};
IEnumerable<List<Student>> value = source.Batch(2);
List<List<Student>> list = new List<List<Student>>(value);
Assert.AreEqual(3, list.Count);
Assert.AreEqual(2, list[0].Count);
Assert.AreEqual(2, list[1].Count);
Assert.AreEqual(1, list[2].Count);
Projects
It is a Visual Studio 2017 solution, single unit test project. The folder structure is as below:
- Mode (Student model and mapper)
- Helper (Reader classes or actual helper codes)
- Test (Code examples, unit tests, and test files)
The code may throw unexpected errors for untested inputs. If any, just let me know.
History
- 2nd August, 2020: Initial version