Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / Languages / C#

Cinchoo ETL - Sorting Large CSV File

4.67/5 (7 votes)
20 Sep 2021CPOL3 min read 16.5K   72  
Easy way to sort large CSV files using Cinchoo ETL
In this tip, you will learn how to sort very large CSV file using Cinchoo ETL framework. It is very simple to use, with few lines of code, the conversion can be done. You can convert large files as the conversion process is stream based, quite fast and with low memory footprint.

1. Introduction

ChoETL is an open source ETL (extract, transform and load) framework for .NET. It is a code based library for extracting data from multiple sources, transforming, and loading into your very own data warehouse in .NET environment. You can have data in your data warehouse in no time.

This article talks about sorting large CSV files using ChoCSVReader/ChoCSVWriter components offered by ChoETL framework. Large CSV files present a challenge when need arises to sort. It fails with OutOfMemory exception if you try to do with .NET application. With Microsoft Excel for sorting on large files, you will promptly receive an error the first time.

If a string sort will work for you, then just use the Windows SORT command. Sort the file and be done with it. It'll happily sort your large CSV file, and it's simple to use.

If you need to filter and convert the file, specifically the date / custom type fields, then you would need to write a small conversion program that converts the fields whatever you like, and rewrites the record. This is where Cinchoo ETL framework places a role to ease up the conversion of large CSV file the way you want with all possible requirements with minimal memory footprint and ultra-fast performance.

In this article, let's learn how to do that using a Cinchoo ETL framework.

To learn more about Cinchoo ETL - CSVReader, please visit the CodeProject article here.

2. Requirement

This framework library is written in C# using .NET 4.5 / .NET Core Framework.

3. How to Use

3.1 Sample Data

Let's begin by looking into a simple example of reading the below CSV file having 4 columns.

Listing 1. Sample CSV Data File (customers.csv)

CustId,Name,Balance,AddedDate
1732,Tom Perez,435.00,05/11/2002
5541,Pedro Gomez,12342.30,06/02/2000
1122,Mark Politti,0.00,01/02/2004
1924,Pablo Ramirez,3321.30,11/24/2002

The first thing to do is to install ChoETL.JSON / ChoETL.JSON.NETStandard nuget package. To do this, run the following command in the Package Manager Console.

.NET Framework

Install-Package ChoETL.JSON

.NET Core

Install-Package ChoETL.JSON.NETStandard

Now add ChoETL namespace to the program.

using ChoETL;

3.2 Quick Sort- Zero Configuration Approach

This approach shows you how to do the sorting of large CSV file without any mapping class.

Listing 2. Sorting CSV Files without Mapping Class

JavaScript
public static void QuickSort()
{
    using (var dr = new ChoCSVReader(@"customers.csv").WithFirstLineHeader())
    {
        using (var dw = new ChoCSVWriter(Console.Out))
            dw.Write(dr.ExternalSort((c1, c2) => String.Compare(c1.Name, c2.Name)));
    }
}

In the above, open the input CSV file for reading using ChoCSVReader object. Then open ChoCSVWriter for writing sorted CSV file. Use the ExternalSort() method on reader object to sort the items by Name using String.Compare method. This will produce the sorted CSV file.

Likewise, you can perform any type of projection, filtering or conversion on csv collection with minimal memory footprint.

Sample fiddle: https://dotnetfiddle.net/cLQMKb

3.3 Define Mapping Class

This approach talks about sorting of large CSV file using Mapping (POCO) Class. More type safe approach than the previous step. First, create a class which has properties with the same name of the column headings found in the CSV file. Make sure to mark the class as Serializable using SerializableAttribute. Below, you will find an example of a class which does this.

Listing 3. Mapping Class

C#
[Serializable]
public class Customer
{
	public int CustId { get; set; }
	public string Name { get; set; }
	public decimal Balance { get; set; }
	public DateTime AddedDate { get; set; }
}

3.3 Sort with ChoCSVReader/ChoCSVWriter

Finally, create an instance of ChoCSVReader class to read the input file. Create an instance of ChoCSVWriter for writing sorted CSV file as below.

Listing 4. Sorting CSV File with Mapping Class

JavaScript
public static void POCOSort()
{
    using (var dr = new ChoCSVReader<Customer>(@"customers.csv").WithFirstLineHeader())
    {
        using (var dw = new ChoCSVWriter<Customer>(Console.Out))
            dw.Write(dr.ExternalSort((c1, c2) => DateTime.Compare(c1.AddedDate, c2.AddedDate)));
    }
}

In the above, the code uses the ExternalSort() method on reader object to sort the CSV file by AddedDate column and saved to CSV output file. This will produce the sorted CSV file.

Sample fiddle: https://dotnetfiddle.net/1dFvSx

 

For more information about Cinchoo ETL, please visit the below CodeProject article:

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)