ChoETL is an open source ETL (extract, transform and load) framework for .NET. It is a code based library for extracting data from multiple sources, transforming, and loading into your very own data warehouse in .NET environment. You can have data in your data warehouse in no time.
Contents
ChoETL
is an open source ETL (extract, transform and load) framework for .NET. It is a code based library for extracting data from multiple sources, transforming, and loading into your very own data warehouse in .NET environment. You can have data in your data warehouse in no time.
Apache Parquet, an open source file format for Hadoop. Parquet stores nested data structures in a flat columnar format. Compared to a traditional approach where data is stored in row-oriented approach, parquet is more efficient in terms of storage and performance.
This article talks about using ChoParquetReader
component offered by ChoETL framework. It is a simple utility class to extract Parquet
data from file / source to objects.
Features
- Uses Parquet.NET parser under the hood, parses Parquet file in seconds and also handles large file without any memory issues
- Stream based parsers allow for ultimate performance, low resource usage, and nearly unlimited versatility scalable to any size data file, even tens or hundreds of gigabytes
- Event based data manipulation and validation allows total control over the flow of data during the bulk insert process
- Exposes
IEnumerable
list of objects - which is often used with LINQ query for projection, aggregation and filtration, etc. - Supports deferred reading
- Supports processing files with culture specific date, currency and number formats
- Recognizes a wide variety of date, currency, enum, boolean and number formats when reading files
- Provides fine control of date, currency, enum, boolean, number formats when writing files
- Detailed and robust error handling, allowing you to quickly find and fix the problems
This framework library is written in C# using .NET 4.5 Framework / .NET core 2.x.
- Open VS.NET 2013 or higher
- Create a sample VS.NET (.NET Framework 4.5/.NET core 2.x) Console Application project
- Install ChoETL via Package Manager Console using Nuget Command based on the .NET environment:
Install-Package ChoETL.Parquet
- Use the
ChoETL
namespace
Let's begin by looking into a simple example of reading Parquet file having 2 fields.
Image 3.1 Sample Parquet data file (emp.parquet)
There are number of ways you can get the Parquet file parsing started with minimal setup.
It is the zero config, quick way to load a Parquet file in no time. No POCO object is required. Sample code below shows how to load the file.
Listing 3.1.1 Load Parquet file using iterator
foreach (dynamic rec in new ChoParquetReader("emp.parquet"))
{
Console.WriteLine($"Id: {rec.Id}, Name: {rec.Name}");
}
Sample fiddle: https://dotnetfiddle.net/4dJk4G
Listing 3.1.2 Load Parquet file using loop
var reader = new ChoParquetReader("emp.parquet");
dynamic rec;
while ((rec = reader.Read()) != null)
{
Console.WriteLine($"Id: {rec.Id}, Name: {rec.Name}");
}
Sample fiddle: https://dotnetfiddle.net/XAtppL
This is another zero config way to parse and load Parquet file using POCO class. First, define a simple data class to match the underlying Parquet file layout
Listing 3.2.1 Simple POCO entity class
public partial class EmployeeRec
{
public int Id { get; set; }
public string Name { get; set; }
}
In the above, the class defines two properties matching the sample Parquet file template.
Listing 3.2.2 Load Parquet file
foreach (var rec in new ChoParquetReader<EmployeeRec>("emp.parquet"))
{
Console.WriteLine($"Id: {rec.Id}, Name: {rec.Name}");
}
Sample fiddle: https://dotnetfiddle.net/00baoy
In this model, we define the Parquet configuration with all the necessary parsing parameters along with Parquet fields matching with the underlying Parquet file.
Listing 3.3.1 Define Parquet configuration
ChoParquetRecordConfiguration config = new ChoParquetRecordConfiguration();
config.ParquetRecordFieldConfigurations.Add(new ChoParquetRecordFieldConfiguration("Id"));
config.ParquetRecordFieldConfigurations.Add
(new ChoParquetRecordFieldConfiguration("Name"));
In the above, the class defines two properties matching the sample Parquet file template.
Listing 3.3.2 Load Parquet file without POCO object
foreach (dynamic rec in new ChoParquetReader("emp.parquet", config))
{
Console.WriteLine($"Id: {rec.Id}, Name: {rec.Name}");
}
Sample fiddle: https://dotnetfiddle.net/V5ts04
Listing 3.3.3 Load Parquet file with POCO object
foreach (var rec in new ChoParquetReader<EmployeeRec>("emp.parquet", config))
{
Console.WriteLine($"Id: {rec.Id}, Name: {rec.Name}");
}
Sample fiddle: https://dotnetfiddle.net/mwd0EK
This is the combined approach to define POCO entity class along with Parquet configuration parameters decorated declaratively. Id
is required field and Name
is optional value field with default value "XXXX
". If Name
is not present, it will take the default value.
Listing 3.4.1 Define POCO Object
public class EmployeeRec
{
[ChoParquetRecordField]
[Required]
public int Id
{
get;
set;
}
[ChoParquetRecordField]
[DefaultValue("XXXX")]
public string Name
{
get;
set;
}
public override string ToString()
{
return "{0}. {1}".FormatString(Id, Name);
}
}
The code above illustrates about defining POCO object to carry the values of each record line in the input file. First thing defines property for each record field with ChoParquetRecordFieldAttribute
to qualify for Parquet record mapping. ParquetPath is a optional property. If not specified, framework automatically discover and load the values from Parquet
property. Id
is decorated it with RequiredAttribute
, if the value is missing, it will throw an exception. Name
is given default value using DefaultValueAttribute
. It means that if the Name
Parquet field contains empty value in the file, it will be defaulted to 'XXXX
' value.
It is very simple and ready to extract Parquet data in no time.
Listing 3.4.2 Main Method
foreach (var rec in new ChoParquetReader<EmployeeRec>("emp.parquet"))
{
Console.WriteLine($"Id: {rec.Id}, Name: {rec.Name}");
}
We start by creating a new instance of ChoParquetReader
object. That's all. All the heavy lifting of parsing and loading Parquet data stream into the objects is done by the parser under the hood.
By default, ChoParquetReader
discovers and uses default configuration parameters while loading Parquet file. These can be overridable according to your needs. The following sections will give details about each configuration attributes.
It is as easy as setting up POCO object match up with Parquet file structure, you can read the whole file as enumerable pattern. It is a deferred execution mode, but take care while making any aggregate operation on them. This will load the entire file records into memory.
Listing 4.1 Read Parquet File
foreach (var rec in new ChoParquetReader<EmployeeRec>("emp.parquet"))
{
Console.WriteLine($"Id: {rec.Id}, Name: {rec.Name}");
}
or:
Listing 4.2 Read Parquet file stream
foreach (var rec in new ChoParquetReader<EmployeeRec>(textReader))
{
Console.WriteLine($"Id: {rec.Id}, Name: {rec.Name}");
}
This model keeps your code elegant, clean, easy to read and maintain. Also leverages LINQ extension methods to perform grouping, joining, projection, aggregation, etc.
Listing 4.3 Using LINQ
var list = (from o in new ChoParquetReader<EmployeeRec>("emp.parquet")
where o.Name != null && o.Name.StartsWith("R")
select o).ToArray();
foreach (var rec in list)
{
Console.WriteLine($"Id: {rec.Id}, Name: {rec.Name}");
}
It is as easy as setting up POCO object match up with Parquet file structure, you can read the whole file as enumerable pattern.
Listing 5.1 Read Parquet file
var reader = new ChoParquetReader<EmployeeRec>("emp.parquet");
var rec = (object)null;
while ((rec = reader.Read()) != null)
{
Console.WriteLine($"Id: {rec.Id}, Name: {rec.Name}");
}
Using ChoParquetRecordObjectAttribute
, you can customize the POCO entity object declaratively.
Listing 6.1 Customizing POCO object for each record
[ChoParquetRecordObject]
public class EmployeeRec
{
[ChoParquetRecordField]
public int Id { get; set; }
[ChoParquetRecordField]
[Required]
[DefaultValue("ZZZ")]
public string Name { get; set; }
}
Here are the available attributes to carry out customization of Parquet load operation on a file.
CultureName
- The culture name (e.g., en-US
, en-GB
) used to read and write Parquet data. Encoding
- The encoding of the Parquet file. ColumnCountStrict
- This flag indicates if an exception should be thrown if reading an expected field is missing. ErrorMode
- This flag indicates if an exception should be thrown if reading and an expected field is failed to load. This can be overridden per property. Possible values are:
IgnoreAndContinue
- Ignore the error, record will be skipped and continue with next. ReportAndContinue
- Report the error to POCO entity if it is of IChoNotifyRecordRead
type ThrowAndStop
- Throw the error and stop the execution
IgnoreFieldValueMode
- A flag to let the reader know if a record should be skipped when reading if it's empty / null
. This can be overridden per property. Possible values are:
Null
- skipped if the record value is null
DBNull
- N/A Empty
- skipped if the record value is empty WhiteSpace
- skipped if the record value contains only whitespaces
ObjectValidationMode
- A flag to let the reader know about the type of validation to be performed with record object. Possible values are:
Off
- No object validation performed (Default) MemberLevel
- Validation performed at the time of each Parquet property gets loaded with value ObjectLevel
- Validation performed after all the properties are loaded to the POCO object
For each Parquet field, you can specify the mapping in POCO entity property using ChoParquetRecordFieldAttribute
. Only use this attribute if you want to use custom ParquetPath
to map to this field.
Listing 7.1 Customizing POCO object for Parquet fields
public class EmployeeRec
{
[ChoParquetRecordField]
public int Id { get; set; }
[ChoParquetRecordField]
[Required]
[DefaultValue("ZZZ")]
public string Name { get; set; }
}
Here are the available members to add some customization to it for each property:
FieldName
- When mapping by name, you specify the name of the Parquet
field that you want to use for that property.
It is the value used and set to the property when the Parquet value is empty or whitespace (controlled via IgnoreFieldValueMode
).
Any POCO entity property can be specified with default value using System.ComponentModel.DefaultValueAttribute
.
It is the value used and set to the property when the Parquet
value failed to set. Fallback
value only set when ErrorMode
is either IgnoreAndContinue
or ReportAndContinue
.
Any POCO entity property can be specified with fallback value using ChoETL.ChoFallbackValueAttribute
.
Most of the primitive types are automatically converted and set them to the properties. If the value of the Parquet
field can't automatically be converted into the type of the property, you can specify a custom / built-in .NET converters to convert the value. These can be either IValueConverter
or TypeConverter
converters.
There are couple of ways you can specify the converters for each field:
- Declarative Approach
- Configuration Approach
This model is applicable to POCO entity object only. If you have POCO class, you can specify the converters to each property to carry out necessary conversion on them. Samples below shows the way to do it.
Listing 7.3.1.1 Specifying type converters
public class EmployeeRec
{
[ChoParquetRecordField]
[ChoTypeConverter(typeof(IntConverter))]
public int Id { get; set; }
[ChoParquetRecordField]
[Required]
[DefaultValue("ZZZ")]
public string Name { get; set; }
}
Listing 7.3.1.2 IntConverter implementation
public class IntConverter : IValueConverter
{
public object Convert(object value, Type targetType,
object parameter, CultureInfo culture)
{
return value;
}
public object ConvertBack(object value, Type targetType,
object parameter, CultureInfo culture)
{
return value;
}
}
In the example above, we defined custom IntConverter
class. And showed how to use it with 'Id
' Parquet
property.
This model is applicable to both dynamic and POCO entity object. This gives freedom to attach the converters to each property at runtime. This takes precedence over the declarative converters on POCO classes.
Listing 7.3.2.2 Specifying TypeConverters
ChoParquetRecordConfiguration config = new ChoParquetRecordConfiguration();
ChoParquetRecordFieldConfiguration idConfig =
new ChoParquetRecordFieldConfiguration("Id");
idConfig.AddConverter(new IntConverter());
config.ParquetRecordFieldConfigurations.Add(idConfig);
config.ParquetRecordFieldConfigurations.Add
(new ChoParquetRecordFieldConfiguration("Name"));
In the above, we construct and attach the IntConverter
to 'Id
' field using AddConverter
helper method in ChoParquetRecordFieldConfiguration
object.
Likewise, if you want to remove any converter from it, you can use RemoveConverter
on ChoParquetRecordFieldConfiguration
object.
ChoParquetReader
leverages both System.ComponentModel.DataAnnotations and Validation Block
validation attributes to specify validation rules for individual fields of POCO entity. Refer to the MSDN site for a list of available DataAnnotation
s validation attributes.
Listing 7.4.1 Using validation attributes in POCO entity
[ChoParquetRecordObject]
public partial class EmployeeRec
{
[ChoParquetRecordField(FieldName = "id")]
[ChoTypeConverter(typeof(IntConverter))]
[Range(1, int.MaxValue, ErrorMessage = "Id must be > 0.")]
[ChoFallbackValue(1)]
public int Id { get; set; }
[ChoParquetRecordField(FieldName = "Name")]
[Required]
[DefaultValue("ZZZ")]
[ChoFallbackValue("XXX")]
public string Name { get; set; }
}
In the example above, used Range
validation attribute for Id
property. Required
validation attribute to Name
property. ChoParquetReader
performs validation on them during load based on Configuration.ObjectValidationMode
is set to ChoObjectValidationMode.MemberLevel
or ChoObjectValidationMode.ObjectLevel
.
Sometimes, you may want override the defined declarative validation behaviors comes with POCO class, you can do with Cinchoo ETL via configuration approach. The sample below shows the way to override them.
static void ValidationOverridePOCOTest()
{
ChoParquetRecordConfiguration config = new ChoParquetRecordConfiguration();
var idConfig = new ChoParquetRecordFieldConfiguration("Id");
idConfig.Validators = new ValidationAttribute[] { new RequiredAttribute() };
config.ParquetRecordFieldConfigurations.Add(idConfig);
config.ParquetRecordFieldConfigurations.Add
(new ChoParquetRecordFieldConfiguration("Name"));
using (var parser = new ChoParquetReader<EmployeeRec>("emp.parquet", config))
{
object rec;
while ((rec = parser.Read()) != null)
{
Console.WriteLine(rec.ToStringEx());
}
}
}
In some cases, you may want to take control and perform manual self validation within the POCO entity class. This can be achieved by inheriting POCO object from IChoValidatable
interface.
Listing 7.4.2 Manual validation on POCO entity
[ChoParquetRecordObject]
public partial class EmployeeRec : IChoValidatable
{
[ChoParquetRecordField(FieldName = "id")]
[ChoTypeConverter(typeof(IntConverter))]
[Range(1, int.MaxValue, ErrorMessage = "Id must be > 0.")]
[ChoFallbackValue(1)]
public int Id { get; set; }
[ChoParquetRecordField(FieldName = "Name")]
[Required]
[DefaultValue("ZZZ")]
[ChoFallbackValue("XXX")]
public string Name { get; set; }
public bool TryValidate
(object target, ICollection<ValidationResult> validationResults)
{
return true;
}
public bool TryValidateFor
(object target, string memberName, ICollection<ValidationResult> validationResults)
{
return true;
}
public void Validate(object target)
{
}
public void ValidateFor(object target, string memberName)
{
}
}
The sample above shows how to implement custom self-validation in POCO object.
IChoValidatable
interface exposes below methods:
TryValidate
- Validate entire object, return true
if all validation passed. Otherwise, return false
. Validate
- Validate entire object, throw exception if validation is not passed. TryValidateFor
- Validate specific property of the object, return true
if all validation passed. Otherwise return false
. ValidateFor
- Validate specific property of the object, throw exception if validation is not passed.
ChoParquetReader
offers industry standard Parquet parsing out of the box to handle most of the parsing needs. If the parsing is not handling any of the needs, you can use the callback mechanism offered by ChoParquetReader
to handle such situations. In order to participate in the callback mechanism, you can use either of the following models:
- Using event handlers exposed by
ChoParquetReader
via IChoReader
interface. - Inheriting POCO entity object from
IChoNotifyRecordRead
/ IChoNotifyFileRead
/ IChoNotifyRecordFieldRead
interfaces - Inheriting
DataAnnotation
's MetadataType
type object by IChoNotifyRecordRead
/ IChoNotifyFileRead / IChoNotifyRecordFieldRead
interfaces. - Inheriting
IChoNotifyRecordFieldConfigurable
/ IChoNotifyRecordFieldConfigurable
configuration interfaces
Note: Any exceptions raised out of these interface methods will be ignored.
IChoReader
exposes the below events:
BeginLoad
- Invoked at the begin of the Parquet file load EndLoad
- Invoked at the end of the Parquet file load BeforeRecordLoad
- Raised before the Parquet record load AfterRecordLoad
- Raised after Parquet record load RecordLoadError
- Raised when Parquet record load errors out BeforeRecordFieldLoad
- Raised before Parquet field value load AfterRecordFieldLoad
- Raised after Parquet field value load RecordFieldLoadError
- Raised when Parquet field value errors out - SkipUntil - Raised before the Parquet parsing kicks off to add custom logic to skip record lines.
- DoWhile - Raised during Parquet parsing where you can add custom logic to stop the parsing.
IChoNotifyRecordRead
exposes the below methods:
BeforeRecordLoad
- Raised before the Parquet record load AfterRecordLoad
- Raised after Parquet record load RecordLoadError
- Raised when Parquet record load errors out
IChoNotifyFileRead
exposes the below methods:
BeginLoad
- Invoked at the begin of the Parquet file load EndLoad
- Invoked at the end of the Parquet file load - SkipUntil - Raised before the Parquet parsing kicks off to add custom logic to skip record lines.
- DoWhile - Raised during Parquet parsing where you can add custom logic to stop the parsing.
IChoNotifyRecordFieldRead
exposes the below methods:
BeforeRecordFieldLoad
- Raised before Parquet field value load AfterRecordFieldLoad
- Raised after Parquet field value load RecordFieldLoadError
- Raised when Parquet field value errors out
IChoNotifyRecordConfigurable
exposes the below methods:
RecondConfigure
- Raised for Parquet record configuration
IChoNotifyRecordFieldConfigurable
exposes the below methods:
RecondFieldConfigure
- Raised for each Parquet record field configuration
This is more direct and the simplest way to subscribe to the callback events and handle your odd situations in parsing Parquet files. The downside is that code can't be reusable as you do by implementing IChoNotifyRecordRead
with POCO record object.
The sample below shows how to use the BeforeRecordLoad
callback method to skip lines stating with '%
' characters.
Listing 10.1.1 Using ChoParquetReader callback events
static void IgnoreLineTest()
{
using (var parser = new ChoParquetReader("emp.parquet"))
{
parser.BeforeRecordLoad += (o, e) =>
{
if (e.Source != null)
{
e.Skip = !((IDictionary<string, object>)e.Source).ContainsKey("Name");
}
};
foreach (var e in parser)
Console.WriteLine(e.Dump());
}
}
Likewise, you can use other callback methods as well with ChoParquetReader
.
The sample below shows how to implement IChoNotifyRecordRead
interface to direct POCO class.
Listing 10.2.1 Direct POCO callback mechanism implementation
[ChoParquetRecordObject]
public partial class EmployeeRec : IChoNotifyRecordRead
{
[ChoParquetRecordField(FieldName = "Id")]
[ChoTypeConverter(typeof(IntConverter))]
[Range(1, int.MaxValue, ErrorMessage = "Id must be > 0.")]
[ChoFallbackValue(1)]
public int Id { get; set; }
[ChoParquetRecordField(FieldName = "Name")]
[Required]
[DefaultValue("ZZZ")]
[ChoFallbackValue("XXX")]
public string Name { get; set; }
public bool AfterRecordLoad(object target, int index, object source)
{
throw new NotImplementedException();
}
public bool BeforeRecordLoad(object target, int index, ref object source)
{
throw new NotImplementedException();
}
public bool RecordLoadError(object target, int index, object source, Exception ex)
{
throw new NotImplementedException();
}
}
The sample below shows how to attach Metadata
class to POCO class by using MetadataTypeAttribute
on it.
Listing 10.2 MetaDataType based callback mechanism implementation
[ChoParquetRecordObject]
public class EmployeeRecMeta : IChoNotifyRecordRead
{
[ChoParquetRecordField(FieldName = "Id")]
[ChoTypeConverter(typeof(IntConverter))]
[Range(1, int.MaxValue, ErrorMessage = "Id must be > 0.")]
[ChoFallbackValue(1)]
public int Id { get; set; }
[ChoParquetRecordField(FieldName = "Name")]
[Required]
[DefaultValue("ZZZ")]
[ChoFallbackValue("XXX")]
public string Name { get; set; }
public bool AfterRecordLoad(object target, int index, object source)
{
throw new NotImplementedException();
}
public bool BeforeRecordLoad(object target, int index, ref object source)
{
throw new NotImplementedException();
}
public bool RecordLoadError(object target, int index, object source, Exception ex)
{
throw new NotImplementedException();
}
}
[MetadataType(typeof(EmployeeRecMeta))]
public partial class EmployeeRec
{
public int Id { get; set; }
public string Name { get; set; }
}
The sample below shows how to attach Metadata
class for sealed or third party POCO class by using ChoMetadataRefTypeAttribute
on it.
Listing 10.2.3 ChoMetaDataRefType based callback mechanism implementation
[ChoMetadataRefType(typeof(EmployeeRec))]
[ChoParquetRecordObject]
public class EmployeeRecMeta : IChoNotifyRecordRead
{
[ChoParquetRecordField(FieldName = "id")]
[ChoTypeConverter(typeof(IntConverter))]
[Range(1, int.MaxValue, ErrorMessage = "Id must be > 0.")]
[ChoFallbackValue(1)]
public int Id { get; set; }
[ChoParquetRecordField(FieldName = "Name")]
[Required]
[DefaultValue("ZZZ")]
[ChoFallbackValue("XXX")]
public string Name { get; set; }
public bool AfterRecordLoad(object target, int index, object source)
{
throw new NotImplementedException();
}
public bool BeforeRecordLoad(object target, int index, ref object source)
{
throw new NotImplementedException();
}
public bool RecordLoadError(object target, int index, object source, Exception ex)
{
throw new NotImplementedException();
}
}
public partial class EmployeeRec
{
public int Id { get; set; }
public string Name { get; set; }
}
This callback invoked once at the beginning of the Parquet file load. source
is the Parquet file stream object. In here, you have a chance to inspect the stream
, return true
to continue the Parquet load. Return false
to stop the parsing.
Listing 10.1.1 BeginLoad Callback Sample
public bool BeginLoad(object source)
{
StreamReader sr = source as StreamReader;
return true;
}
This callback invoked once at the end of the Parquet file load. source
is the Parquet file stream object. In here, you have a chance to inspect the stream, do any post steps to be performed on the stream.
Listing 10.2.1 EndLoad Callback Sample
public void EndLoad(object source)
{
StreamReader sr = source as StreamReader;
}
This callback invoked before each Parquet node in the Parquet file is loaded. target
is the instance of the POCO record object. index
is the JObject
node index in the file. source
is the Parquet record object. In here, you have a chance to inspect the object, and override it with new values if want to.
TIP: If you want to skip the JObject from loading, set the source to null.
Return true
to continue the load process, otherwise return false
to stop the process.
Listing 10.5.1 BeforeRecordLoad Callback Sample
public bool BeforeRecordLoad(object target, int index, ref object source)
{
IDictionary<string, object> obj = source as IDictionary<string, object>;
return true;
}
This callback invoked after each JObject
node in the Parquet file is loaded. target
is the instance of the POCO record object. index
is the JObject
node index in the file. source
is the Parquet record object. In here, you have a chance to do any post step operation with the JObject
line.
Return true
to continue the load process, otherwise return false
to stop the process.
Listing 10.6.1 AfterRecordLoad Callback Sample
public bool AfterRecordLoad(object target, int index, object source)
{
IDictionary<string, object> obj = source as IDictionary<string, object>;
return true;
}
This callback is invoked if error is encountered while loading JObject
node. target
is the instance of the POCO record object. index
is the JObject
node index in the file. source
is the JObject
node. ex
is the exception object. In here, you have a chance to handle the exception. This method is invoked only when Configuration.ErrorMode
is ReportAndContinue
.
Return true
to continue the load process, otherwise return false
to stop the process.
Listing 10.7.1 RecordLoadError Callback Sample
public bool RecordLoadError(object target, int index, object source, Exception ex)
{
IDictionary<string, object> obj = source as IDictionary<string, object>;
return true;
}
This callback is invoked before each Parquet record field is loaded. target
is the instance of the POCO record object. index
is the JObject
node index in the file. propName
is the Parquet record property name. value
is the Parquet field value. In here, you have a chance to inspect the Parquet record property value and perform any custom validations, etc.
Return true
to continue the load process, otherwise return false
to stop the process.
Listing 10.8.1 BeforeRecordFieldLoad Callback Sample
public bool BeforeRecordFieldLoad
(object target, int index, string propName, ref object value)
{
return true;
}
This callback is invoked after each Parquet record field is loaded. target
is the instance of the POCO record object. index
is the JObject
node index in the file. propName
is the Parquet record property name. value
is the Parquet field value. Any post field operation can be performed here, like computing other properties, validations, etc.
Return true
to continue the load process, otherwise return false
to stop the process.
Listing 10.9.1 AfterRecordFieldLoad Callback Sample
public bool AfterRecordFieldLoad(object target, int index, string propName, object value)
{
return true;
}
This callback is invoked when error is encountered while loading Parquet
record field value. target
is the instance of the POCO record object. index
is the JObject
node index in the file. propName
is the Parquet
record property name. value
is the Parquet
field value. ex
is the exception object. In here, you have a chance to handle the exception. This method is invoked only after the below two sequences of steps are performed by the ChoParquetReader
.
ChoParquetReader
looks for FallbackValue
value of each Parquet
property. If present, it tries to assign its value to it. - If the
FallbackValue
value not present and the Configuration.ErrorMode
is specified as ReportAndContinue
., this callback will be executed.
Return true
to continue the load process, otherwise return false
to stop the process.
Listing 10.10.1 RecordFieldLoadError Callback Sample
public bool RecordFieldLoadError
(object target, int index, string propName, object value, Exception ex)
{
return true;
}
This callback is invoked at the start of the Parquet parsing with custom logic to skip nodes. index
is the JObject
node index in the file.
Return true
to skip the line, otherwise return false
.
Listing 10.11.1 SkipUntil Callback Sample
public bool SkipUntil(long index, object source)
{
return false;
}
This callback is invoked at the start of the Parquet parsing with custom logic to skip nodes. index
is the JObject
node index in the file.
Return true
to stop the parsing, otherwise return false
.
Listing 10.12.1 DoWhile Callback Sample
public bool DoWhile(long index, object source)
{
return false;
}
ChoParquetReader
automatically detects and loads the configured settings from POCO entity. At runtime, you can customize and tweak these parameters before Parquet parsing. ChoParquetReader
exposes Configuration
property, it is of ChoParquetRecordConfiguration
object. Using this property, you can customize them.
Listing 10.1 Customizing ChoParquetReader at run-time
class Program
{
static void Main(string[] args)
{
using (var parser = new ChoParquetReader<EmployeeRec>("emp.parquet"))
{
object row = null;
parser.Configuration.ColumnCountStrict = true;
while ((row = parser.Read()) != null)
Console.WriteLine(row.ToString());
}
}
ChoParquetReader
exposes AsDataReader
helper method to retrieve the Parquet records in .NET datareader
object. DataReader
are fast-forward streams of data. This datareader
can be used in few places like bulk coping data to database using SqlBulkCopy
, loading disconnected DataTable
, etc.
Listing 11.1 Reading as DataReader sample
static void AsDataReaderTest()
{
using (var parser = new ChoParquetReader<EmployeeRec>("emp.parquet"))
{
IDataReader dr = parser.AsDataReader();
while (dr.Read())
{
Console.WriteLine("Id: {0}, Name: {1}", dr[0], dr[1]);
}
}
}
ChoParquetReader
exposes AsDataTable
helper method to retrieve the Parquet
records in .NET DataTable
object. It then can be persisted to disk, displayed in grid/controls or stored in memory like any other object.
Listing 12.1 Reading as DataTable sample
static void AsDataTableTest()
{
using (var parser = new ChoParquetReader<EmployeeRec>("emp.parquet"))
{
DataTable dt = parser.AsDataTable();
foreach (DataRow dr in dt.Rows)
{
Console.WriteLine("Id: {0}, Name: {1}", dr[0], dr[1]);
}
}
}
So far, the article explained about using ChoParquetReader
with POCO object. ChoParquetReader
also supports loading Parquet file without POCO object. It leverages .NET dynamic feature. The sample below shows how to read Parquet
stream without POCO object.
If you have Parquet file, you can parse and load the file with minimal/zero configuration.
The sample below shows it:
Listing 13.1 Loading Parquet file
class Program
{
static void Main(string[] args)
{
dynamic row;
using (var parser = new ChoParquetReader("emp.parquet"))
{
while ((row = parser.Read()) != null)
{
Console.WriteLine(row.Id);
}
}
}
}
The above example automatically discovers the Parquet
object members and parses the file.
You can override the default behavior of discovering fields automatically by adding field configurations manually and pass it to ChoParquetReader
for parsing file.
The sample shows how to do it:
Listing 13.3 Loading Parquet file with configuration
class Program
{
static void Main(string[] args)
{
ChoParquetRecordConfiguration config = new ChoParquetRecordConfiguration();
config.ParquetRecordFieldConfigurations.Add
(new ChoParquetRecordFieldConfiguration("Id"));
config.ParquetRecordFieldConfigurations.Add
(new ChoParquetRecordFieldConfiguration("Name"));
dynamic row;
using (var parser = new ChoParquetReader("emp.parquet", config))
{
while ((row = parser.Read()) != null)
{
Console.WriteLine(row.Name);
}
}
}
}
To completely turn off the auto field discovery, you will have to set ChoParquetRecordConfiguration.AutoDiscoverColumns
to false
.
It is the value used and set to the property when the Parquet value is empty or whitespace (controlled via IgnoreFieldValueMode
).
Any POCO entity property can be specified with default value using System.ComponentModel.DefaultValueAttribute
.
For dynamic object members or to override the declarative POCO object member's default value specification, you can do so through configuration as shown below:
ChoParquetRecordConfiguration config = new ChoParquetRecordConfiguration();
config.ParquetRecordFieldConfigurations.Add
(new ChoParquetRecordFieldConfiguration("Id"));
config.ParquetRecordFieldConfigurations.Add
(new ChoParquetRecordFieldConfiguration("Name") { DefaultValue = "NoName" })
It is the value used and set to the property when the Parquet value failed to set. Fallback
value only set when ErrorMode
is either IgnoreAndContinue
or ReportAndContinue
.
Any POCO entity property can be specified with fallback value using ChoETL.ChoFallbackValueAttribute
.
For dynamic object members or to override the declarative POCO object member's fallback values, you can do through configuration as shown below:
ChoParquetRecordConfiguration config = new ChoParquetRecordConfiguration();
config.ParquetRecordFieldConfigurations.Add
(new ChoParquetRecordFieldConfiguration("Id"));
config.ParquetRecordFieldConfigurations.Add
(new ChoParquetRecordFieldConfiguration("Name") { FallbackValue = "Tom" });
In the type less dynamic object model, the reader reads individual field values and populates them to dynamic object members in 'string' value. If you want to enforce the type and do extra type checking during load, you can do so by declaring the field type at the field configuration.
Listing 8.5.1 Defining FieldType
ChoParquetRecordConfiguration config = new ChoParquetRecordConfiguration();
config.ParquetRecordFieldConfigurations.Add
(new ChoParquetRecordFieldConfiguration("Id") { FieldType = typeof(int) });
config.ParquetRecordFieldConfigurations.Add
(new ChoParquetRecordFieldConfiguration("Name"));
The above sample shows to define field type as 'int
' to 'Id
' field. This instruct the ChoParquetReader
to parse and convert the value to integer before assigning to it. This extra type safety alleviates the incorrect values being loaded to object while parsing.
Most of the primitive types are automatically converted and set them to the properties by ChoParquetReader
. If the value of the Parquet
field can't automatically be converted into the type of the property, you can specify a custom / built-in .NET converters to convert the value. These can be either IValueConverter
or TypeConverter
converters.
In the dynamic object model, you can specify these converters via configuration. See the below example on the approach taken to specify type converters for Parquet
fields.
Listing 13.4.1 Specifying TypeConverters
ChoParquetRecordConfiguration config = new ChoParquetRecordConfiguration();
ChoParquetRecordFieldConfiguration idConfig =
new ChoParquetRecordFieldConfiguration("Id");
idConfig.AddConverter(new IntConverter());
config.ParquetRecordFieldConfigurations.Add(idConfig);
config.ParquetRecordFieldConfigurations.Add
(new ChoParquetRecordFieldConfiguration("Name"));
In the above, we construct and attach the IntConverter
to 'Id
' field using AddConverter
helper method in ChoParquetRecordFieldConfiguration
object.
Likewise, if you want to remove any converter from it, you can use RemoveConverter
on ChoParquetRecordFieldConfiguration
object.
ChoParquetReader
leverages both System.ComponentModel.DataAnnotations
and Validation Block
validation attributes to specify validation rules for individual Parquet fields. Refer to the MSDN site for a list of available DataAnnotations
validation attributes.
Listing 13.5.1 Specifying Validations
ChoParquetRecordConfiguration config = new ChoParquetRecordConfiguration();
ChoParquetRecordFieldConfiguration idConfig =
new ChoParquetRecordFieldConfiguration("Id");
idConfig.Validators = new ValidationAttribute[] { new RangeAttribute(0, 100) };
config.ParquetRecordFieldConfigurations.Add(idConfig);
config.ParquetRecordFieldConfigurations.Add
(new ChoParquetRecordFieldConfiguration("Name"));
In the example above, we used Range
validation attribute for Id
property. ChoParquetReader
performs validation on them during load based on Configuration.ObjectValidationMode
is set to ChoObjectValidationMode.MemberLevel
or ChoObjectValidationMode.ObjectLevel
.
P.S.: Self validation NOT supported in Dynamic object model.
If you already have existing sealed POCO object or the object is in 3rd party library, we can use them with ChoParquetReader
.
Listing 14.1 Exisiting sealed POCO Object
public sealed class ThirdPartyRec
{
public int Id
{
get;
set;
}
public string Name
{
get;
set;
}
}
Listing 14.2 Consuming Parquet file
class Program
{
static void Main(string[] args)
{
using (var parser = new ChoParquetReader<ThirdPartyRec>("emp.parquet"))
{
object row = null;
while ((row = parser.Read()) != null)
Console.WriteLine(row.ToString());
}
}
}
In this case, ChoParquetReader
reverse discovers the Parquet
fields from the Parquet file and loads the data into POCO object. If the Parquet file structure and POCO object matches, the load will succeed with populating all corresponding data to its properties. In case the property is missing for any Parquet field, ChoParquetReader
silently ignores them and continue on with rest.
You can override this behavior by setting ChoParquetRecordConfiguration.ThrowAndStopOnMissingField
property to false
. In this case, the ChoParquetReader
will throw ChoMissingRecordFieldException
exception if a property is missing for a Parquet
field.
ChoParquetReader
throws different types of exceptions in different situations.
ChoParserException
- Parquet file is bad and parser not able to recover. ChoRecordConfigurationException
- Any invalid configuration settings are specified, this exception will be raised. ChoMissingRecordFieldException
- A property is missing for a Parquet
field, this exception will be raised.
Cinchoo ETL works better with data annotation's MetadataType
model. It is a way to attach MetaData
class to data model class. In this associated class, you provide additional metadata information that is not in the data model. Its role is to add attribute to a class without having to modify this one. You can add this attribute that takes a single parameter to a class that will have all the attributes. This is useful when the POCO classes are auto generated (by Entity Framework, MVC etc) by an automatic tools. This is why the second class comes into play. You can add new stuff without touching the generated file. Also, this promotes modularization by separating the concerns into multiple classes.
For more information about it, please search in MSDN.
Listing 17.1 MetadataType annotation usage sample
[MetadataType(typeof(EmployeeRecMeta))]
public class EmployeeRec
{
public int Id { get; set; }
public string Name { get; set; }
}
[ChoParquetRecordObject]
public class EmployeeRecMeta : IChoNotifyRecordRead, IChoValidatable
{
[ChoParquetRecordField(FieldName = "id",
ErrorMode = ChoErrorMode.ReportAndContinue )]
[ChoTypeConverter(typeof(IntConverter))]
[Range(1, 1, ErrorMessage = "Id must be > 0.")]
[ChoFallbackValue(1)]
public int Id { get; set; }
[ChoParquetRecordField(FieldName = "Name")]
[StringLength(1)]
[DefaultValue("ZZZ")]
[ChoFallbackValue("XXX")]
public string Name { get; set; }
public bool AfterRecordLoad(object target, int index, object source)
{
throw new NotImplementedException();
}
public bool BeforeRecordLoad(object target, int index, ref object source)
{
throw new NotImplementedException();
}
public bool RecordLoadError(object target, int index, object source, Exception ex)
{
throw new NotImplementedException();
}
public bool TryValidate
(object target, ICollection<ValidationResult> validationResults)
{
return true;
}
public bool TryValidateFor
(object target, string memberName, ICollection<ValidationResult> validationResults)
{
return true;
}
public void Validate(object target)
{
}
public void ValidateFor(object target, string memberName)
{
}
}
In the above, EmployeeRec
is the data class. Contains only domain specific properties and operations. Mark it as a very simple class to look at it.
We separate the validation, callback mechanism, configuration, etc. into metadata type class, EmployeeRecMeta
.
If the POCO entity class is an auto-generated class or exposed via library or it is a sealed class, it limits you to attach Parquet schema definition to it declaratively. In such a case, you can choose one of the options below to specify Parquet layout configuration:
- Manual Configuration
- Auto Map Configuration
- Attaching
MetadataType
class
I'm going to show you how to configure the below POCO entity class on each approach.
Listing 18.1 Sealed POCO entity class
public sealed class EmployeeRec
{
public int Id { get; set; }
public string Name { get; set; }
}
Define a brand new configuration object from scratch and add all the necessary Parquet fields to the ChoParquetConfiguration.ParquetRecordFieldConfigurations
collection property. This option gives you greater flexibility to control the configuration of Parquet parsing. But the downside is that possibility of making mistakes and hard to manage them if the Parquet file layout is large.
Listing 18.1.1 Manual Configuration
ChoParquetRecordConfiguration config = new ChoParquetRecordConfiguration();
config.ParquetRecordFieldConfigurations.Add
(new ChoParquetRecordFieldConfiguration("Id"));
config.ParquetRecordFieldConfigurations.Add
(new ChoParquetRecordFieldConfiguration("Name"));
This is an alternative approach and very less error-prone method to auto map the Parquet
fields for the POCO entity class.
First, define a schema class for EmployeeRec
POCO entity class as below:
Listing 18.2.1 Auto Map class
public class EmployeeRecMap
{
[ChoParquetRecordField(FieldName = "Id")]
public int Id { get; set; }
[ChoParquetRecordField(FieldName = "Name")]
public string Name { get; set; }
}
Then you can use it to auto map Parquet fields by using ChoParquetRecordConfiguration.MapRecordFields
method.
Listing 18.2.2 Using Auto Map configuration
ChoParquetRecordConfiguration config = new ChoParquetRecordConfiguration();
config.MapRecordFields<EmployeeRecMap>();
foreach (var e in new ChoParquetReader<EmployeeRec>("emp.parquet", config))
Console.WriteLine(e.ToString());
This is one other approach to attach MetadataType
class for POCO entity object. The previous approach simply cares for auto mapping of Parquet
fields only. Other configuration properties like property converters, parser parameters, default/fallback values, etc. are not considered.
This model accounts for everything by defining MetadataType
class and specifying the Parquet
configuration parameters declaratively. This is useful when your POCO entity is sealed and not partial class. Also, it is one of favorable and less error-prone approach to configure Parquet parsing of POCO entity.
Listing 18.3.1 Define MetadataType class
[ChoParquetRecordObject]
public class EmployeeRecMeta : IChoNotifyRecordRead, IChoValidatable
{
[ChoParquetRecordField
(FieldName = "Id", ErrorMode = ChoErrorMode.ReportAndContinue )]
[ChoTypeConverter(typeof(IntConverter))]
[Range(1, 1, ErrorMessage = "Id must be > 0.")]
public int Id { get; set; }
[ChoParquetRecordField(FieldName = "Name")]
[StringLength(1)]
[DefaultValue("ZZZ")]
[ChoFallbackValue("XXX")]
public string Name { get; set; }
public bool AfterRecordLoad(object target, int index, object source)
{
throw new NotImplementedException();
}
public bool BeforeRecordLoad(object target, int index, ref object source)
{
throw new NotImplementedException();
}
public bool RecordLoadError(object target, int index, object source, Exception ex)
{
throw new NotImplementedException();
}
public bool TryValidate
(object target, ICollection<ValidationResult> validationResults)
{
return true;
}
public bool TryValidateFor
(object target, string memberName, ICollection<ValidationResult> validationResults)
{
return true;
}
public void Validate(object target)
{
}
public void ValidateFor(object target, string memberName)
{
}
}
Listing 18.3.2 Attaching MetadataType class
ChoMetadataObjectCache.Default.Attach<EmployeeRec>(new EmployeeRecMeta());
foreach (var e in new ChoParquetReader<EmployeeRec>("emp.parquet"))
Console.WriteLine(e.ToString()
This is a little nifty helper method to parse and load Parquet text string
into object
s.
Listing 19.1 Using LoadText method
string txt = @"
[
{
"Id": 1,
"Name": "Jeanette"
},
{
"Id": 2,
"Name": "Giavani"
}
]";
foreach (var e in ChoParquetReader.LoadText(txt))
Console.WriteLine(e.ToStringEx());
Cinchoo ETL automatically parses and converts each Parquet
field values to the corresponding Parquet
field's underlying data type seamlessly. Most of the basic .NET types are handled automatically without any setup needed.
This is achieved through two key settings in the ETL system:
ChoParquetRecordConfiguration.CultureInfo
- Represents information about a specific culture including the names of the culture, the writing system, and the calendar used, as well as access to culture-specific objects that provide information for common operations, such as formatting dates and sorting strings. Default is 'en-US
'. ChoTypeConverterFormatSpec
- It is global format specifier class holds all the intrinsic .NET types formatting specs.
In this section, I'm going to talk about changing the default format specs for each .NET intrinsic data types according to parsing needs.
ChoTypeConverterFormatSpec
is a singleton class, the instance is exposed via 'Instance
' static
member. It is thread local, means that there will be separate instance copy kept on each thread.
There are two sets of format specs members given to each intrinsic type, one for loading and another one for writing the value, except for Boolean
, Enum
, DataTime
types. These types have only one member for both loading and writing operations.
Specifying each intrinsic data type format specs through ChoTypeConverterFormatSpec
will impact system wide, i.e., by setting ChoTypeConverterFormatSpec.IntNumberStyle = NumberStyles.AllowParentheses
, will impact all integer members of Parquet
objects to allow parentheses. If you want to override this behavior and take control of specific Parquet
data member to handle its own unique parsing of Parquet
value from global system wide setting, it can be done by specifying TypeConverter
at the Parquet
field member level. Refer to section 13.4 for more information.
NumberStyles
(optional) used for loading values from Parquet
stream and Format
string are used for writing values to Parquet
stream.
In this article, I'll brief about using NumberStyles
for loading Parquet
data from stream. These values are optional. It determines the styles permitted for each type during parsing of Parquet file. System automatically figures out the way to parse and load the values from underlying Culture
. In odd situation, you may want to override and set the styles the way you want in order to successfully load the file. Refer to MSDN for more about NumberStyles and its values.
Listing 20.1.1 ChoTypeConverterFormatSpec Members
public class ChoTypeConverterFormatSpec
{
public static readonly ThreadLocal<ChoTypeConverterFormatSpec> Instance =
new ThreadLocal<ChoTypeConverterFormatSpec>(() => new ChoTypeConverterFormatSpec());
public string DateTimeFormat { get; set; }
public ChoBooleanFormatSpec BooleanFormat { get; set; }
public ChoEnumFormatSpec EnumFormat { get; set; }
public NumberStyles? CurrencyNumberStyle { get; set; }
public string CurrencyFormat { get; set; }
public NumberStyles? BigIntegerNumberStyle { get; set; }
public string BigIntegerFormat { get; set; }
public NumberStyles? ByteNumberStyle { get; set; }
public string ByteFormat { get; set; }
public NumberStyles? SByteNumberStyle { get; set; }
public string SByteFormat { get; set; }
public NumberStyles? DecimalNumberStyle { get; set; }
public string DecimalFormat { get; set; }
public NumberStyles? DoubleNumberStyle { get; set; }
public string DoubleFormat { get; set; }
public NumberStyles? FloatNumberStyle { get; set; }
public string FloatFormat { get; set; }
public string IntFormat { get; set; }
public NumberStyles? IntNumberStyle { get; set; }
public string UIntFormat { get; set; }
public NumberStyles? UIntNumberStyle { get; set; }
public NumberStyles? LongNumberStyle { get; set; }
public string LongFormat { get; set; }
public NumberStyles? ULongNumberStyle { get; set; }
public string ULongFormat { get; set; }
public NumberStyles? ShortNumberStyle { get; set; }
public string ShortFormat { get; set; }
public NumberStyles? UShortNumberStyle { get; set; }
public string UShortFormat { get; set; }
}
The sample below shows how to load Parquet
data stream having 'se-SE
' (Swedish) culture specific data using ChoParquetReader
. Also, the input feed comes with 'EmployeeNo
' values containing parentheses. In order to make the load successful, we have to set the ChoTypeConverterFormatSpec.IntNumberStyle
to NumberStyles.AllowParenthesis
.
Listing 20.1.2 Using ChoTypeConverterFormatSpec in code
static void UsingFormatSpecs()
{
ChoParquetRecordConfiguration config = new ChoParquetRecordConfiguration();
config.Culture = new System.Globalization.CultureInfo("se-SE");
config.ParquetRecordFieldConfigurations.Add
(new ChoParquetRecordFieldConfiguration("Id") { FieldType = typeof(int) });
config.ParquetRecordFieldConfigurations.Add
(new ChoParquetRecordFieldConfiguration("Name"));
config.ParquetRecordFieldConfigurations.Add
(new ChoParquetRecordFieldConfiguration("Salary")
{ FieldType = typeof(ChoCurrency) });
config.ParquetRecordFieldConfigurations.Add
(new ChoParquetRecordFieldConfiguration("JoinedDate")
{ FieldType = typeof(DateTime) });
config.ParquetRecordFieldConfigurations.Add
(new ChoParquetRecordFieldConfiguration("EmployeeNo") { FieldType = typeof(int) });
ChoTypeConverterFormatSpec.Instance.IntNumberStyle = NumberStyles.AllowParentheses;
using (var parser = new ChoParquetReader("emp.parquet", config))
{
object row = null;
while ((row = parser.Read()) != null)
Console.WriteLine(row.ToStringEx());
}
}
Cinchoo ETL provides ChoCurrency
object to read and write currency values in Parquet files. ChoCurrency
is a wrapper class to hold the currency value in decimal type along with support of serializing them in text format during Parquet load.
Listing 20.2.1 Using Currency members in dynamic model
static void CurrencyDynamicTest()
{
ChoParquetRecordConfiguration config = new ChoParquetRecordConfiguration();
config.ParquetRecordFieldConfigurations.Add
(new ChoParquetRecordFieldConfiguration("Id"));
config.ParquetRecordFieldConfigurations.Add
(new ChoParquetRecordFieldConfiguration("Name"));
config.ParquetRecordFieldConfigurations.Add
(new ChoParquetRecordFieldConfiguration("Salary")
{ FieldType = typeof(ChoCurrency) });
using (var parser = new ChoParquetReader("emp.parquet", config))
{
object rec;
while ((rec = parser.Read()) != null)
{
Console.WriteLine(rec.ToStringEx());
}
}
}
The sample above shows how to load currency values using dynamic object model. By default, all the members of dynamic object are treated as string
type, unless specified explicitly via ChoParquetFieldConfiguration.FieldType
. By specifying the field type as ChoCurrency
to the 'Salary
' Parquet
field, ChoParquetReader
loads them as currency object.
PS: The format of the currency value is figured by ChoParquetReader through ChoRecordConfiguration.Culture and ChoTypeConverterFormatSpec.CurrencyNumberStyle.
The sample below shows how to use ChoCurrency Parquet
field in POCO entity class.
Listing 20.2.2 Using Currency members in POCO model
public class EmployeeRecWithCurrency
{
public int Id { get; set; }
public string Name { get; set; }
public ChoCurrency Salary { get; set; }
}
static void CurrencyTest()
{
using (var parser = new ChoParquetReader<EmployeeRecWithCurrency>("emp.parquet"))
{
object rec;
while ((rec = parser.Read()) != null)
{
Console.WriteLine(rec.ToStringEx());
}
}
}
Cinchoo ETL implicitly handles parsing of enum
field values from Parquet files. If you want to fine control the parsing of these values, you can specify them globally via ChoTypeConverterFormatSpec.EnumFormat
. Default is ChoEnumFormatSpec.Value
.
FYI, changing this value will impact system wide.
There are three possible values that can be used:
ChoEnumFormatSpec.Value
- Enum
value is used for parsing. ChoEnumFormatSpec.Name
- Enum
key name is used for parsing. ChoEnumFormatSpec.Description
- If each enum
key is decorated with DescriptionAttribute
, its value will be used for parsing.
Listing 20.3.1 Specifying Enum format specs during parsing
public enum EmployeeType
{
[Description("Full Time Employee")]
Permanent = 0,
[Description("Temporary Employee")]
Temporary = 1,
[Description("Contract Employee")]
Contract = 2
}
static void EnumTest()
{
ChoTypeConverterFormatSpec.Instance.EnumFormat = ChoEnumFormatSpec.Description;
ChoParquetRecordConfiguration config = new ChoParquetRecordConfiguration();
config.ParquetRecordFieldConfigurations.Add
(new ChoParquetRecordFieldConfiguration("Id") { FieldType = typeof(int) });
config.ParquetRecordFieldConfigurations.Add
(new ChoParquetRecordFieldConfiguration("Name"));
config.ParquetRecordFieldConfigurations.Add
(new ChoParquetRecordFieldConfiguration("Salary")
{ FieldType = typeof(ChoCurrency) });
config.ParquetRecordFieldConfigurations.Add
(new ChoParquetRecordFieldConfiguration("JoinedDate")
{ FieldType = typeof(DateTime) });
config.ParquetRecordFieldConfigurations.Add
(new ChoParquetRecordFieldConfiguration("EmployeeType")
{ FieldType = typeof(EmployeeType) });
ChoTypeConverterFormatSpec.Instance.IntNumberStyle = NumberStyles.AllowParentheses;
using (var parser = new ChoParquetReader("emp.parquet", config))
{
object row = null;
while ((row = parser.Read()) != null)
Console.WriteLine(row.ToStringEx());
}
}
Cinchoo ETL implicitly handles parsing of boolean Parquet
field values from Parquet files. If you want to fine control the parsing of these values, you can specify them globally via ChoTypeConverterFormatSpec.BooleanFormat
. Default value is ChoBooleanFormatSpec.ZeroOrOne
.
FYI, changing this value will impact system wide.
There are four possible values that can be used:
ChoBooleanFormatSpec.ZeroOrOne
- '0
' for false
. '1
' for true
. ChoBooleanFormatSpec.YOrN
- 'Y
' for true
, 'N
' for false
. ChoBooleanFormatSpec.TrueOrFalse
- 'True
' for true
, 'False
' for false
. ChoBooleanFormatSpec.YesOrNo
- 'Yes
' for true
, 'No
' for false
.
Listing 20.4.1 Specifying boolean format specs during parsing
static void BoolTest()
{
ChoTypeConverterFormatSpec.Instance.BooleanFormat = ChoBooleanFormatSpec.ZeroOrOne;
ChoParquetRecordConfiguration config = new ChoParquetRecordConfiguration();
config.ParquetRecordFieldConfigurations.Add
(new ChoParquetRecordFieldConfiguration("Id") { FieldType = typeof(int) });
config.ParquetRecordFieldConfigurations.Add
(new ChoParquetRecordFieldConfiguration("Name"));
config.ParquetRecordFieldConfigurations.Add
(new ChoParquetRecordFieldConfiguration("Salary")
{ FieldType = typeof(ChoCurrency) });
config.ParquetRecordFieldConfigurations.Add
(new ChoParquetRecordFieldConfiguration("JoinedDate")
{ FieldType = typeof(DateTime) });
config.ParquetRecordFieldConfigurations.Add
(new ChoParquetRecordFieldConfiguration("Active") { FieldType = typeof(bool) });
ChoTypeConverterFormatSpec.Instance.IntNumberStyle = NumberStyles.AllowParentheses;
using (var parser = new ChoParquetReader("emp.parquet", config))
{
object row = null;
while ((row = parser.Read()) != null)
Console.WriteLine(row.ToStringEx());
}
}
Cinchoo ETL implicitly handles parsing of datetime Parquet
field values from Parquet files using system Culture or custom set culture. If you want to fine control the parsing of these values, you can specify them globally via ChoTypeConverterFormatSpec.DateTimeFormat
. Default value is 'd
'.
FYI, changing this value will impact system wide.
You can use any valid standard or custom datetime .NET format specification to parse the datetime
Parquet values from the file.
Listing 20.5.1 Specifying datetime format specs during parsing
static void DateTimeTest()
{
ChoTypeConverterFormatSpec.Instance.DateTimeFormat = "MMM dd, yyyy";
ChoParquetRecordConfiguration config = new ChoParquetRecordConfiguration();
config.ParquetRecordFieldConfigurations.Add
(new ChoParquetRecordFieldConfiguration("Id") { FieldType = typeof(int) });
config.ParquetRecordFieldConfigurations.Add
(new ChoParquetRecordFieldConfiguration("Name"));
config.ParquetRecordFieldConfigurations.Add
(new ChoParquetRecordFieldConfiguration("Salary")
{ FieldType = typeof(ChoCurrency) });
config.ParquetRecordFieldConfigurations.Add
(new ChoParquetRecordFieldConfiguration("JoinedDate")
{ FieldType = typeof(DateTime) });
config.ParquetRecordFieldConfigurations.Add
(new ChoParquetRecordFieldConfiguration("Active") { FieldType = typeof(bool) });
ChoTypeConverterFormatSpec.Instance.IntNumberStyle = NumberStyles.AllowParentheses;
using (var parser = new ChoParquetReader("emp.parquet", config))
{
object row = null;
while ((row = parser.Read()) != null)
Console.WriteLine(row.ToStringEx());
}
}
The sample above shows how to parse custom datetime Parquet
values from Parquet file.
Note: As the datetime
values contains Parquet
seperator, it is given with double quotes to pass the parsing.
ChoParquetReader
exposes few frequent to use configuration parameters via fluent API methods. This will make the programming of parsing of Parquet files quicker.
This API method specifies the list of Parquet
nodes (either attributes or elements) to be considered for parsing and loading. Other fields in the Parquet
nodes will be discarded.
foreach (var e in new ChoParquetReader<EmployeeRec>
("emp.parquet").WithFields("Id", "Name"))
Console.WriteLine(e.ToString());
This API method used to add Parquet
node with ParquetPath
, data type and other parameters. This method helpful in dynamic object model, by specifying each and individual Parquet
node with appropriate datatype
.
foreach (var e in new ChoParquetReader<EmployeeRec>("emp.parquet").WithField
("Id", fieldType: typeof(int)))
Console.WriteLine(e.ToString());
This API method used to set the ChoParquetWriter
to perform check on field countness before reading Parquet file.
foreach (var e in new ChoParquetReader<EmployeeRec>("emp.parquet").ColumnCountStrict())
Console.WriteLine(e.ToString());
This API method used to define the number of rows to be processed before generating a notification event. This property is designed for user interface components that illustrate the Parquet loading progress. Notifications are sent to subscribers who subscribed to RowsLoaded
event.
static void NotifyAfterTest()
{
using (var parser = new ChoParquetReader("emp.parquet")
.NotifyAfter(1000)
)
{
parser.RowsLoaded += (o, e) => Console.WriteLine(e.RowsLoaded);
foreach (var rec in parser)
{
Console.WriteLine(String.Format("Id: {0}", rec.Id));
Console.WriteLine(String.Format("Name: {0}", rec.Name));
Console.WriteLine(String.Format("Salary: {0}", rec.Salary));
}
}
}
This API method used to configure all configuration parameters which are not exposed via fluent API.
static void ConfigureTest()
{
using (var parser = new ChoParquetReader("emp.parquet")
.Configure(c => c.ErrorMode = ChoErrorMode.ThrowAndStop)
)
{
foreach (var rec in parser)
{
Console.WriteLine(String.Format("Id: {0}", rec.Id));
Console.WriteLine(String.Format("Name: {0}", rec.Name));
Console.WriteLine(String.Format("Salary: {0}", rec.Salary));
}
}
}
This API method used to setup the reader's parameters / events via fluent API.
static void SetupTest()
{
using (var parser = new ChoParquetReader("emp.parquet")
.Setup(r => r.BeforeRecordLoad += (o, e) =>
{
if (e.Source.CastTo<JObject>().ContainsKey("Name1"))
e.Skip = true;
}
)
{
foreach (var rec in parser)
{
Console.WriteLine(String.Format("Id: {0}", rec.Id));
Console.WriteLine(String.Format("Name: {0}", rec.Name));
Console.WriteLine(String.Format("Salary: {0}", rec.Salary));
}
}
}
ChoParquetReader
implicitly handles the conversion of the enum
text to enum
value. The sample below shows how to load Parquet
with POCO object:
public enum Gender { Male, Female }
public class Employee
{
public int Age { get; set; }
public Gender Gender { get; set; }
}
static void EnumTest()
{
using (var r = new ChoParquetReader<Employee>("emp.parquet"))
{
foreach (var rec in r)
Console.WriteLine(rec.Dump());
}
}
The sample below shows how to parse the Parquet
with enum
values in dynamic object model approach:
static void DynamicEnumTest()
{
using (var r = new ChoParquetReader<Employee>("emp.parquet")
.WithField("Age")
.WithField("Gender", fieldType: typeof(Gender))
)
{
foreach (var rec in r)
Console.WriteLine(rec.Dump());
}
}
ChoParquetReader
does this implicitly in the dynamic object model.
static void DynamicEnumTest()
{
using (dynamic r = new ChoParquetReader("emp.parquet")
.WithField("Age")
.WithField("Gender", fieldType: typeof(Gender))
)
{
foreach (var rec in r)
{
Console.WriteLine(rec.Age);
Console.WriteLine(rec.Gender);
}
}
}
In the above, the parser loads the .parquet file, constructs and returns dynamic
object.
Cinchoo ETL provides ChoXmlWriter
to generate XML file from objects. With ChoParquetReader
along with ChoXmlWriter
, you can convert Parquet
to XML format easily.
static void Parquet2XmlTest()
{
StringBuilder xml = new StringBuilder();
using (var r = new ChoParquetReader("emp.parquet"))
{
using (var w = new ChoXmlWriter(xml)
.WithRootName("Emps")
.WithNodeName("Emp")
)
w.Write(r);
}
Console.WriteLine(xml.ToString());
}
Output
<Emps>
<Emp>
<Id>1</Id>
<Name>Mark</Name>
</Emp>
<Emp>
<Id>2</Id>
<Name>Tom</Name>
</Emp>
</Emps>
Cinchoo ETL provides ChoCSVWriter
to generate CSV file from objects. With ChoParquetReader
along with ChoCSVWriter
, you can convert Parquet
to CSV format easily.
static void Parquet2CSVTest()
{
StringBuilder csv= new StringBuilder();
using (var r = new ChoParquetReader("emp.parquet"))
{
using (var w = new ChoCSVWriter(csv).WithFirstLineHeader())
w.Write(r);
}
Console.WriteLine(csv.ToString());
}
Output
Id, Name
1, Tom
2, Mark
Cinchoo ETL provides ChoJSONWriter
to generate JSON file from objects. With ChoParquetReader
along with ChoJSONWriter
, you can convert Parquet
to JSON format easily.
static void Parquet2JSONTest()
{
StringBuilder json = new StringBuilder();
using (var r = new ChoParquetReader("emp.parquet"))
{
using (var w = new ChoJSONWriter(json))
w.Write(r);
}
Console.WriteLine(json.ToString());
}
Output
[
{
"Id" : 1,
"Name" : "Tom"
},
{
"Id" : 2,
"Name" : "Mark"
}
]
Using WithField()
fluent API method, you can specify selective fields to be loaded from Parquet feed.
static void SelectiveFieldTest()
{
using (dynamic r = new ChoParquetReader("emp.parquet")
.WithField("Age")
.WithField("Gender", fieldType: typeof(Gender))
)
{
foreach (var rec in r)
{
Console.WriteLine(rec.Age);
Console.WriteLine(rec.Gender);
}
}
}
ChoParquetReader
provides a little helper method to convert Parquet
to Datatable
, AsDataTable()
.
static void ConvertToDataTableTest()
{
using (var r = new ChoParquetReader<UserInfo>("emp.parquet"))
{
var dt = r.AsDataTable();
}
}
ChoParquetReader
provides a little helper method to convert Parquet
to DataReader
, AsDataReader()
.
static void ConvertToDataTableTest()
{
using (var r = new ChoParquetReader<UserInfo>("emp.parquet"))
{
var dr = r.AsDataReader();
}
}
This sample deserializes Parquet
to an object:
public class Account
{
public string Email { get; set; }
public bool Active { get; set; }
public DateTime CreatedDate { get; set; }
public IList<string> Roles { get; set; }
}
static void DeserializeObject()
{
Account account = ChoParquetReader.Deserialize<Account>
("emp.parquet").FirstOrDefault();
Console.WriteLine(account.Email);
}
This sample deserializes Parquet
to an collection:
static void DeserializeCollection()
{
List<EmployeeRec> emps =
ChoParquetReader.Deserialize<EmployeeRec>("emp.parquet").ToList();
}
This sample deserializes Parquet
to a Dictionary
:
static void DeserializeDictionary()
{
Dictionary<string, object> htmlAttributes =
ChoParquetReader.Deserialize<Dictionary<string, oject>>
("emp.parquet").FirstOrDefault();
Console.WriteLine(htmlAttributes["Key"]);
Console.WriteLine(htmlAttributes["Value"]);
}
This sample deserializes Parquet
to a Dictionary
:
public class Movie
{
public string Name { get; set; }
public int Year { get; set; }
}
static void DeserializeFromFile()
{
Movie movie1 = ChoParquetReader.Deserialize<Movie>("movie.parquet").FirstOrDefault();
}
This sample deserializes Parquet
with custom factory to instantiate Employee
instance for Person
type.
public class Person
{
public string FirstName { get; set; }
public string LastName { get; set; }
public DateTime BirthDate { get; set; }
}
public class Employee : Person
{
public string Department { get; set; }
public string JobTitle { get; set; }
}
static void CustomCreationTest()
{
ChoActivator.Factory = (type, args) =>
{
if (type == typeof(Person))
return new Employee();
else
return null;
};
Person person = ChoParquetReader.Deserialize<Person>("emp.parquet").FirstOrDefault();
Console.WriteLine(person.GetType().Name);
}
ChoParquetReader
can automatically convert datetime
value using current system culture. If the Parquet
comes with custom datetime
formatted value, you can set the custom datetime
format to parse the Parquet
successfully.
Sample Parquet
with custom datetime
format value:
{
'Department': 'Furniture',
'JobTitle': 'Carpenter',
'FirstName': 'John',
'LastName': 'Joinery',
'BirthDate': '30-12-2003'
}
Define POCO class as below to handle the custom datetime
format:
public class Employee
{
public string Department { get; set; }
public string JobTitle { get; set; }
[DisplayFormat(DataFormatString = "dd-MM-yyyy")]
public DateTime BirthDate { get; set; }
}
or:
public class Employee
{
[ChoParquetRecordField]
public string Department { get; set; }
[ChoParquetRecordField]
public string JobTitle { get; set; }
[ChoParquetRecordField(FormatText = "dd-MM-yyyy")]
public DateTime BirthDate { get; set; }
}
Use the parser to load the Parquet
as below:
using (var r = new ChoParquetReader<Employee>("emp.parquet"))
{
foreach (var rec in r)
Console.WriteLine(rec.Dump());
}
In the dynamic model, you can set the custom datetime
format as below:
using (var r = new ChoParquetReader("emp.parquet")
.WithField("Department")
.WithField("JobTitle")
.WithField("BirthDate", fieldType: typeof(DateTime), formatText: "dd-MM-yyyy")
)
{
foreach (var rec in r)
Console.WriteLine(rec.Dump());
}
- 4th June, 2020: Initial version