In this tip, you will learn how to split a large JSON file based on deeply nested array property using Cinchoo ETL framework. It is very simple to use, with few lines of code, the conversion can be done. You can convert large files as the conversion process is stream based, quite fast and with low memory footprint.
ChoETL is an open source ETL (extract, transform and load) framework for .NET. It is a code based library for extracting data from multiple sources, transforming, and loading into your very own data warehouse in .NET environment. You can have data in your data warehouse in no time.
This article talks about splitting large JSON file based on deeply nested array property using Cinchoo ETL framework. This method helps to parse the large JSON file without going through out of memory exception by splitting them into multiple smaller files.
This framework library is written in C# using .NET 4.5 / .NET Core 3.x Framework.
3.1 Sample Data
Let's begin by looking into the sample JSON file below. Assuming the JSON file is large in size, in here, InstrumentData
nodes are repeated and contains thousands of elements in it. The objective of this exercise is to produce split files containing all the root nodes along with one InstrumentData
in each in them.
Listing 3.1.1. Input JSON file (sample.json)
{
"Job": {
"Keys": {
"JobID": "test123",
"DeviceID": "TEST01"
},
"Props": {
"FileType": "Measurements",
"InstrumentDescriptions": [
{
"InstrumentID": "1723007",
"InstrumentType": "Actual1",
"Name": "U",
"DataType": "Double",
"Units": "degC"
},
{
"InstrumentID": "2424009",
"InstrumentType": "Actual2",
"Name": "VG03",
"DataType": "Double",
"Units": "Pa"
}
]
},
"Steps": [
{
"Keys": {
"StepID": "START",
"StepResult": "NormalEnd"
},
"InstrumentData": [
{
"Keys": {
"InstrumentID": "1723007"
},
"Measurements": [
{
"DateTime": "2021-11-16 21:18:37.000",
"Value": 540
},
{
"DateTime": "2021-11-16 21:18:37.100",
"Value": 539
},
{
"DateTime": "2021-11-16 21:18:37.200",
"Value": 540
},
{
"DateTime": "2021-11-16 21:18:37.300",
"Value": 540
},
]
},
{
"Keys": {
"InstrumentID": "2424009"
},
"Measurements": [
{
"DateTime": "2021-11-16 21:18:37.000",
"Value": 1333.22
},
{
"DateTime": "2021-11-16 21:18:37.100",
"Value": 1333.22
},
]
}
]
}
]
}
}
Step 1: In the above, Job.Keys
and Job.Props
is common factor for all elements, needs to be included all new split files.
Step 2: Since the input file comes with Job.Steps[*].InstrumentData[*]
nodes, which requires 2 levels of parsing to split the files by InstrumentData
level.
Step 3: Break the file by each Steps[*]
node (aka. Steps_0.json, Steps_1.json, etc.)
Step 4. Then take each StepsFiles and break them by each InstrumentData[*]
node level. (aka. InstrumentData_0.json, InstrumentData_1.json, etc.)
Final expected split files should look as below.
Listing 3.1.2. Output split JSON file (InstrumentData_0.json)
{
"Job": {
"Keys": {
"JobID": "test123",
"DeviceID": "TEST01"
},
"Props": {
"FileType": "Measurements",
"InstrumentDescriptions": [
{
"InstrumentID": "1723007",
"InstrumentType": "Actual1",
"Name": "U",
"DataType": "Double",
"Units": "degC"
},
{
"InstrumentID": "2424009",
"InstrumentType": "Actual2",
"Name": "VG03",
"DataType": "Double",
"Units": "Pa"
}
]
},
"Steps": {
"Keys": {
"InstrumentID": "1723007"
},
"Measurements": [
{
"DateTime": "2021-11-16 21:18:37.000",
"Value": 540
},
{
"DateTime": "2021-11-16 21:18:37.100",
"Value": 539
},
{
"DateTime": "2021-11-16 21:18:37.200",
"Value": 540
},
]
}
}
}
The first thing to do is to install ChoETL.JSON/ChoETL.JSON.NETStandard
nuget package. To do this, run the following command in the Package Manager Console.
.NET Framework
Install-Package ChoETL.JSON
.NET Core
Install-Package ChoETL.JSON.NETStandard
Now add ChoETL
namespace to the program.
using ChoETL;
3.2 Split Operation
As JSON file comes in large in size, we need to consider deserialize InstrumentData
nodes in stream model rather than loading entire file in memory to avoid memory pressure.
First, split the file by Steps
nodes as below:
- Capture the value of
Job.Keys
node. - Capture the value of
Job.Props
node. - Then loop through each
Job.Steps
node, generate output split files (aka Steps_0.json, Steps_1.json, etc.) using ChoJObjectWriter
(utility class to write json values in stream manner) - Finally, the method returns list of steps split filenames for consuming to split by
InstrumentData
node.
Listing 3.2.1. Split by Steps nodes
static string[] SplitBySteps(string inputFilePath)
{
List<string> stepsFiles = new List<string>();
dynamic keys = null;
dynamic props = null;
using (var r = new ChoJSONReader(inputFilePath).WithJSONPath("$..Job.Keys"))
{
keys = r.FirstOrDefault();
}
using (var r = new ChoJSONReader(inputFilePath).WithJSONPath("$..Job.Props"))
{
props = r.FirstOrDefault();
}
int fileCount = 0;
using (var r = ChoJSONReader.LoadText(json).WithJSONPath("$..Job.Steps")
.NotifyAfter(1)
.Setup(s => s.RowsLoaded += (o, e) =>
$"Step Nodes loaded: {e.RowsLoaded} <- {DateTime.Now}".Print())
.Configure(c => c.CustomJObjectLoader = (sr, s) =>
{
string outFilePath = $"Steps_{fileCount++}.json";
$"Writing to `{outFilePath}` file...".Print();
using (var topJo = new ChoJObjectWriter(outFilePath))
{
topJo.Formatting = Newtonsoft.Json.Formatting.Indented;
using (var jo = new ChoJObjectWriter("Job", topJo))
{
jo.WriteProperty("Keys", keys);
jo.WriteProperty("Props", props);
jo.WriteProperty("Steps", sr);
}
}
stepsFiles.Add(outFilePath);
return ChoJSONObjects.EmptyJObject;
})
)
{
r.Loop();
}
return stepsFiles.ToArray();
}
Next step is to consume the above generated steps split files one at a time, split them by InstrumentData
node. Code below shows how.
Next, split the file by InstrumentData
nodes as below:
- Capture the value of
Job.Keys
node. - Capture the value of
Job.Props
node. - Capture the value of
Job.Steps.Keys
node. - Then loop through each
Job.Steps.InstrumentData
node, generate output split files (aka InstrumentData_0.json, InstrumentData_1.json, etc.) using ChoJObjectWriter
. - Finally, the method returns list of
InstrumentData
split filenames.
Listing 3.2.2. Split by InstrumentData nodes
static string[] SplitByInstrumentData(string stepsFilePath)
{
List<string> instrumentDataFiles = new List<string>();
dynamic keys = null;
dynamic props = null;
dynamic stepsKeys = null;
using (var r = new ChoJSONReader(stepsFilePath).WithJSONPath("$..Job.Keys"))
{
keys = r.FirstOrDefault();
}
using (var r = new ChoJSONReader(stepsFilePath).WithJSONPath("$..Job.Props"))
{
props = r.FirstOrDefault();
}
using (var r = new ChoJSONReader(stepsFilePath).WithJSONPath("$..Job.Steps.Keys"))
{
stepsKeys = r.FirstOrDefault();
}
int fileCount = 0;
using (var r = ChoJSONReader.LoadText(json).WithJSONPath("$..Job.Steps.InstrumentData")
.NotifyAfter(1)
.Setup(s => s.RowsLoaded += (o, e) => $"InstrumentData Nodes loaded:
{e.RowsLoaded} <- {DateTime.Now}".Print())
.Configure(c => c.CustomJObjectLoader = (sr, s) =>
{
string outFilePath = $"InstrumentData_{fileCount++}.json";
$"Writing to `{outFilePath}` file...".Print();
using (var topJo = new ChoJObjectWriter(outFilePath))
{
topJo.Formatting = Newtonsoft.Json.Formatting.Indented;
using (var jo = new ChoJObjectWriter("Job", topJo))
{
jo.WriteProperty("Keys", keys);
jo.WriteProperty("Props", props);
jo.WriteProperty("Steps", sr);
}
}
File.ReadAllText(outFilePath).Print();
"".Print();
instrumentDataFiles.Add(outFilePath);
return ChoJSONObjects.EmptyJObject;
})
)
{
r.Loop();
}
return instrumentDataFiles.ToArray();
}
Finally, use the above two methods to accomplish the split process by InstrumentData
as below.
Listing 3.2.3. Main() method
public static void Main()
{
var inputFilePath = "input.json";
foreach (var stepsFilePath in SplitBySteps(inputFilePath))
{
SplitByInstrumentData(stepsFilePath);
File.Delete(stepsFilePath);
}
}
Sample fiddle: https://dotnetfiddle.net/j3Y03m
For more information about Cinchoo ETL, please visit the other CodeProject articles:
History
- 12th January, 2022: Initial version