How can I improve performance and increase the speed of processing data so that it takes short time read and process 500, 000 records

Question

1.00/5 (2 votes)

See more:

, +

I'm having a .net core console application that reads parts of the JSON files stored in a folder the part of the JSON that I'm reading contains greater than 500,000 records in it, I'm able to read all the data but the only challenge I'm facing is that it is very slow it takes approximately 10 minutes to fetch and insert all the data. how can I modify my code below so I can improve performance and increase the speed to minimize the time it takes to process and insert all the data.

What I have tried:

using (var context = new ApplicationDbContext())
{
var dispenseToDelete = context.tblDispense.Where(p => p.HfrCode == facilityName);

if (dispenseToDelete.Any())
{
context.tblDispense.RemoveRange(dispenseToDelete);
context.SaveChanges();
}

}

using (StreamReader sr = new StreamReader(filePath))
using (JsonTextReader reader = new JsonTextReader(sr))
{
    while (reader.Read())
    {
        if (reader.TokenType == JsonToken.StartArray && reader.Path == "tblDispense")
        {
            using (var context = new ApplicationDbContext())
            {
                List<TblDispense> tblDispenseBatch = new List<TblDispense>();
                int batchSize = 1000;

                while (reader.Read() && reader.TokenType != JsonToken.EndArray)
                {
                    if (reader.TokenType == JsonToken.StartObject)
                    {
                        JObject item = JObject.Load(reader);
                        TblDispense tblDispense = item.ToObject<TblDispense>();
                        tblDispense.UUID = Guid.NewGuid().ToString() + "." + facilityName;
                        tblDispense.MSDCode = tblDispense.MSDCode + '.' + facilityName;
                        tblDispense.DispenseID = tblDispense.DispenseID + "." + facilityName;
                        tblDispense.DispenseRecordId = tblDispense.DispenseRecordId + '.' + facilityName;
                        tblDispense.HfrCode = facilityName;
                        tblDispenseBatch.Add(tblDispense);

                        // Batch insert to avoid memory overload
                        if (tblDispenseBatch.Count >= batchSize)
                        {
                        try
                        {
                        context.tblDispense.AddRange(tblDispenseBatch);
                        context.SaveChanges();
                        tblDispenseBatch.Clear();
                        }
                        catch (DbUpdateException ex)
                        {
                        Console.WriteLine($"[{facilityName}] an error occurred while saving the batch: {ex.InnerException?.Message}");
                        logger.LogError($"[{facilityName}] an error occurred while saving the batch: {ex.InnerException?.Message}");
                        }
                        }
                    }
                }

                // Insert remaining items in the batch
                if (tblDispenseBatch.Count > 0)
                {
                try
                {
                context.tblDispense.AddRange(tblDispenseBatch);
                context.SaveChanges();
                tblDispenseBatch.Clear();
                }
                catch (DbUpdateException ex)
                {
                Console.WriteLine($"[{facilityName}] an error occurred while saving the batch: {ex.InnerException?.Message}");
                logger.LogError($"[{facilityName}] an error occurred while saving the batch: {ex.InnerException?.Message}");
                }
                }
            }
            break;
        }
    }
}

any suggestions will be much appreciated.

Posted 7-Jul-24 20:59pm

Office Systems

Add a Solution

Comments

Richard Deeming 8-Jul-24 3:46am

When you profiled your code, which part did you identify as the bottleneck?

And if you haven't profiled it, then why not? Without profiling, all you can do is try to guess what the problem might be. You'll spend days optimizing parts of the code that are not relevant to the problem you're trying to solve.

Office Systems 8-Jul-24 4:02am

Thanks @Richard Deeming for your comment however after profiling my code I found that the part that takes a lot of time is this part here since there are many parts of the of the json so it takes time to reach tblDispense and also there are many records that it reads over 500,000 records bottleneck while (reader.Read())
{
if (reader.TokenType == JsonToken.StartArray && reader.Path == "tblDispense")
{

Dave Kreskowiak 8-Jul-24 9:49am

That doesn't seem to make sense. Just because some line of code executes 500,000 times doesn't make it the bottleneck. It's the TIME it takes to execute each line of code, not the number of executions.

PIEBALDconsult 8-Jul-24 8:41am

If multiple files, read them in parallel.
If SQL Server, use an SqlBulkCopy object.

2 solutions

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

OriginalGriff · Answer 1 · 2024-07-07T22:44:00

There really isn't a major way to improve the performance here as json isn't a "random access" data storage format - it's designed as a data transfer format and that means it's meant to be read in a linear manner - there isn't a way to predict how long a data item in a json file is in the storage medium other than to actually read it, so you can't split the data into multiple segments easily and process each segment in a separate thread - which is the only way you will get significant performance advances with your code.

The problem is the storage medium - you should really consider switching to a data storage format such as a database (SQL Server, MySql or even a single user DB like Access is much more appropriate.) The storage engines for these are optimised for queries and will be significantly faster than any code you want to write!

[edit]
I was just going to start writing an article explaining why JSON is a bad choice for data storage when I realized that I wrote one a couple of years ago ... :O
That's not a database, dude![^]
[/edit]

Pete O'Hanlon · Answer 2 · 2024-07-08T01:34:00

There are so many parts to this that represent potential bottlenecks. You probably want to rethink your architecture here, and take care of a couple of basics while you're at it. What you are going to have to do is a lot of investigative work, identifying slow down areas sections by section. Some things I would consider - you have a List created without a capacity. You know that you're going to be allocating 1000 records to it at a time, so create the list sized to 1000 entries. Something else I would look at is not reusing the same list in a batch. What you could do is create, effectively, a List of Lists and create all of your batches before you actually update the database. Something like this:

C#

List<List<TblDispense>> batches = new List<List<TblDispense>>(1000); // Create a 1000 batch entries for giggles
List<TblDispense> tblDispense = new List<TblDispense>(1000);
batches.Add(tblDispense);
.... Do work adding data into tblDispense but don't clear it down at the end.

At the end, just iterate over your batches and add them.

However, playing around with this isn't going to get around the fact that your biggest bottleneck is going to be physically loading and reading the file. You might want to investigate faster file capabilities in the framework, and create something like a ReadOnlySpan<t> to hold the contents of the file (if you're using a late enough version of .NET).

Note: When I am investigating performance, I normally use dotTrace Profiler: .NET Profiler[^]

How can I improve performance and increase the speed of processing data so that it takes short time read and process 500, 000 records

2 solutions

Solution 1

Solution 2

Add your solution here

Preview 0