Large Collections in C# (Managed Code) - Part II

Aydin Homay

3.16/5 (10 votes)

4 Sep 2018CPOL4 min read

19.3K

In this article, I tried to show a real benchmark based on pressure test method, for a Big-Data collection in C# .NET.

Introduction

Let's imagine you are working for a company that deals with big data collections and you must come up with the best data-structure, to handle billions of records. How we know which data-structure is the best is a challange that I tired to answer to it during this benchmarking. Before starting to read please take into your account the following ponts:

The operating system is not a real-time operating system so I am presenting results in an avrage
The BS (Binary Search) requres a sorting and this sotring time complexity is relaxed from my study as you can use any convinent sorting algorithm and add the complexity of your sorting algorithm on top of my results. For more information about the sorting time complexity see the link.

What is the Good Solution?

The good solution is a solution that gives you a Maximum-Speed and Capacity with a Minimum-Time cost. Obviously, I made deterministic distinguish out of the question because my working Operating System is not a Real-Time Operating System! but if we can come up with a more deterministic solution that's like Apple-Pie and Chevrolet.

Machines Specifications

The machine that will execute all unit tests has the following hardware and software configuration:

CPU: Intel (R) Core (TM) i7-6650U @ 2.2GHz (4 CPUs), ~2.2GHz
RAM: 8 GB
O.S: Windows 10 Pro
Visual Studio 2017

Background

In this article, I will do some performance benchmarking for dealing with big data collections in C# .NET language from the point of capacity, speed and time cost of view. My selected collections are Array, List, and Dictionary.

The managed code term coined by Microsoft and refers to executes the code under the management of a CLR, like the .NET or Mono. For more information about managed and unmanaged code, please follow the below links:

Execution of managed code could be not so deterministic because of the CLR interference, for this reason having a benchmark that provides a rough estimation could be very helpful. However, before considering the results of this benchmark, we would be good if we answer the following two questions:

What is the maximum size of your big data collection?
What is the data type? You will deal with what kind of data? Integer, Text, etc.

As you will see from the results, the performance of each collection highly depends on the size of the collection and the algorithms that are being used over the collection such as the Search.

Note 1: Please keep in your mind all test-beds are in none Real-Time Operating System. This means the results are not deterministic. To overcome this non-deterministic result, I have executed all tests for 100 times, thus what you will see is the average time of 100 iterations.

The operations that I have considered are creation and search because these are the two most expensive and important operations.

Search Operations

The following method presents a schema of the search tests, as you can see from the code I am considering always the worst case scenario, this means the item which I am looking for is almost at the end of the collection. The search algorithm is surrounded by the Monitor object which contains a clock and starts before entering the search body and stops immediately after finding the item in the collection:

Shrink ▲

[TestMethod]
[TestCategory("Owner [C'#' Array]")]
public void FindByForInArry()
{
    try
    {
        string find = mocklbl + (capacity - 1);
        Monitor.Reset();

        for (int j = 0; j < iteration; j++)
        {
            string found = "";
            Monitor.Start();
            for (int i = 0; i < capacity; i++)
            {
                if (array[i] == find)
                {
                    found = array[i];
                    break;
                }
            }
            Monitor.Stop();
            Assert.AreEqual(find, found);
        }

        Monitor.LogAvgElapsed("FindByForInArray");
    }
    catch (Exception ex)
    {
        Monitor.WriteLog(ex.Message);
        throw ex;
    }
}

Search Time Complexity

There are several interesting facts from the above chart which I would like to list:

The Binary-search (the second column from the above table) always takes 0 milliseconds to find the item.
The time variance between 1000000 (1 million) records and 10000000 (10 million) records deviates exponentially in all search algorithms (compare the green line with the yellow line).
The dictionary data structure is not able to handle 10000000 (10 million) records (this is why the green line does not exceed from the FindbyFindInArray).
The find by enumeration is the worst search algorithm (see the third column).
The order of search algorithms over a list and array from the best to worst is: Binary-search, Find, For loop, and the enumerator object.
Over 100000 (100 thousand) records, there is no significant difference between data structures and search algorithms.
If there is no need for the key-value pair, then a list with the Binary-search is the best option.

The next chart will present the time complexity of creating and writing into the same collections (array, list, and dictionary). The below code represents a sample schema from these tests:

Shrink ▲

[TestMethod] 
[TestCategory("Owner [C'#' Array]")]
public void FetchMockData(long capacity = 1000)
{
    try
    {
        capacity = Monitor.BySize;
        for (int j = 0; j < 100; j++)
        {
            Monitor.Reset();
            Monitor.Start();
            array = new string[capacity];
            for (int i = 0; i < capacity; i++)
            {
                array[i] = mocklbl + i;
            }
            Monitor.Stop();

        }
        Monitor.WriteLog($"The creation of array took:{Monitor.AvgElapsedInMiliseconds}");

    }
    catch (Exception ex)
    {
        Monitor.WriteLog(ex.Message);
        throw ex;
    }
}

Creation Time Complexity

Following are the facts from the above chart:

The same exponential complexity between 1000000 (1 million) records and 10000000 (10 million) records also appear in the creation time.
The List data structure is almost beating the Array and knockdowns the Array after reaching the 1000000 (1 million) records size.
Dictionary has the worst time complexity and should be used if and only if it is necessary.

Conclusion

Magically, the List data structure works so good and has better performance and with the combination of Binary-search can provide very significant performance optimization.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)