Introduction
Every now and then on .NET code fora, questions are asked about how to manage big collections over a million items. The usual reaction by the guru's is along the line of: "why would you want to show over a million items to the user?" and other implicit reproaches of bad coding practice. Fact is sometimes you do need to browse (quickly) through a huge set of items, because sometimes that's the best way to analyze the data. Or: "Ask not what the data can do for your software; ask what your software can do for the data".
Background
In my professional life, I work a lot with seismic data which comes in the shape of big binary files (> 10 Gb is not exceptional). Sometimes, these data are corrupt and won't load in standard software. It is then efficient to browse in a structured, but quick way through the whole data file to find out where the file falls over upon loading. However, there is a size limit in the Microsoft Common Language Runtime (CLR) of 2Gb which a single object can hold. Also, out-of-memory (OOM) errors occur when a single list gets too big and the software is unable to find a large enough section of contiguous pages in its virtual address space to store the data. Please note: this applies to each single object. And because it applies to single objects, it should not be too difficult to create a work around using a composite list of some sort. There's a very readable article here, from which I quote and for more background. This tip shows a class I wrote to handle composite lists as if it is a single list. It works for my seismic data (especially in combination with a ListView
in VirtualMode
); it may help you with your 'big data'...
Using the Code
The workaround is a single class which I will walk through in 'blocks'. At the end, I will list the full class. In the attached source code, you will find this class with a little form to show use of the class.
The class is called 'BigList
', no surprise, and implements IEnumerator(<span style="color: rgb(0, 0, 255);">Of</span> T)
and IEnumerable(<span style="color: rgb(0, 0, 255);">Of</span> T)
, like:
Public Class BigList(Of T)
Implements IEnumerator(Of T), IEnumerable(Of T)
Private _listoflists As List(Of List(Of T))
Private _enumindex As Int64 = -1
Private _listindex As Integer = 0
Private _current As T = Nothing
Public Sub New()
_listoflists = New List(Of List(Of T))
_listoflists.Add(New list(Of T))
_enumindex = -1
_listindex = 0
_current = CType(Nothing, T)
End Sub
Public MaxItemsPerSublist As Integer = 500000
End Class
Implementing these abstracts forces you to implement the GetEnumerator, Reset, Dispose, Current
and MoveNext
methods and properties. We'll get to that later. Main point here is that I declare a local Private _listoflists <span style="color: rgb(0, 0, 255);">As</span> List(<span style="color: rgb(0, 0, 255);">Of</span> List(<span style="color: rgb(0, 0, 255);">Of</span> T))
which is the composed list of List(Of T)
. Each of these lists can have a max of MaxItemsPerSublist
items, in this case initiated to 500,000, but this can be changed to any suitable number once the class is instantiated. Three more local privates are of importance: _enumindex
is the iterator index for the BigList
as a whole, _listindex
is the iterator of the sublist within the composite list and _current
is the object which is stored of a user specified type T
. The class is constructed by setting the indexes to their appropriate values and adding the first List(Of T)
to the _listoflists
composite.
Now the Add
function is the next method to discuss because it dives right into managing the different lists and indices. It simply adds the item and then checks if the current list _listoflists(_listindex)
is at its maximum. If so, it will add another list and increment the sublist index _listindex
.
Public Function Add(ByVal item As T) As Int64
_listoflists(_listindex).Add(item)
If _listoflists(_listindex).Count = Me.MaxItemsPerSublist Then
_listoflists.Add(New list(Of T))
_listindex += 1
End If
Return Count
End Function
Public ReadOnly Property Count() As Int64
Get
Dim sum As Int64 = 0
For Each l In _listoflists
sum += l.Count
Next
Return sum
End Get
End Property
The function returns Count
which is the read-only property defined in the above code block; it loops over all sublists and sums the amount of items in each.
Now to retrieve any item back from the list, the method Current
is implemented (also needed to do a For Each
loop):
Public ReadOnly Property Current As T Implements IEnumerator(Of T).Current
Get
Dim li As Integer = 0
Dim ti As Integer = 0
getSubindexes(_enumindex,li,ti)
_current = _listoflists(li)(ti)
Return _current
End Get
End Property
Now the heart of this property (in fact, of the whole class) is the function getSubindexes
which takes in the index of the item in the BigList
and returns the sublist and sublist index where it is stored. Here is the code:
Private Function getSubindexes(ByVal index As int64, _
ByRef listindex As Integer, ByRef itemindex As Integer) As Boolean
Dim li As Integer = 0
Dim ri As Integer = 0
Dim nid As Int64 = index
Dim sum As Int64 = 0
Dim indexfound As Boolean = False
For l As Integer = 0 To _listindex
If index < sum + _listoflists(l).Count Then
listindex = li
itemindex = CInt(index - sum)
indexfound = True
Exit For
Else
li +=1
sum += _listoflists(l).Count
End If
Next
If Not indexfound Then
Throw New IndexOutOfRangeException
Return False
Else
Return True
End If
End Function
The method will loop over the sublists and break down the _enumindex
into a list and item index (returned ByRef
). The method will also throw an IndexOutOfRangeException
like any normal List(Of )
would do. Similarly, the Item
property (default to any collection class) makes use of getSubindexes
:
Default Property Item(index As Int64) As T
Get
Dim li As Integer = 0
Dim ti As Integer = 0
getSubindexes(index,li,ti)
Return _listoflists(li)(ti)
End Get
Set
Dim li As Integer = 0
Dim ti As Integer = 0
getSubindexes(index,li,ti)
_listoflists(li)(ti) = value
End Set
End Property
And this is basically all there is to it...
For the implementation, you do need to code the GetEnumerator
, Reset
, Dispose
and MoveNext
methods and properties. They are worked out in the attached source file. I have added myself the Clear
, IndexOf
and Remove
methods to have the basic List(Of )
functionality in. You can extend it with other methods to your heart's desire. As an example, I include here a screen dump of an application I'm coding to view and edit a seismic file (or more popularly known SegY
file) showing the tail end of a 13.5Gb file with 2,025,780 seismic records (or traces):
Points of Interest
I did try initially to implement the IList(Of ) abstract
class but ran into trouble. I needed the Int64
variable to be able to have a huge index over the BigList
. I may be wrong... Also, when coding the getSubindexes
method, I was wondering if the False
is actually returned after throwing the IndexOutOfRangeException
? Anybody knows?
History
- January 16, 2016: First draft