By Andrew Whiddett
Overview
See: http://alphachitech.wordpress.com/2015/01/31/virtualizing-observable-collection/ for ongoing updated.
First off: The code for this implementation is a couple of thousands of lines of code - so we have put it together in a nice nuget package, and this article is focused on how to use it.
The Observer design pattern has been of the lynchpins of almost every modern software platform for years. In fact Observer is the core of the MVC/MVVM design patterns. In .NET, ObservableCollection is one of the most frequently-used objects in any system that involves data binding (which is pretty much any non-trivial application).
As great as ObservableCollection is, it does have some limitations:
- It doesn’t implement any kind of data virtualization, making large or unbound data sets challenging to work with.
- It resets to load changes, making dynamic data sets and multi-user scenarios challenging to work with.
There are some frameworks and tools out there that provide some form of virtualized observable collection, but typically these come with limitations like only working with controls from the same framework, or requiring certain coding paradigms to function properly. They’re not a general-use solution. Plus, as far as we know there is no virtualized observable collection that actually supports virtualized read/write, but rather they use a “read and reload” pattern which can be very inefficient with large data sets.
So, what to do about it? We made one ourselves – the VirtualizingObservableCollection, which does the following:
- Implements the same interfaces and methods as ObservableCollection<T> so you can use it anywhere you’d use an ObsevableCollection<T> – no need to change any of your existing controls.
- Supports true multi-user read/write without resets (maximizing performance for large-scale concurrency scenarios).
- Manages memory on its own so it never runs out of memory, no matter how large the data set is (especially important for mobile devices).
- Natively works asynchronously – great for slow network connections and occasionally-connected models.
- Works great out of the box, but is flexible and extendable enough to customize for your needs.
- Has a data access performance curve so good it’s nearly as fast as the regular ObsevableCollection – the cost of using it is negligible.
- Works in any .NET project because it’s implemented in a Portable Code Library (PCL).
Interested? Read on.
First, a bit about ObservableCollection
We have all seen the myriad of examples of using ObservableCollection with various ItemControls (like ListBox and ComboBox), such as:
<ListBox x:Name="lb" ItemsSource="{Binding Path=MyData}" Grid.Row="0">
<ListBox.ItemTemplate>
<DataTemplate>
<TextBlock Text="{Binding Path=Name}" />
</DataTemplate>
</ListBox.ItemTemplate>
</ListBox>
Where the ViewModel contains something like:
public class SimpleViewModel
{
private ObservableCollection<SimpleDataItem> _MyData = null;
public ObservableCollection<SimpleDataItem> MyData
{
get
{
if(_MyData == null)
{
_MyData = new ObservableCollection<SimpleDataItem>();
foreach (var o in SimpleDataSource.Instance.Items)
{
_MyData.Add(o);
}
}
return _MyData;
}
}
}
You just fill the data up in the ObservableCollection from the data source, and off you go. Perfect, right?
Well, not so fast. It’s a great solution if the ObservableCollection is a manageable size, but in most large scale systems the data can be thousands or even hundreds of thousands of rows. Loading that volume of data will seriously slow down the application. If the call is to a remote server then it’s even worse – that simple load, can take minutes, or even hours. Mobile devices, with limited CPU, memory and bandwidth compound the problem even further.
Clearly this is not a workable solution for these cases (which pretty well every one of our projects seems to have).
So how are people dealing with big data sets today? Typically we see a few common solutions:
- Implement a ‘page next/previous’ system in the UI. So, this way we only load the number of items in the list. This is the most common large data UI – but it’s seriously flawed when the data we want it on page 250 – the user has to press the “next” button 249 times.
- Limit the number of rows returned. Well that’s not very usable, and how do we tell the user that the list limit has been reached.
- Extending 2, we could limit the list, but then have a “load more” option at the end. This is better, and is a UI paradigm that we see frequently on mobile devices. But again, if the data is on page 250, users will still have to press “load more” 249 times.
Those are three the most common solutions used today. There is a fourth, which is a degree more sophisticated:
- Create a pagination system inside the collection, so it appears to be the full data set, but it loads chunks (or pages) at a time as needed in the background. That way you have the same UI whether you have a hundred data items or a hundred million.
The problem this approach is that to date, the available implemented solutions are read-only (except the framework providers such as Telerik that do support paging and CRUD (Create, Read, Update and Delete) operations in their own special ways. But these but typically only on their own controls.
If we want to update the collection with changes in the data set, we effectively have to reset/refresh the entire list and fetch the data again. Every paginating collection we’ve found follows this same pattern (see the end of this article for more details about that).
If there’s only one user, or the data doesn’t change very often, this may be okay. However, in multi-user and/or high-data-frequency scenarios, where lots of users are making changes to the underlying data store (like a SQL database, for example) or the data is changing very often (like a near-realtime monitoring system), this becomes a serious problem because the only way to ensure you have up-to-date data is to reload, causing your visuals to flicker as they’re reloaded, scrolling lists back up to the top, and losing your current selection, not to mention the cost of needing to get all the pages of data all over again.
This is less than ideal solution. You can get around this by using Behaviors to intercept the “reset” and do tricky things with the visuals, scroll position and selection to give the impression of seamlessness, but these are ugly hacks we’ve all used for years. What would be better would be a solution to the underlying problem.
That’s what VirtualizingObservableCollection is – an ObservableCollection that paginates, and that you can update just like an ordinary ObservableCollection. That’s a real solution to this problem.
Where to get it
The latest package can be found on Nuget (Package Manager console command: Install-Package VirtualizingObservableCollection). We will be putting up the source shortly, as well as a comprehensive example.
How to use it
VirtualizingObservableCollection is implemented in a Portable Class Library so you can use it in all kinds of .NET solutions. Consequently, some of the Async work it does in the background needs to be able to call back to the dispatcher thread. Also the, page reclamation system needs to run on a periodic basis, so you need to set up some sort of timer to set those up.
Setting up the VirtualizationManager
Both of these are platform-specific needs, so you’ll need to write (a tiny bit) of boilerplate code for the VirtualizationManager to use. However you only needs to do it once.
Here’s is an example in WPF:
public MainWindow()
{
InitializeComponent();
if (!VirtualizationManager.IsInitialized)
{
VirtualizationManager.Instance.UIThreadExcecuteAction =
(a) => Dispatcher.Invoke(a);
new DispatcherTimer(
TimeSpan.FromSeconds(1),
DispatcherPriority.Background,
delegate(object s, EventArgs a)
{
VirtualizationManager.Instance.ProcessActions();
},
this.Dispatcher).Start();
}
}
Here’s the same thing in Xamarin.Forms:
public class App
{
public static Page GetMainPage()
{
if(!VirtualizationManager.IsInitialized)
{
VirtualizationManager.Instance.UIThreadExcecuteAction = (a) => Device.BeginInvokeOnMainThread(a);
Device.StartTimer(
TimeSpan.FromSeconds(1),
() => {
VirtualizationManager.Instance.ProcessActions(); return true;
});
}
return new ContentPage
{
Content = new Label
{
Text = "Hello, Forms!",
VerticalOptions = LayoutOptions.CenterAndExpand,
HorizontalOptions = LayoutOptions.CenterAndExpand,
},
};
}
}
Now we have set that up, we need to create a service that fills up the data. We don’t want to do fetch all the data all at once, so we’ll use something more like a callback service, called by our virtualizing collection as needed. We do this by implementing the IPagedSourceProvider<T> interface.
Implementing a provider
IPagedSourceProvider<T> defines three methods (two of which are optional) and one property:
- public PagedSourceItemsPacket<T> GetItemsAt(int pageoffset, int count, bool usePlaceholder)
This returns a list of items (in the IEnumerable<T> Items property) starting at item pageoffset, for a maximum of count items (ignore the usePlaceholder argument, that’s used internally by the system).
NOTE – The pageoffset is the number of items into the Enum, not the page number. This is to allow pages to be different sizes (see the “how it works” section for details).
- public int IndexOf(SimpleDataItem item)
This returns the index of a specific item. This method is optional – you can just return -1 if you don’t need to use IndexOf. It’s not strictly required if don’t need to be able to seeking to a specific item, but if you are selecting items implementing this method is recommended. - public void OnReset(int count)
This is a callback that runs when a Reset is called on a provider. Implementing this is also optional. If you don’t need to do anything in particular when resets occur, you can leave this method body empty. - public int Count
This is an integer with the total number of items in the data set.
So, let’s look at our super simple example:
public class TesterSource : IPagedSourceProvider<SimpleDataItem>
{
public PagedSourceItemsPacket<SimpleDataItem> GetItemsAt(int pageoffset, int count, bool usePlaceholder)
{
return new PagedSourceItemsPacket<SimpleDataItem>()
{ LoadedAt = DateTime.Now,
Items = (from items in SimpleDataSource.Instance.Items select items).Skip(pageoffset).Take(count)
};
}
public int Count
{
get { return SimpleDataSource.Instance.Items.Count; }
}
public int IndexOf(SimpleDataItem item)
{
return SimpleDataSource.Instance.Items.IndexOf(item);
}
public void OnReset(int count)
{
}
}
And finally, using the provider in the VirtualizingObservableCollection
Once we have our IPagedSourceProvider set up, we can create a VirtualizingObservableCollection and use it just like we do a normal ObservableCollection:
public class SimpleViewModel
{
private VirtualizingObservableCollection<SimpleDataItem> _MyDataVirtualized = null;
public VirtualizingObservableCollection<SimpleDataItem> MyDataVirtualized
{
get
{
if (_MyDataVirtualized == null)
{
_MyDataVirtualized =
new VirtualizingObservableCollection<SimpleDataItem>(
new PaginationManager<SimpleDataItem>(new TesterSource()));
}
return _MyDataVirtualized;
}
}
}
As you can see from the above example, we are actually creating a few objects. Let’s unwrap it to see how they’re related:
- The IPagedSourceProvider<T> (TesterSource) accesses the underlying data source, providing essential information about the data source like the number of items.
- The PaginationManager<T> wraps the IPagedSourceProvider<T> and manages it. Most of the heavy lifting for pagination, memory management and data insertion is done here.
- The VirtualizingObservableCollection<T> wraps the PaginationManager<T> and can be used anywhere you can use an ObservableCollection<T>.
The PaginationManager constructor has a number of optional parameters to make it more flexible:
public PaginationManager(
IPagedSourceProvider<T> provider,
IPageReclaimer<T> reclaimer = null,
IPageExpiryComparer expiryComparer = null,
int pageSize = 100,
int maxPages = 100,
int maxDeltas = -1,
int maxDistance = -1,
string sectionContext = ""
)
- IPageReclaimer<T> reclaimer
The default implementation uses PageReclaimOnTouched<T> which removes pages based on the last time they were accessed. However, there may be times when you want to use a different strategy for reclaiming pages. In those cases, you can override the IPageReclaimer<T> with one of your own. - IPageExpiryComparer expiryComparer
This is a really useful system that allows you to check the update stamp of the page against the update stamp of the update itself (so it does not reapply the update to a page that has already got that update). There is an implementation using DateTime in DateBasedPageExpiryComparer. - int pageSize
The default size of a page is 100 items. You can change that to any positive integer you like. - int maxPages
By default, the VirtualizingObservableCollection will begin to reclaim pages once there are 100 pages in memory. You can change this to any positive integer you like. - int maxDelta
The default is -1, which means the VirtualizingObservableCollection will not release once a delta limit is exceeded. A forthcoming post about how the VirtualizingObservableCollection works under the hood will detail this argument further. - int maxDistance
The default is -1, which means the VirtualizingObservableCollection does take distance into account. A forthcoming post about how the VirtualizingObservableCollection works under the hood will detail this argument further. - string sectionContext
The default is an empty string. A forthcoming post about how the VirtualizingObservableCollection works under the hood will detail what this argument does.
Once we have our collection we can quite happily do anything we want to, for example:
public void InsertAt(int index, SimpleDataItem newItem)
{
SimpleDataSource.Instance.Items.Insert(index, newItem);
this.MyDataVirtualized.Insert(index, newItem);
}
public void RemoveAt(int index)
{
SimpleDataSource.Instance.Items.RemoveAt(index);
this.MyDataVirtualized.RemoveAt(index);
}
Yes, those are actual write operations on a data-virtualized observable collection, and once the edit is performed, whether done locally, via some messaging system such SignalR, or when someone else has modified the underlying data source, the VirtualizingObservableCollection won’t reset, so you won’t have to worry about hiding flicker, re-scrolling or caching selections.
Using UpdateAt when you are sharing a database
When you are using latent calls, and updates from other clients, such as a SQL database, it’s a great idea to check if the update should be applied to your copy of the virtual collection. This is done at a page level (hence the LoadedAt member on the PagedSourceItemsPacket) – simply return DateTime.Now, or even better store the TimeStamp in the row in the database and return LoadedAt = (from I in Items select TimeStamp).Max();.
Then, pass in the expiration implementation you wish to use, for example:
new VirtualizingObservableCollection<SimpleDataItem>(
new PaginationManager<SimpleDataItem>(
new TesterSource(),
expiryComparer: DateBasedPageExpiryComparer.DefaultInstance));
To aid with this, when you get an update from another client (for example from SignalR), use the extended CRUD operators e.g. InsertAt(int index, T newItem, object UpdatedAt) where UpdatedAt is the DateTime of the insert operation. That way, your copy will always be in perfect sync with the database.
Understanding how it works
Before we can really understand how it works, we need to understand how controls such as ListBox call the interfaces.
When you have a control that has an ItemsSource it typically uses IList<T> (specifically the T this[in index] getter to populate the items, with an occasional IndexOf thrown in.
By using a sparse array of pages, we only cache a limited number of elements, and can release them if the page cache fills up. That way, the collection can look like it’s any size – but we keep a smaller subset of data as it’s traversed over. This is the Dictionary<int page, ISourcePage<T> _Pages collection in the page manager.
These pages that the PagingManager stores contain (initially) one “pageSize” worth of items (100 by default). This is typically how all the read-only virtualizing collections do it. What we realized was that we needed a way of being able to change the page size for pages that have had CRUD operations applied. But we needed an efficient method to work out what page and offset into the page a given index is actually at.
The secret to this implementation is using a combination of recursive seeks to get the page and offset instead of working out the page by summing the length of every page before it. We also introduced optimized paths where the control is asking for the very next item, or the current item.
Let’s explain this a little more.
We keep a Dictionary<int page, PageDelta> _Delta. This means that we can compute a logical adjustment: like this
var adjustment = (from deltas in _Deltas.Values where deltas.PageNumber < page select deltas.Delta).Sum();
So, basically we are only storing the pages that are non-conforming in size.
Now it’s not quite as simple as applying (page * pageSize) + adjustment. Depending on whether the adjustment is positive or negative we must recurse if the adjustment is positive, taking into account any adjustments for the pages it effectively offsets into. This is a small price on performance, but these initial seeks are infrequent compared with single increment seeks (basically it goes [100], [101], etc.), which are optimized so they do not require recursion, or access to any of the previous deltas.
If you want to understand the specifics, the seeking is all contained in the CalculateFromIndex method in the PaginationManager class.
We should also mention something about the IPageReclaimer<T> interface, and more specifically the class PageReclaimOnTouched (which is the default implementation used by the PaginationManager). On a periodic basis, if the total number of pages held in the cache exceeds the maxPages limit, then it will attempt to remove and release the overflow pages by removing those pages, dropping the “oldest” first, where the pages last accessed are the “oldest.”
So, we get the best of both worlds – huge collections that load fast, an efficient memory model, small data packets, and the ability to edit the observable collection.
Still to do
ASync: I have started this pattern, but still need to make the count work asynchronously, and also address the issue when you do an insert into a page before its fetched. It might take a couple of weeks to implement this, so feel free to ping me on Nuget if you have any needs or insights.
About Alpha Chi Technology
You can always find out more on what we are doing at www.alphachitech.com.
Credits
I have to thank Bea Stollnitz for providing the inspiration to actually address this issue (http://www.zagstudio.com/blog/498 and http://www.zagstudio.com/blog/378).