Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles
(untagged)

C# LINQ: Ordered Joining and Grouping Lazy Operators

0.00/5 (No votes)
18 May 2017 1  
Lazy Joining and Grouping IEnumerable extensions for ordered sequences

Introduction

This article introduces one of possible implementations of Join and GroupJoin LINQ extensions assuming that the source data is ordered (pre-sorted). Exploiting this assumption, let us build Joining and Grouping logic using lazy evaluation as opposed to standard LINQ Join and GroupJoin operators that require the whole sequence to be preloaded into memory.

Although C# is used for this article, the technique applies to all .NET languages.

Background

Let’s say we have two datasets – Master and Detail. Both of them contain lots of records ordered by Master ID field and have master/detail relationship. The datasets are not necessarily stored in database, they could be anywhere (in our example, we will generate fake data sequences on the fly). Now the task is to read data from both sequences, join results together by Master ID, and perform some operations on resulting sequence (just print in our example). Of course, if data was stored in database, we could have written a SQL JOIN statement and let SQL Server perform joining. But for the sake of demonstration, let’s try implementing joining and grouping ourselves, right in the C# code.

Data Structures

In our demo, we will use the following data structures called MasterData and DetailData:

public struct MasterData
{
    public int MasterId { get; set; }
 
    public string Data
    {
        get { return string.Format("MASTER(Master ID: {0})", MasterId); }
    }
}
 
public struct DetailData
{
    public int MasterId { get; set; }
    public int DetailId { get; set; }
 
    public string Data
    {
        get { return string.Format("DETAIL(Master ID: {0}, Detail ID: {1})", MasterId, DetailId); }
    }
}

We will also use PrintData helper extension methods to print data:

public static class PrintingExtensions
{
    public static void PrintData(this IEnumerable<Tuple<MasterData, MasterData>> data)
    {
        foreach (var masterItem in data)
            Console.WriteLine("{0} <===> {1}", masterItem.Item1.Data, masterItem.Item2.Data);
    }
 
    public static void PrintData(this IEnumerable<Tuple<MasterData, IEnumerable<DetailData>>> data)
    {
        foreach (var masterItem in data)
        {
            Console.WriteLine(masterItem.Item1.Data);
            foreach (var detailItem in masterItem.Item2)
                Console.WriteLine("\t{0}", detailItem.Data);
        }
    } 
}

And finally two methods to generate sequences of sample data:

private static IEnumerable<MasterData> GetMasterData(int count)
{
    return Enumerable.Range(1, count).Select(m => new MasterData {MasterId = m});
}
 
private static IEnumerable<DetailData> GetDetailData(int count)
{
    return Enumerable.Range(1, count).SelectMany(m => 
    Enumerable.Range(1, 5).Select(d => new DetailData {MasterId = m, DetailId = d}));
}

GroupJoin - Standard Approach

Here is how we could do it using standard GroupJoin LINQ operator:

private const int c_count = 10*1000*1000;
private const int c_skipCount = 1*1000*1000;
private const int c_takeCount = 3;
 
void Main()
{    
    var masterData = GetMasterData(c_count);
    var detailData = GetDetailData(c_count);
    
    masterData.GroupJoin(detailData, m => m.MasterId, d => d.MasterId, Tuple.Create)
        .Skip(c_skipCount).Take(c_takeCount).PrintData();
}

This works perfectly well with relatively short sequences, but when the number of items is big, like in this example, execution becomes not very optimal because standard LINQ GroupJoin operator requires the entire sequences to be loaded into memory first, so that joining and grouping can work correctly for even un-ordered sequences.

However, according to our initial requirements, both sequences should be sorted by Master ID. If that’s true, why read the entire sequence if we can achieve the same result in a lazy manner? Let’s see how it might look like.

GroupJoin - Optimized Approach

In order to exploit the fact that the datasets are ordered, I have created a few extension methods on IEnumerable<T>. Here is an example that gives results identical to the previous one, though faster and more optimal:

private const int c_count = 10*1000*1000;
private const int c_skipCount = 1*1000*1000;
private const int c_takeCount = 3;
 
void Main()
{    
    var masterData = GetMasterData(c_count);
    var detailData = GetDetailData(c_count);
    
    masterData.OrderedEqualityGroupJoin(detailData, m => m.MasterId, d => d.MasterId, Tuple.Create)
        .Skip(c_skipCount).Take(c_takeCount).PrintData();
}

OrderedEqualityGroupJoin operator, as well as OrderedCompareGroupJoin, internally uses the following class to achieve lazy evaluation:

private class FilteredEnumerator<T> : IDisposable
{
    private bool m_hasData;
    private readonly IEnumerator<T> m_enumerator;
 
    public FilteredEnumerator(IEnumerable<T> sequence)
    {
        m_enumerator = sequence.GetEnumerator();
        m_hasData = m_enumerator.MoveNext();
    }
 
    public IEnumerable<T> SkipAndTakeWhile(Predicate<T> filter)
    {
        while (m_hasData && !filter(m_enumerator.Current))
            m_hasData = m_enumerator.MoveNext();
 
        while (m_hasData && filter(m_enumerator.Current))
        {
            yield return m_enumerator.Current;
            m_hasData = m_enumerator.MoveNext();
        }
    }
 
    public IEnumerable<T> SkipAndTakeWhile(Func<T, int> comparer)
    {
        while (m_hasData && comparer(m_enumerator.Current) > 0)
            m_hasData = m_enumerator.MoveNext();
 
        while (m_hasData && comparer(m_enumerator.Current) == 0)
        {
            yield return m_enumerator.Current;
            m_hasData = m_enumerator.MoveNext();
        }
    }
 
    public void Dispose()
    {
        m_enumerator.Dispose();
    }
}

FilteredEnumerator will skip all Detail records until it finds the one that matches current Master key, and will only pick up those that match. The difference between the two is that OrderedEqualityGroupJoin is based on equality of keys, whereas OrderedCompareGroupJoin is based on comparing keys. This becomes relevant when some Master records exist that do not have corresponding Detail records. Operator based on equality of keys might skip the entire sequence in that case, but the one based on comparison of keys would only skip detail records with keys less than the current Master ID.

Inner and Outer Joins

So far, we talked about Group Joining and it’s time to look into Join operators now. The attached source code contains implementation of Inner, Outer Full, Left, and Right joins, again, assuming that both inner and outer sequences are ordered.

Implementation is based on OrderedJoinIterator, which basically walks through both sequences and depending on comparison of keys and joining type yield returns either inner or outer records, or their default values.

static IEnumerable<TResult> OrderedJoinIterator<TOuter, TInner, TKey, TResult>(
    this IEnumerable<TOuter> outer, 
    IEnumerable<TInner> inner, 
    Func<TOuter, TKey> outerKeySelector, 
    Func<TInner, TKey> innerKeySelector,
    Func<TOuter, TInner, TResult> resultSelector, 
    JoinType jt, IComparer<TKey> comparer)
{
    if (comparer == null)
        comparer = Comparer<TKey>.Default;
 
    var l = outer.GetEnumerator();
    var r = inner.GetEnumerator();
 
    var lHasData = l.MoveNext();
    var rHasData = r.MoveNext();
 
    while (lHasData || rHasData)
    {
        if (!lHasData)
        {
            if (jt == JoinType.Inner || jt == JoinType.Left)
                yield break;
            yield return resultSelector(default(TOuter), r.Current);
            rHasData = r.MoveNext();
            continue;
        }
 
        if (!rHasData)
        {
            if (jt == JoinType.Inner || jt == JoinType.Right)
                yield break;
            yield return resultSelector(l.Current, default(TInner));
            lHasData = l.MoveNext();
            continue;
        }
 
        var comp = comparer.Compare(outerKeySelector(l.Current), innerKeySelector(r.Current));
 
        if (comp < 0)
        {
            if (jt == JoinType.Left || jt == JoinType.Full)
                yield return resultSelector(l.Current, default(TInner));
            lHasData = l.MoveNext();
        }
        else if (comp > 0)
        {
            if (jt == JoinType.Right || jt == JoinType.Full)
                yield return resultSelector(default(TOuter), r.Current);
            rHasData = r.MoveNext();
        }
        else
        {
            yield return resultSelector(l.Current, r.Current);
            lHasData = l.MoveNext();
            rHasData = r.MoveNext();
        }
    }
}

Signatures and behavior of presented OrderedJoin operators are identical to the standard LINQ Join operator. Below is the list of those operators:

  • OrderedInnerJoin: Returns inner and outer records that have matching keys.
  • OrderedFullJoin: Returns all inner and outer records. If keys are not matching on either side, type’s default value will be returned.
  • OrderedLeftJoin: Returns all outer records plus either matching inner records or their type’s defaults.
  • OrderedRightJoin: Returns all inner records plus either matching outer records or their type’s defaults.

Conclusion

LINQ is a very powerful technology and allows quite easily to achieve desired functionality hiding added complexity away from the main code that uses it. Although behavior of standard joining and grouping operators in LINQ-to-Objects is not targeting ordered sequences, the architecture of extension methods, lambdas, and the concept of yield returning IEnumerable<T> elements, play together very nicely and allow creating new operators with desired functionality.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here