Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / Languages / C#

Combination Generator

4.88/5 (29 votes)
11 Jun 2006CPOL7 min read 1   3.4K  
A combination generator that works properly with duplicate symbols in its input.

Introduction

This article presents a code library for generating combinations. A "combination" is a subset of the input sequence with a certain length. For example, given an input of {1, 2, 3, 4}, requesting all the length-2 combinations gives {1, 2}, {1, 3}, {1, 4}, {2, 3}, {2, 4}, and {3, 4}. The number of combinations of size r drawn from an input of size n is referred to as nCr, which is equal to n! ÷ [r!(n-r)!]. However, this formula is only correct if the input is actually a set (all the elements are distinct). I needed an algorithm that would properly handle duplicates in the input, and the code in this article does just that. By coincidence more than by design, each combination's elements are returned in sorted order, and the combinations themselves are also returned in order.

Using the Code

To generate all the length-3 combinations of integers from the set {1, 2, 3, 4, 5}:

C#
int[] input = new int[] { 1, 2, 3, 4, 5 };
Combinations<int> combinations = new Combinations<int>(input, 3);
foreach (int[] combination in combinations) 
{
  // Do something with "combination".
}

This will result in the body of the foreach being run with combination set to each of the following values: {1, 2, 3}, {1, 2, 4}, {1, 2, 5}, {1, 3, 4}, {1, 3, 5}, {1, 4, 5}, {2, 3, 4}, {2, 3, 5}, {2, 4, 5}, and {3, 4, 5}.

Note that the input need not be an array; it can be any implementation of IEnumerable, including most of the standard collections classes. Also, the input need not be ints; simply by changing the generic type parameter on the Combinations object, any data type can be used.

If the inputs contain duplicates, they will be handled properly. For example, the following code:

C#
int[] input = new int[] { 1, 2, 2, 3 };
Combinations<int> combinations = new Combinations<int>(input, 3);
foreach (int[] combination in combinations) 
{
  // Do something with "combination".
}

will yield the following values of combination: {1, 2, 2}, {1, 2, 3}, {2, 2, 3}.

It is possible to use an alternative IComparer for equality and sorting. For example:

C#
string[] input = new string[] { "alpha", "ALPHA", "beta", "gamma" };
Combinations<string> combinations = new Combinations<string>(input,
  StringComparer.CurrentCultureIgnoreCase, 2);
foreach (string[] combination in combinations) 
{
  // Do something with "combination".
}

results in combination taking on the following values: {"alpha", "alpha"}, {"alpha", "beta"}, {"alpha", "gamma"}, {"beta", "gamma"}. "alpha" and "ALPHA" are considered equivalent, and whichever version appears first in the input list (in this case "alpha") is considered canonical; only that form will appear in the output.

Certain procedures can be carried out when the algorithm only knows the n value (the set of inputs), but before the r value (the output size) is known. These procedures are actually carried out in the ElementSet class. An ElementSet is constructed internally by the two Combinations constructors which accept IEnumerables, but it is also possible to construct an ElementSet separately and pass it into a Combinations constructor. This is useful if the same input will be processed for more than one output length. For example:

C#
int[] input = new int[] { 1, 2, 3, 4, 5 };
ElementSet<int> elements = new ElementSet<int>(input);
for (int i = 0; i <= 5; i++) {
  Combinations<int> combos = new Combinations<int>(elements, i);
  foreach (int[] combo in combos) 
  {
    // Do something with "combo".
  }
}

This code will efficiently generate all the subsets of {1, 2, 3, 4, 5}. Any processing which does not depend on the length of the subsets is done outside the loop. Also, this example demonstrates that the corner cases of r=0 and r=n are handled properly. In the r=0 case, a single array of length zero is returned; in the r=n case, a single array containing the entire input is returned.

It is possible to use both a custom IComparer and a preconstructed ElementSet. The IComparer must be passed into the ElementSet constructor along with the input data. The IComparer is not passed to the Combinations constructor in this case.

Complexity Analysis

For this discussion, n is the size of the input, r is the size of the output, and m is the number of "distinct" elements in the input.

Cycles

ElementSet constructor

Assuming the IEnumerable passed to the constructor can enumerate its contents in insignificant time, the constructor operates in O(nlog2n).

Combinations constructor

O(1), given an already-constructed ElementSet; otherwise, O(nlog2n).

Combinations.GetEnumerator()

O(r)

IEnumerator.Reset()

O(r)

IEnumerator.Current (get)

O(r)

IEnumerator.MoveNext()

O(r×m)

Storage

Only large quantities of storage are mentioned here.

ElementSet<T>

  • SortedList<T, int> of size m
  • int[r]
  • SortedDictionary<T, int> of size m only during construction

Combinations<T>

This class has no significant storage requirements.

IEnumerator<T[]>

  • int[r]
  • int[m]
  • T[r] only during a call to Current property-getter
  • Stack space for recursion to depth O(r) only during a call to MoveNext()

Diagnostics

The source archive contains a solution with two projects and three configurations. In the "Test" configuration, only the "Combinations" project will be built, and 13 NUnit tests will be compiled into the assembly. In the "Debug" and "Release" configurations, the "Combinations" project's test cases will not be compiled, while the "Benchmark" project will be compiled, which provides a simple console-mode benchmarking program that exercises the library. My benchmarks on a 1.7GHz processor in release mode range from over four million combinations per second for sets of 3 integers from a set of 10 distinct down to 66,571 combinations per second for sets of 700 integers from a set of 1000 with 10% duplicate entries. The input and output sizes impact the benchmarks much more than the percentage duplicates. Benchmarks for constructing ElementSets are also provided, passing in arrays of integers in both ascending and random order. My figures range from about 1.5 million input elements per second for a dataset of 10 integers to about 500,000 elements per second for an input of 1000 integers. Sorted-vs-random ordering is insignificant.

How It Works

The ElementSet constructor first gathers all the input elements. A SortedList<T, int> is populated such that each key maps to the number of times it occurs. Actually, a SortedDictionary is populated first, then copied to a SortedList, as this method yields better complexity. Also, the ElementSet calculates a cumulative sum backwards over the elements, storing the values into an int[], such that one can, in O(1), determine how many elements (not distinct elements) exist at and after a given position in the list. This is needed later.

The Combinations class does very little except to contain an ElementSet and the r value. All the work happens in the IEnumerator.

The IEnumerator uses "indices" to generate combinations. Imagining all the input elements lined up from left to right, with duplicates stacked above each other, the indices are initially "pushed" as far to the left as possible, such that each index points to its own element (not distinct element). Moving to the next value consists of sliding an index to the right. If it falls off the edge or hits another index, the next index is advanced one element to the right, and the current index is then pushed back to the left until it hits another index. Once all the indices have moved as far to the right as possible, there are no more combinations to generate. Other code is dedicated to making sure that an index never moves so far to the right that it doesn't leave enough room for other indices which reside to the right of it; this is what the backwards cumulative sum is used for. Once the indices have settled on their positions, a combination is generated in the Current property-getter by simply indexing the SortedList<T, int>.Keys collection with each index in turn, storing the values into an array.

Other Similar Code

There is already a Combinatorial Algorithms in C# article which contains a combinations algorithm (as well as others), but that algorithm apparently fails when presented with duplicates in its input. For my purposes that was a problem. Also, that article's code implements IEnumerator, while mine implements IEnumerable; hence my code can be used in a foreach but the other article's cannot. gybrush's explanation is that "If you have requested enumerator over the same sequence previously it must not become invalid, because it still may be in use. This is impossible with the current implementation of the Combinatorial classes and I don't see a way to easily modify this." My algorithm, on the other hand, presents a very clear division of code which makes supporting multiple IEnumerators easy. Finally, my code is generic, so it generates combinations in a typesafe way.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)