Introduction
Gathering a random sample of data from a bigger source has many uses, from testing, debugging and marketing. Some examples are as follows:
- Whilst developing or debugging a LINQ query, it is easier to analyze a smaller result set from a much larger source
- Running abbreviated unit tests during the daytime hours to improve multi-developer integration time, use a small sample during business hours and the full test data during the nightly builds
- Marketing - "Lets call 10% of our users and ask them for some feedback or references"
I want to be able to write a query like this to get a random sample of 10 names from a bigger list of names in a string[]
:
var sample = listOfNames.RandomSample(10);
The code described in this article adds an extension method to IEnumerable
that allows you to generate a random sampling of elements from a collection of any type.
Background - Extension Methods
The next release of C# introduces a new feature called Extension Methods. Normally you can call methods on an object instance as long as that class, or one of its ancestor classes provided a method in-scope and of that name. Extension Methods allow us to "add" methods to types without changing the original class, or creating a sub-classed derivative of our own.
In the following example, I added a new method to the .NET's string
class that returns a new string
with Hello
as the prefix.
public static class Extensions
{
public static string Hello(this string s) {
return "Hello " + s;
}
}
After creating the extension method, I can use the following syntax to add Hello … to any string
instance object and in this case, write out the resulting string
to the Console window.
string name = "Troy";
Console.WriteLine(name.Hello());
The important syntax change above is the first parameter has the this
modifier before the type in the arguments list.
The other key points when defining extension methods are:
- The extension method must be in a class marked as
static
- Each extension method must be marked as
static
- The
this
modifier must be on the first parameter
At compile time, the C# compiler first looks to see if there are any instance methods that match that name and parameter signature. If no matching method name or signature is found, then the search continues through any namespaces imported with the using
clause. If any static
methods with the same name have the this
modifier for the same type as the instance object's type, then that method will be used.
Our RandomSample
extension method will allow any instance object that implements IEnumerable
to return another IEnumerable
with the number of random sequence elements we request.
Using the Code
When using the RandomSequence
extension method, you have two method signatures to choose from:
[IEnumerable object].RandomSample( count, Allow Duplicates )
[IEnumerable object].RandomSample( count, Seed, AllowDuplicates )
count
: The number of elements to return, or less if the source list has fewer elements
Allow Duplicates
: true
or false
. If true
, an element may be returned more than once if the random generator picks it more than once.
Seed
: The initial integer seed for the random sequence generator. If you don't specify the system tick count will be used. If you specify an explicit seed, the sequence will be identical for each call given the same input source list. This is useful for being able to repeat tests with a specific random sequence.
To use this code in your project, download the source for this article and add the following using
clause to the code where you wish to call the RandomSequence
method:
using Aspiring.Query;
By adding this using
clause, our extension method is now in scope and the object that inherits from IEnumerable
can utilize its action as in the following code which returns three random names from a list, allowing duplicates (the last argument is true
for allowing duplicates if the random sequence decided too, or false
to only return each element once):
string[] firstNames = new string[] {"Paul", "Peter", "Mary", "Janet",
"Troy", "Adam", "Nick", "Tatham", "Charles" };
var randomNames = firstNames.RandomSample(3, true);
foreach(var name in randomNames) {
Console.WriteLine(name);
}
Here is the code that implements our RandomSample
extension method for IEnumerable
objects:
using System;
using System.Collections.Generic;
using System.Text;
using System.Query;
namespace Aspiring.Query
{
public static class RandomSampleExtensions
{
public static IEnumerable<T> RandomSample<T>(
this IEnumerable<T> source, int count, bool allowDuplicates) {
if (source == null) throw new ArgumentNullException("source");
return RandomSampleIterator<T>(source, count, -1, allowDuplicates);
}
public static IEnumerable<T> RandomSample<T>(
this IEnumerable<T> source, int count, int seed,
bool allowDuplicates)
{
if (source == null) throw new ArgumentNullException("source");
return RandomSampleIterator<T>(source, count, seed,
allowDuplicates);
}
static IEnumerable<T> RandomSampleIterator<T>(IEnumerable<T> source,
int count, int seed, bool allowDuplicates) {
List<T> buffer = new List<T>(source);
Random random;
if (seed < 0)
random = new Random();
else
random = new Random(seed);
count = Math.Min(count, buffer.Count);
if (count > 0)
{
for (int i = 1; i <= count; i++)
{
int randomIndex = random.Next(buffer.Count);
yield return buffer[randomIndex];
if (!allowDuplicates)
buffer.RemoveAt(randomIndex);
}
}
}
}
}
Points of Interest
The cornerstone to writing extension methods for LINQ is the .NET 2.0 feature yield return
keyword. Each time the framework calls the GetNext()
enumerator method, which it does each loop around a ForEach
statement, our routine will begin execution from the line after the previous yield return
statement. The framework maintains state between calls, so authoring interesting enumerators like this becomes a fraction of the work that would have been required in .NET 1.1.
I have written a library with many more useful extension methods and posted them on my Blog. But, they all follow the same pattern. Extend IEnumerable
and build an iterator pattern using the yield return
statement.