Using a DataReader like a List in F#

gjvdkamp

4.71/5 (12 votes)

7 Oct 2011CPOL8 min read

39.7K

162

This article demonstrates a technique how you can write recursive algorithms in F# using a Reader instead of a list, so you don't have to load all your data into memory first

Download source - 2.11 KB

Introduction

A lot of algorithms in F# use the List datastructure and recursion to solve tasks. This is very well for smaller datasets, but when they get bigger, the overhead of first converting them to a list can become significant if not prohibitive. This article presents a simple technique on how an IDataReader can be used just like you would a List, letting you stream large datasets across your algorithm, so it will use far less memory.

Background

When you compare the List and DataReader, they're actually pretty similar in how they work. A list has a Head that is the current element and a Tail with the other elements. The DataReader has a current element and a set of next elements that you can access using MoveNext(). This made me wonder if you couldn't just use a DataReader like a list and do away with first loading the whole thing into memory in a List. Turns out, it's pretty easy.

Update

I wrote this article when I was relatively new to F# and still rather dogmatic about functional programming, I don't do this anymore. To me, the elegance of F# is that you use functional coding where it fits best, and can mix in imperative coding where that just works better than purely functional code. IEnumerables in F# are handled by sequences that are well known. So the way to go is to wrap your IDataReader in a seq that exposes the elements in a strongly typed way.

I'll leave the rest of the article because tail recursion is still a useful concept to learn and the internet should be written in ink, but here's a quick snippet of how I would do this today:

open System.Data.SqlClient
open System.Data

// a simple record that mirrors the table in the DB
type SomeRecord = {
    id   : int
    val1 : double
    val2 : double
} with 
    // actually these functions could exist on their own outside the record, 
    // I prefer to keep them together
    // create a new record from the current rdr position
    static member fromRdr(rdr:IDataReader) = {
                    id      = rdr.GetInt32  0; 
                    val1    = rdr.GetDouble 1; 
                    val2    = rdr.GetDouble 2
                  }
    // project an IDataReader into a seq<SomeRecord>
    static member asSeq (rdr:IDataReader) = seq { while rdr.Read() 
	do yield SomeRecord.fromRdr rdr} // here we mix in a simple while loop to 
        //free us from recursing down the stack to traverse the IDataReader. MUCH simpler.

// open up connection (I didn't wrap any of this into functions here to just 
// show how to use this)
let cn = new SqlConnection "server=.;database=SomeDatabase;integrated security=true"
cn.Open()

// open DataReader
let cmd = new SqlCommand("select id, val1, val2 from table1", cn)
let rdr = cmd.ExecuteReader()

rdr
|> SomeRecord.asSeq                     // project datareader into seq<SomeRecord>
|> Seq.sumBy (fun r-> r.val1 * r.val2)  // now you can use all the Seq operators 
                                        // that everybody knows.

// finish up
rdr.Close()
cn.Close()

This is the rest of the original article...

A Textbook F# Recursive Algorithm

Let's begin by looking at a basic F# recursive algorithm using a List. This one just takes a list of numbers and adds them up.

 // create a list of numbers
let l = [1..10]

// create a function to add them together
let rec SumList l =
    match l with       
        head::tail -> head + Sum tail  	// peel off the head, and add 
					// its value to the value of the tail 
        | [] -> 0 				// until there are no more elements left, 
					// return zero for that

// call the function on the list, returns 55 in this case
SumList l

This is a typical example of how recursion is used in F# to elegantly describe an algorithm. The way the numbers are added here is the list is divided into two parts: the first element: the head, and the other elements: the tail. The value of the whole list is the value of the head plus the value of the tail. Duh. The function calls itself again with the tail (which each time is one element shorter than the list passed in) until there is no more tail. That's what the [] stands for, the empty list. When it comes across that, it returns 0. Now it goes back up the stack adding up all the numbers and tadaa: the result.

Typically when you'd write an algorithm like the one above, the data isn't hardcoded in there, you load that data from somewhere into a list and then apply the logic. The problem with this is however that now that entire set has to be loaded into memory. This can be a challenge when the dataset is big. It would be nicer on memory if we could load only a portion of that data into memory to run the algorithm on, dispose of that and then read the next portion.

Doing That with a DataReader

Here's the same algorithm, but now using a DataReader instead of a list. (Note: Do not run off now and use this code in production because it will fail miserably as we will demonstrate).

let rec SumReader (dr:IDataReader) =
   match dr.Read() with
       true -> dr.GetInt32(1) + SumReader dr
       | false -> 0

This works pretty much the same as the example above. It accepts an IDataReader, but instead of splitting that into a head and a tail, it calls movenext on it. That returns true if there is a new current element, and false if not. If there was a next element, it adds the value of that to the value of the rest of the elements in the reader. Just like the example above, it keeps doing that until there is no next element, then it unwinds the stack to add all the numbers together.

Oops... No Tail Call

Using the algorithm above works fine for smaller datasets, but throws a StackOverflowException on larger ones. As far as Exceptions go, that's about as bad as they come. You can't catch them, when one is thrown in your code, it's lights out. The whole point of using a reader over a list was to be able to handle larger datasets, so we're clearly not there yet.

The reason we get this exception is because we're not making a tail call. F# uses a technique called Tail Call Optimization (TCO, explained here, here and here) to make code run more efficiently. Basically, what that does is take a recursive function and rewrite that as an iterative one. When the original function meets a couple of criteria, this can be done pretty easily, the major criterion being that the call to itself must be the last thing the function does in that branch. When rewritten, the function now no longer calls itself but does a loop, which is more efficient because there are no more pushes to, and pops from, the stack. Pushes to the stack is what killed us, TCO is what we want.

If you look carefully at the code of our DataReader example, you see that the recursive call to itself is actually not the last thing it does, even when it's the last thing on that line, because it adds the result of that call to the value of the current element. THAT is the last thing it does and that makes it impossible for F# to rewrite it as a tail call. (Thanks to Tomas Petricek for pointing that out). So what can we do about it?

Enter the Accumulator Pattern

What we need to do is pass an extra variable that holds the result so far. Each time we add the value of the current element to this result so far and then pass that to the next call. This way, we don't have to do anything after the call returns. This technique is called the accumulator pattern. When we do it like that, the call to itself IS the last thing it does and the F# compiler can do the Tail Call Optimization that will let us run larger datasets across. Here's what that looks like:

let rec SumReaderAcc (dr:IDataReader, accum) =
    match dr.Read() with
        true -> SumReaderAcc(dr, accum + dr.GetInt32(1)) // now this is the last thing 
							// it does, enabling TCO
        |false -> accum		// when there are no more elements, 
				// return the accumulator

It's ok now to run off and fix your production issue, but if you stick around a little longer I'll show you how you can turn this into a more generic approach that you can reuse in your code. In general, typing is bad, reuse is good.

Making Things a Bit More Generic

So far the functionality of this code is pretty fixed, it adds the first column of the query. Since F# is a functional language, why not pass a function to be applied to every element in the reader to get a value (functional dependency injection, so geeky). That way, it will be a bit more generic already. Here's the code for that:

// passing the function to apply to each element as a parameter 
// already makes it more generic
let rec SumReaderF (dr:IDataReader, func, accum) =
    match dr.Read() with
        true -> SumReaderF(dr, func, accum + func(dr)) // pass reader and func 
	// on to next call, and also apply func to element and add to accumulator
        |false -> accum

// create a function that takes an IDataReader and returns an int
let myAdd(element:IDataReader) = element.GetInt32(1)

// apply the function to each element in the reader
SumReaderF(rdr, myAdd, 0)

Now that's better already, but we can still do better. For one, the user now has to enter a starting zero for accum for this to work, and it only adds up the calls to the elements, no way to for instance multiply or average. The code below fixes those two issues and is as far as we will go today. It uses an inner function to handle the recursion and accumulator, so the user only has the pass the reader and the functions. There are two functions passed to it: one function that is applied to the current element of the IDataReader to extract a value, and a second aggregate function that is applied to the result of the element function and the accumulator.

let ApplyFuncToReader (dr:IDataReader, func, agg) =    
    let rec ApplyInternal (accum) = 	// create inner function to handle 
				//the recursion and accumulator. 
				// No need to pass everything for each call, 
				// just the accumulator will do
        match dr.Read() with
            true -> ApplyInternal(agg(accum, func(dr))) // apply func to element 
						// and agg to func
            |false -> 
                dr.Close()  // be nice, close the reader
                accum       // and return the result
    ApplyInternal 0 	// and execute that inner function passing 0 
			// as starting value for accumulator

Using the Code

Here's an example of how to call the code above:

// create a function to apply to each element
let myElementFunction(element:IDataReader) = element.GetInt32(1)

// create an aggregate function to apply between each call to elements
let myAggregate(x, y) = x + y

// create a SqlDataReader (NewReader is a helper function 
// I wrote that returns an open SqlDataReader)
let rdr = NewReader()

// and pass all three to ApplyFuncToReader
ApplyFuncToReader (rdr, myElementFunction, myAggregate)

First we create 2 functions, one to be applied to every element in the IDataReader to extract a value and a second that aggregates the calls to the first function. Here we just add them up, but we could also multiply them or do something more complicated. We don't have to change the implementation of ApplyFuncToReader itself to change its behaviour.

Points of Interest

The code above will work fine in most situations encountered in daily work, especially when using F# interactive. When used in multithreaded code, it's not exactly the same as using a List because someone else could be fiddling with your Reader making your results unreliable. Although the resultset is protected by the SQL transaction of your choice, the Reader itself could be advanced to a next element outside of your control. These things can be easily avoided obviously but it is important to understand that you are not protected by immutability like you would be with a List.

I went for an IDataReader here because databases is typically where my data lives. The same technique can of course be applied to a FileReader or StreamReader when your data is stored in a file on disk, or comes to you from the cloud in a stream. You no longer need to buffer it all into memory before you can work on it and that's a big plus for large datavolumes. I hope this helps you as it helps me. If it does, please drop me a line or rate my article.

History

21-07-2010: Version 1.0
07-10-2011: Updated article

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)