Introduction
A lot of algorithms in F# use the List
datastructure and recursion to solve tasks. This is very well for smaller datasets, but when they get bigger, the overhead of first converting them to a list can become significant if not prohibitive. This article presents a simple technique on how an IDataReader
can be used just like you would a List
, letting you stream large datasets across your algorithm, so it will use far less memory.
Background
When you compare the List
and DataReader
, they're actually pretty similar in how they work. A list
has a Head
that is the current element and a Tail
with the other elements. The DataReader
has a current element and a set of next elements that you can access using MoveNext()
. This made me wonder if you couldn't just use a DataReader
like a list
and do away with first loading the whole thing into memory in a List
. Turns out, it's pretty easy.
Update
I wrote this article when I was relatively new to F# and still rather dogmatic about functional programming, I don't do this anymore. To me, the elegance of F# is that you use functional coding where it fits best, and can mix in imperative coding where that just works better than purely functional code. IEnumerables
in F# are handled by sequences that are well known. So the way to go is to wrap your IDataReader
in a seq
that exposes the elements in a strongly typed way.
I'll leave the rest of the article because tail recursion is still a useful concept to learn and the internet should be written in ink, but here's a quick snippet of how I would do this today:
open System.Data.SqlClient
open System.Data
type SomeRecord = {
id : int
val1 : double
val2 : double
} with
static member fromRdr(rdr:IDataReader) = {
id = rdr.GetInt32 0;
val1 = rdr.GetDouble 1;
val2 = rdr.GetDouble 2
}
static member asSeq (rdr:IDataReader) = seq { while rdr.Read()
do yield SomeRecord.fromRdr rdr}
let cn = new SqlConnection "server=.;database=SomeDatabase;integrated security=true"
cn.Open()
let cmd = new SqlCommand("select id, val1, val2 from table1", cn)
let rdr = cmd.ExecuteReader()
rdr
|> SomeRecord.asSeq
|> Seq.sumBy (fun r-> r.val1 * r.val2)
rdr.Close()
cn.Close()
This is the rest of the original article...
A Textbook F# Recursive Algorithm
Let's begin by looking at a basic F# recursive algorithm using a List
. This one just takes a list of numbers and adds them up.
let l = [1..10]
let rec SumList l =
match l with
head::tail -> head + Sum tail
| [] -> 0
SumList l
This is a typical example of how recursion is used in F# to elegantly describe an algorithm. The way the numbers are added here is the list is divided into two parts: the first element: the head, and the other elements: the tail. The value of the whole list is the value of the head plus the value of the tail. Duh. The function calls itself again with the tail (which each time is one element shorter than the list passed in) until there is no more tail. That's what the []
stands for, the empty list. When it comes across that, it returns 0
. Now it goes back up the stack adding up all the numbers and tadaa: the result.
Typically when you'd write an algorithm like the one above, the data isn't hardcoded in there, you load that data from somewhere into a list and then apply the logic. The problem with this is however that now that entire set has to be loaded into memory. This can be a challenge when the dataset is big. It would be nicer on memory if we could load only a portion of that data into memory to run the algorithm on, dispose of that and then read the next portion.
Doing That with a DataReader
Here's the same algorithm, but now using a DataReader
instead of a list
. (Note: Do not run off now and use this code in production because it will fail miserably as we will demonstrate).
let rec SumReader (dr:IDataReader) =
match dr.Read() with
true -> dr.GetInt32(1) + SumReader dr
| false -> 0
This works pretty much the same as the example above. It accepts an IDataReader
, but instead of splitting that into a head and a tail, it calls movenext
on it. That returns true
if there is a new current element, and false
if not. If there was a next element, it adds the value of that to the value of the rest of the elements in the reader. Just like the example above, it keeps doing that until there is no next element, then it unwinds the stack to add all the numbers together.
Oops... No Tail Call
Using the algorithm above works fine for smaller datasets, but throws a StackOverflowException
on larger ones. As far as Exceptions go, that's about as bad as they come. You can't catch them, when one is thrown in your code, it's lights out. The whole point of using a reader over a list was to be able to handle larger datasets, so we're clearly not there yet.
The reason we get this exception is because we're not making a tail call. F# uses a technique called Tail Call Optimization (TCO, explained here, here and here) to make code run more efficiently. Basically, what that does is take a recursive function and rewrite that as an iterative one. When the original function meets a couple of criteria, this can be done pretty easily, the major criterion being that the call to itself must be the last thing the function does in that branch. When rewritten, the function now no longer calls itself but does a loop, which is more efficient because there are no more pushes to, and pops from, the stack. Pushes to the stack is what killed us, TCO is what we want.
If you look carefully at the code of our DataReader
example, you see that the recursive call to itself is actually not the last thing it does, even when it's the last thing on that line, because it adds the result of that call to the value of the current element. THAT is the last thing it does and that makes it impossible for F# to rewrite it as a tail call. (Thanks to Tomas Petricek for pointing that out). So what can we do about it?
Enter the Accumulator Pattern
What we need to do is pass an extra variable that holds the result so far. Each time we add the value of the current element to this result so far and then pass that to the next call. This way, we don't have to do anything after the call returns. This technique is called the accumulator pattern. When we do it like that, the call to itself IS the last thing it does and the F# compiler can do the Tail Call Optimization that will let us run larger datasets across. Here's what that looks like:
let rec SumReaderAcc (dr:IDataReader, accum) =
match dr.Read() with
true -> SumReaderAcc(dr, accum + dr.GetInt32(1))
|false -> accum
It's ok now to run off and fix your production issue, but if you stick around a little longer I'll show you how you can turn this into a more generic approach that you can reuse in your code. In general, typing is bad, reuse is good.
Making Things a Bit More Generic
So far the functionality of this code is pretty fixed, it adds the first column of the query. Since F# is a functional language, why not pass a function to be applied to every element in the reader to get a value (functional dependency injection, so geeky). That way, it will be a bit more generic already. Here's the code for that:
let rec SumReaderF (dr:IDataReader, func, accum) =
match dr.Read() with
true -> SumReaderF(dr, func, accum + func(dr))
|false -> accum
let myAdd(element:IDataReader) = element.GetInt32(1)
SumReaderF(rdr, myAdd, 0)
Now that's better already, but we can still do better. For one, the user now has to enter a starting zero for accum
for this to work, and it only adds up the calls to the elements, no way to for instance multiply or average. The code below fixes those two issues and is as far as we will go today. It uses an inner function to handle the recursion and accumulator, so the user only has the pass the reader and the functions. There are two functions passed to it: one function that is applied to the current element of the IDataReader
to extract a value, and a second aggregate function that is applied to the result of the element function and the accumulator.
let ApplyFuncToReader (dr:IDataReader, func, agg) =
let rec ApplyInternal (accum) =
match dr.Read() with
true -> ApplyInternal(agg(accum, func(dr)))
|false ->
dr.Close()
accum
ApplyInternal 0
Using the Code
Here's an example of how to call the code above:
let myElementFunction(element:IDataReader) = element.GetInt32(1)
let myAggregate(x, y) = x + y
let rdr = NewReader()
ApplyFuncToReader (rdr, myElementFunction, myAggregate)
First we create 2 functions, one to be applied to every element in the IDataReader
to extract a value and a second that aggregates the calls to the first function. Here we just add them up, but we could also multiply them or do something more complicated. We don't have to change the implementation of ApplyFuncToReader
itself to change its behaviour.
Points of Interest
The code above will work fine in most situations encountered in daily work, especially when using F# interactive. When used in multithreaded code, it's not exactly the same as using a List
because someone else could be fiddling with your Reader
making your results unreliable. Although the resultset
is protected by the SQL transaction of your choice, the Reader
itself could be advanced to a next element outside of your control. These things can be easily avoided obviously but it is important to understand that you are not protected by immutability like you would be with a List
.
I went for an IDataReader
here because databases is typically where my data lives. The same technique can of course be applied to a FileReader
or StreamReader
when your data is stored in a file on disk, or comes to you from the cloud in a stream. You no longer need to buffer it all into memory before you can work on it and that's a big plus for large datavolumes. I hope this helps you as it helps me. If it does, please drop me a line or rate my article.
History
- 21-07-2010: Version 1.0
- 07-10-2011: Updated article