Introduction
Ordinary DataReader methods which
use SQL statements to retrieve data directly from databases are most straightforward, and
are probably the ultimate function blocks beneath all other higher
architectures. However, maintaining a SQL statement written in plain text is a
pain. Whenever there is a change in the database structure, all SQL statements
must be reviewed for correctness. A small typo of field name can only be
revealed by running all tests or even only after running in the real production
environment, which means a very high maintenance cost.
Frameworks like Hibernate do a
great job in linking class definitions in source code with tables in the database.
However, since relations defined are not an integral part of the source
code, change of database requires both change in relation definition and
change of logic in source code. All of the above changes will not be
“syntax-checked” by the compiler, meaning errors are more often found only by
running the program.
The emergence of LINQ-to-SQL brought
light to this maintenance problem. The relations of tables, the conditions of
query, and many other parts around SQL are now an integral part of the source
code. They will be compiler checked, and will be turned into SQL statement at
run-time with little human intervention. Small typos and ‘miss change’ problems
are prevented before running by the compiler.
However, LINQ-to-SQL does have its problems, mainly its
execution overhead and caching. Executing a LINQ-to-SQL statement takes much
longer time than executing a direct SQL statement through DataReader. The
probable cost might be from converting LINQ expressions to SQL commands,
and retrieving query data into objects, and creating other necessary meta data.
This cost is partly saved by using cache to store retrieved data for later use.
However, a cache may cause discrepancies between cached data with actual data in
the database, when multiple applications are modifying the database.
The purpose of this article is to
provide a data retrieving method with the high maintainability of LINQ-to-SQL
with the minimum overhead cost of executing.
Microsoft SQL LINQ-to-SQL is the target system of this
article. Other LINQ-to-SQL systems such as DbLinq and ALinq can be used too,
but will need a slightly different coding to make it work. The reason is that
all vendors of LINQ-to-SQL have their implementation of IQueryable
and
DataContext
.
Approach
IQueryable
is the core class to
LINQ-to-SQL. It holds a LINQ expression tree constructed along the executing
path, and ready to be turned into a SQL command when required. For those LINQ
methods such as Where()
and Select()
, they do nothing but modify the
IQueryable
’s expression tree to take on more information. Following is the
expression tree from the LINQ statement “from p in Persons where p.Id > id
select p” defined in a method body.
1. Call Where()
2. __Constant Table(Person)
3. __Quote
4. ____Lambda
5. ______Equal
6. ________MemberAccess Id
7. __________Parameter p
8. ________MemberAccess id
9. __________Constant value(Prototype.Database.AllQueries+<>c__DisplayClass2
Expression Tree #1
DisplayClass2
is an “anonymous” method-context class
defined by the compiler and created at runtime to help store values necessary for
a LINQ expression. For every method call encountered during IQueryable
construction, there will be such a DisplayClass
created. (Call to one method
multiple times will introduce multiple instances of the same type of DisplayClass
).
They are constant to one particular instant of IQueryable
, even though they may
have different values within different instances of IQueryable
.
The above expression tree will be transformed into the following
SQL statements before executing reading.
SELECT [t0].[id] AS [Id],
[t0].[name] AS [Name], [t0].[birthday] AS [Birthday], [t0].[value] AS [Value]
FROM [person] AS [t0]
WHERE [t0].[id] = @p0
At this point, it is not hard to
see that the only variable of the above SQL command is its parameter @p0
, which
is associated with the expression branched at line 8 in Expression Tree #1.
From this observation, there could be a less-overhead operation if the
relationship between IQueryable
expression branches and SQL command parameters
is stored, along with the SQL command itself. When such an IQueryable
is to be
executed, its associated SQL command is reused, and the SQL command parameters
are re-evaluated from the IQueryable
, eliminating the effort of recreating a new
SQL command object from the expression tree.
Therefore, the descriptive steps
to execute a cached LINQ-to-SQL query is as follows:
- Find
the matching cached query
- Build
one if no match is found
- Get the SQL command (template for future use) using
DataContext.GetCommand
- Build the display class instance getters
- Build the SQL parameter getters using display class instances as input
- Run
the display class instance getters to obtain the actual display class instances
- Set
the parameters in the SQL template using values executing parameter getters
- Execute
the modified SQL statement and retrieve the DataReader
- Read from the DataReader and do the conversion
The key steps involved in this
approach are:
- Differentiate
IQueryable
s - Retrieve the display class instances from
the current
IQueryable
- Reconstruct the SQL command parameters using the
retrieved display class instances
- Reading the DataReader into objects with conversions
Compare IQueryables
Since every IQueryable
stores an
expression tree, comparing IQueryable
s is nothing but comparing their
expression trees. The comparison is performed node by node and branch by
branch. Any difference found means the two IQueryable
s are different in
essence. A Constant expression, such as line 8 in the example, will only need to
compare to the type, not to the value, with one exception.
One particular interesting case
happens when there is a Contains()
method call involving an array of values.
For example
LINQ:
var ids = new List<int> {1, 2, 3, 4, 5, 20, 70 };
from p in persons where
ids.Contains(p.Id) select p;
Its expression tree:
1. Call Where()
2. __Constant Table(Person)
3. __Quote
4. ____Lambda
5. ______Call ids.Contains()
6. ________MemberAccess p.Id
7. __________Parameter p
Expression Tree #2
Its SQL command:
SELECT [t0].[id] AS [Id],
[t0].[name] AS [Name], [t0].[birthday] AS [Birthday], [t0].[value] AS [Value]
FROM [person] AS [t0]
WHERE [t0].[id] IN (@p0, @p1, @p2, @p3, @p4, @p5, @p6)
The form of SQL command will be different if the number of
values in the array (list) is different. The simplest way is to treat
IQueryable
s with different element counts as unique individuals, but the
downside of it is more cached commands in the memory. One alternative is to
reuse a cached command with a longer list. By doing this, setting the actual values
for the command parameters need additional logic to using existing array values
in place of non-existing array values. Comparing the increased memory demands
of the simplest solution, the extra execution time of the latter seems to be an
acceptable cost. Therefore, the longer list alternative is used in this
approach to deal with the Contains()
query.
Retrieve display class instances
From this step up, all actions
will be accelerated by using compiled delegates. The composing and compilation
of these delegates is costly. But it saves from the second time on, and it is
essential to the performance of this method.
Interestingly, IQueryable
does
not hold a list of display class instances explicitly, instead, those instances
are buried inside the expression tree. To find out all the instances requires a
recursive walk over all levels of child expressions. Since there could be
multiple instances of the same display type (explained before, as in the case of
multiple calls to the same method during composing of a query), strict
comparisons of display class instance are used, and a full walk is necessary.
The walk results are logical “paths” from the top node of the expression tree
down to the Constant expression node holding one particular display class
instance. The logical path to retrieve the display class instance in the first
LINQ example is:
((((((ExpressionRoot).Arguments.get_Item(1)).Operand).Body).Right).Expression).Value
Such logical paths found for one
IQueryable
are composed into expression trees and compiled into executable
delegates. They are ready to be used in the next phase.
Load command parameters
As the query expression tree is so constructed, there are
certain mapping relationships of one command parameter from one branch in the
expression tree. By observation, a parameter branch is the largest branch
without Parameter expression or Constant Table expression. As in Expression
Tree #1, one parameter exists from line 8 and onwards.
In case of multiple parameters, the sequence of the
parameters is determined by the LINQ-to-SQL implementation, vendor by vendor,
and specific to each expression node. For example, in a Binary expression in
a Microsoft implementation, any parameter found with in the Left branch will have
precedence over parameters found within the Right branch.
For the Contains()
method on the array/list, each individual value
of the array/list is rated as a parameter and the sequence is usually their
sequence appearing in the array/list.
As mentioned before, a cached query with a longer list is used
in this approach. Therefore the actual array/list in a running query always
contains equal or less elements than the array/list used to build the cache.
Retrieving the elements can not always be straightforward by its index.
Instead, a length check is performed before retrieving. If the current index is
larger than the length of the actual array/list, index 0 is used, else use the
index directly. The final parameter expression looks like the following:
Param_0.ids.get_Item(IIF((Param_0.ids.Count > 0), 0, 0))
Param_0.ids.get_Item(IIF((Param_0.ids.Count > 1), 1, 0))
Instead of:
Param_0.ids.get_Item(0)
Param_0.ids.get_Item(1)
Running these expressions on a list with only one element is
equivalent to
Param_0.ids.get_Item(0)
Param_0.ids.get_Item(0)
Even though it is redundant to
have two identical values in an IN
clause, it is still logically correct.
Performance is not greatly affected, and some memory space is saved by not
creating a separate cached query for this instance with only one element.
The final objects to be stored and cached are an executable
lambda expression, which takes the actual display class instances as parameters,
and the returned values ready to be inserted into the SQL statement as parameters. From
the expression obtained for each parameter, replace any Constant expression
holding a display class instance, with a Parameter expression of the exact type
of the display class. Put a LambdaExpression on top of the processed expression
and compile. That will become the executable expression ready to extract actual
parameters.
Read the data and convert
It is often a full object that is retrieved for a row of data
read, it is equally often a conversion or multiple conversions performed
upon the raw data from the database. Reading without a conversion performs a direct
match on object properties with data columns using, mostly, the object property
name or Column attribute attached. Reading with conversion comes in generally
two forms typically as:
from p in Persons select new OtherForm { Face = p.Name }
or
from p in Persons select ConvertToOtherForm(p)
For direct object creation, LINQ-to-SQL will add the field
names of the target object into the SELECT list of the SQL statement, which
looks similar to:
SELECT [t0].[name] AS [Face] FROM [person] AS [t0]
then, a direct field to property read is enough to get the
data.
For conversion using a method, there is no obvious way to
know what will be performed within the method, a full read is performed to get
the starting object and then invoke the conversion method.
Therefore, an executable LambdaExpression is needed in the
final object. The first part of the expression is a new object statement with
a specific DataReader.Get()
method to retrieve the corresponding values. The rest of
the expression is further conversions using static methods. Just for example,
the first query shown in this section will have a reading LambdaExpression
expressed as:
reader => new OtherForm() {Face = reader.GetString(0)}
Sample code
A set of sample codes, contained in a Visual Studio 2010
Test project, is provided as attachment. To get a sample database ready, we
need to run the test case “Preparation.RebuildDatabase” on a fully controlled
data connection to SQL Express. This test case will insert pseudo random values
for testing purposes.
The implementation of this approach is all included in the Prototype.Database.Utility
class, with the most important entry being the
extension method ExecuteList<T>()
. The sample code is a demonstration
of the practicality of the approach. Thus most of the implementation is just full
enough to make all test cases in the project working. Especially, the data
reading LambdaExpression is built on the assumption that object property names
are case-in-sensitively equal to field names, and property types match the
corresponding field types. This assumption may not always be valid in
a production environment.
Benchmark of the approach
A simple benchmark of the various reading methods is provided,
too, in the test project, under the name “Benchmark.SQLRepeatSingleLoad”. It is
to measure the average time executing a query with only one record returned.
The benchmark is performed on five methods: direct DataReader reading, direct
LINQ (with possible DataContext cache), compiled LINQ, LINQ to command to
DataReader, and finally the approach mentioned here.
Running the benchmark on a virtual machine gives the following
results:
0.114759ms Reader
0.186484ms ExecuteList (this approach)
0.315588ms Linq
0.615396ms Linq-Get-DataReader
2.300208ms Compiled Linq
which reveals the performance improvement over an ordinary LINQ implementation.
Final words
This approach sits some way between the DataReader way and
the LINQ-to-SQL way, it provides better maintainability through LINQ over
direct DataReader implementation and better performance over ordinary
LINQ-to-SQL in reading. And, the sample code is to show the feasibility of this approach, and is not flexible enough to be in a production environment without
a change.
Sorry for any bad naming and hardly any comments. Hope you enjoy reading the article and enjoy playing with the
sample.