Cache IQueryable for Better LINQ-to-SQL Performance

Zimin Max Yang

4.69/5 (10 votes)

21 May 2012CPOL10 min read

61.4K

448

An approach to improve LINQ-to-SQL performance while preserving maintainability over DataReader.

Download prototype - 18 KB

Introduction

Ordinary DataReader methods which use SQL statements to retrieve data directly from databases are most straightforward, and are probably the ultimate function blocks beneath all other higher architectures. However, maintaining a SQL statement written in plain text is a pain. Whenever there is a change in the database structure, all SQL statements must be reviewed for correctness. A small typo of field name can only be revealed by running all tests or even only after running in the real production environment, which means a very high maintenance cost.

Frameworks like Hibernate do a great job in linking class definitions in source code with tables in the database. However, since relations defined are not an integral part of the source code, change of database requires both change in relation definition and change of logic in source code. All of the above changes will not be “syntax-checked” by the compiler, meaning errors are more often found only by running the program.

The emergence of LINQ-to-SQL brought light to this maintenance problem. The relations of tables, the conditions of query, and many other parts around SQL are now an integral part of the source code. They will be compiler checked, and will be turned into SQL statement at run-time with little human intervention. Small typos and ‘miss change’ problems are prevented before running by the compiler.

However, LINQ-to-SQL does have its problems, mainly its execution overhead and caching. Executing a LINQ-to-SQL statement takes much longer time than executing a direct SQL statement through DataReader. The probable cost might be from converting LINQ expressions to SQL commands, and retrieving query data into objects, and creating other necessary meta data. This cost is partly saved by using cache to store retrieved data for later use. However, a cache may cause discrepancies between cached data with actual data in the database, when multiple applications are modifying the database.

The purpose of this article is to provide a data retrieving method with the high maintainability of LINQ-to-SQL with the minimum overhead cost of executing.

Microsoft SQL LINQ-to-SQL is the target system of this article. Other LINQ-to-SQL systems such as DbLinq and ALinq can be used too, but will need a slightly different coding to make it work. The reason is that all vendors of LINQ-to-SQL have their implementation of IQueryable and DataContext.

Approach

IQueryable is the core class to LINQ-to-SQL. It holds a LINQ expression tree constructed along the executing path, and ready to be turned into a SQL command when required. For those LINQ methods such as Where() and Select(), they do nothing but modify the IQueryable’s expression tree to take on more information. Following is the expression tree from the LINQ statement “from p in Persons where p.Id > id select p” defined in a method body.

1. Call Where()
2. __Constant Table(Person) 
3. __Quote
4. ____Lambda 
5. ______Equal 
6. ________MemberAccess Id
7. __________Parameter p 
8. ________MemberAccess id 
9. __________Constant value(Prototype.Database.AllQueries+<>c__DisplayClass2

Expression Tree #1

DisplayClass2 is an “anonymous” method-context class defined by the compiler and created at runtime to help store values necessary for a LINQ expression. For every method call encountered during IQueryable construction, there will be such a DisplayClass created. (Call to one method multiple times will introduce multiple instances of the same type of DisplayClass). They are constant to one particular instant of IQueryable, even though they may have different values within different instances of IQueryable.

The above expression tree will be transformed into the following SQL statements before executing reading.

SQL

SELECT [t0].[id] AS [Id],
[t0].[name] AS [Name], [t0].[birthday] AS [Birthday], [t0].[value] AS [Value] 
FROM [person] AS [t0] 
WHERE [t0].[id] = @p0

At this point, it is not hard to see that the only variable of the above SQL command is its parameter @p0, which is associated with the expression branched at line 8 in Expression Tree #1. From this observation, there could be a less-overhead operation if the relationship between IQueryable expression branches and SQL command parameters is stored, along with the SQL command itself. When such an IQueryable is to be executed, its associated SQL command is reused, and the SQL command parameters are re-evaluated from the IQueryable, eliminating the effort of recreating a new SQL command object from the expression tree.

Therefore, the descriptive steps to execute a cached LINQ-to-SQL query is as follows:

Find the matching cached query

Build one if no match is found

Get the SQL command (template for future use) using DataContext.GetCommand
Build the display class instance getters
Build the SQL parameter getters using display class instances as input

Run the display class instance getters to obtain the actual display class instances
Set the parameters in the SQL template using values executing parameter getters
Execute the modified SQL statement and retrieve the DataReader
Read from the DataReader and do the conversion

The key steps involved in this approach are:

Differentiate IQueryables
Retrieve the display class instances from the current IQueryable
Reconstruct the SQL command parameters using the retrieved display class instances
Reading the DataReader into objects with conversions

Compare IQueryables

Since every IQueryable stores an expression tree, comparing IQueryables is nothing but comparing their expression trees. The comparison is performed node by node and branch by branch. Any difference found means the two IQueryables are different in essence. A Constant expression, such as line 8 in the example, will only need to compare to the type, not to the value, with one exception.

One particular interesting case happens when there is a Contains() method call involving an array of values. For example

LINQ:

var ids = new List<int> {1, 2, 3, 4, 5, 20, 70 };
from p in persons where
ids.Contains(p.Id) select p;

Its expression tree:

1. Call Where()
2. __Constant Table(Person) 
3. __Quote
4. ____Lambda
5. ______Call ids.Contains()
6. ________MemberAccess p.Id
7. __________Parameter p

Expression Tree #2

Its SQL command:

SQL

SELECT [t0].[id] AS [Id],
[t0].[name] AS [Name], [t0].[birthday] AS [Birthday], [t0].[value] AS [Value] 
FROM [person] AS [t0] 
WHERE [t0].[id] IN (@p0, @p1, @p2, @p3, @p4, @p5, @p6)

The form of SQL command will be different if the number of values in the array (list) is different. The simplest way is to treat IQueryables with different element counts as unique individuals, but the downside of it is more cached commands in the memory. One alternative is to reuse a cached command with a longer list. By doing this, setting the actual values for the command parameters need additional logic to using existing array values in place of non-existing array values. Comparing the increased memory demands of the simplest solution, the extra execution time of the latter seems to be an acceptable cost. Therefore, the longer list alternative is used in this approach to deal with the Contains() query.

Retrieve display class instances

From this step up, all actions will be accelerated by using compiled delegates. The composing and compilation of these delegates is costly. But it saves from the second time on, and it is essential to the performance of this method.

Interestingly, IQueryable does not hold a list of display class instances explicitly, instead, those instances are buried inside the expression tree. To find out all the instances requires a recursive walk over all levels of child expressions. Since there could be multiple instances of the same display type (explained before, as in the case of multiple calls to the same method during composing of a query), strict comparisons of display class instance are used, and a full walk is necessary. The walk results are logical “paths” from the top node of the expression tree down to the Constant expression node holding one particular display class instance. The logical path to retrieve the display class instance in the first LINQ example is:

((((((ExpressionRoot).Arguments.get_Item(1)).Operand).Body).Right).Expression).Value

Such logical paths found for one IQueryable are composed into expression trees and compiled into executable delegates. They are ready to be used in the next phase.

Load command parameters

As the query expression tree is so constructed, there are certain mapping relationships of one command parameter from one branch in the expression tree. By observation, a parameter branch is the largest branch without Parameter expression or Constant Table expression. As in Expression Tree #1, one parameter exists from line 8 and onwards.

In case of multiple parameters, the sequence of the parameters is determined by the LINQ-to-SQL implementation, vendor by vendor, and specific to each expression node. For example, in a Binary expression in a Microsoft implementation, any parameter found with in the Left branch will have precedence over parameters found within the Right branch.

For the Contains() method on the array/list, each individual value of the array/list is rated as a parameter and the sequence is usually their sequence appearing in the array/list.

As mentioned before, a cached query with a longer list is used in this approach. Therefore the actual array/list in a running query always contains equal or less elements than the array/list used to build the cache. Retrieving the elements can not always be straightforward by its index. Instead, a length check is performed before retrieving. If the current index is larger than the length of the actual array/list, index 0 is used, else use the index directly. The final parameter expression looks like the following:

Param_0.ids.get_Item(IIF((Param_0.ids.Count > 0), 0, 0))
Param_0.ids.get_Item(IIF((Param_0.ids.Count > 1), 1, 0))

Instead of:

Param_0.ids.get_Item(0) 
Param_0.ids.get_Item(1)

Running these expressions on a list with only one element is equivalent to

Param_0.ids.get_Item(0)
Param_0.ids.get_Item(0)

Even though it is redundant to have two identical values in an IN clause, it is still logically correct. Performance is not greatly affected, and some memory space is saved by not creating a separate cached query for this instance with only one element.

The final objects to be stored and cached are an executable lambda expression, which takes the actual display class instances as parameters, and the returned values ready to be inserted into the SQL statement as parameters. From the expression obtained for each parameter, replace any Constant expression holding a display class instance, with a Parameter expression of the exact type of the display class. Put a LambdaExpression on top of the processed expression and compile. That will become the executable expression ready to extract actual parameters.

Read the data and convert

It is often a full object that is retrieved for a row of data read, it is equally often a conversion or multiple conversions performed upon the raw data from the database. Reading without a conversion performs a direct match on object properties with data columns using, mostly, the object property name or Column attribute attached. Reading with conversion comes in generally two forms typically as:

from p in Persons select new OtherForm { Face = p.Name }

from p in Persons select ConvertToOtherForm(p)

For direct object creation, LINQ-to-SQL will add the field names of the target object into the SELECT list of the SQL statement, which looks similar to:

SQL

SELECT [t0].[name] AS [Face] FROM [person] AS [t0]

then, a direct field to property read is enough to get the data.

For conversion using a method, there is no obvious way to know what will be performed within the method, a full read is performed to get the starting object and then invoke the conversion method.

Therefore, an executable LambdaExpression is needed in the final object. The first part of the expression is a new object statement with a specific DataReader.Get() method to retrieve the corresponding values. The rest of the expression is further conversions using static methods. Just for example, the first query shown in this section will have a reading LambdaExpression expressed as:

reader => new OtherForm() {Face = reader.GetString(0)}

Sample code

A set of sample codes, contained in a Visual Studio 2010 Test project, is provided as attachment. To get a sample database ready, we need to run the test case “Preparation.RebuildDatabase” on a fully controlled data connection to SQL Express. This test case will insert pseudo random values for testing purposes.

The implementation of this approach is all included in the Prototype.Database.Utility class, with the most important entry being the extension method ExecuteList<T>(). The sample code is a demonstration of the practicality of the approach. Thus most of the implementation is just full enough to make all test cases in the project working. Especially, the data reading LambdaExpression is built on the assumption that object property names are case-in-sensitively equal to field names, and property types match the corresponding field types. This assumption may not always be valid in a production environment.

Benchmark of the approach

A simple benchmark of the various reading methods is provided, too, in the test project, under the name “Benchmark.SQLRepeatSingleLoad”. It is to measure the average time executing a query with only one record returned. The benchmark is performed on five methods: direct DataReader reading, direct LINQ (with possible DataContext cache), compiled LINQ, LINQ to command to DataReader, and finally the approach mentioned here.

Running the benchmark on a virtual machine gives the following results:

0.114759ms Reader
0.186484ms ExecuteList (this approach)
0.315588ms Linq
0.615396ms Linq-Get-DataReader
2.300208ms Compiled Linq

which reveals the performance improvement over an ordinary LINQ implementation.

Final words

This approach sits some way between the DataReader way and the LINQ-to-SQL way, it provides better maintainability through LINQ over direct DataReader implementation and better performance over ordinary LINQ-to-SQL in reading. And, the sample code is to show the feasibility of this approach, and is not flexible enough to be in a production environment without a change.

Sorry for any bad naming and hardly any comments. Hope you enjoy reading the article and enjoy playing with the sample.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)