Introduction
This article provides a few techniques to work with 'in-clause' in queries, ADO.NET, Stored Procedures, and SQL Server.
Background
During the years of developing, it can be interesting to try different ways to solve the same problem. In this case, it's searching for the best way to deal with a tricky SQL Server and other RDBMS issue of querying with an "in-clause", where there's an unknown number of values for the clause.
A typical scenario is a situation that builds up a set of row IDs to query for. The scenario could be a screen where the user picks a few values in a listbox or a bunch of check boxes in a grid. Another scenario could be a batch process that identifies a bunch of IDs to work on. IDs are usually a unique value, and sometimes show up more than once in the set.
So, how many different ways can we set up something like:
select * from sometable where id in ( X number of ids )
The part that makes this tricky is each value in an in-clause has to be independent.
Select * from sometable where id in (1, 400, 33, 333)
So, using one variable to represent '1,400,33,333' just won't work.
Declare @ids = '1,400,33,333'
select * from sometable where id in (@ids)
Approaches
Okay, so how about a few approaches to work around this issue: dynamic SQL, parameterized SQL, Stored Procedure using delimited string and temp table, Stored Procedure using delimited string to parameter table, Stored Procedure using delimited string and variable table, Stored Procedure using XML, and Stored Procedure using value type. That should be enough to build on. I'm focusing on SQL Server 2008 here. There are other techniques for SQL Server and other RDBMSs.
When working with these approaches, there are a few things to keep in mind. How much network bandwidth do they use? How easy are they to prepare on the client? How much memory will they use? How well does the RDBMS work with it?
To work with these approaches, we have a simple database, In-Clause. There's one table, SomeString, that has an ID field and a string field. There are a few indexes for performance and uniqueness. There are two other tables used for the parameter tables. The database has a few Stored Procedures, a function, and even a user defined type. For code, I tried to keep the project simple with a Windows project and a MS Unit Test project. If setting this up for your own testing, make sure you set up enough space in the database files temp, temp log, In-Clause, and In-Clause log. We want there to be enough space to make things run and not get hung up with files needing to grow (I eventually set it all to 100 megs).
A Little Code Discussion
Before seeing some results, let's talk a little about the approaches. There's a general class, TestClassDac
, that has all the data calls. A Windows Form and Unit Tests run code in TestClassDac
.
To make sure it works, TestClassDacTest in the test project has a few tests for each approach. The names follow the approach, like GetUsingDynamic_expect50, GetUsingDynamic_expect5000, and GetUsingDynamic_expect50000. "50,000!" Yep, that's a lot of IDs. Does it work? Depends. To help find the maximum input sizes, the Windows Form RunIt has buttons that will run code looking to find the maximum number of IDs that each approach can handle.
Dynamic SQL
Pretty simple approach here. Basically, a function, TestClassDac.GetUsingDynamic
, takes a passed-in list of IDs to make a SQL statement. A helper function builds up the portion for the in-clause. A DataReader
is used to execute the statement and return the data. A familiar set of commands are used to work with the results to create a collection of objects that the function returns.
Parameterized SQL
What is parameterized SQL? It's very similar to dynamic SQL. The method builds a SQL statement on the client and passes the string to the RDBMS to execute as a query. Though, instead of passing values in the SQL, the values are changed to parameters and parameters are set much like when working with Stored Procedures. The expectation is that the RDBMS will change the SQL into something similar to a temporary Stored Procedure that can be re-used. Different RDBMSs handle parameterized SQL differently. The big step for performance is the recognition that a similar query exists in the procedure cache and so the query information in the cache can be re-used. The code has functions and test similar to dynamic SQL.
Delimited String to Temp Table
This approach uses a Stored Procedure that accepts a delimited list of IDs. So, the .NET code uses the list of IDs to create a delimited list of IDs, i.e., "1,4,6,33", in a string. The string is passed to the Stored Procedure. The Stored Procedure uses another procedure to create a temporary table populated with the values. The query, instead of using an in-clause, can now use an exists
statement. As the building of the delimited string happens in a few places, a helper function has been made to create the string.
Delimited String to Variable Table
Similar to the Delimited String to Temp Table approach, this approach uses a function to return a Table variable populated with the values from the string. The Table variable can then be used much like the temporary table.
Delimited String to Parm Table
Creating temporary tables and table variables has an impact on the temporary database, and can cause a few more recompiles than desired, depending on how it's coded. The usage of a parameter table can be used to avoid using the temporary database. Values from the delimited string would be inserted into the parameter table. The values would be grouped by a common identifier to differentiate the lists. There are a few ways to create the grouping, so two were tried here. One approached used a GUID/UniqueIdentifier to group. Another approached used a bigint to group. Again, the goal here is the temporary database won't be used as much and the Stored Procedure won't recompile as often.
XML
This one might appear a little spooky at first. The documentation needs some work to make it a little easier to use. It almost appeared written as if the user already knew and the folks writing it hadn't seen normal XML generated in .NET. So, this example could save a few hours of pecking for somebody. A helper function builds an XML from the list of IDs. The XML is passed to a Stored Procedure as an XML parameter. The Stored Procedure, using a simple XQUERY, can use the XML as a table. Using XML this way is amazing.
Typed Value
A new kid on the block is Typed Values and structured parameters. They appear to be built to speed up DataTables, but we can use them too. In this example, a helper function builds up a list of structured data. A simple user defined type describes the structured data as a Typed Value for the Stored Procedure to use. The collection is passed to the Stored Procedure all at once. It's almost like passing classes to a Stored Procedure, but not quite. In this case, once at the Stored Procedure level, the data can be treated like a table, but it's not really a table.
Critiquing Results: Performance vs. Maintainability
Whenever reviewing a tool or a technique in an engineering environment, it's important to look at its performance and maintainability. When working with the techniques here, performance factors include first-run time, average-run time, capability, and footprint. A couple of factors for maintainability include complexity and accessibility.
So, let's see how the techniques did.
There are units test written to run the same methods over and over with a fixed number of random inputs, and to run the same methods over and over with a random number of random inputs. There's a Windows Forms application that has a few buttons that try to find the maximum number each approach can support. The Windows Form was used to see the method progressing to find the maximum by increments of 1,000, and usually letting them run to exception. The fixed number of random input tests start with SetTimesForSetRandom
, and run for 10,000 times with 50 random numbers between 0 and 2500. The random number of random input tests start with SetTimesForRandom
. SetTimesForRandom
tests run for 10,000 times with a random number of inputs between 2 and 200 and random numbers between 0 and 2,500. We tried to use similar techniques to time the process it takes to build the inputs, the execution of the query, and data collection.
Dynamic SQL
This technique performs "OK" for this scenario, basically passing in IDs to retrieve a set of rows from one table. There are RDBMS limits to how big the SQL statement can be, so they have to be kept in mind. Now, what's interesting with later versions of SQL Server is that it will make an attempt to parameterize the statement. Check out this article at http://www.sqlteam.com/article/introduction-to-parameterization-in-sql-server or even the help books. In summary, SQL Server 2005 and 2008 try to change fixed values to parameters before creating a plan and running the query. If a query shows up that basically changes the fixed values, it will re-use the old plan. Now, all this work may cost a little, but that cost may be less than the cost for SQL Server to start all over to figure out the plan. While running the SQL Profiler, it's possible to see SQL Server re-using a dynamic SQL statement that only changes its values. To see this, turn on the cache hit. SQL Server's approach also helps to keep the amount of memory in the procedure cache to a minimum. Some RDBMSs or older versions will place the SQL statement into the procedure cache. So, unless the statement exactly matched, it would result in another statement in the procedure cache. In the testing here, the dynamic SQL statement reached a limit to the number of IDs that could be passed in. Other RDBMSs may reach different limits because of how they work with the in-clause. A problem with this approach is that if the number of IDs changes, the same plan won't be reused by SQL Server and the procedure cache will start to fill. From a maintainability point of view, dynamic SQL is in the code and so it goes with the code. Being in the code, installers have one less installation to worry about in the environment; there are no Stored Procedures. If the query is written to ANSI compliant SQL, most RDBMSs will accept the query, making the code RDBMS transportable. Another interesting point with this is that with different dynamic SQL, each call will be compiled using the latest statistics. Compiling with the latest statistics can be good and it can be bad.
Moving on, how well did Dynamic SQL do in this simple test environment:
Fixed Number of Random | 93,271ms |
Random Number of Random | 232,745ms |
Max inputs | 31,000....stopped due to error |
As can be seen, SQL Server works well when a dynamic query has a similar signature to a previously run query. So, SQL Server parameterizing appears to work. When the queries are not similar, its performance is that of running a query the first time over and over. The maximum inputs makes sense as the size of the query sent over has at least 62,000 characters. In fact, the message is, insufficient system memory.
Parameterized SQL
Similar to Dynamic SQL, the client creates the query. Now, something to notice about this code is that a schedule is in use. The schedule breaks the SQL in-clause into a couple of in-clauses per schedule. Why? Because, we need to try and make the query look the same to the RDBMS over and over. To make it look the same and re-use procedure caching, the number of parameters needs to be the same. In order to keep the number of parameters to a minimum and the number of query signatures at a minimum, the in-clause can be broken up and OR'd together. The hard question is how to set up the schedule. If we set the number of inputs in the schedule to be 2,100 parameters, that would be a large query to send over and over. On the downside, to make a schedule, we have to "OR" the in-clauses to find the result. Using "OR" to keep the in-clauses together can be slow. To make it even more difficult, some RDBMs will only support so many values or parameters in an in-clause, e.g., 1000 items in an in-clause. Thinking in terms of scenarios, with a schedule of 10, 15, and 25. With 10 inputs, there would only be one in-clause. With 14 inputs, there would be two in-clauses, with 11 of the last inputs set to the same value of the last input. With 20 inputs, there would be two in-clauses, with 5 of the last inputs set to the same value of the last input. At this point, there would only be two query signatures. The 20 input query would re-use the 14 input query plan if it's still available. So, if you try a schedule approach, this at least is a start.
Fixed Number of Random | 17,716ms |
Random Number of Random | 71,401ms |
Max inputs | 2,100 per the documentation |
As we can see, Parameterized SQL did better than Dynamic SQL, but has a much lower limit on the number of parameters it supports.
Delimited String to Temp Table
The old tried and trusted technique. Accept a delimited string as a parameter, put the values into a temp table, and work with it. The approach makes a lot of temporary tables. It uses Stored Procedures. The temporary tables can be re-used in the Stored Procedure. If the Stored Procedure is written well, it won't need to re-compile and so the query plan can be re-used. The nasty part about this approach is the time it takes to write to the temporary table. So, if you use this technique, make sure you have a fast tempdb set up.
Fixed Number of Random | 115,582ms |
Random Number of Random | 211,766ms |
Max inputs | 87,000....stopped at this value due to timeout error. |
Your environment will probably produce different values. We can see the performance drop having to populate another table. But the procedure can support many more inputs.
Delimited String to Variable Table
A variation of a temporary table, but using a table returned from a function. Table variables act a little different from Temp tables. Here's a blog discussing the latest version of this variable: SQL Server Storage Engine.
Fixed Number of Random | 26,534ms |
Random Number of Random | 50,430ms |
Max inputs | 88,000....stopped at this value due to timeout error. |
The results show table variables in this scenario perform fairly well compared to temp tables. However, as the size reaches a certain point, the variable starts to act like a temporary table.
A Function
helps to make the table variable:
CREATE FUNCTION [dbo].[ufnDelimitedBigIntToTable]
(
@List varchar(max), @Delimiter varchar(10)
)
RETURNS @Ids TABLE
(Id bigint) AS
BEGIN
DECLARE @list1 VARCHAR(MAX), @Pos INT, @rList VARCHAR(MAX)
SET @List = LTRIM(RTRIM(@List)) + @Delimiter
SET @pos = CHARINDEX(@Delimiter, @List, 1)
WHILE @pos > 0
BEGIN
SET @list1 = LTRIM(RTRIM(LEFT(@List, @pos - 1)))
IF @list1 <> ''
INSERT INTO @Ids(Id) VALUES (CAST(@list1 AS bigint))
SET @List = SUBSTRING(@List, @pos+1, LEN(@List))
SET @pos = CHARINDEX(@Delimiter, @list, 1)
END
RETURN
END
Once made, the table-function can be used in a query:
CREATE PROCEDURE [dbo].[GetUsingDelimitedFunctionTable]
@Ids varchar(max)
AS
BEGIN
SET NOCOUNT ON
SELECT s.Id,s.SomeString
FROM SomeString s (NOLOCK)
WHERE EXISTS ( SELECT *
FROM ufnDelimitedBigIntToTable(@Ids,',') Ids
WHERE s.Id = Ids.id )
END
This code produces an interesting query plan, but it works fine. EXISTS
helps logically as all we care about is there being an instance of a value. One thought is to use an INNER JOIN
. INNER JOIN
though would return multiple rows if a value shows up more than once in the delimited string. If multiple rows are not desired, a Distinct
will be needed somewhere with the INNER JOIN
. It's really up to the scenario to help decide the approach. In this cause, we're trying to simulate an in-clause.
Delimited String to Parm Table
Much like the temporary table approach, but instead writes to a permanent table. The permanent table acts as a parm table. At first, this approach was done with the same approach as the temporary tables, inserting one row at a time. In the second attempt, the function to convert the input delimited string to a table was used. The values from the returned table were then inserted into the parm table at once. To keep the records grouped, the inserted group is accompanied by a GUID. Here, the GUID goes into the field JobId. There is another set of tests that use a Bigint for a GUID. The Bigint approach gets its value from another table to ensure uniqueness and to know when it was created. The performance was slower and sometimes a little faster than a temporary table. Though the use of a table variable does somewhat hit the tempdb, there are lot less temporary tables made. One nice point of having the data in a permanent table is, the data can be re-used, like for paging or making different views of the results. Every once in a while, the parm table should be cleared of rows so that it won't grow too much. The one thing to watch out for is keeping enough data in the table so that the procedure won't recompile or use outdated statistics. It's either that or forcing a plan.
Some results using GUID:
Fixed Number of Random | 137,808ms |
Random Number of Random | 203,800ms |
Max inputs | 87,000....stopped at this value due to timeout error. |
Some results using BIGINT:
Fixed Number of Random | 74,307ms |
Random Number of Random | 134,114ms |
Max inputs | 83,000....stopped at this value due to timeout error. |
Again, your numbers will vary from these. Lowering the JobId column size appears to have helped. Lowering the JobId further to an integer should help more. If lowering to an integer, keep in mind the maximum value of an integer (2 billion), and the need to reset it comes around a little sooner.
XML
This approach passes in an XML string. One approach for creating the XML string used the XmlTextWriter
, another used a StringBuilder
. If you are risky, there's one that uses streams. There're a few test methods for comparing their performance. Here, we settled on the StringBuilder
to make the XML. With the XML being easy to create, most .NET programmers knowing how to use it, and the StringBuilder
working faster than the XmlTextWriter
, the choice was easy. The stream approach works very well, but would add an advantage that, for this testing, it might not "keep the field level". After seeing the stream approach, you may decide to use it instead of the StringBuilder
in many instances, just watch the encoding. The XML built is fairly basic. Getting it to work in a Stored Procedure takes a little poking and pecking. Most .NET coders are not familiar with XML in Stored Procedures, the documentation is rough (in our humble opinion), and there isn't much of an Internet community discussing it. Here are a couple of articles: Shred XML data with XQuery in SQL Server 2005, Introduction to XQuery in SQL Server 2005. One big advantage to XML is it's not a table. By not being a table, there's no locking and it's in memory. There are different ways to work with XML. The code here basically changes the XML to a table so its' easy to work with. So, there isn't much of a need to learn a ton of XQUERY, though 'exists
' might be something to try.
Fixed Number of Random | 27,935ms |
Random Number of Random | 49,701ms |
Max inputs | 150,000.....manually stopped at this value |
Yes! That's an order of magnitude improvement in performance. Working in memory can be wonderful. Now, where is my Distributed Main Memory RDBMS? Oh yeah, not out yet. Now, working on data in memory could eventually become a scalability issue, but with smaller sets of data, it should work well.
In this approach, we passed in XML like:
'<L><I>1</I><I>2</I><I>3</I><I>4</I><I>5</I><I>6</I> <I>7</I><I>8</I><I>99999</I><I>2</I><I>3</I><I>4</I></L>'
The Stored Procedure uses the XML with:
SELECT s.Id,s.SomeString
FROM SomeString s (NOLOCK)
WHERE EXISTS (SELECT * FROM
( select T.c.value('.','bigint') as Id
from @Ids.nodes('/L/I') T(c)) Ids
where s.Id = Ids.id)
)
There may be a few other ways to tweak this, but this worked fine. Maybe later on, you can mess around with something like:
select @test.exist('/L/I[text()[1] = @somevariable] ')
Typed Value
A new tool. A little different to work with on the client, but still fairly easy. A big advantage of this approach is that it stores passed in data in memory, not a table. Here's a link to help with documentation on them: Typed Value Parameter.
Fixed Number of Random | 28,194ms |
Random Number of Random | 37,443ms |
Max inputs | 150,000.....manually stopped at this value |
The Typed Value approach performed about the same as the XML approach. Many will find them easier to work with than XML. An issue with this approach is the transportability of the code; it's not ANSI. Other RDBMSs and their .NET clients may have similar approaches, like arrays, and they may be worth looking into for performance. Since Typed Values reside in memory, there may be a few concerns on scalability.
The Typed Value was easy to create with:
CREATE TYPE [dbo].[IdsTableType] AS TABLE(
[Id] [bigint] NULL
)
The code in C# to fill in a collection was easy:
public List<SqlDataRecord> ListToSqlDataRecord(List<long> items)
{
List<SqlDataRecord> records = new List<SqlDataRecord>();
SqlDataRecord recordProto;
SqlMetaData[] metas = new SqlMetaData[]{new SqlMetaData("Id",SqlDbType.BigInt)};
foreach (long item in items)
{
recordProto = new SqlDataRecord(metas);
recordProto.SetInt64(0, item);
records.Add(recordProto);
}
return records;
}
Then, it's a little code to use this procedure and get results:
const string sqlGetUsingTypeValue = "GetUsingTypeValue";
public List<TestClass> GetUsingSPTypeValue(List<long> ids)
{
List<TestClass> results = new List<TestClass>();
List<SqlDataRecord> idsDataRecords = ListToSqlDataRecord(ids);
using (SqlConnection con = new SqlConnection())
{
con.ConnectionString =
ConfigurationManager.ConnectionStrings[DatabaseKey].ConnectionString;
SqlCommand cmd = con.CreateCommand();
cmd.CommandText = sqlGetUsingTypeValue;
cmd.CommandType = CommandType.StoredProcedure;
SqlParameter parm = cmd.Parameters.Add(parmIds, SqlDbType.Structured);
parm.TypeName ="dbo.IdsTableType";
parm.SqlValue = idsDataRecords ;
con.Open();
using (IDataReader dr = cmd.ExecuteReader())
{
while (dr.Read())
{
TestClass obj = new TestClass();
FillObject(dr, obj);
results.Add(obj);
}
}
}
return results;
}
The tricky part is matching the type name to the user defined type table.
Conclusion
The code here provides a few examples on ways to work with the in-clause, SQL-Server, and .NET. The XML and Typed Value approaches worked great for performance. All approaches have positives and negatives. The times listed here are general values, and not official in any means. One set of information not here is the first run times. The environment used produced too erratic results to show reliable numbers. So, make sure you test to find what's best for your environment. Hopefully, some of the code can help improve decisions you make in development.
And, of course, after working on this write-up, I found there's a decent discussion on this subject and other SQL Server common programming issues at Erland Sommarskog's home page.
To run the demo
Restore the back-up of the InClause database from InClause.bak. Change the connection string in the config files in the Windows project and Test Project. Load up SomeString with rows of data by pressing the 'makedata' button on the form. The input next to the button is for setting the number of rows to create. Then, use the buttons and Unit Tests to try things out.