Custom Aggregates in SQL Server

Wendelius

4.95/5 (24 votes)

26 Mar 2011CPOL8 min read

97.1K

575

SQL Server not having the aggregates you need? Why not build your own.

Download source - version 2 - 9.61 KB

Introduction

SQL Server doesn’t have so many aggregates to use. The basics, such as COUNT, MIN, MAX, etc., are implemented but still the list is quite small. This may lead to a situation where some of the calculations must be done in procedures, functions or even at client side.

However, SQL Server includes both CLR integration and the ability to define an aggregate implemented in .NET project. This combination makes it possible to create custom aggregates.

The First Aggregate, Calculating Product

C# Implementation

Creating an aggregate is quite straightforward. We need a simple class library (DLL) project targeting .NET Framework 3.5 (maximum for SQL Server 2008), implement the necessary structures and then register the assembly and the aggregates into the SQL Server.

An aggregate is created by defining a struct with SqlUserDefinedAggregate – attribute. Using the attribute, we can define for example the following options:

Format: Serialization format for the struct. This is typically either Native or UserDefined. In case of Native format, the framework handles all the necessary steps to serialize and deserialize the structure.
IsInvariantToDuplicates (bool): Does receiving the same value twice or more affect the result
IsInvariantToNulls (bool): Does receiving a NULL value change the result
IsInvariantToOrder (bool): Does the order of values affect the result
IsNullIfEmpty (bool): Does an empty set result to a NULL value
Name (string): Name of the aggregate

The struct itself must contain at least the following methods:

Init: This is called when a new group of values is going to be handled using an instance of the structure
Accumulate: Each value is passed to the Accumulate method which is responsible for making the necessary calculations, etc.
Merge: This method is used when the original set of values is divided into several independent groups and after accumulating the group specific values, the group is merged to another group.
Terminate: And finally when all values have been handled, Terminate returns the result.

Note that the structure may be re-used while using the aggregate. For this reason, it’s important to do all necessary initializations in the Init method and not to trust that the instance of a struct is a fresh one.

Merging is best explained with a small (simplified) diagram:

The query processor may divide a set of values to a smaller subsets called groups. When the aggregation is done, each group has its own instance of the structure to handle the subset. Each instance is first initialized and the accumulation is done for each value in the group. After this, the group is merged to another group. Finally, when all the groups have been merged, the aggregation is terminated and the result is returned to the consumer. For this reason, the aggregate must be designed to support operating on partial sets at accumulation time.

So far, lots of text and no code. So let’s have a look at the structure:

/// <summary>
/// Calculates the product of numerical values
/// </summary>
[System.Serializable]
[Microsoft.SqlServer.Server.SqlUserDefinedAggregate(
   Microsoft.SqlServer.Server.Format.Native,
   IsInvariantToDuplicates = false, // receiving the same value again 
				 // changes the result
   IsInvariantToNulls = false,      // receiving a NULL value changes the result
   IsInvariantToOrder = true,       // the order of the values doesn't 
				 // affect the result
   IsNullIfEmpty = true,            // if no values are given the result is null
   Name = "Product"                 // name of the aggregate

)]
public struct Product {
   /// <summary>
   /// Used to store the product
   /// </summary>
   public System.Data.SqlTypes.SqlDouble Result { get; private set; }

   /// <summary>
   /// Used to inform if the accumulation has received values
   /// </summary>
   public bool HasValue { get; private set; }

   /// <summary>
   /// Initializes a new Product for a group
   /// </summary>
   public void Init() {
      this.Result = System.Data.SqlTypes.SqlDouble.Null;
      this.HasValue = false;
   }

   /// <summary>
   /// Calculates the product of the previous values and the value received
   /// </summary>
   /// <param name="number">Value to include</param>
   public void Accumulate(System.Data.SqlTypes.SqlDouble number) {
      if (!this.HasValue) {
         // if this is the first value received
         this.Result = number;
      } else if (this.Result.IsNull) {
         //if the calculated value is null, stay that way
      } else if (number.IsNull) {
         //if the value received is null the result is null
         this.Result = System.Data.SqlTypes.SqlDouble.Null;
      } else {
         //multiply the values
         this.Result = System.Data.SqlTypes.SqlDouble.Multiply(this.Result, number);
      }
      this.HasValue = true;
   }

   /// <summary>
   /// Merges this group to another group instantiated for the calculation
   /// </summary>
   /// <param name="group"></param>
   public void Merge(Product group) {
      // Count the product only if the other group has values
      if (group.HasValue) {
         this.Result = System.Data.SqlTypes.SqlDouble.Multiply
				(this.Result, group.Result);
      }
   }

   /// <summary>
   /// Ends the calculation and returns the result
   /// </summary>
   /// <returns></returns>
   public System.Data.SqlTypes.SqlDouble Terminate() {
      return this.Result;
   }
}

This aggregate is using only blittable data types which means that both SQL Server and the .NET Framework have common representation for the fields defined in the struct. Because of this, the aggregate is defined as Native in format and no additional steps need to be taken for the serialization.

The four mandatory methods are implemented and since this is a very trivial calculation, I believe that the code needs no thorough explanation. Only one note: Because of possible NULL values in the value set, NULL is handled as a special case so that NULL times anything always yields to NULL.

Registering to SQL Server

After the project has been successfully built, the next step is to register the assembly into the SQL Server. Before adding the assembly to the database, I chose to create a new schema. This is not mandatory but I felt that it would be nice to have the custom aggregates in one, separate place:

SQL

-- Create a new schema for the aggregates
CREATE SCHEMA Aggregates;

After creating the schema, let’s upload the assembly:

SQL

-- Add the assembly into SQL Server
--
-- NOTE: Change the disk and the path!
--
CREATE ASSEMBLY CustomAggregates
   AUTHORIZATION dbo
   FROM '?:\???\CustomAggregates.dll'
   WITH PERMISSION_SET SAFE;

Now the assembly is stored within SQL Server so we can register the newly created aggregate to the database:

SQL

-- Add the aggregate into SQL Server
CREATE AGGREGATE Aggregates.Product (@number float) RETURNS float
   EXTERNAL NAME CustomAggregates.Product;

The aggregate is created in the Aggregates –schema. Data type float is used since this is the equivalent data type in SQL Server for the SQLDouble data type.

So now we’re ready to test the aggregate. First, a simple test-run:

SQL

-- Test-run 1
SELECT Aggregates.Product(a.Val) AS Result
FROM (SELECT 2.1 AS Val UNION ALL
      SELECT 5.3 AS Val) a

And the result is:

Result
------
11,13

Now the second test run, what if the set contains a NULL:

SQL

-- Test-run 2, NULL in the set
SELECT Aggregates.Product(a.Val) AS Result
FROM (SELECT 2.1  AS Val UNION ALL
      SELECT NULL AS Val UNION ALL
      SELECT 5.3  AS Val) a

The result is NULL as discussed earlier:

Result
------
NULL

And the last test-run is to use an empty set:

SQL

-- Test-run 3, empty set
SELECT Aggregates.Product(NULL) AS Result

This also results NULL as it should.

A Slight Enhance, Geometric Mean

The next aggregate is very similar to the product since its geometric mean. Geometric mean is defined:

Now since this calculation cannot be done in accumulation (in this form) we calculate the product in accumulation and merge steps and the final result is calculated in the Terminate method. The structure could look like the following:

/// <summary>
/// Calculates the geometric mean of numerical values
/// </summary>
[System.Serializable]
[Microsoft.SqlServer.Server.SqlUserDefinedAggregate(
   Microsoft.SqlServer.Server.Format.Native, 
   IsInvariantToDuplicates = false, // receiving the same value again 
				 // changes the result
   IsInvariantToNulls = false,      // receiving a NULL value changes the result
   IsInvariantToOrder = true,       // the order of the values doesn't 
				 // affect the result
   IsNullIfEmpty = true,            // if no values are given the result is null
   Name = "GeometricMean"           // name of the aggregate
)]
public struct GeometricMean {
   /// <summary>
   /// Used to store the product
   /// </summary>
   public System.Data.SqlTypes.SqlDouble Product { get; private set; }

   /// <summary>
   /// Number of values in the set
   /// </summary>
   public double ValueCount { get; private set; }

   /// <summary>
   /// Initializes a new Product for a group
   /// </summary>
   public void Init() {
      this.Product = System.Data.SqlTypes.SqlDouble.Null;
      this.ValueCount = 0;
   }

   /// <summary>
   /// Calculates the product of the previous values and the value received
   /// </summary>
   /// <param name="number">Value to include</param>
   public void Accumulate(System.Data.SqlTypes.SqlDouble number) {
      if (this.ValueCount == 0) {
         // if this is the first value received
         this.Product = number;
      } else if (this.Product.IsNull) {
        //if the calculated value is null, stay that way
      } else if (number.IsNull) {
         //if the value is null the result is null
         this.Product = System.Data.SqlTypes.SqlDouble.Null;
      } else {
         //multiply the values
         this.Product = System.Data.SqlTypes.SqlDouble.Multiply(this.Product, number);
      }
      this.ValueCount++;
   }

   /// <summary>
   /// Merges this group to another group instantiated for the calculation
   /// </summary>
   /// <param name="group"></param>
   public void Merge(GeometricMean group) {
      //Count the product only if the other group has values
      if (group.ValueCount > 0) {
         this.Product = System.Data.SqlTypes.SqlDouble.Multiply(
            this.Product, group.Product);
      }
      this.ValueCount += group.ValueCount;
   }

   /// <summary>
   /// Ends the calculation for this group and returns the result
   /// </summary>
   /// <returns></returns>
   public System.Data.SqlTypes.SqlDouble Terminate() {
      return this.ValueCount > 0 && !this.Product.IsNull
         ? System.Math.Pow(this.Product.Value, 1 / this.ValueCount) 
         : System.Data.SqlTypes.SqlDouble.Null;
   }
}

Let’s register the aggregate:

SQL

-- Add the aggregate into SQL Server
CREATE AGGREGATE Aggregates.GeometricMean (@number float) RETURNS float
   EXTERNAL NAME CustomAggregates.GeometricMean;

And then the test:

SQL

-- Test-run 1
SELECT Aggregates.GeometricMean(a.Val) AS Result
FROM (SELECT 34 AS Val UNION ALL
      SELECT 27 AS Val UNION ALL
      SELECT 45 AS Val UNION ALL
      SELECT 55 AS Val UNION ALL
      SELECT 22 AS Val UNION ALL
      SELECT 34 AS Val) a

The result is:

Result
------
34,5451100372458

Using Partitioning

Partitioning can be used normally also with custom aggregates. For example, if we divide the previous data to two different categories and we want to have the geometric mean for each category, the query looks like:

SQL

-- Test-run 4, using partitioning
SELECT DISTINCT 
       a.Cat, 
       Aggregates.GeometricMean(a.Val) OVER (PARTITION BY a.Cat) AS Result
FROM (SELECT 1 AS Cat, 34 AS Val UNION ALL
      SELECT 1 AS Cat, 27 AS Val UNION ALL
      SELECT 1 AS Cat, 45 AS Val UNION ALL
      SELECT 2 AS Cat, 55 AS Val UNION ALL
      SELECT 2 AS Cat, 22 AS Val UNION ALL
      SELECT 2 AS Cat, 34 AS Val) a

When this is run, the result is:

Cat  Result
---  ------
1    34,5688605753326
2    34,5213758169679

Using Multiple Parameters, Concatenate

When creating aggregates, float is definitely not the only data type that can be used. So in this last example, let’s look at few other things:

Using multiple parameters
Using UserDefined format
Using SQLString and SQLBoolean

The aggregate concatenates strings using the given delimiter. The third parameter controls the behavior of the aggregate. If NullYieldsToNull is true, a NULL in the values will result to NULL. When NullYieldsToNull is false, NULLs are completely ignored. The implementation:

/// <summary>
/// Concatenates the strings with a given delimiter
/// </summary>
[System.Serializable]
[Microsoft.SqlServer.Server.SqlUserDefinedAggregate(
   Microsoft.SqlServer.Server.Format.UserDefined,
   IsInvariantToDuplicates = false, // Receiving the same value again 
				 // changes the result
   IsInvariantToNulls = false,      // Receiving a NULL value changes the result
   IsInvariantToOrder = false,      // The order of the values affects the result
   IsNullIfEmpty = true,            // If no values are given the result is null
   MaxByteSize = -1,                // Maximum size of the aggregate instance. 
                                    // -1 represents a value larger than 8000 bytes,
                                    // up to 2 gigabytes
   Name = "Concatenate"             // Name of the aggregate
)]
public struct Concatenate : Microsoft.SqlServer.Server.IBinarySerialize {
   /// <summary>
   /// Used to store the concatenated string
   /// </summary>
   public System.Text.StringBuilder Result { get; private set; }

   /// <summary>
   /// Used to store the delimiter
   /// </summary>
   public System.Data.SqlTypes.SqlString Delimiter { get; private set; }

   /// <summary>
   /// Used to inform if the string has a value
   /// </summary>
   public bool HasValue { get; private set; }

   /// <summary>
   /// Used to inform if the string is NULL
   /// </summary>
   public bool IsNull { get; private set; }

   /// <summary>
   /// Is the concatenation resulting a NULL if some of the values contain NULL
   /// </summary>
   public bool NullYieldsToNull { get; private set; }

   /// <summary>
   /// Initializes a new Concatenate for a group
   /// </summary>
   public void Init() {
      this.Result = new System.Text.StringBuilder("");
      this.HasValue = false;
      this.IsNull = false;
   }

   /// <summary>
   /// Inserts a new string into the existing already concatenated string
   /// </summary>
   /// <param name="stringval">Value to include</param>
   /// <param name="delimiter">Delimiter to use</param>
   /// <param name="nullYieldsToNull">Is the concatenation resulting a NULL 
   ///                                if some of the values contain NULL</param>
   public void Accumulate(System.Data.SqlTypes.SqlString stringval, 
                          System.Data.SqlTypes.SqlString delimiter, 
                          System.Data.SqlTypes.SqlBoolean nullYieldsToNull) {
      if (!this.HasValue) {
         // if this is the first value received
         if (nullYieldsToNull && stringval.IsNull) {
            this.IsNull = true;
         } else if (stringval.IsNull) {
         } else {
            this.Result.Append(stringval.Value);
         }
         this.Delimiter = delimiter;
         this.NullYieldsToNull = nullYieldsToNull.Value;
      } else if (this.IsNull && nullYieldsToNull.Value) {
         //if the concatenated value is null, stay that way
      } else if (stringval.IsNull && nullYieldsToNull.Value) {
         //if the value is null the result is null
         this.IsNull = true;
      } else {
         //concatenate the values (only if the new value is not null)
         if (!stringval.IsNull) {
            this.Result.AppendFormat("{0}{1}", delimiter.Value, stringval.Value);
         }
      }
      // true if a value has already been set or the string to be added is not null
      this.HasValue = this.HasValue || 
		!(stringval.IsNull && !nullYieldsToNull.Value);
   }

   /// <summary>
   /// Merges this group to another group instantiated for the concatenation
   /// </summary>
   /// <param name="group"></param>
   public void Merge(Concatenate group) {
      // Merge only if the group has a value
      if (group.HasValue) {
         this.Accumulate(group.Result.ToString(), 
		this.Delimiter, this.NullYieldsToNull);
      }
   }

   /// <summary>
   /// Ends the operation and returns the result
   /// </summary>
   /// <returns></returns>
   public System.Data.SqlTypes.SqlString Terminate() {
      return this.IsNull ? System.Data.SqlTypes.SqlString.Null : 
					this.Result.ToString();
   }

   #region IBinarySerialize
   /// <summary>
   /// Writes the values to the stream in order to be stored
   /// </summary>
   /// <param name="writer">The BinaryWriter stream</param>
   public void Write(System.IO.BinaryWriter writer) {
      writer.Write(this.Result.ToString());
      writer.Write(this.Delimiter.Value);
      writer.Write(this.HasValue);
      writer.Write(this.NullYieldsToNull);
      writer.Write(this.IsNull);
   }

   /// <summary>
   /// Reads the values from the stream
   /// </summary>
   /// <param name="reader">The BinaryReader stream</param>
   public void Read(System.IO.BinaryReader reader) {
      this.Result = new System.Text.StringBuilder(reader.ReadString());
      this.Delimiter = new System.Data.SqlTypes.SqlString(reader.ReadString());
      this.HasValue = reader.ReadBoolean();
      this.NullYieldsToNull = reader.ReadBoolean();
      this.IsNull = reader.ReadBoolean();
   }
   #endregion IBinarySerialize
}

This needs a bit explanation. Since we’re using data types in fields that are not present in SQL Server, the format is marked as UserDefined. In this situation, the IBinarySerialize interface must be implemented in order to serialize and deserialize the results. Also note that the MaxByteSize is set now that we have to do our own serialization.

NULL is now handled separately (in IsNull property) so, that the StringBuilder can always contain an instance, but we’re still able to control if NULL should be returned.

Let’s see what happens with the tests. First registration:

SQL

---------------------------------------------
-- Concatenation
---------------------------------------------
CREATE AGGREGATE Aggregates.Concatenate (@string nvarchar(max), 
                                         @delimiter nvarchar(max), 
                                         @nullYieldsToNull bit) RETURNS nvarchar(max)
   EXTERNAL NAME CustomAggregates.Concatenate;

And the test:

SQL

-- Test-run 1
SELECT Aggregates.Concatenate(a.Val, ', ', 0) AS Result
FROM (SELECT 'A' AS Val UNION ALL
      SELECT 'B' AS Val UNION ALL
      SELECT 'C' AS Val) a

The result is:

Result
------
A, B, C

So, this simple case works. What about NULLs:

SQL

-- Test-run 2, NULL in the set, NullYieldsToNull = false
SELECT Aggregates.Concatenate(a.Val, ', ', 0) AS Result
FROM (SELECT 'A'  AS Val UNION ALL
      SELECT 'B'  AS Val UNION ALL
      SELECT NULL AS Val UNION ALL
      SELECT 'C'  AS Val) a

Resulting to:

Result
------
A, B, C

and

SQL

-- Test-run 3, NULL in the set, NullYieldsToNull = true
SELECT Aggregates.Concatenate(a.Val, ', ', 1) AS Result
FROM (SELECT 'A'  AS Val UNION ALL
      SELECT 'B'  AS Val UNION ALL
      SELECT NULL AS Val UNION ALL
      SELECT 'C'  AS Val) a

results to:

Result
------
NULL

More test runs are included in the downloadable script.

About the Performance

I was asked about the performance several times, so I decided to add some discussion about it into the article.

Using custom aggregates can never match the performance of built-in aggregates. One reason is that custom aggregates use MSIL (or CIL) since the logic is implemented using .NET languages. This means that the JIT compiler is involved so there's an overhead because of the compiling.

On the other hand, comparing built-in and custom aggregates isn't quite fair. You wouldn't create a new SUM aggregate because it already exists. Custom aggregates are handy to create new functionality into the database that do not exist. So when comparing the performance, it should be done as a whole. This means that the comparison should include at least:

execution time
the amount of pages read
CPU usage (the DBMS server and the consumer)
network traffic, etc.

But to give an example of the difference between built-in and custom aggregates, let's use both AVG and GeometricMean. First, we need some data so let's create a table:

SQL

CREATE TABLE SomeNumbers (
   Value decimal not null
);

And then fill the table with 100'000 rows with random data. This will take a while, so be patient.

SQL

SET NOCOUNT ON
TRUNCATE TABLE SomeNumbers;
DECLARE @counter int;
BEGIN
   SET @counter = 1;
   WHILE @counter <= 100000 BEGIN
      INSERT INTO SomeNumbers VALUES (RAND() * CAST(10 AS DECIMAL));
      SET @counter = @counter + 1;
   END;
END;

If we take the arithmetic mean (AVG), the amount of page reads is 234:

SQL

SELECT AVG(Value)
FROM SomeNumbers;

Statistics:

Table 'SomeNumbers'. Scan count 1, logical reads 234, physical reads 0,
      read-ahead reads 0, lob logical reads 0, lob physical reads 0,
      lob read-ahead reads 0.

SQL Server Execution Times:
  CPU time = 31 ms,  elapsed time = 26 ms.

Now let's compare this to the geometric mean:

SQL

SELECT Aggregates.GeometricMean(Value)
FROM SomeNumbers;

This will result to:

Table 'SomeNumbers'. Scan count 1, logical reads 234, physical reads 0,
      read-ahead reads 0, lob logical reads 0, lob physical reads 0,
      lob read-ahead reads 0.

SQL Server Execution Times:
  CPU time = 78 ms,  elapsed time = 83 ms.

So it's safe to say that the custom aggregate in this case was roughly 4 times slower than the built-in aggregate. But again calculating a different thing.

When running this example, you may encounter an arithmetic overflow. This is because the geometric mean calculates the product of elements and if there's no zero present, this calculation quickly overflows. With greater amounts of data or with large numbers, the calculation itself should be done differently.

Common Pitfalls

Few words about common pitfalls that may cause gray hair:

When you make modifications to the source code and recompile the assembly, it’s not automatically refreshed in the SQL Server since the database has its own copy of the assembly. Use ALTER ASSEMBLY command to ’refresh’ the DLL in the database without having to drop the aggregates.
If the signature of a method is changed (for example, parameter data type changes or parameters are added), ALTER ASSEMBLY cannot be used until the aggregate changed is first dropped from the database.
The matching data types in SQL Server and CLR are not always so easy to know. Use data type mapping documentation provided by Microsoft: System.Data.SqlTypes Namespace
The SQL Server equivalent for SQLString is nvarchar, not varchar. Using varchar in the CREATE AGGREGATE statement results to error 6552.
Also calling SQLString.Concat(some SQLStringInstance from database, c# string) may fail with error 6552 since the ordering is different.

That’s it this time. Hopefully, you find custom aggregates useful. I’d be grateful if you would have the extra time for comments and votes. Thank you.

History

March 18, 2011: Created
March 27, 2011:
- Correction: Merge step for GeometricMean didn't include the addition of elements. Thanks to Marc Brooks for noticing this.
- Added discussion about the performance

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)