In this article, you will see how to calculate progressive sums in T-SQL.
Introduction
A problem that a business developer often encounters is the creation of a progressive sum, over a number of cash inflows/outflows by date. By progressive sum, we mean the sum of cash inflows / outflows, such as:
Date | Inflow/Outflow | Amount | Progressive Sum |
1/1/2016 | Inflow | 10000 | 10000 |
3/1/2016 | Outflow | 5000 | 5000 |
3/1/2016 | Inflow | 3000 | 8000 |
4/1/2016 | Outflow | 12000 | -4000 |
Setup
For example, suppose we create the following Moves
table in SQL Server:
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE TABLE [dbo].[Moves](
[RecID] [int] IDENTITY(1,1) NOT NULL,
[MoveDate] [datetime] NULL,
[MoveDescr] [nvarchar](2000) NULL,
[MoveSignID] [int] NULL,
[MoveProjValue] [decimal](18, 4) NULL,
[MoveActualValue] [decimal](18, 4) NULL,
[MoveComments] [nvarchar](2000) NULL,
CONSTRAINT [PK_Moves] PRIMARY KEY CLUSTERED
(
[RecID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, _
IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
CREATE NONCLUSTERED INDEX [MovesIndex] ON [dbo].[Moves]
(
[MoveDate] ASC,
[MoveSignID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, _
SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, _
ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
GO
MoveSignID
has the value of 1
for inflow (i.e., cash input, for example, cash paid to us by a customer), 2
for outflows (i.e., cash output, for example, paying a supplier).
Let’s then create some test data using create sample Moves.zip query, which will create a test load of 40.000 records, with random values, in order to test with a realistic load of roughly 6 years data, such as:
RecID MoveDate MoveSignID MoveProjValue
1 2016-04-17 1 6776.8625
2 2016-04-17 1 7399.8839
...
10 2016-04-17 1 7423.4752
11 2016-04-18 1 6769.2806
...
20001 2016-04-17 2 237.0888
20002 2016-04-17 2 7865.3622
...
20011 2016-04-18 2 1348.9914
20012 2016-04-18 2 5248.4128
Processing
Suppose now, that we want to calculate the running total of these data. The most obvious query will be:
SELECT a.moveDate, a.moveprojValue, CASE WHEN a.moveSignID=1 _
THEN SUM(b.moveprojValue) ELSE SUM(-b.moveprojValue) END AS RunningTotal
FROM moves a CROSS JOIN moves b
WHERE (b.moveDate <= a.moveDate)
GROUP BY a.moveDate,a.moveprojValue,a.moveSignID
ORDER BY a.moveDate
The query takes roughly 29 seconds to complete on my W2012R2 virtual machine with SQL Server 2014 Express, with the following statistics:
SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 0 ms.
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 0 ms.
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 0 ms.
(40000 row(s) affected)
Table 'Worktable'. Scan count 1, logical reads 804070,
physical reads 0, read-ahead reads 0, lob logical reads 0,
lob physical reads 0, lob read-ahead reads 0.
Table 'Workfile'. Scan count 0, logical reads 0,
physical reads 0, read-ahead reads 0, lob logical reads 0,
lob physical reads 0, lob read-ahead reads 0.
Table 'Moves'. Scan count 2, logical reads 540,
physical reads 0, read-ahead reads 0, lob logical reads 0,
lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 28422 ms, elapsed time = 29155 ms.
Of course, there are more optimised solutions, like the following:
SELECT T1.*,
(SELECT SUM(CASE WHEN MoveSignID=1 THEN MoveProjValue ELSE -MoveProjValue END )
FROM Moves T2
WHERE T2.MoveDate <= T1.MoveDate) AS RunningTotal
FROM Moves T1
ORDER BY MoveDate
GO
The query takes roughly 12-13 seconds to complete on the same VM, with the following statistics:
(40000 row(s) affected)
Table 'Worktable'. Scan count 44000, logical reads 449343,
physical reads 0, read-ahead reads 0, lob logical reads 0,
lob physical reads 0, lob read-ahead reads 0.
Table 'Moves'. Scan count 2, logical reads 82890,
physical reads 0, read-ahead reads 0, lob logical reads 0,
lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 8890 ms, elapsed time = 13407 ms.
Not very satisfactory, it still takes a noticeable amount of time to produce a result that seems trivial.
Maybe we can try a cursor solution?
For example:
CREATE TABLE #Moves (MoveDate datetime, Amount decimal(18,4), RunningTotal decimal(18,4))
DECLARE @MoveDate datetime, @SignID int, @Amount decimal(18,4), @RunningTotal decimal(18,4)
SET @RunningTotal = 0
DECLARE rt_cursor CURSOR FOR SELECT MoveDate, MoveSignID, _
MoveProjValue FROM Moves ORDER BY MoveDate,MoveSignID
OPEN rt_cursor
FETCH NEXT FROM rt_cursor INTO @MoveDate, @SignID, @Amount
WHILE @@FETCH_STATUS = 0
BEGIN
IF @SignID=1
SET @RunningTotal = @RunningTotal + @Amount
ELSE SET @RunningTotal = @RunningTotal - @Amount
INSERT #Moves VALUES (@MoveDate, @Amount,@RunningTotal)
FETCH NEXT FROM rt_cursor INTO @MoveDate, @SignID, @Amount
END
CLOSE rt_cursor
DEALLOCATE rt_cursor
SELECT * FROM #Moves ORDER BY MoveDate
DROP TABLE #Moves
This produces results in 9 seconds, statistics are not clear, since we are doing an insert
for every day. However, it’s still not optimum and has a great disadvantage, that it cannot be incorporated in a view, since cursors are only allowed in stored procedures/functions.
If we have access to an SQL 2012 or later, we can try using the OVER
clause.
SELECT T1.*,
SUM(CASE WHEN T1.MoveSignID=1 THEN T1.MoveProjValue ELSE -T1.MoveProjValue END )
OVER(ORDER BY moveDate) AS RunningTotal
FROM Moves T1
order by T1.MoveDate
The OVER
operator has been defined in SQL in order to define a window or user specified set of rows, over which to apply an aggregate function such as sum
, avg
, etc. The query above completes in only 0.6 seconds, nearly 15 times faster than the next best solution, the cursor query and 20 times faster than the obvious subquery solution, presented above. See also the relevant statistics.
SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 0 ms.
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 0 ms.
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 0 ms.
(40000 row(s) affected)
Table 'Worktable'. Scan count 2001, logical reads 174260,
physical reads 0, read-ahead reads 0, lob logical reads 0,
lob physical reads 0, lob read-ahead reads 0.
Table 'Moves'. Scan count 1, logical reads 82611,
physical reads 0, read-ahead reads 0, lob logical reads 0,
lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 375 ms, elapsed time = 588 ms.
Additionally, it uses a much more elegant syntax, extremely clear and consise. The main disadvantage is that the syntax used is only available on SQL Server 2012 or later, for example, even SQL 2008 R2 will produce a syntax error. So, either you have to upgrade your SQL Server or use a less efficient solution.
Final Results
To sum up in a table:
Solution SQL compatibility CPU time (ms) Elapsed time (ms)
CROSS JOIN All 28422 29155
SUBQUERY All 8890 13407
CURSOR All N/A ~9000
OVER SQL 2012 + 375 588
All comments are welcome, hope some people will find it useful!
Using the Code
Just create a blank database, run Moves.SQL, and then create sample moves.SQL.
After that, you can use any sample query provided and especially progressive sum partition order by - only for SQL2012+.SQL.
Points of Interest
SQL Server and TSQL keep improving. However, it still lacks a particular mission critical feature, scaling out...
Copyright [2016] [by E. Gimissis]
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
History
- 17th April, 2016: Initial version