Introduction
SQL is a language which is set-based. It's good to do something with all rows in one step. It's not so good to do something with one row, then move to the next row and do the same. That is the classical programming, useful if the values are stored in arrays or lists. But SQL has own tables and indices and want to find the best way to optimize the code. So we have to use SQL set-based. We begin with a sample which is row-based and change the code step by step into a set-based version.
Sample
Look at the following example:
Given a table tbl_rangeList
with sportsmen or books. There is an ID
, a NAME
and the RANGE
of this object. Now sportsmen stop to do sport, books are no longer available, so some rows have to been removed from the list. All IDs to delete may be listed in a table tbl_toRemove
. And - this is the job - you must correct the ranges.
Sample-table tbl_rangeList
:
Id |
Name |
range |
1055 |
Mueller |
1 |
7138 |
Smith |
2 |
916 |
John Public (*) |
3 |
3024 |
Keller |
4 |
6211 |
Jeanni Public (*) |
5 |
4809 |
O'Neill |
6 |
The two entries marked with (*) have to be removed, so the result should be:
Id |
Name |
range |
1055 |
Mueller |
1 |
7138 |
Smith |
2 |
3024 |
Keller |
3 |
4809 |
O'Neill |
4 |
First solution: Cursor-based with computed range per row
The sample is written in Transact-SQL, but it could also be written in PHP with a mySql-Connection. A SQL-newcomer may write:
Declare @id int,
@range int
Declare cs_rangeList Cursor
Local Fast_Forward
For
Select A.Id, A.range From tbl_rangeList
Open cs_rangeList
Fetch Next From cs_rangeList Into @id, @range
While (@@Fetch_Status = 0)
Begin
If ((Select Count(*) From tbl_toRemove As A
Where A.Id = @id) = 0)
Begin
Select @new_range = Count(*)
From tbl_rangeList As A
Where A.range <= @range
And A.Id Not In
(Select B.Id From tbl_toRemove As B)
Update tbl_rangeList
Set range = @new_range
Where Id = @Id
End
Else
Delete From tbl_rangeList
Where Id = @Id
Fetch Next From cs_rangeList Into @id, @range
End
Close cs_rangeList
Deallocate cs_rangeList
Now the SQL-newcomer is happy, says to himself, 'great, tables are like arrays and I am an expert of arrays' - and does not understand why it is slow when using it on a table with 10,000 rows.
The solution is terrible, it's a job-creation measure. But it works and it is good to understand what to do.
Second solution: Cursor-based with sort, additional index and removing all outdated rows first
- First idea: Remove all rows before doing the rest.
- Second idea: Order the cursor. So the first range is 1, place the value in a variable and increment it.
Declare @id int,
@new_range int
Delete From tbl_rangeList
Where Id In
(Select Id From tbl_toRemove)
Declare cs_rangeList Cursor
Local Fast_Forward
For
Select A.Id
From tbl_rangeList As A
Order By A.range
Open cs_rangeList
Fetch Next From cs_rangeList Into @id
Set @new_range = 1
While (@@Fetch_Status = 0)
Begin
Update tbl_rangeList
Set range = @new_range
Where Id = @id
Set @new_range = @new_range + 1
Fetch Next From cs_rangeList Into @id
End
This is better, but it's also row-oriented, not set-based. If it would be part of a procedural oriented language and if the tables would be arrays, this may be a good solution. But SQL - never, there it is like a dead-end street.
But why not with SQL? The SQLl-code is parsed. But then it is not clear in which order the tables are scanned, the ON
clauses are computed, if a WHERE
should produce a complete table-scan or should use an index. So an execution plan has to be created. And this plan will be compiled, cached and re-used later. So - and this is a difference to other programming languages - the same stored procedure or batch may produce completely different execution plans. But if the code is row-oriented, the execution plan will be every (maybe: most) time the same, no internal optimization exists. So we have to create SQL-code which is far away from such a do-it-step-by-step, so that the optimizer can put a layer - the execution plan - between the SQL-code and the tables.
Third solution: Set-oriented without cursor
Now, we will remove the cursor. We have to compute the new range for all rows in one step. So let's create a self-join with the ON
-part B.range >= C.range
and count all rows. The result is the new range of this id.
Select B.Id, Count(*) As new_range
From tbl_rangeList As B Inner Join tbl_rangeList As C
On B.range >= C.range
Group By B.Id
This produces the following output:
Id |
new_range |
1055 |
1 |
7138 |
2 |
3024 |
3 |
4809 |
4 |
Every Id
is listed with its new range. This table can be used as subquery in a join.
Delete From tbl_rangeList
Where Id In
(Select Id From tbl_toRemove)
Update tbl_rangeList
Set range = D.new_range
From tbl_rangeList As A Inner Join
(Select B.Id,
Count(*) As new_range
From tbl_rangeList As B Inner Join
tbl_rangeList As C
On B.range >= C.range
Group By B.Id) As D
On A.Id = D.Id
The inner query holds all new ranges and is allowed in an update-command. But there is one problem: All rows are updated, not only these with new ranges.
Fourth solution: Skip rows where the range is not changed - with HAVING
With a Having
-Part, we can test if the new range is lower then the old range. And the subquery collects only these rows.
Update tbl_rangeList
Set range = D.new_range
From tbl_rangeList As A Inner Join
(Select B.Id,
Count(*) As new_range
From tbl_rangeList As B Inner Join
tbl_rangeList As C
On B.range >= C.range
Group By B.Id, B.range
Having Count(*) < B.range) As D
On A.Id = D.Id
Now the inner table holds only those rows which need a range-update. Ok, this looks better - but not good enough. If we have 10,000 rows and the first row which is removed has the range 9,523, we must compute the new range for 9,522 rows and throw these results away. But this allows to look after the first removed range and use this to reduce the subquery with a WHERE
-part. A WHERE
is better - because it's done before the GROUP BY
.
Fifth (and last) solution: Skip rows where the range is not changed - use the minimum in WHERE
Declare @i int
Select @i = Min(A.range)
From tbl_rangeList As A
Inner Join tbl_toRemove As B
On A.Id = B.Id
Delete From tbl_rangeList
Where Id In
(Select B.Id From tbl_toRemove As B)
Update tbl_rangeList
Set range = D.new_range
From tbl_rangeList As A Inner Join
(Select B.Id,
Count(*) As new_range
From tbl_rangeList As B Inner Join
tbl_rangeList As C
On B.range >= C.range
Where B.range >= @i
Group By B.Id) As D
On A.Id = D.Id
Now the Count(*)
is only done for rows which are needed. And the optimizer can compute the best execution plan, he checks if it is better to scan tables directly or use an index. Our SQL-code is really set-oriented: The first command stores the minimal range, the second deletes all outdated rows and the third does the job with a single command.
Thanks for your reading - and use it.
A German version of this article can be found at SQL as set-oriented language. The SQL-Tutorial has a lot of Group By
and subquery samples.