(untagged)

SQL as a set-oriented language

j.Auer

0.00/5 (No votes)

17 Apr 2005

SQL works if the commands are set-based. Procedural elements may helpful, but shouldn't be used on the wrong place. The article begins with a procedural sample and changes it into a set-based version.

Introduction

SQL is a language which is set-based. It's good to do something with all rows in one step. It's not so good to do something with one row, then move to the next row and do the same. That is the classical programming, useful if the values are stored in arrays or lists. But SQL has own tables and indices and want to find the best way to optimize the code. So we have to use SQL set-based. We begin with a sample which is row-based and change the code step by step into a set-based version.

Sample

Look at the following example:

Given a table tbl_rangeList with sportsmen or books. There is an ID, a NAME and the RANGE of this object. Now sportsmen stop to do sport, books are no longer available, so some rows have to been removed from the list. All IDs to delete may be listed in a table tbl_toRemove. And - this is the job - you must correct the ranges.

Sample-table tbl_rangeList:

Id	Name	range
1055	Mueller	1
7138	Smith	2
916	John Public (*)	3
3024	Keller	4
6211	Jeanni Public (*)	5
4809	O'Neill	6

The two entries marked with (*) have to be removed, so the result should be:

Id	Name	range
1055	Mueller	1
7138	Smith	2
3024	Keller	3
4809	O'Neill	4

First solution: Cursor-based with computed range per row

The sample is written in Transact-SQL, but it could also be written in PHP with a mySql-Connection. A SQL-newcomer may write:

Declare @id int,
    @range int

Declare cs_rangeList Cursor
    Local Fast_Forward
For
Select A.Id, A.range From tbl_rangeList

Open cs_rangeList

Fetch Next From cs_rangeList Into @id, @range

While (@@Fetch_Status = 0)
Begin
    -- check, if the row has to remove

    If ((Select Count(*) From tbl_toRemove As A
        Where A.Id = @id) = 0)
    Begin
        -- compute the new range

        Select @new_range = Count(*)
        From tbl_rangeList As A
        Where A.range <= @range
        And A.Id Not In
            (Select B.Id From tbl_toRemove As B)

        -- and update it

        Update tbl_rangeList
            Set range = @new_range
        Where Id = @Id
    End
    Else
        -- remove it

        Delete From tbl_rangeList
        Where Id = @Id

    Fetch Next From cs_rangeList Into @id, @range
End
Close cs_rangeList
Deallocate cs_rangeList

Now the SQL-newcomer is happy, says to himself, 'great, tables are like arrays and I am an expert of arrays' - and does not understand why it is slow when using it on a table with 10,000 rows.

The solution is terrible, it's a job-creation measure. But it works and it is good to understand what to do.

Second solution: Cursor-based with sort, additional index and removing all outdated rows first

First idea: Remove all rows before doing the rest.
Second idea: Order the cursor. So the first range is 1, place the value in a variable and increment it.

Declare @id int,
    @new_range int

Delete From tbl_rangeList
Where Id In
    (Select Id From tbl_toRemove)

Declare cs_rangeList Cursor
    Local Fast_Forward
For
Select A.Id
From tbl_rangeList As A
-- new: Order the list

Order By A.range

Open cs_rangeList

Fetch Next From cs_rangeList Into @id
-- new: begin with 1

Set @new_range = 1

While (@@Fetch_Status = 0)
Begin
    Update tbl_rangeList
        Set range = @new_range
    Where Id = @id

    -- great computing

    Set @new_range = @new_range + 1

    Fetch Next From cs_rangeList Into @id
End
-- close and deallocate the cursor - skipped

This is better, but it's also row-oriented, not set-based. If it would be part of a procedural oriented language and if the tables would be arrays, this may be a good solution. But SQL - never, there it is like a dead-end street.

But why not with SQL? The SQLl-code is parsed. But then it is not clear in which order the tables are scanned, the ON clauses are computed, if a WHERE should produce a complete table-scan or should use an index. So an execution plan has to be created. And this plan will be compiled, cached and re-used later. So - and this is a difference to other programming languages - the same stored procedure or batch may produce completely different execution plans. But if the code is row-oriented, the execution plan will be every (maybe: most) time the same, no internal optimization exists. So we have to create SQL-code which is far away from such a do-it-step-by-step, so that the optimizer can put a layer - the execution plan - between the SQL-code and the tables.

Third solution: Set-oriented without cursor

Now, we will remove the cursor. We have to compute the new range for all rows in one step. So let's create a self-join with the ON-part B.range >= C.range and count all rows. The result is the new range of this id.

Select B.Id, Count(*) As new_range
From tbl_rangeList As B Inner Join tbl_rangeList As C
On B.range >= C.range
Group By B.Id

This produces the following output:

Id	new_range
1055	1
7138	2
3024	3
4809	4

Every Id is listed with its new range. This table can be used as subquery in a join.

Delete From tbl_rangeList
Where Id In
    (Select Id From tbl_toRemove)

Update tbl_rangeList

    Set range = D.new_range

From tbl_rangeList As A Inner Join

    (Select B.Id,
        Count(*) As new_range
    From tbl_rangeList As B Inner Join
        tbl_rangeList As C
    On B.range >= C.range
    Group By B.Id) As D

On A.Id = D.Id

The inner query holds all new ranges and is allowed in an update-command. But there is one problem: All rows are updated, not only these with new ranges.

Fourth solution: Skip rows where the range is not changed - with HAVING

With a Having-Part, we can test if the new range is lower then the old range. And the subquery collects only these rows.

Update tbl_rangeList

    Set range = D.new_range

From tbl_rangeList As A Inner Join

    (Select B.Id,
        Count(*) As new_range
    From tbl_rangeList As B Inner Join
        tbl_rangeList As C
    On B.range >= C.range
    Group By B.Id, B.range
    Having Count(*) < B.range) As D

On A.Id = D.Id

Now the inner table holds only those rows which need a range-update. Ok, this looks better - but not good enough. If we have 10,000 rows and the first row which is removed has the range 9,523, we must compute the new range for 9,522 rows and throw these results away. But this allows to look after the first removed range and use this to reduce the subquery with a WHERE-part. A WHERE is better - because it's done before the GROUP BY.

Fifth (and last) solution: Skip rows where the range is not changed - use the minimum in WHERE

Declare @i int

Select @i = Min(A.range)
From tbl_rangeList As A
    Inner Join tbl_toRemove As B
On A.Id = B.Id

-- @i holds now the lowest range to remove


Delete From tbl_rangeList
Where Id In
    (Select B.Id From tbl_toRemove As B)


Update tbl_rangeList

    Set range = D.new_range

From tbl_rangeList As A Inner Join

    (Select B.Id,
        Count(*) As new_range
    From tbl_rangeList As B Inner Join
        tbl_rangeList As C
    On B.range >= C.range
    Where B.range >= @i
    Group By B.Id) As D

On A.Id = D.Id

Now the Count(*) is only done for rows which are needed. And the optimizer can compute the best execution plan, he checks if it is better to scan tables directly or use an index. Our SQL-code is really set-oriented: The first command stores the minimal range, the second deletes all outdated rows and the third does the job with a single command.

Thanks for your reading - and use it.

A German version of this article can be found at SQL as set-oriented language. The SQL-Tutorial has a lot of Group By and subquery samples.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here