Tuesday, May 29, 2012

Recursive CTE is evil. And cursor is an angel

Someone made a comment on Stackoverflow that I'm painting the Recursive CTE as evil and should always be avoided on my post Running Total. Here and Now

Well, like Stackoverflow, I like facts, not opinions, give me numbers any time of day, numbers don't lie. Numbers could lie if you are a boxer though. How? By cherry-picking opponents. Ok, I digress ;-)

First, I would never declare a too encompassing statement that this and that thing is evil. What I actually conveyed on that previous post is, using recursive CTE for running total is slow. How could I even make a fast hierarchical report in this 21st century if I don't use recursive CTE then?

Anyway, I don't have too many blog post declaring things as evil, in fact since I do blogging, I have only one blog post that mention the word evil, two if you counted Dracula as evil, three after I deliberately title this post with headline-grabbing four-letter word evil. With this in mind, I would give my blog a favor and give it a traffic-grabbing title, XXX is EVIL. I will then title this article as Recursive CTE is Evil, unmindful of the fact that what I really conveyed is Running Total with Recursive CTE is Evil :-)


This blog post has twofold purposes. First, I find using *recursive* CTE for running total is really dog-slow^H^H^H^H^H^H^H^H evil; second, I'm ashamed that quirky update approach for making running total semi-works only, and yeah it's really quirky. I put a guard statement on the UPDATE statement, and that guard statement gives a warning(divide-by-zero) if UPDATE update rows unpredictably(i.e. it run out-of-order). And unpredictable it is, sometimes the query work, sometimes it don't(divide-by-zero is invoked), I have yet to explore the suggestions by many to make a clustered index on temporary table to make the update be predictable.


And before I forgot, a third purpose of this post, I would like to re-introduce to you our old friend cursor; he's nice, he's dependable, he's our most unsung hero. I would hazard a guess that for every smooth-running system, there's a cursor-using report that is lurking around that is not given much recognition or accolade, because you know, cursors are evil. While our new-found friend(CTE) is being given all the mad props, even with the mere fact that it is just doing a glorified row_numbering chore, yes folks, SQL Server 2008 don't have decent windowing capability, that a running total blog post such as this exists because of that limitation. This blog post will be moot when most of us will be using the latest Sql Server 2012 by then, but that's neither here nor there, even we are already at the end of the second quarter of year 2012, most of us are still using Sql Server 2008.


Ok, so let's start with some data to verify our query correctness:

create table Test(
    OrderID int primary key,
    CustomerCode varchar(50),
    Qty int not null
);


declare @i int = 1;

while @i <= 20 begin
    insert into Test(OrderID, CustomerCode, Qty) values (
        @i * 2
        ,case @i % 4 
        when 0 then 'JOHN' 
        when 1 then 'PAUL'
        when 2 then 'GEORGE'
        when 3 then 'RINGO'
        end
        ,rand() * 10);    
    set @i = @i + 1;
end;


Then make a Recursive CTE approach:

with T AS
(
    select ROW_NUMBER() over(partition by CustomerCode order by OrderID) as rn, * from test
)
,R(CustomerCode, Rn, OrderId, Qty, RunningTotal) as
(
    select CustomerCode, Rn, OrderID, Qty, Qty
    from t 
    where rn = 1
    
    union all
    
    select t.CustomerCode, t.Rn, t.OrderId, t.Qty, p.RunningTotal + t.Qty
    from t t
    join r p on p.CustomerCode = t.CustomerCode and t.rn = p.rn + 1
    
)
select R.CustomerCode, R.OrderId, R.Qty, R.RunningTotal from r
order by R.CustomerCode, R.OrderId 
option(maxrecursion 0);

Output:

CustomerCode                                       OrderId     Qty         RunningTotal
-------------------------------------------------- ----------- ----------- ------------
GEORGE                                             4           7           7
GEORGE                                             12          8           15
GEORGE                                             20          4           19
GEORGE                                             28          4           23
GEORGE                                             36          9           32
JOHN                                               8           2           2
JOHN                                               16          9           11
JOHN                                               24          0           11
JOHN                                               32          3           14
JOHN                                               40          3           17
PAUL                                               2           5           5
PAUL                                               10          8           13
PAUL                                               18          3           16
PAUL                                               26          7           23
PAUL                                               34          0           23
RINGO                                              6           7           7
RINGO                                              14          9           16
RINGO                                              22          2           18
RINGO                                              30          3           21
RINGO                                              38          3           24

(20 row(s) affected)


Then make a Quirky Update approach:

create function TestRunningTotalGuarded()
returns @ReturnTable table(
    CustomerCode varchar(50), OrderId int, Qty int, RunningTotal int not null, RN int identity(1,1) not null
)
as begin
 
    insert into @ReturnTable(CustomerCode, OrderID, Qty, RunningTotal)
    select CustomerCode, OrderID, Qty, 0 from Test
    order by CustomerCode, OrderID;
     
    declare @RunningTotal int;
     
    declare @RN_check INT = 0;
    declare @PrevCustomerCode varchar(50) = NULL;
     
    update @ReturnTable set
            @RN_check = @RN_check + 1,
            @RunningTotal = 
                (case when RN = @RN_check then 
                    case when @PrevCustomerCode = CustomerCode then
                        @RunningTotal + Qty 
                    else
                        Qty
                    end
                else 
                    1/0 
                end),
            @PrevCustomerCode = CustomerCode,
            RunningTotal = @RunningTotal;
 
    return;
end;

Output:

CustomerCode                                       OrderId     Qty         RunningTotal RN
-------------------------------------------------- ----------- ----------- ------------ -----------
GEORGE                                             4           7           7            1
GEORGE                                             12          8           15           2
GEORGE                                             20          4           19           3
GEORGE                                             28          4           23           4
GEORGE                                             36          9           32           5
JOHN                                               8           2           2            6
JOHN                                               16          9           11           7
JOHN                                               24          0           11           8
JOHN                                               32          3           14           9
JOHN                                               40          3           17           10
PAUL                                               2           5           5            11
PAUL                                               10          8           13           12
PAUL                                               18          3           16           13
PAUL                                               26          7           23           14
PAUL                                               34          0           23           15
RINGO                                              6           7           7            16
RINGO                                              14          9           16           17
RINGO                                              22          2           18           18
RINGO                                              30          3           21           19
RINGO                                              38          3           24           20

(20 row(s) affected)


Then make a Running Total via Cursor approach:

create function TestRunningTotalCursor()
returns @ReturnTable table(
    CustomerCode varchar(50), OrderId int, Qty int, RunningTotal int not null
) as
begin

    declare @c_CustomerCode varchar(50);
    declare @c_OrderID int;
    declare @c_qty int;
    
    declare @PrevCustomerCode varchar(50) = null;
    declare @RunningTotal int = 0;
    
    

    declare o_cur cursor for
    select CustomerCode, OrderID, Qty from Test
    order by CustomerCode, OrderID;

    open o_cur;


    fetch next from o_cur
    into @c_CustomerCode, @c_OrderID, @c_Qty;

    while @@FETCH_STATUS = 0 begin
                
        if @c_CustomerCode = @PrevCustomerCode begin
            set @RunningTotal = @RunningTotal + @c_qty;
        end
        else begin
            set @RunningTotal = @c_Qty;
        end
        
        set @PrevCustomerCode = @c_CustomerCode;
        
        insert into @ReturnTable(CustomerCode, OrderId, Qty, RunningTotal)
        values(@c_CustomerCode, @c_OrderID, @c_Qty, @RunningTotal);
                    
        
        fetch next from o_cur
        into @c_CustomerCode, @c_OrderID, @c_Qty;
    end


    close o_cur;
    deallocate o_cur;
    
    return;    
end;

Output:

CustomerCode                                       OrderId     Qty         RunningTotal
-------------------------------------------------- ----------- ----------- ------------
GEORGE                                             4           7           7
GEORGE                                             12          8           15
GEORGE                                             20          4           19
GEORGE                                             28          4           23
GEORGE                                             36          9           32
JOHN                                               8           2           2
JOHN                                               16          9           11
JOHN                                               24          0           11
JOHN                                               32          3           14
JOHN                                               40          3           17
PAUL                                               2           5           5
PAUL                                               10          8           13
PAUL                                               18          3           16
PAUL                                               26          7           23
PAUL                                               34          0           23
RINGO                                              6           7           7
RINGO                                              14          9           16
RINGO                                              22          2           18
RINGO                                              30          3           21
RINGO                                              38          3           24

(20 row(s) affected)


After verifying all those queries are correct, now let's bump the number of rows to 5,000. Here are the metrics:


* Recursive CTE : 49 seconds
* Quirky Update : 0 second
* Cursor        : 0 second


Those 0 seconds are not meaningful, let's bump the rows to 50,000 to see a meaningful metrics:
* Quirky Update : 1 second
* Cursor        : 3 seconds
* Recursive CTE : An hour

Caveat, quirky update look like a speed demon, but there's some problem with it, with Sql Server(and other RDBMS might be too. Sybase was the first RDBMS that uses the quirky update approach for running total and encourages it, so YMMV if you are using Sybase. Microsoft SQL Server was a Sybase before), the order of updating rows is unpredictable when it is given large number of rows. I now see a divide-by-error intermittently on 50,000 rows for about one time in every five run.

CustomerCode                                       OrderId     Qty         RunningTotal RN
-------------------------------------------------- ----------- ----------- ------------ -----------
Msg 8134, Level 16, State 1, Line 1
Divide by zero error encountered.
The statement has been terminated.


From the metrics and correctness, we should be now comfortable using cursor for running total. No one could paint FUD on this baby.


If someone could enlighten me if making quirky update not quirky is really a lost cause, I would stop here and would not find a way to make the quirky update reliable. Until further notice, I'm scouring the web if there's a way to make the quirky update reliable.



UPDATE

The pure CTE approach is the root of all evil :-) When a row-numbering query(that is being joined to a recursive CTE afterwards) is materialized to an actual table first before being joined to a recursive CTE, the speed goes up

select ROW_NUMBER() over(partition by CustomerCode order by OrderID) as rn, * into #xxx 
from test;

with T AS
(
    select * from #xxx
)
,R(CustomerCode, Rn, OrderId, Qty, RunningTotal) as
(
    select CustomerCode, Rn, OrderID, Qty, Qty
    from t 
    where rn = 1
    
    union all
    
    select t.CustomerCode, t.Rn, t.OrderId, t.Qty, p.RunningTotal + t.Qty
    from t t
    join r p on p.CustomerCode = t.CustomerCode and t.rn = p.rn + 1
    
)
select R.CustomerCode, R.OrderId, R.Qty, R.RunningTotal from r
order by R.CustomerCode, R.OrderId 
option(maxrecursion 0);

drop table #xxx;



Using view is also as slow as the pure CTE, also an hour:
create view xxx 
as
    select ROW_NUMBER() over(partition by CustomerCode order by OrderID) as rn, *
    from test;
go



with T AS
(
    select * from xxx
)
,R(CustomerCode, Rn, OrderId, Qty, RunningTotal) as
(
    select CustomerCode, Rn, OrderID, Qty, Qty
    from t 
    where rn = 1
     
    union all
     
    select t.CustomerCode, t.Rn, t.OrderId, t.Qty, p.RunningTotal + t.Qty
    from t t
    join r p on p.CustomerCode = t.CustomerCode and t.rn = p.rn + 1
     
)
select R.CustomerCode, R.OrderId, R.Qty, R.RunningTotal from r
order by R.CustomerCode, R.OrderId 
option(maxrecursion 0);



To speed up that view, materialize the view to an actual table:
create view xxx 
as
    select ROW_NUMBER() over(partition by CustomerCode order by OrderID) as rn, *
    from test;
go


select * into #xxx from xxx;


with T AS
(
    select * from #xxx
)
,R(CustomerCode, Rn, OrderId, Qty, RunningTotal) as
(
    select CustomerCode, Rn, OrderID, Qty, Qty
    from t 
    where rn = 1
     
    union all
     
    select t.CustomerCode, t.Rn, t.OrderId, t.Qty, p.RunningTotal + t.Qty
    from t t
    join r p on p.CustomerCode = t.CustomerCode and t.rn = p.rn + 1
     
)
select R.CustomerCode, R.OrderId, R.Qty, R.RunningTotal from r
order by R.CustomerCode, R.OrderId 
option(maxrecursion 0);




CTE is not slow, not understanding how queries work is what makes your query slow.



To recap, prior to materializing the row numbering to an actual table(i.e. temporary table), here are the metrics on 50,000 rows:

* Quirky Update       : 1 second
* Cursor              : 3 seconds
* Recursive CTE(Pure) : An hour

Here are the new metrics after row numbering table materialization (50,000 rows):

* Quirky Update         : 1 second
* Cursor                : 3 seconds
* Recursive CTE(Hybrid) : 2 seconds (inclusive of row numbering table materialization)


UPDATE May 29, 2012 8:00 PM

The quirky update is not so quirky after all. Just by putting a clustered primary key on sequential row number, SQL Server UPDATE is able to do an orderly update of rows.

The only change needed be done is to put a not null primary key clustered
alter function TestRunningTotalGuarded()
returns @ReturnTable table(
    CustomerCode varchar(50), OrderId int, Qty int, RunningTotal int not null, RN int identity(1,1) not null primary key clustered
)

I tried looping it 100 times, didn't encounter any divide-by-zero error
declare @i int = 0;
while @i < 100 begin
 select * from dbo.TestRunningTotalGuarded() order by CustomerCode, OrderId
 set @i = @i + 1;
end
And it's still the same 1 second query, and it's faster than Hybrid Recursive CTE. Here's the metric for 100,000 rows:
Quirky Update        : 3 seconds
Hybrid Recursive CTE : 5 seconds
Cursor               : 6 seconds
Quirky update is not so quirky after all.
UPDATE July 27, 2013 10:47 AM Table variable is faster by a few milliseconds...
declare @t table
(
    RN INT,
    OrderID int,
    CustomerCode varchar(50),
    Qty int,
    primary key(CustomerCode, RN)
);

insert into @t(RN, OrderID, CustomerCode, Qty)
select ROW_NUMBER() over(partition by CustomerCode order by OrderID) as rn, * from test;

with T AS
(
    select * from @t
)
,R(CustomerCode, Rn, OrderId, Qty, RunningTotal) as
(
    select CustomerCode, Rn, OrderID, Qty, Qty
    from t 
    where rn = 1
     
    union all
     
    select t.CustomerCode, t.Rn, t.OrderId, t.Qty, p.RunningTotal + t.Qty
    from t t
    join r p on p.CustomerCode = t.CustomerCode and t.rn = p.rn + 1
     
)
select R.CustomerCode, R.OrderId, R.Qty, R.RunningTotal from r
order by R.CustomerCode, R.OrderId 
option(maxrecursion 0);
...than temporary table:
create table #t
(
    RN INT,
    OrderID int,
    CustomerCode varchar(50),
    Qty int,
    primary key(CustomerCode, RN)
);

insert into #t(RN, OrderID, CustomerCode, Qty)
select ROW_NUMBER() over(partition by CustomerCode order by OrderID) as rn, * from test;

with T AS
(
    select * from #t
)
,R(CustomerCode, Rn, OrderId, Qty, RunningTotal) as
(
    select CustomerCode, Rn, OrderID, Qty, Qty
    from t 
    where rn = 1
     
    union all
     
    select t.CustomerCode, t.Rn, t.OrderId, t.Qty, p.RunningTotal + t.Qty
    from t t
    join r p on p.CustomerCode = t.CustomerCode and t.rn = p.rn + 1
     
)
select R.CustomerCode, R.OrderId, R.Qty, R.RunningTotal from r
order by R.CustomerCode, R.OrderId 
option(maxrecursion 0);


drop table #t
It's not possible to create a temporary table inside a function, only a table variable is possible:
create function TestRunningTotalMaterialized()
returns @ReturnTable table(
    CustomerCode varchar(50), OrderId int, Qty int, RunningTotal int not null, RN int identity(1,1) not null
)
as
begin

    declare @t table
    (
        RN INT,
        OrderID int,
        CustomerCode varchar(50),
        Qty int,
        primary key(CustomerCode, RN)
    );

    insert into @t(RN, OrderID, CustomerCode, Qty)
    select ROW_NUMBER() over(partition by CustomerCode order by OrderID) as rn, * from test;

    
    with T AS
    (
        select * from @t
    )
    ,R(CustomerCode, Rn, OrderId, Qty, RunningTotal) as
    (
        select CustomerCode, Rn, OrderID, Qty, Qty
        from t 
        where rn = 1
     
        union all
     
        select t.CustomerCode, t.Rn, t.OrderId, t.Qty, p.RunningTotal + t.Qty
        from t t
        join r p on p.CustomerCode = t.CustomerCode and t.rn = p.rn + 1
     
    )
    insert into @ReturnTable(CustomerCode, OrderId, Qty, RunningTotal)
    select R.CustomerCode, R.OrderId, R.Qty, R.RunningTotal from r
    order by R.CustomerCode, R.OrderId 
    option(maxrecursion 0);
    
    return;
end;

In terms of performance, even we use table variable on Hybrid Recursive CTE, things stay the same: Quirky Update > Hybrid Recursive CTE > Cursor

1 comment:

  1. I was going to suggest that you had written the Qirky Update incorrectly but I see you've discovered the joy there. There are a couple of other things you might want to employ for even greater speed like using the TABLOCK hint and forcing no parallelism.

    ReplyDelete