Wednesday, July 31, 2013

Why so fond of UNION?



I'm wearing my sargchitect hat in our current sprint. Aside from improving the speed of the existing queries by making them sargable and devoid of unintended cartesian-yielding joins, the optimization team need to make sure the queries are properly refactored, easy to maintain, smooth and suave to the eyes, so the developers could have a pleasant time reading and understanding the queries.


I saw some UNION queries that can be made simpler using OR. I don't think it's ok to have a union of the same table, union like that tend to be DRY-violating, tend to have many lines, hence harder to debug and harder to optimize


I'm thinking, why the inordinate fondness for UNION if the query can be made simpler with OR? Last I checked, Microsoft haven't disabled the OR functionality. So let's rejoice to the fact that Microsoft don't have any plans to disable OR functionality for the foreseeable future.




Happy Sarging! Scratch that, Happy SARGing! ツ

Thursday, July 25, 2013

Tuple Design Pattern for SQL Server

If your RDBMS doesn't support tuples (just an RDBMS-fancy speak for record), instead of this:
SELECT a,b FROM aTable
WHERE 
    (aTable.a,aTable.b) IN
    
    (SELECT anotherTable.a,anotherTable.b 
    FROM anotherTable
    WHERE anotherTable.IsActive = 1);


You have to do this:
SELECT a,b FROM aTable
WHERE     
    EXISTS
    (
        SELECT *
        FROM anotherTable
        WHERE anotherTable.a = aTable.a AND anotherTable.b = aTable.b
            AND anotherTable.IsActive = 1
    );


I prefer to write it this way though, so as to make the intent clearer:
SELECT a,b FROM aTable
WHERE 
-- (aTable.a,aTable.b) IN -- leave this commented, it makes the intent more clear
    EXISTS
    (
        SELECT anotherTable.a,anotherTable.b -- do not remove this too, perfectly fine for self-documenting code, i.e., tuple presence check
        FROM anotherTable
        WHERE anotherTable.a = aTable.a AND anotherTable.b = aTable.b
        
            AND anotherTable.IsActive = 1 -- put a blank line above, to emphasize that the above condition is filter for the tuple we are looking for
    );


Design patterns are bug reports against your programming language — Peter Norvig



Happy Computing! ツ

Debunking the myth that JOIN is faster than IN

We see now that contrary to the popular opinion, IN / EXISTS queries are not less efficient than a JOIN query in SQL Server. In fact, JOIN queries are less efficient on non-indexed tables, since Semi Join methods allow aggregation and matching against a single hash table, while a JOIN needs to do these two operations in two steps. -- http://explainextended.com/2009/06/16/in-vs-join-vs-exists/




Converted this JOIN (with DISTINCT) … (From 1 minute and 24 seconds) :

INNER JOIN dbo.fn_ssrs_ConstrainID(@LanguageCultureCode,@NeutralLanguageCode,@UserID,@PersonID,@Application,@Module,'PEOPLE','Review') cons
 
ON rpp.PersonID = cons.id


…to IN expression (optimized, down to 43 seconds) :

and rpp.PersonID in
    (select cons.id
    from dbo.fn_ssrs_ConstrainID(@LanguageCultureCode,@NeutralLanguageCode,@UserID,@PersonID,@Application,@Module,'PEOPLE','Review') cons)


Advised the two colleagues to further optimize the query's existence filters by converting all the JOIN+DISTINCT combo to IN. They were able to optimized the original query of 1 minute and 24 seconds down to 1 second.



Using the same technique as above, another colleague and I were able to optimized a stored proc that is very crucial to the web app’s performance, as that stored proc determines the visibility of some buttons on the web page. The original stored proc was 5 seconds slow, on one page there are two buttons that are using the stored proc, that makes the web page to appear only after 10 seconds or so, we were able to tame it down to sub-second response time just by converting some of the JOINs to INs.





Happy Computing! ツ

Wednesday, July 24, 2013

SQL Server GREATEST function optimization

PostgreSQL has a GREATEST function, SQL Server don't have a built-in equivalent of it. So we have to create one ourselves. This was the first iteration:

CREATE FUNCTION dbo.fn_GetTheLatestDate
(
    @DateTime1 SMALLDATETIME = NULL,
    @DateTime2 SMALLDATETIME = NULL,
    @DateTime3 SMALLDATETIME = NULL,
    @DateTime4 SMALLDATETIME = NULL,
    @DateTime5 SMALLDATETIME = NULL,
    @DateTime6 SMALLDATETIME = NULL,
    @DateTime7 SMALLDATETIME = NULL,
    @DateTime8 SMALLDATETIME = NULL
)
RETURNS SMALLDATETIME
AS
BEGIN
    DECLARE @temp TABLE (dt SMALLDATETIME)
     
    IF @DateTime1 IS NOT NULL
    BEGIN
        INSERT INTO @temp ( dt )
        SELECT @DateTime1
    END
     
    IF @DateTime2 IS NOT NULL
    BEGIN
        INSERT INTO @temp ( dt )
        SELECT @DateTime2
    END
    IF @DateTime3 IS NOT NULL
    BEGIN
        INSERT INTO @temp ( dt )
        SELECT @DateTime3
    END
     
    IF @DateTime4 IS NOT NULL
    BEGIN
        INSERT INTO @temp ( dt )
        SELECT @DateTime4
    END
     
    IF @DateTime5 IS NOT NULL
    BEGIN
        INSERT INTO @temp ( dt )
        SELECT @DateTime5
    END
     
    IF @DateTime6 IS NOT NULL
    BEGIN
        INSERT INTO @temp ( dt )
        SELECT @DateTime6
    END
     
    IF @DateTime7 IS NOT NULL
    BEGIN
        INSERT INTO @temp ( dt )
        SELECT @DateTime7
    END
     
    IF @DateTime8 IS NOT NULL
    BEGIN
        INSERT INTO @temp ( dt )
        SELECT @DateTime8
    END
     
         
    -- Return the result of the function
    RETURN (SELECT MAX(dt) FROM @temp)
END

The query that uses that function took 4 seconds to complete, and that function has an execution cost of 13% on that query



Sans table variable:

CREATE FUNCTION dbo.fn_GetTheLatestDate
(
    @DateTime1 SMALLDATETIME = NULL,
    @DateTime2 SMALLDATETIME = NULL,
    @DateTime3 SMALLDATETIME = NULL,
    @DateTime4 SMALLDATETIME = NULL,
    @DateTime5 SMALLDATETIME = NULL,
    @DateTime6 SMALLDATETIME = NULL,
    @DateTime7 SMALLDATETIME = NULL,
    @DateTime8 SMALLDATETIME = NULL
)
RETURNS SMALLDATETIME
AS
BEGIN
    
    
    return
    (
        
        select max(v.d)
        from
        (
            select @DateTime1 as d
            where @DateTime1 is not null
            
            union all
            select @DateTime2
            where @DateTime2 is not null
 
            union all
            select @DateTime3
            where @DateTime3 is not null
 
            union all
            select @DateTime4
            where @DateTime4 is not null
 
            union all
            select @DateTime5
            where @DateTime5 is not null
 
            union all
            select @DateTime6
            where @DateTime6 is not null
            
            union all
            select @DateTime7
            where @DateTime7 is not null
 
            union all
            select @DateTime8
            where @DateTime8 is not null
        ) as v
    )        
    
 
END

The query that uses that function took 0 second to complete, and that function has an execution cost of 0% on that query

Tuesday, July 23, 2013

Debunking the myth that CTE is slow

There are some folks who believe that CTE is slow. I don't know where they got that impression. We might as well say that our SELECT statement is slow because our code is using er.. SELECT statement. And there are some folks who believe CTEs underpinning is a cursor, really, no joking. I'm starting to lose faith in humanity ^_^ lol


For those who believe CTE is made out of cursor, here's how to partner cursor to a CTE:

http://stackoverflow.com/questions/14601407/using-a-cursor-with-a-cte


So if one can define a cursor out of CTE, what is a cursor on CTE? a cursor on cursor? Head-scratching! :D There's a lot of engineering effort that goes into CTE and all some folks can say about CTE is that it is a cursor, this kind of folks is dangerous to our industry. It might be their excuse for not learning CTE and windowing functions, which by the way can be used separately, but commonly seen together as they come out at the same time on most RDBMSes.


As with derived tables, CTEs allow you to create complex queries that would otherwise require the use of temporary tables and cursors, which can be inefficient. CTEs remove the overhead of creating and dropping those temporary tables, and of the INSERT and UPDATE statements required to populate them. This can make for faster execution, though you should always compare all available options when performance is critical -- http://www.blackwasp.co.uk/SQLCTEs.aspx


What we do with CTE is what determines its performance. CTE is nothing but just a programming functionality that makes our query more manageable, it helps when we are simplifying very complex and slow queries; and granted that you are doing the right approach, it aids the developer on making the query performant.


If you have tried programming with C or C++, you might be familiar with macro. Yes, CTE is just like that, a macro. It is expanded to full form when you use it. It will not magically make your query faster nor slower.



CTE is just an inline view:

with product_count_query as
(
    select product_id, product_count = count(*)
    from tbl
    group by product_id
)

select p.*, the_count = isnull(c.product_count, 0)
from product p 
left join product_count_query c -- derive table from CTE
on p.product_id = c.product_id;

CTE is just a shorthand for an explicit view:

create view product_count_query as

    select product_id, product_count = count(*)
    from tbl
    group by product_id;
go


select p.*, the_count = isnull(c.product_count, 0)
from product p 
left join product_count_query c -- derive table from VIEW
on p.product_id = c.product_id;


A view, and a CTE for that matter, don't have their own execution plan, both are ultimately just being expanded and bolted to the main query:

select p.*, the_count = isnull(c.product_count, 0)
from product p
left join
(
    select product_id, product_count = count(*)
    from tbl
    group by product_id
) as c -- derive table from an inline view, er.. a CTE? ;-)
 on p.product_id = c.product_id


All of them, same banana, exactly the same execution plan, same speed.


If you believe CTE is just a cursor, you might as well believe that view is just a cursor too. A darndest and silly belief.


Time to stop spewing the nonsense that CTE is just a cursor, all those three queries have exactly the same execution plan and same speed. I don't know what magic sauce that could give the table-deriving query a performance boost over its CTE cousin query, there's none.


If you believe that there's something inherently slow with CTE, thus you prefer to write your query using table-deriving approach because you believe it is faster than CTE, you are merely doing Cargo Cult Programming. You are depriving yourself of the benefits of abstractions, readability and manageability of CTE offers to the table by having a cult-like belief that CTE is slow.


It's what you do with CTE that makes it slow or fast. If you will use JOIN+DISTINCT combo on a CTE or any form of query abstraction for that matter, expect bad performance and rewrites.


CTE is just one of the vast array of SQL tools at your disposal that can make your daily craft easier. You don't need to use it exclusively, you can mix-and-match it with other SQL tools too.


So what makes a query fast?


Start with the right assumption, it's just a macro:
@crokusek they(CTE) don't get materialized at all AFAIK - they are just a table expression -- http://dba.stackexchange.com/questions/13112/whats-the-difference-between-a-cte-and-a-temp-table#comment69061_13117



This is what gives CTE a bad name: wrong assumption:

One reason CTE's can be employable for performance reasons is because their (possibly forced) materialization could be more easily parallelized whereby temp tables are generally computed serially before their use. -- http://dba.stackexchange.com/questions/13112/whats-the-difference-between-a-cte-and-a-temp-table#comment69060_13117


Until MATERIALIZED is supported...
with 
T MATERIALIZED AS
(
    select ROW_NUMBER() over(partition by CustomerCode order by OrderID) as rn, * from test
)
,R(CustomerCode, Rn, OrderId, Qty, RunningTotal) as
(
    select CustomerCode, Rn, OrderID, Qty, Qty
    from t 
    where rn = 1
     
    union all
     
    select t.CustomerCode, t.Rn, t.OrderId, t.Qty, p.RunningTotal + t.Qty
    from t t
    join r p on p.CustomerCode = t.CustomerCode and t.rn = p.rn + 1
     
)
select R.CustomerCode, R.OrderId, R.Qty, R.RunningTotal from r
order by R.CustomerCode, R.OrderId 
option(maxrecursion 0);

...we should mix-and-match CTE with temporary table to give our query a performance boost. That is, there are times we need to materialize the result of a query, e.g., we should materialize the results of a row_numbering query to an actual table(temporary table or variable table will do) first when we wanted to use that row-numbered set to a recursive query, this is fast:

select ROW_NUMBER() over(partition by CustomerCode order by OrderID) as rn, * into #xxx 
from test;
 
with 
T AS
(
    select * from #xxx
)
,R(CustomerCode, Rn, OrderId, Qty, RunningTotal) as
(
    select CustomerCode, Rn, OrderID, Qty, Qty
    from t 
    where rn = 1
     
    union all
     
    select t.CustomerCode, t.Rn, t.OrderId, t.Qty, p.RunningTotal + t.Qty
    from t t
    join r p on p.CustomerCode = t.CustomerCode and t.rn = p.rn + 1
     
)
select R.CustomerCode, R.OrderId, R.Qty, R.RunningTotal from r
order by R.CustomerCode, R.OrderId 
option(maxrecursion 0);
 
drop table #xxx;

Happy Computing! ツ