Thursday, September 26, 2013

Debunking the myth that CTE is just a cursor

Some devs who have fervent belief that CTE is just a cursor got me worked up


Hearing that broken theory from them, I decided to create a proof-of-concept that would prove otherwise. As we only have limited keystrokes left in our hands, I decided to practice first at play.typeracer.com, being able to type fast is certainly a win if one is very worked up to quickly prove something by quickly making a proof-of-concept and quickly blogging it afterwards, all needed be done on one sitting. So that's how seriously worked up I am to prove something lol


Heheh scratch that, I already made an analysis and proof an eon ago that CTE is not slow.


A CTE is just an inline view, and in turn is just a table-deriving query, blogged it here: http://www.ienablemuch.com/2013/07/debunking-myth-that-cte-is-slow.html


CTE is not slow, not understanding the intent and how the query works is what makes a query slow, blogged it here: http://www.ienablemuch.com/2012/05/recursive-cte-is-evil-and-cursor-is.html



Please do note that a **recursive** CTE is being done sequentially, hence the "cursor-like" belief of some devs, however, the recursive CTE sequential steps is not being done on cursor. The CTE's loop is made on C++, which makes CTE faster, while a CURSOR's loop is being done via T-SQL, T-SQL loop is slow. C#'s loop is even an order of magnitude faster than T-SQL loop. Push the loop on C++, i.e., push the loop on CTEs. For non-recursive CTE, rest assured, the query still operates in set-based manner, not in cursor-like manner.



Time to stop spewing the nonsense that CTE is just a cursor.



Happy Coding!

Monday, September 16, 2013

Using Partial Index On SQL Server To Enforce Uniqueness On Denormalized Design

On my last post, using partial index (aka filtered index) I showed you how we can still enforce unique constraint even when soft deletions is the chosen mechanism for deleting records.


We will put again unique filtered index to good use. Given this table:

create table PeriodReviewers
(
     PeriodId int not null references Period(PeriodId),
     RevieweePersonId int not null references Person(PersonId),
     ReviewerPersonId int not null references Person(PersonId),

     IsPrimaryReviewer bit not null default(0),

     constraint pk_PeriodReviewers primary key(PeriodId, RevieweePersonId, ReviewerPersonId)
);


These are all valid data:

PeriodId    RevieweePersonId    ReviewerPersonId    IsPrimaryReviewer
1           Marcus              Buddy               1
1           Marcus              Raymund             0
1           Marcus              Ely                 0
1           Buddy               Raymund             1
1           Buddy               Ely                 0
1           Raymund             Ely                 1

Invalid data such as the following can be blocked:

PeriodId    RevieweePersonId    ReviewerPersonId    IsPrimaryReviewer
1           Marcus              Buddy               1
1           Marcus              Raymund             0   
1           Marcus              Raymund             0   <-- this is blocked, two Raymunds
1           Marcus              Ely                 0
1           Buddy               Raymund             1
1           Buddy               Ely                 0
1           Raymund             Ely                 1

However the above unique constraint can't defend the system on invalid data such as the following:

PeriodId  RevieweePersonId    ReviewerPersonId    IsPrimaryReviewer
1           Marcus            Buddy               1
1           Marcus            Raymund             1 <-- should error,only one primary reviewer to each person
1           Marcus            Ely                 0
1           Buddy             Raymund             1
1           Buddy             Ely                 0
1           Raymund           Ely                 1

There should be only one primary reviewer for each reviewee, but since our unique constraint is applied on these three fields: PeriodId + RevieweePersonId + ReviewerPersonId, invalid data such as above could creep in to our system.


This is the ideal solution:

create table PeriodReviewers
(
     PeriodId int not null references Period(PeriodId),
     RevieweePersonId int not null references Person(PersonId),
     ReviewerPersonId int not null references Person(PersonId),

     constraint pk_PeriodReviewers primary key(PeriodId, RevieweePersonId, ReviewerPersonId)
);
 
create table PeriodPrimaryReviewers
(
     PeriodId int not null references Period(PeriodId),
     RevieweePersonId int not null references Person(PersonId),

     ReviewerPersonId int not null references Person(PersonId),

     constraint pk_PeriodReviewers primary key(PeriodId, RevieweePersonId)
);

That's a fully normalized design. The advantage of normalization is we can save space on redundant data. On our business case, there can be only one primary reviewer for each person on a given period, so let's say we have 20 reviewers for each person on a given period, those other 19 reviewers, who are not primary reviewers, whose IsPrimaryReviewer are all set to false, those 19 IsPrimary fields are just wasting space, IsPrimaryReviewer field should be removed from the table. To normalize, just create a table that maintains each person's primary reviewer on a given period, as illustrated on the DDL above.


Normalize until it hurts, denormalize until it works -- Jeff Atwood

However fully normalizing the database might not be the norm nor an option, hence denormalized schema might be present in your system like the first one shown in this article. As we all know, denormalized design can be faster for queries. So for example, we wanted to list all the person's reviewer and indicate if that reviewer is a primary reviewer of the person. It's just this simple with denormalized table:


select
    
    PeriodId, 
    RevieweePersonId, 
    ReviewerPersonId, 
    IsPrimaryReviewer
    
from PeriodReviewers


Now, to do that on fully normalized tables:

select 

    emp.PeriodId, 
    emp.RevieweePersonId, 
    emp.ReviewerPersonId, 
    emp.IsPrimaryReviewer = 
        case when pri.ReviewerPersonId is not null then
            convert(bit, 1)
        else
            convert(bit, 0)
        end
        
from PeriodReviewers emp
left join PeriodPrimaryReviewers pri 
on  emp.PeriodId = pri.PeriodId
    and emp.RevieweePersonId = pri.RevieweePersonId
    and emp.ReviewerPersonId = pri.ReviewerPersonId


Using very normalized design, queries could become slower. Denormalized table makes for faster queries, as there's no need to join tables. However, denormalized design can cause data integrity problem. On denormalized design, concurrent update or not properly implemented application could let bad data slip in. So in our denormalized design, this bad data could slip in:


PeriodId    RevieweePersonId    ReviewerPersonId    IsPrimaryReviewer
1           Marcus              Buddy               1
1           Marcus              Raymund             1
1           Marcus              Ely                 0
1           Buddy               Raymund             1
1           Buddy               Ely                 0
1           Raymund             Ely                 1


As we can see, Marcus has two primary reviewers now. This kind of errors doesn't happen on normalized database design. If the denormalized design is already in place, it could be far more costly to redesign the database, and it's not guaranteed that fully normalizing the design won't affect query's performance. But how can we prevent bad data such as above from happening if we maintain the denormalized design?

Enter partial index, er.. filtered index. With SQL Server's filtered index (available since version 2008), we can create unique filtered index to prevent two or more primary reviewers for each person.


To note, the following is still allowed even we add unique filtered index, which is a valid business case.

PeriodId    RevieweePersonId    ReviewerPersonId    IsPrimaryReviewer
1           Marcus              Buddy               1
1           Marcus              Ely                 0
1           Marcus              Robin               0


However, upon further adding a reviewer for the same person on the same period, whose IsPrimaryReviewer field value is also set to true, that should be blocked by the database.

PeriodId    RevieweePersonId    ReviewerPersonId    IsPrimaryReviewer
1           Marcus              Buddy               1
1           Marcus              Ely                 0
1           Marcus              Robin               0
1           Marcus              Raymund             1   <-- adding this fourth row will not be allowed by the database.


To facilitate robust database design for the desired scenario above, we just need to add a unique index on those primary reviewers only, hence the filtered index nomenclature. In most RDBMSes this is called partial index. To wit, this will be the table definition:

create table PeriodReviewers
(
     PeriodId int not null references Period(PeriodId),
     RevieweePersonId int not null references Person(PersonId),

     ReviewerPersonId int not null references Person(PersonId),

     constraint pk_PeriodReviewers primary key(PeriodId, RevieweePersonId, ReviewerPersonId)
);

 
create unique index ix_PeriodReviewers_PrimaryReviewer
on PeriodReviewers(PeriodId, RevieweePersonId)
where IsPrimaryReviewer = 1;


We just need to create a unique index on Period and RevieweePersonId for primary reviewers only. When we try to add another primary reviewer for the same person on the same period, it will be stopped by the database, no bad data could creep in. That's how easy it is to prevent bad data on SQL Server. Validating business logic should be done on application layer, but it should also be done on the database too



Happy Computing! ツ