Sunday, July 29, 2012

From raw parameters to structured data

To avoid stringly-typed processing, convert string-based parameters to structured data. Then do further processing on that internal data structure, where the information can be easily processed.


public class Node
{
    public Node ChildOf { get; set; }
    public string Path { get; set; }

    public IList<Node> Children { get; set; }

}



public static class Helper
{

    public static Node ArrayToTree(params string[] paths)
    {
        Node root = new Node
        {
            ChildOf = null,
            Path = null,
            Children = new List<Node>()
        };

        

        foreach (string path in paths)
        {
            string[] subPaths = path.Split('.');

            Node addTo = root;

            foreach (string subPath in subPaths)
            {
                Node child = addTo.Children.FirstOrDefault(x => x.Path == subPath);

                if (child == null)
                {
                    child = new Node
                    {
                        ChildOf = addTo,
                        Path = subPath,
                        Children = new List<Node>()
                    };

                    addTo.Children.Add(child);
                }//if

                addTo = child;
            }//foreach

        }//foreach


        return root;
    }
}


Unit test:


public void TestStructuredData()
{
    Node n = Helper.ArrayToTree(
        new[]
        { 
            "Answers", 
            "Answers.Comments", 
            "Answers.Greats", 
            "Answers.Balls.Of.Fire", 
            "Comments" 
        
        });


    Assert.AreEqual(2, n.Children.Count);
    Assert.AreEqual(3, n.Children[0].Children.Count);


    Assert.AreEqual("Answers", n.Children[0].Path);

    Assert.AreEqual("Comments", n.Children[0].Children[0].Path);
    Assert.AreEqual("Greats", n.Children[0].Children[1].Path);
    Assert.AreEqual("Balls", n.Children[0].Children[2].Path);
    Assert.AreEqual("Of", n.Children[0].Children[2].Children[0].Path);
    Assert.AreEqual("Fire", n.Children[0].Children[2].Children[0].Children[0].Path);

    Assert.AreEqual("Comments", n.Children[1].Path);
}


Avoid stringly-typed programming, use proper data structure.

http://www.google.com/search?q=stringly-typed+programming+-strongly

Sunday, July 15, 2012

CASE WHEN ELSE is faster than OR approach

Common denominator feature. When ideals get trumped by reality.


I encountered an SQL query where I thought it would be ideal to make it ORM-friendly, so when time comes you need to port it to an ORM query, porting shall be easier. OR so I thought.


I saw this question on http://stackoverflow.com/questions/10646018/can-you-use-case-to-specify-a-join-to-use-in-mysql:

I am wondering if it is possible to use some sort of case in a mysql query to effectively achieve the following and reduce the amount of data that has to be looked up each time ..

SELECT um.message, UNIX_TIMESTAMP(um.time), um.read, user.username
FROM usermessages um
CASE
WHEN um.friendId = 1
INNER JOIN user ON um.sourceUserId = user.id
WHEN um.sourceUserId = 1
INNER JOIN user ON um.friendId = user.id
END
WHERE (um.friendId = 1 OR um.sourceUserId = 1)



I offered this solution: http://stackoverflow.com/questions/10646018/can-you-use-case-to-specify-a-join-to-use-in-mysql/10646233#10646233

SELECT um.message, UNIX_TIMESTAMP(um.time), um.read, user.username
FROM usermessages um
INNER JOIN user ON 
    (um.friendId = 1 AND um.sourceUserId = user.id)
    OR
    um.friendId = user.id
WHERE (um.friendId = 1 OR um.sourceUserId = 1);


Then another stackoverfellow offer this solution: http://stackoverflow.com/questions/10646018/can-you-use-case-to-specify-a-join-to-use-in-mysql/10646078#10646078


SELECT um.message, UNIX_TIMESTAMP(um.time), um.read, user.username
FROM usermessages um
INNER JOIN user ON 
    CASE 
        WHEN um.friendId = 1 THEN um.sourceUserId
                             ELSE um.friendId
    END = user.id
WHERE (um.friendId = 1 OR um.sourceUserId = 1)



I disliked his solution initially, as I embrace ORM more and more, I'm of the opinion that we should write portable code, and what's more portable than writing the query with constructs that exists on both SQL and ORM? I disliked the CASE WHEN approach to the problem as it don't have any analogous construct on ORM, Linq in particular.

With that in mind, I tried to dissuade the user from using the CASE WHEN approach. But we need to wise-up though, portability is not a panacea that can magically make our applications have more customers. How can a customer be happy if bad application performance spoils the fun on using your application. With this in mind, I tried to profile both queries to see which one is faster. I'm expecting the OR to be slower(RDBMS don't do short-circuit, it uses cost-based analysis on determining how it should perform the query) than CASE WHEN, but I'm not expecting very big difference in performance, then much to my chagrin when I see a staggering difference in performance, the CASE WHEN approach can finish the job in 88 milliseconds, while the OR approach took 4.7 seconds. The performance of OR is not acceptable




CASE WHEN approach's Query Cost is 3% only, while the OR approach is 97%. So there goes our portability ideals, it's not ideal. We need to tap the strengths of a given tool in order to get our job done in an efficient manner. I will not be dead set anymore to make a given code construct be very compatible on another platform or API.



Expand to see the code used for benchmarking: http://www.sqlfiddle.com/#!6/29531/2
create table usermessages(
  message varchar(30),
  friendId int,
  sourceUserId int
);

create table "user"(
  userId int identity(1,1) primary key,
  userName varchar(30)
);


create index ix_usermessages__friendid 
on usermessages(friendid);

create index ix_usermessages__combined
on usermessages(friendid,sourceUserId);


create index ix_usermessages__sourceuserid
on usermessages(sourceuserid);



insert into usermessages
select 'hello',row_number() over(order by x.message_id) % 50,1 from sys.messages x;


insert into "user"(userName)
select 'hello' from sys.messages;


select count(*)
from usermessages um
join "user" u

on
(case when um.friendId = 1 then um.sourceUserId else um.friendId end) = u.userId;


select count(*)
from usermessages um
join "user" u

on 
(um.friendId = 1 and um.sourceUserId = u.userId)
or 
(um.friendId = u.userId);

Wednesday, July 11, 2012

Monitoring perfect attendance. Consecutive calendar date query

In this article, I will show you how to get the longest attendance streak a given student had achieved. This will also take the weekends and holidays into consideration.





Given a student has this attendance:

July 3 Tuesday
July 4 Holiday -- Independence Day
July 5 Thursday
July 6 Friday
July 9 Monday
July 12 Thursday
July 13 Friday


The query should show the student had achieved 4 consecutive attendance on these dates:
July 3 Tuesday
July 5 Thursday
July 6 Friday
July 9 Monday


July 4 is a holiday and should not affect or penalize the student's attendance performance, the query should show the student has good four consecutive attendance from July 3 to July 9; likewise, the weekends(Saturday and Sunday) should not have an effect on attendance performance.


Student gets an attendance performance of 11 from July 18 to August 1 when she attend classes on following dates:

July 18 Wednesday
July 19 Thursday
July 20 Friday
July 23 Monday
July 24 Tuesday
July 25 Wednesday
July 26 Thursday
July 27 Friday
July 30 Monday
July 31 Tuesday
August 1 Wednesday
August 3 Friday


How to determine if the given dates are consecutive? This is a solved problem, and it has a terminology to it, it's called islands and gaps algorithm.

The core thing with islands and gaps problem is we must have a mechanism to group consecutive rows so in the final query we can count them. One mechanism is to use row_number windowing function, though row numbering approach will fail if there's duplicate rows, row_number would suffice if the rows are unique, e.g. attendance.

So given this numbers, we have 5 consecutive numbers(1 to 5) and another 3 consecutive numbers(8 to 10):
1
2
3
4
5
8
9
10
12


With row_number we can determine the consecutive-ness of those numbers
N    R#
1    1
2    2   
3    3 
4    4
5    5
8    6
9    7
10   8
12   9


Subtract row number from N, this will give us related rows
N    R#   GRP
1    1    0
2    2    0
3    3    0
4    4    0
5    5    0
8    6    2
9    7    2
10   8    2
12   9    3


Given the GRP column, we can do this query to count the longest attendance streak a given student has:


with a as
(
  select n, n -  row_number() over(order by n) as grp
  from tbl
)
select 
  *
  ,dense_rank() over(order by grp)  as nth_streak
  ,count(*) over(partition by grp)  as streak
from a;

|  N | GRP | NTH_STREAK | STREAK |
----------------------------------
|  1 |   0 |          1 |      5 |
|  2 |   0 |          1 |      5 |
|  3 |   0 |          1 |      5 |
|  4 |   0 |          1 |      5 |
|  5 |   0 |          1 |      5 |
|  8 |   2 |          2 |      3 |
|  9 |   2 |          2 |      3 |
| 10 |   2 |          2 |      3 |
| 12 |   3 |          3 |      1 |    


Now, for the final hurdle to tackle, if 5 is Friday, 6 and 7 are Saturday and Sunday respectively. So on the above data, the student should have an 8 consecutive attendance instead of separate 5 and 3. To do that, we need to shift all the date if it is over 7.


Here's the formula for that:

n - ((n/7) * 2) - row_number()over(order by n) as grp


To illustrate, use this query:

with a as
(
  select n, n - ((n/7) * 2) as shift_up, row_number() over(order by n) as rn
  from tbl
)
select 
  *
  ,n - rn as grp
from a;

Here's the output:

|  N | SHIFT_UP | RN | GRP |
----------------------------
|  1 |        1 |  1 |   0 |
|  2 |        2 |  2 |   0 |
|  3 |        3 |  3 |   0 |
|  4 |        4 |  4 |   0 |
|  5 |        5 |  5 |   0 |
|  8 |        6 |  6 |   2 |
|  9 |        7 |  7 |   2 |
| 10 |        8 |  8 |   2 |
| 12 |       10 |  9 |   3 |    

Notice that the value Monday's value(8) shifted to 6, likewise with Tuesday(9, this shifted to 7) and so on.

With this logic at hand, we can now easily determine the consecutive-ness of adjacent rows. We just need to subtract row_number from SHIFT_UP in order to make the current Monday consecutive to last week's Friday. This is the query:

with a as
(
  select n, n - ((n/7) * 2) - row_number()over(order by n) as grp
  from tbl
)
select 
  *
  ,dense_rank() over(order by grp)  as nth_streak
  ,count(*) over(partition by grp)  as streak
from a;


This is the output:
|  N | GRP | NTH_STREAK | STREAK |
----------------------------------
|  1 |   0 |          1 |      8 |
|  2 |   0 |          1 |      8 |
|  3 |   0 |          1 |      8 |
|  4 |   0 |          1 |      8 |
|  5 |   0 |          1 |      8 |
|  8 |   0 |          1 |      8 |
|  9 |   0 |          1 |      8 |
| 10 |   0 |          1 |      8 |
| 12 |   1 |          2 |      1 |    


It now yields the date from last week's Monday to Friday(1 to 5) and this week's Monday to Friday(8 to 10) as consecutive 8 days(indicated by GRP=0). Hence yielding a 5+3 consecutive attendance, this yields 8 consecutive attendance.


Another way to visualize how the date are being shifted(to determine date consecutive-ness) two days back:

Original data:
                  Weekends are not in attendance(e.g. 6,7,13,14)
 1  2  3  4  5     6  7
 8  9 10 11 12    13 14
15

If a given date falls on 15th, this will be computed as 15/7 * 2 == 4; then subtract 4 from the original number, 15 - 4 == 11. 15 will become the 11th day. Likewise the 8th day becomes the 6th day; 8 - (8/7 * 2) == 6. The above data yields the following output when the computation is applied:

 1  2  3  4  5    
 6  7  8  9 10    
11

The above attendance yields 11 consecutive good attendance. For holidays, you need to slot them on attendance, so the consecutive-ness of dates could be easily determined, then just remove them from the final query.


To apply the above logic on determining consecutive attendance even there are holidays and weekends, use this:


drop table tx;
drop table holiday;

create table tx
(
i int identity(1,1) not null primary key,
n varchar(10), d date,
constraint ux_tx unique(n,d)
);

insert into tx(n,d) values
('john','2012-7-3'),
('john','2012-7-5'),
('john','2012-7-6'),
('john','2012-7-9'),
('john','2012-7-12'),
('john','2012-7-13'),
('john','2012-7-16'),
('john','2012-7-17'),
('john','2012-7-18'),
('john','2012-7-20'),
('john','2012-7-30'),
('john','2012-7-31'),

('paul','2012-7-3'),
('paul','2012-7-5'),
('paul','2012-7-18'),
('paul','2012-7-19'),
('paul','2012-7-20'),
('paul','2012-7-23'),
('paul','2012-7-24'),
('paul','2012-7-25'),
('paul','2012-7-26'),
('paul','2012-7-27'),
('paul','2012-7-30'),
('paul','2012-7-31'),
('paul','2012-8-1'),
('paul','2012-8-3'),
('paul','2012-8-6'),
('paul','2012-8-7');

create table holiday(d date);

insert into holiday(d) values
('2012-7-4');



--- query

with first_date as
(
 select dateadd( ww, datediff(ww,0,min(d)), 0 ) as first_date -- get the monday of the earliest date
 from tx 
)
,shifted as
(
 select 
  tx.n, tx.d, 
    
  diff = datediff(day, fd.first_date, tx.d) - (datediff(day, fd.first_date, tx.d)/7 * 2) 
  
 from tx
 cross join first_date fd
 union
 select 
  xxx.n, h.d, 
  
  diff = datediff(day, fd.first_date, h.d) - (datediff(day, fd.first_date, h.d)/7 * 2) 
 from holiday h 
 cross join first_date fd
 cross join (select distinct n from tx) as xxx
)
,grouped as
(
 select 
  *
  , grp = diff - row_number() over(partition by n order by d)
 from shifted
)
select 
    
    -- remove staging columns from the output...
    -- * 
    
    -- ...just output what the user will see:
    d, n
    
 ,dense_rank() over (partition by n order by grp) as nth_streak
 ,count(*) over (partition by n, grp) as streak
from grouped
where d not in (select d from holiday)  -- remove the holidays

Output:
|          D |    N | NTH_STREAK | STREAK |
-------------------------------------------
| 2012-07-03 | john |          1 |      4 |
| 2012-07-05 | john |          1 |      4 |
| 2012-07-06 | john |          1 |      4 |
| 2012-07-09 | john |          1 |      4 |
| 2012-07-12 | john |          2 |      5 |
| 2012-07-13 | john |          2 |      5 |
| 2012-07-16 | john |          2 |      5 |
| 2012-07-17 | john |          2 |      5 |
| 2012-07-18 | john |          2 |      5 |
| 2012-07-20 | john |          3 |      1 |
| 2012-07-30 | john |          4 |      2 |
| 2012-07-31 | john |          4 |      2 |
| 2012-07-03 | paul |          1 |      2 |
| 2012-07-05 | paul |          1 |      2 |
| 2012-07-18 | paul |          2 |     11 |
| 2012-07-19 | paul |          2 |     11 |
| 2012-07-20 | paul |          2 |     11 |
| 2012-07-23 | paul |          2 |     11 |
| 2012-07-24 | paul |          2 |     11 |
| 2012-07-25 | paul |          2 |     11 |
| 2012-07-26 | paul |          2 |     11 |
| 2012-07-27 | paul |          2 |     11 |
| 2012-07-30 | paul |          2 |     11 |
| 2012-07-31 | paul |          2 |     11 |
| 2012-08-01 | paul |          2 |     11 |
| 2012-08-03 | paul |          3 |      3 |
| 2012-08-06 | paul |          3 |      3 |
| 2012-08-07 | paul |          3 |      3 |    

Live test: http://www.sqlfiddle.com/#!3/709a0/1

DATEFIRST settings side-effect on DATEPART

Upon reading this comment:

DATEPART(DW is dependant upon @@datefirst – Martin Smith Aug 24 '11 at 0:03


I tried this code:


set nocount on;

declare @d date = '2012-07-4';
 
set language italian;
-- SET DATEFIRST 1; -- Italy is Monday-first week. this is equivalent to above
select @@DATEFIRST as df;
select DATEPART(DW, @d) as x;
SELECT DATEADD(ww, DATEDIFF(ww,0,@d), 0) as y;
SELECT DATEADD(DD, 1 - DATEPART(DW, @d), @d) as z;
 
set language us_english;
-- SET DATEFIRST 7; -- US is Sunday-first week. this is equivalent to above
select @@DATEFIRST as df;
select DATEPART(DW, @d) as x;
SELECT DATEADD(ww, DATEDIFF(ww,0,@d), 0) as y;
SELECT DATEADD(DD, 1 - DATEPART(DW, @d), @d) as z;


Output
L'impostazione della lingua è stata sostituita con Italiano.
df
----
1

x
-----------
3

y
-----------------------
2012-07-02 00:00:00.000

z
----------
2012-07-02

Changed language setting to us_english.
df
----
7

x
-----------
4

y
-----------------------
2012-07-02 00:00:00.000

z
----------
2012-07-01

There's a discrepancy(the last queries, z column) when using datepart on obtaining first day of week. On countries(e.g. USA) that start their calendar week with Sunday, the first day of week for July 4, 2012 is July 1, 2012(Sunday). For countries that start their calendar week with Monday, the first day of week for July 4, 2012 is July 2, 2012(Monday).


To avoid this discrepancy, instead of using date functions(e.g. DATEPART) that has a dependency on @@DATEFIRST settings, use date functions that doesn't rely on @@DATEFIRST settings e.g. DATEDIFF