Wednesday, July 11, 2012

Monitoring perfect attendance. Consecutive calendar date query

In this article, I will show you how to get the longest attendance streak a given student had achieved. This will also take the weekends and holidays into consideration.





Given a student has this attendance:

July 3 Tuesday
July 4 Holiday -- Independence Day
July 5 Thursday
July 6 Friday
July 9 Monday
July 12 Thursday
July 13 Friday


The query should show the student had achieved 4 consecutive attendance on these dates:
July 3 Tuesday
July 5 Thursday
July 6 Friday
July 9 Monday


July 4 is a holiday and should not affect or penalize the student's attendance performance, the query should show the student has good four consecutive attendance from July 3 to July 9; likewise, the weekends(Saturday and Sunday) should not have an effect on attendance performance.


Student gets an attendance performance of 11 from July 18 to August 1 when she attend classes on following dates:

July 18 Wednesday
July 19 Thursday
July 20 Friday
July 23 Monday
July 24 Tuesday
July 25 Wednesday
July 26 Thursday
July 27 Friday
July 30 Monday
July 31 Tuesday
August 1 Wednesday
August 3 Friday


How to determine if the given dates are consecutive? This is a solved problem, and it has a terminology to it, it's called islands and gaps algorithm.

The core thing with islands and gaps problem is we must have a mechanism to group consecutive rows so in the final query we can count them. One mechanism is to use row_number windowing function, though row numbering approach will fail if there's duplicate rows, row_number would suffice if the rows are unique, e.g. attendance.

So given this numbers, we have 5 consecutive numbers(1 to 5) and another 3 consecutive numbers(8 to 10):
1
2
3
4
5
8
9
10
12


With row_number we can determine the consecutive-ness of those numbers
N    R#
1    1
2    2   
3    3 
4    4
5    5
8    6
9    7
10   8
12   9


Subtract row number from N, this will give us related rows
N    R#   GRP
1    1    0
2    2    0
3    3    0
4    4    0
5    5    0
8    6    2
9    7    2
10   8    2
12   9    3


Given the GRP column, we can do this query to count the longest attendance streak a given student has:


with a as
(
  select n, n -  row_number() over(order by n) as grp
  from tbl
)
select 
  *
  ,dense_rank() over(order by grp)  as nth_streak
  ,count(*) over(partition by grp)  as streak
from a;

|  N | GRP | NTH_STREAK | STREAK |
----------------------------------
|  1 |   0 |          1 |      5 |
|  2 |   0 |          1 |      5 |
|  3 |   0 |          1 |      5 |
|  4 |   0 |          1 |      5 |
|  5 |   0 |          1 |      5 |
|  8 |   2 |          2 |      3 |
|  9 |   2 |          2 |      3 |
| 10 |   2 |          2 |      3 |
| 12 |   3 |          3 |      1 |    


Now, for the final hurdle to tackle, if 5 is Friday, 6 and 7 are Saturday and Sunday respectively. So on the above data, the student should have an 8 consecutive attendance instead of separate 5 and 3. To do that, we need to shift all the date if it is over 7.


Here's the formula for that:

n - ((n/7) * 2) - row_number()over(order by n) as grp


To illustrate, use this query:

with a as
(
  select n, n - ((n/7) * 2) as shift_up, row_number() over(order by n) as rn
  from tbl
)
select 
  *
  ,n - rn as grp
from a;

Here's the output:

|  N | SHIFT_UP | RN | GRP |
----------------------------
|  1 |        1 |  1 |   0 |
|  2 |        2 |  2 |   0 |
|  3 |        3 |  3 |   0 |
|  4 |        4 |  4 |   0 |
|  5 |        5 |  5 |   0 |
|  8 |        6 |  6 |   2 |
|  9 |        7 |  7 |   2 |
| 10 |        8 |  8 |   2 |
| 12 |       10 |  9 |   3 |    

Notice that the value Monday's value(8) shifted to 6, likewise with Tuesday(9, this shifted to 7) and so on.

With this logic at hand, we can now easily determine the consecutive-ness of adjacent rows. We just need to subtract row_number from SHIFT_UP in order to make the current Monday consecutive to last week's Friday. This is the query:

with a as
(
  select n, n - ((n/7) * 2) - row_number()over(order by n) as grp
  from tbl
)
select 
  *
  ,dense_rank() over(order by grp)  as nth_streak
  ,count(*) over(partition by grp)  as streak
from a;


This is the output:
|  N | GRP | NTH_STREAK | STREAK |
----------------------------------
|  1 |   0 |          1 |      8 |
|  2 |   0 |          1 |      8 |
|  3 |   0 |          1 |      8 |
|  4 |   0 |          1 |      8 |
|  5 |   0 |          1 |      8 |
|  8 |   0 |          1 |      8 |
|  9 |   0 |          1 |      8 |
| 10 |   0 |          1 |      8 |
| 12 |   1 |          2 |      1 |    


It now yields the date from last week's Monday to Friday(1 to 5) and this week's Monday to Friday(8 to 10) as consecutive 8 days(indicated by GRP=0). Hence yielding a 5+3 consecutive attendance, this yields 8 consecutive attendance.


Another way to visualize how the date are being shifted(to determine date consecutive-ness) two days back:

Original data:
                  Weekends are not in attendance(e.g. 6,7,13,14)
 1  2  3  4  5     6  7
 8  9 10 11 12    13 14
15

If a given date falls on 15th, this will be computed as 15/7 * 2 == 4; then subtract 4 from the original number, 15 - 4 == 11. 15 will become the 11th day. Likewise the 8th day becomes the 6th day; 8 - (8/7 * 2) == 6. The above data yields the following output when the computation is applied:

 1  2  3  4  5    
 6  7  8  9 10    
11

The above attendance yields 11 consecutive good attendance. For holidays, you need to slot them on attendance, so the consecutive-ness of dates could be easily determined, then just remove them from the final query.


To apply the above logic on determining consecutive attendance even there are holidays and weekends, use this:


drop table tx;
drop table holiday;

create table tx
(
i int identity(1,1) not null primary key,
n varchar(10), d date,
constraint ux_tx unique(n,d)
);

insert into tx(n,d) values
('john','2012-7-3'),
('john','2012-7-5'),
('john','2012-7-6'),
('john','2012-7-9'),
('john','2012-7-12'),
('john','2012-7-13'),
('john','2012-7-16'),
('john','2012-7-17'),
('john','2012-7-18'),
('john','2012-7-20'),
('john','2012-7-30'),
('john','2012-7-31'),

('paul','2012-7-3'),
('paul','2012-7-5'),
('paul','2012-7-18'),
('paul','2012-7-19'),
('paul','2012-7-20'),
('paul','2012-7-23'),
('paul','2012-7-24'),
('paul','2012-7-25'),
('paul','2012-7-26'),
('paul','2012-7-27'),
('paul','2012-7-30'),
('paul','2012-7-31'),
('paul','2012-8-1'),
('paul','2012-8-3'),
('paul','2012-8-6'),
('paul','2012-8-7');

create table holiday(d date);

insert into holiday(d) values
('2012-7-4');



--- query

with first_date as
(
 select dateadd( ww, datediff(ww,0,min(d)), 0 ) as first_date -- get the monday of the earliest date
 from tx 
)
,shifted as
(
 select 
  tx.n, tx.d, 
    
  diff = datediff(day, fd.first_date, tx.d) - (datediff(day, fd.first_date, tx.d)/7 * 2) 
  
 from tx
 cross join first_date fd
 union
 select 
  xxx.n, h.d, 
  
  diff = datediff(day, fd.first_date, h.d) - (datediff(day, fd.first_date, h.d)/7 * 2) 
 from holiday h 
 cross join first_date fd
 cross join (select distinct n from tx) as xxx
)
,grouped as
(
 select 
  *
  , grp = diff - row_number() over(partition by n order by d)
 from shifted
)
select 
    
    -- remove staging columns from the output...
    -- * 
    
    -- ...just output what the user will see:
    d, n
    
 ,dense_rank() over (partition by n order by grp) as nth_streak
 ,count(*) over (partition by n, grp) as streak
from grouped
where d not in (select d from holiday)  -- remove the holidays

Output:
|          D |    N | NTH_STREAK | STREAK |
-------------------------------------------
| 2012-07-03 | john |          1 |      4 |
| 2012-07-05 | john |          1 |      4 |
| 2012-07-06 | john |          1 |      4 |
| 2012-07-09 | john |          1 |      4 |
| 2012-07-12 | john |          2 |      5 |
| 2012-07-13 | john |          2 |      5 |
| 2012-07-16 | john |          2 |      5 |
| 2012-07-17 | john |          2 |      5 |
| 2012-07-18 | john |          2 |      5 |
| 2012-07-20 | john |          3 |      1 |
| 2012-07-30 | john |          4 |      2 |
| 2012-07-31 | john |          4 |      2 |
| 2012-07-03 | paul |          1 |      2 |
| 2012-07-05 | paul |          1 |      2 |
| 2012-07-18 | paul |          2 |     11 |
| 2012-07-19 | paul |          2 |     11 |
| 2012-07-20 | paul |          2 |     11 |
| 2012-07-23 | paul |          2 |     11 |
| 2012-07-24 | paul |          2 |     11 |
| 2012-07-25 | paul |          2 |     11 |
| 2012-07-26 | paul |          2 |     11 |
| 2012-07-27 | paul |          2 |     11 |
| 2012-07-30 | paul |          2 |     11 |
| 2012-07-31 | paul |          2 |     11 |
| 2012-08-01 | paul |          2 |     11 |
| 2012-08-03 | paul |          3 |      3 |
| 2012-08-06 | paul |          3 |      3 |
| 2012-08-07 | paul |          3 |      3 |    

Live test: http://www.sqlfiddle.com/#!3/709a0/1

No comments:

Post a Comment