Wednesday, April 28, 2010

Debunking the myth that COUNT('Dracula') is faster than COUNT(*)

Ok, that was an attempt at humor ;-)

How to debunk the myth that COUNT(1) is faster than COUNT(*) ?

The most simple mistaken assumption is that the asterisk sign makes the RDBMS read all the columns. Fact is, asterisk just represent the cardinality(an unfortunate name, very computer-sciencey-sounding) of the set

Thinking that COUNT(1) or COUNT(anyNonNullConstantHere) is faster than COUNT(*), is the most common programming-by-superstition done by many. First point, if this language construct had made it into SQL standard...

SELECT COUNT(cat)
FROM cats

...we will not need COUNT(*) (and COUNT(1) for that matter) in SQL language, but as RDBMS don't have an AI to infer that the entity(cat) you want to count is the singular form of the cats table, this didn't make it into the final ANSI SQL spec


Second point, a compromise, if this language construct permits counting the table name directly...

SELECT users.user_id, COUNT(answers)
FROM users
JOIN answers ON answers.user_id = users.user_id
GROUP BY users.user_id

...we will not need COUNT(*) in SQL languages, it's very natural to read. But only Postgres supports this principle. But that will still fail if there's a nullable answers(same name as its table) column in either of the two table, as Postgres gives more priority to field than table; though can be worked around if we put asterisk after the name, so the RDBMS can detect that the thing we we wanted to count is the table rows rather than the column name:

SELECT users.user_id, COUNT(answers.*)
FROM users
JOIN answers ON answers.user_id = users.user_id
GROUP BY users.user_id

The ultimate proof that the database isn't reading all columns, this script outputs 4, not 0.

create table person
(
lastname varchar(50),
firstname varchar(50),
age int
);

insert into person(lastname,firstname,age) values(null,null,null);
insert into person(lastname,firstname,age) values(null,null,null);
insert into person(lastname,firstname,age) values(null,null,null);
insert into person(lastname,firstname,age) values(null,null,null);

select count(*) from person; -- output is 4, not 0

If the RDBMS reads all the columns, why the count still resolves to 4? Shouldn't it be 0 since all the values are null?

What makes the COUNT(1) different from COUNT(0), COUNT(2)? COUNT('your blessings')? COUNT('Dracula')? Answer: None. This will all output 4 on the above data. There's nothing special about COUNT(1) or COUNT(anyNonNullValues) that makes it different from COUNT(*); there's nothing slow in COUNT(*) as we can plainly see that from the output above, that it don't care whatever the column values are. How can COUNT(1) be faster than COUNT(*) when COUNT(*) doesn't evaluate any values on all columns?


And one could argue as well that COUNT(*) is faster than COUNT(fieldhere) or COUNT(1) for that matter, since it doesn't need interpreting any when it count rows.


The asterisk inside of COUNT is just a directive for the RDBMS to fetch the cardinality of the set, not to evaluate all columns of the row.

And SELECT COUNT(1) is a one trick pony, it's only pleasant and possible to use on one table query only.

SELECT COUNT(1) FROM person

You cannot mold the COUNT(1) for LEFT JOINed tables

Won't work on other scenarios(e.g. counting all bids of person, regardless if he/she has bids):
SELECT person.person_id, COUNT(bid.1) -- ugh! won't work, syntax error
FROM person
LEFT JOIN bid ON bid.person_id = person.person_id
GROUP BY person.person_id

This is the proper way to do it:
SELECT person.person_id, COUNT(bid.person_id) -- works on all RDBMS(Sql Server, Postgresql, Mysql, Access, etc)
FROM person
LEFT JOIN bid ON bid.person_id = person.person_id
GROUP BY person.person_id

This is the proper way to do it:

-- works on Postgresql only(afaik). More intuitive, you don't count the column, you count the cardinality of the set
SELECT person.person_id, COUNT(bid.*) 
FROM person
LEFT JOIN bid ON bid.person_id = person.person_id
GROUP BY person.person_id


/* COUNT(1) is a mistake on queries like below, 
this will always return at least 1 bid belonging to person, even that person don't have any bids. 
Not because the parameter is 1, even if you change the parameter of COUNT to COUNT(2), it won't double the rows count. 

You will never get a COUNT of 0 with COUNT(1), or even COUNT(0) will not result to a COUNT of 0 when you are using LEFT JOIN 


*/
SELECT person.person_id, COUNT(1) 
FROM person
LEFT JOIN bid ON bid.person_id = person.person_id
GROUP BY person.person_id

So if COUNT(1) is just a one-trick pony, why others are still using it on their queries? Cargo cult programming perhaps?


COUNT(*) is such a common programming idiom to be misinterpreted by the database vendor's engineers; to instill one a confidence that COUNT(*) is not slow compared to COUNT(1), one must visualize that the asterisk inside of COUNT(*)indicates cardinality of the set, it doesn't pertain to columns, full stop. Asterisk on COUNT(*) has no bearing with the asterisk on SELECT *, they just share a common token, i.e. asterisk. Database vendor's engineers are smarter than you and I will ever be, they won't dumbly implement the asterisk inside of COUNT to perform reading all columns, asterisk inside of COUNT indicates set cardinality.


Well it's just an unfortunate thing also that this doesn't make it to ANSI SQL:

SELECT COUNT() FROM person;

That will surely make debating 1 vs asterisk inside a function a moot point. The recommended way to give the RDBMS the hint that you want to count the cardinality of set is to use asterisk, rather than using 1; the parameterless COUNT was not able to made its way on ANSI SQL language constructs though

Saturday, April 24, 2010

How to speed up ORDER BY COALESCE(col,9999999) in Sql Server?

No we won't, Sql Server 2008 doesn't support creating index on function operations yet.

So how can we speed up that expression? First of all, that expression is a common idiom and a common given answer when the query need to display null values last. Though some RDBMS already has a construct for putting nulls at the end of the sorted list, Sql Server 2008 still hasn't.





For reference, here's the construct for displaying nulls last on RDBMS that supports it, works on Postgres (with or without NULLS LAST, the database still uses the col's index)

SELECT * FROM tbl ORDER BY col NULLS LAST

Now let's do the same for Sql Server. Performance aside, how we will solve it without using sentinel/magic number? Use CASE WHEN:

SELECT * FROM tbl ORDER BY CASE WHEN col IS NULL THEN 1 ELSE 0 END, col

Well it is longer than sentinel approach, but one less magic numbers to remember(e.g. 9999999,ZZZZZZZ). Though it could be a lot simpler using other RDBMS which has boolean type:

SELECT * FROM tbl ORDER BY col IS NULL, col

That works on both Postgres and Mysql, true value sorts last, false first.

So if we have the following data...

create table xtemp(Col int)
INSERT INTO xtemp VALUES(1)
INSERT INTO xtemp VALUES(5)
INSERT INTO xtemp VALUES(4)
INSERT INTO xtemp VALUES(NULL)
INSERT INTO xtemp VALUES(3)
INSERT INTO xtemp VALUES(6)
INSERT INTO xtemp VALUES(NULL)
INSERT INTO xtemp VALUES(9999999)
INSERT INTO xtemp VALUES(2)

...the Sql Server ORDER BY CASE WHEN construct, sorts like this under the hood:

Sort Col
0    1
0    2
0    3
0    4
0    5
0    6
0    9999999
1    NULL
1    NULL

Now, let's go back to our performance goal on Sql Server. First, let's check if there's an ROI for using an extra field for sorting our list:

create table ytemp(Col int, IsColNull int)
INSERT INTO ytemp VALUES(1,0)
INSERT INTO ytemp VALUES(5,0)
INSERT INTO ytemp VALUES(4,0)
INSERT INTO ytemp VALUES(NULL,1)
INSERT INTO ytemp VALUES(3,0)
INSERT INTO ytemp VALUES(6,0)
INSERT INTO ytemp VALUES(NULL,1)
INSERT INTO ytemp VALUES(9999999,0)
INSERT INTO ytemp VALUES(2,0)

Then let's create index on both xtemp and ytemp
create index ix_xtemp on xtemp(Col)
create index ix_ytemp on ytemp(IsColNull, Col)

Now let's compare their performance relative to each other. Highlight these two statements, then press Ctrl+L...

select * from ytemp order by IsColNull, col
select * from xtemp order by case when col is null then 1 else 0 end, col


...here's the output:




As we can see, the first one is faster than the CASE WHEN approach

We can conclude that an extra field is a sure win in terms of sorting performance, so let's automate that extra column(IsColNull) so it's not a chore to populate them.

First, let's drop the existing column IsColNull:

drop index ix_ytemp on ytemp;
alter table ytemp drop column IsColNull;

Then re-add the column, but this time we specify it as computed/formula column:

ALTER TABLE dbo.ytemp ADD IsColNull  AS 
(case when col is null then 1 else 0 end) PERSISTED


Don't forget the index:
create index ix_ytemp on ytemp(iscolnull, col) 

Now let's recheck the execution plan of these two statements...

select * from ytemp order by IsColNull, col
select * from xtemp order by case when col is null then 1 else 0 end, col

...here's the output:




As we can see, they have the same performance as our first test. Well it should be, the column IsColNull is a real column on our table, and it is automatically assigned the formula's result

Now if you are curious how the COALESCE fares against the extra field approach, here is the execution plan:



The extra field approach is faster than coalesce approach. Until Sql Server supported NULLS LAST, we can use the extra field approach to speed up our query that needs to put nulls last.

Final note, we can use the extra field approach on sorting that we need to override, let's say we wanted to put US, UK on the top of our dropdown list, just do this in computed column:

ALTER TABLE dbo.ytemp ADD
 CustomSort AS 
(case 
when country = 'US' then 1
when country = 'UK' then 2
when country is NULL then 4 
else 3 
end) PERSISTED

That will put US on top of list, then UK, then other countries, then other blank(null) countries last

Related: http://www.ienablemuch.com/2010/12/postgresql-speeding-up-slow-coalesce.html

Tuesday, April 20, 2010

Getting the default value of column

Somebody in Stackoverflow wanted to do something like this:

UPDATE emp SET subordinate_of = COALESCE(subordinate_of, DEFAULT), other = 'somethingelse'
WHERE conditionhere

But it cannot yet be done in Postgres. So we will just obtain the column's default value from information_schema.

Recalling quick sort while downloading Visual Studio Service Pack



using System;
using System;
using System.Collections.Generic;
using System.Text;

namespace TestQuicksort
{
    class Program
    {

        
        static void Main(string[] args)
        {
            int[] r = new int[] { 6, 4, 5, 1, 8, 9, 3, 7, 2, 0 };


            Action print = () =>
            { for (int i = 0; i < r.Length; ++i) 
                  Console.Write("{0,4}", r[i]); }; 

            print();

            Qsort(r, 0, r.Length - 1);

            Console.WriteLine("\n");

            print();

            Console.ReadLine();
            
        }

        private static void Qsort(int[] r, int a, int z)
        {
            int i = a;
            int j = z;

            int segregator = r[new Random().Next(a,z)];            

            while (i < j)
            {
                while (r[j] > segregator) --j;
                while (r[i] < segregator) ++i;

                if (i <= j)
                {
                    if (i < j)
                    {
                        int x = r[i];
                        r[i] = r[j];
                        r[j] = x;
                    }

                    --j;
                    ++i;
                }

            }

            if (j > a)
                Qsort(r, a, j);

            if (i < z)
                Qsort(r, i, z);


        }
    }//class Program
}

Thursday, April 15, 2010

Simple hierarchical query display

How to output this in Postgresql and Sql Server?

 emp_id |            emp_name
--------+---------------------------------
      1 | President
      2 |   Vice President
      3 |     CEO
      4 |     CTO
      5 |       Group Project Manager
      6 |         Project Manager 1
      8 |           Team Leader 1
      9 |             Software Engineer 1
     10 |             Software Engineer 2
     11 |           Test Lead 1
     12 |             Tester 1
     13 |             Tester 2
      7 |         Project Manager 2
     14 |           Team Leader 2
     15 |             Software Engineer 3
     16 |             Software Engineer 4
     17 |           Test Lead 2
     18 |             Tester 3
     19 |             Tester 4
     20 |             Tester 5

Monday, April 12, 2010

How to know if all of A is inside B


This is my answer on http://stackoverflow.com/questions/2577500/how-understand-one-result-set-is-subset-of-another-in-twomysql-select-result-set/


A's set: 3, 5, 10

B's set: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10

I wanted to do this, count b when it equals a:

select count( b.id when b.id = a.id ) = count(a.id) 
    as is_a_subset_of_b
from a
left join b using(id)

Thursday, April 8, 2010

Find two particular fish in same tank

How to query which container contains two particular items?



First, let's create test data

create table tank as 
select * from (values('01',7),('02',8),('03',9),('04',10),('05',11)) as x(tank_num, volume);

create table tank_fish as 
select * from 
(values
('01','Black Moor'),
('01','Bubble Eye'),
('01','Comet'),
('02','Bubble Eye'),
('02','Lion Head'),
('02','Pompom'),
('02','Ryukin'),
('03','Comet'),
('03','Lion Head'),
('03','Ranchu'),
('04','Betta')) as x(tank_num, fish);

Some wrong voted answer on stackoverflow

I love stackoverflow, it's just that sometimes it has its share of wrong voted answers. Actually, that someone's answer is correct, it yields the correct output; but choosing that answer is akin to choosing bubble sort over quick sort.



Let's take an example on http://stackoverflow.com/questions/497241/how-do-i-perform-a-group-by-on-an-aliased-column-in-ms-sql-server/497251#497251

The question:


"I'm trying to perform a group by action on an aliased column (example below) but can't determine the proper syntax."

SELECT       LastName + ', ' + FirstName AS 'FullName'
FROM         customers
GROUP BY     'FullName'

Then one stacker answer this:

SELECT       LastName + ', ' + FirstName AS 'FullName'
FROM         customers
GROUP BY      LastName + ', ' + FirstName

Then I suggested to him to change the concatenation in grouping:  GROUP BY LastName + ', ' + FirstName

To its natural grouping: GROUP BY Lastname, Firstname

Then somebody defended his answer:

You should leave it in, at least to differentiate between 'x, yz' and 'xy, z' which would roll up to the same string without the comma. – ConcernedOfTunbridgeWells

Now, to refute the argument that "x, yz" and "xy, z" will roll up to same string when concatenation is removed from GROUP BY clause, let's produce sample data that is purported to roll up when there is no comma in GROUP BY clause

create table customers(lastname varchar(50),firstname varchar(50));
insert into customers
select 'robot', 'share' union
select 'robots', 'hare';

If by virtue of not concatenating strings in GROUP BY will roll up those two fields to same thing, that is this query...

select lastname + firstname as fullname
from customers
group by lastname, firstname

...will produce one row only:

Output:
fullname
-----------
robotshare


But no, the above query did not produce one row, despite the two entities have same caption, they are still grouped to two rows:
Output:
fullname
-----------
robotshare
robotshare




And how about this:

select lastname + ', ' + firstname as fullname
from customers
group by lastname, firstname

Output:
fullname
---------------
robot, share
robots, hare

Was it reduced to one row? Again No


In fact, the only requirement in GROUP BY clause is to repeat the field(s) from SELECT clause to GROUP BY clause, but if it is not a field, and just an auxiliary information, let's say a caption:

select 'yikes'
from customers
group by lastname, firstname

Output:
(No column name)
----------------
yikes
yikes

your RDBMS won't deem that as invalid query, and result will not be reduced to one row.

Logic-wise, grouping by concatenated result is the one that has the chances of having two or more dissimilar rows be reduced to one row when grouping, the following query is a contrived example but you'll get the point:

select lastname + firstname as fullname
from customers
group by lastname + firstname

fullname
----------------
robotshare

Now, that is a faulty query, there are two entities, yet they are reduced to one row when grouping.


The following is the correct and proper query. And performant query, not because we don't repeat the concatenation expression, it's because RDBMS properly use the table's index, any RDBMS will cease to use index on function or expression results (unless your RDBMS can put index on function or expression), see the figures below:

select lastname + ', ' +  firstname as fullname, 'yikes' as reaction
from customers
group by lastname, firstname

Output:
fullname           reaction
-----------------  --------
robot, share       yikes
robots, hare       yikes

Did the above query reduce the result to one row? Again NO

And in fact, you don't need to repeat non-field values in GROUP BY as seen on query above, the 'yikes' is not repeated in GROUP BY


So this is the correct query(and performant query):

select lastname + ', ' +  firstname as fullname
from customers
group by lastname, firstname

Outputs:
fullname
-----------------
robot, share
robots, hare


Now let's put some index on fields Lastname, Firstname to see how those query differs in execution when run.


create index ux_customers on customers (lastname, firstname)


And let's see how the RDBMS will execute the GROUP BY concatenation:




And how it will execute the natural grouping:


As we can see from the execution plan of concatenated grouping, it spent some time on sorting(78%), while the natural grouping will just rip through the indexed fields directly when it executes the query

Saturday, April 3, 2010

Get Row Size in SQL Server

I don't know if there's a built-in storedproc or function to determine each row size. Here's one way to do it. Get all column names and add each column's data length.






declare @cs varchar(8000);

select @cs = coalesce(@cs + ' + DATALENGTH(' + column_name + ')', 'DATALENGTH(' + column_name + ')')
from INFORMATION_SCHEMA.COLUMNS i where TABLE_NAME = 'Members';

declare @qry nvarchar(4000);

set @qry = 'select Length = ' + @cs + ', * from members ';
select @qry;
exec sp_executesql @qry;


Length  MemberID  LoginName    ReputationPoints
16      1         John         1
16      2         Paul         1
20      3         George       1
18      4         Ringo        1

Friday, April 2, 2010

Clean the table before deleting duplicates

Clean foreign tables and delete duplicates in referenced table, 8 steps











Test database tables:
create table item
(
item_id serial not null,
item text not null,
constraint pk_item primary key(item_id),
constraint uk_item unique (item)
);

   
create table purchased
(
purchased_id serial not null,
item_id int not null,
qty int not null,
constraint pk_purchased primary key(purchased_id),
constraint fk_purchased__item foreign key(item_id) references item(item_id) 
);

Test data:
insert into item (item) values
('cpu'),
('keyboard'),
('keyboard '),
('mouse');

insert into purchased(item_id,qty) values
(1,2),
(2,26),
(3,19),
(4,51),
(3,5);

Thursday, April 1, 2010

My stackoverflow answer on: Check if string is a rotation of other string

Found a curious puzzle on stackoverflow, how to check if the string is a rotation of another string.




My provided answer uses the modulo approach.  Hats off to the voted answer(uses concatenation approach), it's the best, I think even Jon Skeet was impressed :-)