Wednesday, April 28, 2010

Debunking the myth that COUNT('Dracula') is faster than COUNT(*)

Ok, that was an attempt at humor ;-)

How to debunk the myth that COUNT(1) is faster than COUNT(*) ?

The most simple mistaken assumption is that the asterisk sign makes the RDBMS read all the columns. Fact is, asterisk just represent the cardinality(an unfortunate name, very computer-sciencey-sounding) of the set

Thinking that COUNT(1) or COUNT(anyNonNullConstantHere) is faster than COUNT(*), is the most common programming-by-superstition done by many. First point, if this language construct had made it into SQL standard...

SELECT COUNT(cat)
FROM cats

...we will not need COUNT(*) (and COUNT(1) for that matter) in SQL language, but as RDBMS don't have an AI to infer that the entity(cat) you want to count is the singular form of the cats table, this didn't make it into the final ANSI SQL spec


Second point, a compromise, if this language construct permits counting the table name directly...

SELECT users.user_id, COUNT(answers)
FROM users
JOIN answers ON answers.user_id = users.user_id
GROUP BY users.user_id

...we will not need COUNT(*) in SQL languages, it's very natural to read. But only Postgres supports this principle. But that will still fail if there's a nullable answers(same name as its table) column in either of the two table, as Postgres gives more priority to field than table; though can be worked around if we put asterisk after the name, so the RDBMS can detect that the thing we we wanted to count is the table rows rather than the column name:

SELECT users.user_id, COUNT(answers.*)
FROM users
JOIN answers ON answers.user_id = users.user_id
GROUP BY users.user_id

The ultimate proof that the database isn't reading all columns, this script outputs 4, not 0.

create table person
(
lastname varchar(50),
firstname varchar(50),
age int
);

insert into person(lastname,firstname,age) values(null,null,null);
insert into person(lastname,firstname,age) values(null,null,null);
insert into person(lastname,firstname,age) values(null,null,null);
insert into person(lastname,firstname,age) values(null,null,null);

select count(*) from person; -- output is 4, not 0

If the RDBMS reads all the columns, why the count still resolves to 4? Shouldn't it be 0 since all the values are null?

What makes the COUNT(1) different from COUNT(0), COUNT(2)? COUNT('your blessings')? COUNT('Dracula')? Answer: None. This will all output 4 on the above data. There's nothing special about COUNT(1) or COUNT(anyNonNullValues) that makes it different from COUNT(*); there's nothing slow in COUNT(*) as we can plainly see that from the output above, that it don't care whatever the column values are. How can COUNT(1) be faster than COUNT(*) when COUNT(*) doesn't evaluate any values on all columns?


And one could argue as well that COUNT(*) is faster than COUNT(fieldhere) or COUNT(1) for that matter, since it doesn't need interpreting any when it count rows.


The asterisk inside of COUNT is just a directive for the RDBMS to fetch the cardinality of the set, not to evaluate all columns of the row.

And SELECT COUNT(1) is a one trick pony, it's only pleasant and possible to use on one table query only.

SELECT COUNT(1) FROM person

You cannot mold the COUNT(1) for LEFT JOINed tables

Won't work on other scenarios(e.g. counting all bids of person, regardless if he/she has bids):
SELECT person.person_id, COUNT(bid.1) -- ugh! won't work, syntax error
FROM person
LEFT JOIN bid ON bid.person_id = person.person_id
GROUP BY person.person_id

This is the proper way to do it:
SELECT person.person_id, COUNT(bid.person_id) -- works on all RDBMS(Sql Server, Postgresql, Mysql, Access, etc)
FROM person
LEFT JOIN bid ON bid.person_id = person.person_id
GROUP BY person.person_id

This is the proper way to do it:

-- works on Postgresql only(afaik). More intuitive, you don't count the column, you count the cardinality of the set
SELECT person.person_id, COUNT(bid.*) 
FROM person
LEFT JOIN bid ON bid.person_id = person.person_id
GROUP BY person.person_id


/* COUNT(1) is a mistake on queries like below, 
this will always return at least 1 bid belonging to person, even that person don't have any bids. 
Not because the parameter is 1, even if you change the parameter of COUNT to COUNT(2), it won't double the rows count. 

You will never get a COUNT of 0 with COUNT(1), or even COUNT(0) will not result to a COUNT of 0 when you are using LEFT JOIN 


*/
SELECT person.person_id, COUNT(1) 
FROM person
LEFT JOIN bid ON bid.person_id = person.person_id
GROUP BY person.person_id

So if COUNT(1) is just a one-trick pony, why others are still using it on their queries? Cargo cult programming perhaps?


COUNT(*) is such a common programming idiom to be misinterpreted by the database vendor's engineers; to instill one a confidence that COUNT(*) is not slow compared to COUNT(1), one must visualize that the asterisk inside of COUNT(*)indicates cardinality of the set, it doesn't pertain to columns, full stop. Asterisk on COUNT(*) has no bearing with the asterisk on SELECT *, they just share a common token, i.e. asterisk. Database vendor's engineers are smarter than you and I will ever be, they won't dumbly implement the asterisk inside of COUNT to perform reading all columns, asterisk inside of COUNT indicates set cardinality.


Well it's just an unfortunate thing also that this doesn't make it to ANSI SQL:

SELECT COUNT() FROM person;

That will surely make debating 1 vs asterisk inside a function a moot point. The recommended way to give the RDBMS the hint that you want to count the cardinality of set is to use asterisk, rather than using 1; the parameterless COUNT was not able to made its way on ANSI SQL language constructs though

No comments:

Post a Comment