class Programmer : IEnableMuch {: Cargo Cult Programming

Showing posts with label Cargo Cult Programming. Show all posts

Monday, June 21, 2010

Unboxing nullable value type from object is slow

Contrary to common perception, as is not faster than is-then-cast approach. It's easy to debunk that perception. Just profile

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Diagnostics;

class Program
{
    const int Size = 30000000;

    static void Main(string[] args)
    {
        object[] values = new object[Size];
        for (int i = 0; i < Size - 2; i += 3)
        {
            values[i] = null;
            values[i + 1] = "";
            values[i + 2] = 1;
        }

        FindSumWithIsThenCast(values);
            
        FindSumWithAsThenHasThenValue(values);
        FindSumWithAsThenHasThenCast(values);

        FindSumWithManualAs(values);
        FindSumWithAsThenManualHasThenValue(values);

        Console.ReadLine();
    }

    static void FindSumWithIsThenCast(object[] values)
    {
        Stopwatch sw = Stopwatch.StartNew();
        int sum = 0;
        foreach (object o in values)
        {
            if (o is int)
            {
                int x = (int)o;
                sum += x;
            }
        }
        sw.Stop();
        Console.WriteLine("Is then Cast: {0} : {1}", sum,
                            (long)sw.ElapsedMilliseconds);
    }

    static void FindSumWithAsThenHasThenValue(object[] values)
    {
        Stopwatch sw = Stopwatch.StartNew();
        int sum = 0;
        foreach (object o in values)
        {
            int? x = o as int?;

            if (x.HasValue)
            {
                sum += x.Value;
            }
        }
        sw.Stop();
        Console.WriteLine("As then Has then Value: {0} : {1}", sum,
                            (long)sw.ElapsedMilliseconds);
    }

    static void FindSumWithAsThenHasThenCast(object[] values)
    {
        Stopwatch sw = Stopwatch.StartNew();
        int sum = 0;
        foreach (object o in values)
        {
            int? x = o as int?;

            if (x.HasValue)
            {
                sum += (int)o;
            }
        }
        sw.Stop();
        Console.WriteLine("As then Has then Cast: {0} : {1}", sum,
                            (long)sw.ElapsedMilliseconds);
    }

    static void FindSumWithManualAs(object[] values)
    {
        Stopwatch sw = Stopwatch.StartNew();
        int sum = 0;
        foreach (object o in values)
        {
            bool hasValue = o is int;
            int x = hasValue ? (int)o : 0;

            if (hasValue)
            {
                sum += x;
            }
        }
        sw.Stop();
        Console.WriteLine("Manual As: {0} : {1}", sum,
                            (long)sw.ElapsedMilliseconds);
    }

    static void FindSumWithAsThenManualHasThenValue(object[] values)
    {
        Stopwatch sw = Stopwatch.StartNew();
        int sum = 0;
        foreach (object o in values)
        {
            int? x = o as int?;

            if (o is int)
            {
                sum += x.Value;
            }
        }
        sw.Stop();
        Console.WriteLine("As then Manual Has then Value: {0} : {1}", sum,
                            (long)sw.ElapsedMilliseconds);
    }

}

Is then Cast: 10000000 : 303
As then Has then Value: 10000000 : 3524
As then Has then Cast: 10000000 : 3272
Manual As: 10000000 : 395
As then Manual Has then Value: 10000000 : 3282

What can we infer from these figures?

First, is-then-cast approach is significantly faster than as approach. 303 vs 3524
Second, .Value is marginally slower than casting. 3524 vs 3272
Third, .HasValue is marginally slower than using manual has(i.e. using is). 3524 vs 3282
Fourth, doing an apple-to-apple comparison(i.e. both assigning of simulated HasValue and converting simulated Value happens together) between simulated as and real as approach, we can see simulated as is still significantly faster than real as. 395 vs 3524
Lastly, based on first and fourth conclusion, there's something wrong with as
implementation ^_^

Sunday, May 9, 2010

Why is EXISTS (SELECT 1 ...) a cargo cult programming?

Why application programmers are using this kind of data existence checking...

select author_id, post_id, post from posts 
where exists(select 1 from comments where author_id = posts.author_id and post_id = posts.post_id)

 author_id | post_id |         post
-----------+---------+----------------------
       777 |       1 | what is optimization
(1 row)

...instead of this?

select author_id, post_id, post from posts 
where exists(select * from comments where author_id = posts.author_id and post_id = posts.post_id)

To create test data, expand the following:

create table posts
(
author_id int not null, -- composite pk
post_id int not null, -- composite pk
post varchar(50)  not null
);

insert into posts(author_id, post_id, post) values(777, 1, 'what is optimization');
insert into posts(author_id, post_id, post) values(777, 2, 'how to profile');

create table comments
(
author_id int not null,
post_id int not null,
commenter_id int not null,
comment varchar(50)
);

insert into comments(author_id, post_id, commenter_id, comment) values(777,1, 8, 'magnificent');
insert into comments(author_id, post_id, commenter_id, comment) values(777,1, 8, 'great');
insert into comments(author_id, post_id, commenter_id, comment) values(777,1, 8, 'excellent');

Why use 1 instead of asterisk? Most application programmers when are in sight of asterisk inside EXIST clause, they cringe for its purported inefficiency, they thought that asterisk always project data, and as such can incur delays to query. They have this ill-conceived notion that the asterisk fetches columns. If there's a way to avoid the SELECT asterisk altogether, all will be happy, no one will have a nagging feeling that the engineers of their favorite database are not smart.

Any takers of this syntax?

select * from posts 
where exists( from comments where author_id = posts.author_id and post_id = posts.post_id)

That EXIST clause doesn't have a look and feel of projecting unnecessary data. Now, you might not immediately like that (though might be if you are C# programmer that is accustomed to Linq syntax), so for better or for worse, we have to contend with the fact that in order to make parsing job easier, the language designer have designed that the query inside of EXISTS clause must conform to normal SQL syntax too. Language aesthetics or language symmetry also played a factor here.

Now are the irrational feelings that the asterisk inside EXISTS project data warranted? It's not, contrary to popular belief, database engineers are smarter than you and I will ever be. Do application programmers think that the database vendors wanted to lose their product mindshares to their competitors?

Database is a complex product that potential bottleneck like that construct won't let go past unnoticed, the keyword EXISTS is already a directive to RDBMS that data projecting should be suppressed from query inside of EXISTS. Why should they blindly project data inside of EXISTS?

To disprove that irrational fear that EXISTS clause is projecting data(it isn't), try this in your favorite database:

select * from posts 
where exists(select 1/0 from comments where author_id = posts.author_id and post_id = posts.post_id)

Did it result to divide by zero error? If it is, try to hurl criticisms to the vendor of your favorite database. But rest assured, we can be practically sure that our database won't blindly project data inside of EXISTS clause.

If only all database supports tuple testing, we can do this:

select * from posts where (author_id,post_id) in (select author_id, post_id from comments)

But some database don't support that tuple testing. So we have to do this:

select * from posts 
where /* (author_id,post_id) in */
exists 
    (select author_id, post_id 
    from comments 
    where author_id = posts.author_id and post_id = posts.post_id)

If I would have to translate the tuple testing to WHERE EXISTS clause, I would retain the two fields in first WHERE clause and put them in comment brackets; and inside of EXISTS clause, I would preserve the tuples to be tested upon, it makes for great documentation. It's better than EXISTS(SELECT 1) and EXISTS(SELECT *)

On some database, they have sleep function to simulate delay. To prove that our database discard data projection of EXISTS clause. Try to do this (change the pg_sleep with your database's equivalent):

select * from posts 
where exists(
select pg_sleep(5000) -- 5 seconds delay
from comments where author_id = posts.author_id and post_id = posts.post_id)

test=# select * from posts 
where exists(
    select pg_sleep(5000) 
    from comments 
    where author_id = posts.author_id and post_id = posts.post_id);

 author_id | post_id |         post
-----------+---------+----------------------
       777 |       1 | what is optimization
(1 row)

Time: 1.548 ms

As we can see, it didn't incur any delay

Strive for readability first before applying cargo cult programming. Don't have that irrational fear that asterisk always gets expanded to data projection

Now, I'm not suggesting to application programmers that they convert all their EXISTS(SELECT 1 FROM ..) to EXISTS(SELECT * FROM ..). Just remove the irrational fear, eliminate or minimize your Cargo Cult Programming.

I don't mind if application programmers or colleagues are used to with EXISTS(SELECT 1..), but I would do mind if they will go out their way to bikeshed and point out that EXISTS(SELECT * ) is not efficient. I trust the engineers of database more than the programmers who just merely use database.

Wednesday, April 28, 2010

Debunking the myth that COUNT('Dracula') is faster than COUNT(*)

Ok, that was an attempt at humor ;-)

How to debunk the myth that COUNT(1) is faster than COUNT(*) ?

The most simple mistaken assumption is that the asterisk sign makes the RDBMS read all the columns. Fact is, asterisk just represent the cardinality(an unfortunate name, very computer-sciencey-sounding) of the set

Thinking that COUNT(1) or COUNT(anyNonNullConstantHere) is faster than COUNT(*), is the most common programming-by-superstition done by many. First point, if this language construct had made it into SQL standard...

SELECT COUNT(cat)
FROM cats

...we will not need COUNT(*) (and COUNT(1) for that matter) in SQL language, but as RDBMS don't have an AI to infer that the entity(cat) you want to count is the singular form of the cats table, this didn't make it into the final ANSI SQL spec

Second point, a compromise, if this language construct permits counting the table name directly...

SELECT users.user_id, COUNT(answers)
FROM users
JOIN answers ON answers.user_id = users.user_id
GROUP BY users.user_id

...we will not need COUNT(*) in SQL languages, it's very natural to read. But only Postgres supports this principle. But that will still fail if there's a nullable answers(same name as its table) column in either of the two table, as Postgres gives more priority to field than table; though can be worked around if we put asterisk after the name, so the RDBMS can detect that the thing we we wanted to count is the table rows rather than the column name:

SELECT users.user_id, COUNT(answers.*)
FROM users
JOIN answers ON answers.user_id = users.user_id
GROUP BY users.user_id

The ultimate proof that the database isn't reading all columns, this script outputs 4, not 0.

create table person
(
lastname varchar(50),
firstname varchar(50),
age int
);

insert into person(lastname,firstname,age) values(null,null,null);
insert into person(lastname,firstname,age) values(null,null,null);
insert into person(lastname,firstname,age) values(null,null,null);
insert into person(lastname,firstname,age) values(null,null,null);

select count(*) from person; -- output is 4, not 0

If the RDBMS reads all the columns, why the count still resolves to 4? Shouldn't it be 0 since all the values are null?

What makes the COUNT(1) different from COUNT(0), COUNT(2)? COUNT('your blessings')? COUNT('Dracula')? Answer: None. This will all output 4 on the above data. There's nothing special about COUNT(1) or COUNT(anyNonNullValues) that makes it different from COUNT(*); there's nothing slow in COUNT(*) as we can plainly see that from the output above, that it don't care whatever the column values are. How can COUNT(1) be faster than COUNT(*) when COUNT(*) doesn't evaluate any values on all columns?

And one could argue as well that COUNT(*) is faster than COUNT(fieldhere) or COUNT(1) for that matter, since it doesn't need interpreting any when it count rows.

The asterisk inside of COUNT is just a directive for the RDBMS to fetch the cardinality of the set, not to evaluate all columns of the row.

And SELECT COUNT(1) is a one trick pony, it's only pleasant and possible to use on one table query only.

SELECT COUNT(1) FROM person

You cannot mold the COUNT(1) for LEFT JOINed tables

Won't work on other scenarios(e.g. counting all bids of person, regardless if he/she has bids):

SELECT person.person_id, COUNT(bid.1) -- ugh! won't work, syntax error
FROM person
LEFT JOIN bid ON bid.person_id = person.person_id
GROUP BY person.person_id

This is the proper way to do it:

SELECT person.person_id, COUNT(bid.person_id) -- works on all RDBMS(Sql Server, Postgresql, Mysql, Access, etc)
FROM person
LEFT JOIN bid ON bid.person_id = person.person_id
GROUP BY person.person_id

This is the proper way to do it:

-- works on Postgresql only(afaik). More intuitive, you don't count the column, you count the cardinality of the set
SELECT person.person_id, COUNT(bid.*) 
FROM person
LEFT JOIN bid ON bid.person_id = person.person_id
GROUP BY person.person_id

/* COUNT(1) is a mistake on queries like below, 
this will always return at least 1 bid belonging to person, even that person don't have any bids. 
Not because the parameter is 1, even if you change the parameter of COUNT to COUNT(2), it won't double the rows count. 

You will never get a COUNT of 0 with COUNT(1), or even COUNT(0) will not result to a COUNT of 0 when you are using LEFT JOIN 


*/
SELECT person.person_id, COUNT(1) 
FROM person
LEFT JOIN bid ON bid.person_id = person.person_id
GROUP BY person.person_id

So if COUNT(1) is just a one-trick pony, why others are still using it on their queries? Cargo cult programming perhaps?

COUNT(*) is such a common programming idiom to be misinterpreted by the database vendor's engineers; to instill one a confidence that COUNT(*) is not slow compared to COUNT(1), one must visualize that the asterisk inside of COUNT(*)indicates cardinality of the set, it doesn't pertain to columns, full stop. Asterisk on COUNT(*) has no bearing with the asterisk on SELECT *, they just share a common token, i.e. asterisk. Database vendor's engineers are smarter than you and I will ever be, they won't dumbly implement the asterisk inside of COUNT to perform reading all columns, asterisk inside of COUNT indicates set cardinality.

Well it's just an unfortunate thing also that this doesn't make it to ANSI SQL:

SELECT COUNT() FROM person;

That will surely make debating 1 vs asterisk inside a function a moot point. The recommended way to give the RDBMS the hint that you want to count the cardinality of set is to use asterisk, rather than using 1; the parameterless COUNT was not able to made its way on ANSI SQL language constructs though

Open Source

Monday, June 21, 2010

Unboxing nullable value type from object is slow

Sunday, May 9, 2010

Why is EXISTS (SELECT 1 ...) a cargo cult programming?

Wednesday, April 28, 2010

Debunking the myth that COUNT('Dracula') is faster than COUNT(*)