Sunday, May 9, 2010

Why is EXISTS (SELECT 1 ...) a cargo cult programming?

Why application programmers are using this kind of data existence checking...
select author_id, post_id, post from posts 
where exists(select 1 from comments where author_id = posts.author_id and post_id = posts.post_id)


 author_id | post_id |         post
-----------+---------+----------------------
       777 |       1 | what is optimization
(1 row)

...instead of this?

select author_id, post_id, post from posts 
where exists(select * from comments where author_id = posts.author_id and post_id = posts.post_id)


To create test data, expand the following:
create table posts
(
author_id int not null, -- composite pk
post_id int not null, -- composite pk
post varchar(50)  not null
);

insert into posts(author_id, post_id, post) values(777, 1, 'what is optimization');
insert into posts(author_id, post_id, post) values(777, 2, 'how to profile');

create table comments
(
author_id int not null,
post_id int not null,
commenter_id int not null,
comment varchar(50)
);

insert into comments(author_id, post_id, commenter_id, comment) values(777,1, 8, 'magnificent');
insert into comments(author_id, post_id, commenter_id, comment) values(777,1, 8, 'great');
insert into comments(author_id, post_id, commenter_id, comment) values(777,1, 8, 'excellent');


Why use 1 instead of asterisk? Most application programmers when are in sight of asterisk inside EXIST clause, they cringe for its purported inefficiency, they thought that asterisk always project data, and as such can incur delays to query. They have this ill-conceived notion that the asterisk fetches columns. If there's a way to avoid the SELECT asterisk altogether, all will be happy, no one will have a nagging feeling that the engineers of their favorite database are not smart.

Any takers of this syntax?
select * from posts 
where exists( from comments where author_id = posts.author_id and post_id = posts.post_id)

That EXIST clause doesn't have a look and feel of projecting unnecessary data. Now, you might not immediately like that (though might be if you are C# programmer that is accustomed to Linq syntax), so for better or for worse, we have to contend with the fact that in order to make parsing job easier, the language designer have designed that the query inside of EXISTS clause must conform to normal SQL syntax too. Language aesthetics or language symmetry also played a factor here.

Now are the irrational feelings that the asterisk inside EXISTS project data warranted? It's not, contrary to popular belief, database engineers are smarter than you and I will ever be. Do application programmers think that the database vendors wanted to lose their product mindshares to their competitors?

Database is a complex product that potential bottleneck like that construct won't let go past unnoticed, the keyword EXISTS is already a directive to RDBMS that data projecting should be suppressed from query inside of EXISTS. Why should they blindly project data inside of EXISTS?

To disprove that irrational fear that EXISTS clause is projecting data(it isn't), try this in your favorite database:

select * from posts 
where exists(select 1/0 from comments where author_id = posts.author_id and post_id = posts.post_id)

Did it result to divide by zero error? If it is, try to hurl criticisms to the vendor of your favorite database. But rest assured, we can be practically sure that our database won't blindly project data inside of EXISTS clause.

If only all database supports tuple testing, we can do this:

select * from posts where (author_id,post_id) in (select author_id, post_id from comments)


But some database don't support that tuple testing. So we have to do this:

select * from posts 
where /* (author_id,post_id) in */
exists 
    (select author_id, post_id 
    from comments 
    where author_id = posts.author_id and post_id = posts.post_id)

If I would have to translate the tuple testing to WHERE EXISTS clause, I would retain the two fields in first WHERE clause and put them in comment brackets; and inside of EXISTS clause, I would preserve the tuples to be tested upon, it makes for great documentation. It's better than EXISTS(SELECT 1) and EXISTS(SELECT *)

On some database, they have sleep function to simulate delay. To prove that our database discard data projection of EXISTS clause. Try to do this (change the pg_sleep with your database's equivalent):

select * from posts 
where exists(
select pg_sleep(5000) -- 5 seconds delay
from comments where author_id = posts.author_id and post_id = posts.post_id)

test=# select * from posts 
where exists(
    select pg_sleep(5000) 
    from comments 
    where author_id = posts.author_id and post_id = posts.post_id);

 author_id | post_id |         post
-----------+---------+----------------------
       777 |       1 | what is optimization
(1 row)

Time: 1.548 ms

As we can see, it didn't incur any delay

Strive for readability first before applying cargo cult programming. Don't have that irrational fear that asterisk always gets expanded to data projection

Now, I'm not suggesting to application programmers that they convert all their EXISTS(SELECT 1 FROM ..) to EXISTS(SELECT * FROM ..). Just remove the irrational fear, eliminate or minimize your Cargo Cult Programming.

I don't mind if application programmers or colleagues are used to with EXISTS(SELECT 1..), but I would do mind if they will go out their way to bikeshed and point out that EXISTS(SELECT * ) is not efficient. I trust the engineers of database more than the programmers who just merely use database.

No comments:

Post a Comment