Monday, August 9, 2010

Postgresql recognizing functional dependency on primary keys for GROUP BY is plain awesome!

This is how I write a query like this...

SELECT
    p.id,
    p.firstname, p.lastname, p.address,
    count(*)
FROM
    people p
    JOIN visits v ON p.id = v.person_id
GROUP BY p.id, p.firstname, p.lastname, p.address;

...when it involves a GROUP BY:

SELECT
    p.id,
    p.firstname, p.lastname, p.address, -- these auxiliary fields is helpful to program users
    count(*)
FROM
    people p
    JOIN visits v ON p.id = v.person_id
GROUP BY p.id
    ,p.firstname, p.lastname, p.address; -- these ancillary fields aids the RDBMS on preventing
    -- programmers from accidentally committing the same mistake as MySQL programmers.
    -- see the Mysql code near the bottom of this post.


Notice the p.firstname, p.lastname and p.address are wrapped to new line. I do that to emphasize that the system is semantically grouping person's record(s) based on its ID, not on its firstname, lastname nor address.

Starting from Postgres 9.1, we can now do this...

SELECT
    p.id,
    p.firstname, p.lastname, p.address, -- these auxiliary fields is helpful to users
    count(*)
FROM
    people p
    JOIN visits v ON p.id = v.person_id
GROUP BY p.id


-- the ancillary fields(i.e. fields from the same table of primary key) are not needed anymore 
-- when you group on a primary key field with Postgres 9.1. Primary key field is sufficient enough 
-- for Postgres 9.1 that you are grouping the entity on its internal representation(i.e. primary key) 

Notice that we can omit the additional fields p.lastname, p.firstname, p.address on GROUP BY clause. Postgres 9.1 can now infer that we are grouping by primary key. That's a nice capability added to Postgres, we can now avoid SQL design pattern for GROUP BY, i.e. wrapping other fields on new line when we are not semantically grouping on those ancillary fields. Though you can still add the p.lastname, p.firstname, p.address fields on Postgres 9.1's GROUP BY primary key, you can hardly find a need for doing so, especially if the grouped entity has a gazillion fields to be presented to program users.

And since it only recognize primary key on GROUP BY, this will not work on Postgres and any sane RDBMS

SELECT
    p.firstname,
    p.lastname,
    count(*)
FROM
    people p
    JOIN visits v ON p.id = v.person_id
GROUP BY p.firstname

That will not work, as people can have multiple lastname sharing a given firstname. But that will work on MySQL, which is really bad, as it will just randomly select any lastname from the grouped firstname.

To prove how absurd MySQL design choice(or lack thereof) is, this query...

SELECT region, name, COUNT(*) 
FROM bbc 
GROUP BY region



...produces:


regionnameCOUNT(*)
AfricaAngola47
AmericasAntigua and Barbuda20
Asia-PacificAustralia36
EuropeAlbania48
Middle EastAlgeria19
North AmericaCanada3
South AmericaArgentina12
South AsiaAfghanistan8


MySQL will allow that absurd query, and produce results. While other sane RDBMS like Postgresql, Sql Server, Oracle, etc will prevent that absurd scenario.

If you want to COUNT how many people has similar firstname, you must put firstname only on GROUP BY's clause and firstname only on SELECT clause too; if you really want to COUNT people with similar firstname+lastname, you must group on both firstname+lastname.

To test that query on other database visit sqlzoo.net

Postgres really has saner design choices than other RDBMSes. It's really awesome!


Full article here: http://www.depesz.com/index.php/2010/08/08/waiting-for-9-1-recognize-functional-dependency-on-primary-keys/

Interesting: Auxiliary vs Ancillary

No comments:

Post a Comment