Thursday, April 8, 2010

Some wrong voted answer on stackoverflow

I love stackoverflow, it's just that sometimes it has its share of wrong voted answers. Actually, that someone's answer is correct, it yields the correct output; but choosing that answer is akin to choosing bubble sort over quick sort.



Let's take an example on http://stackoverflow.com/questions/497241/how-do-i-perform-a-group-by-on-an-aliased-column-in-ms-sql-server/497251#497251

The question:


"I'm trying to perform a group by action on an aliased column (example below) but can't determine the proper syntax."

SELECT       LastName + ', ' + FirstName AS 'FullName'
FROM         customers
GROUP BY     'FullName'

Then one stacker answer this:

SELECT       LastName + ', ' + FirstName AS 'FullName'
FROM         customers
GROUP BY      LastName + ', ' + FirstName

Then I suggested to him to change the concatenation in grouping:  GROUP BY LastName + ', ' + FirstName

To its natural grouping: GROUP BY Lastname, Firstname

Then somebody defended his answer:

You should leave it in, at least to differentiate between 'x, yz' and 'xy, z' which would roll up to the same string without the comma. – ConcernedOfTunbridgeWells

Now, to refute the argument that "x, yz" and "xy, z" will roll up to same string when concatenation is removed from GROUP BY clause, let's produce sample data that is purported to roll up when there is no comma in GROUP BY clause

create table customers(lastname varchar(50),firstname varchar(50));
insert into customers
select 'robot', 'share' union
select 'robots', 'hare';

If by virtue of not concatenating strings in GROUP BY will roll up those two fields to same thing, that is this query...

select lastname + firstname as fullname
from customers
group by lastname, firstname

...will produce one row only:

Output:
fullname
-----------
robotshare


But no, the above query did not produce one row, despite the two entities have same caption, they are still grouped to two rows:
Output:
fullname
-----------
robotshare
robotshare




And how about this:

select lastname + ', ' + firstname as fullname
from customers
group by lastname, firstname

Output:
fullname
---------------
robot, share
robots, hare

Was it reduced to one row? Again No


In fact, the only requirement in GROUP BY clause is to repeat the field(s) from SELECT clause to GROUP BY clause, but if it is not a field, and just an auxiliary information, let's say a caption:

select 'yikes'
from customers
group by lastname, firstname

Output:
(No column name)
----------------
yikes
yikes

your RDBMS won't deem that as invalid query, and result will not be reduced to one row.

Logic-wise, grouping by concatenated result is the one that has the chances of having two or more dissimilar rows be reduced to one row when grouping, the following query is a contrived example but you'll get the point:

select lastname + firstname as fullname
from customers
group by lastname + firstname

fullname
----------------
robotshare

Now, that is a faulty query, there are two entities, yet they are reduced to one row when grouping.


The following is the correct and proper query. And performant query, not because we don't repeat the concatenation expression, it's because RDBMS properly use the table's index, any RDBMS will cease to use index on function or expression results (unless your RDBMS can put index on function or expression), see the figures below:

select lastname + ', ' +  firstname as fullname, 'yikes' as reaction
from customers
group by lastname, firstname

Output:
fullname           reaction
-----------------  --------
robot, share       yikes
robots, hare       yikes

Did the above query reduce the result to one row? Again NO

And in fact, you don't need to repeat non-field values in GROUP BY as seen on query above, the 'yikes' is not repeated in GROUP BY


So this is the correct query(and performant query):

select lastname + ', ' +  firstname as fullname
from customers
group by lastname, firstname

Outputs:
fullname
-----------------
robot, share
robots, hare


Now let's put some index on fields Lastname, Firstname to see how those query differs in execution when run.


create index ux_customers on customers (lastname, firstname)


And let's see how the RDBMS will execute the GROUP BY concatenation:




And how it will execute the natural grouping:


As we can see from the execution plan of concatenated grouping, it spent some time on sorting(78%), while the natural grouping will just rip through the indexed fields directly when it executes the query

No comments:

Post a Comment