Thursday, October 14, 2010

Sql Server IS DISTINCT FROM

How to test if two NULLS are equal in SQL Server, ala-C#. Before our dear DBAs waxes philosophical that NULL = NULL is unknown, NULL <> NULL is also unknown, we must note that comparing things that results to knowns has uses, say for example we want to make audit logger in trigger code, that trigger code must be optimized. We will have database performance problem if we don't detect the field when logging changes, as this will do unconditional logging, whether the pertinent field(s) changes or not. We will also have problem in our logging logic if we cannot properly detect if there is a change that have occured in nullable field.

As typified by this:
IF new.sub_account_of <> old.sub_account_of THEN
    // log changes here
ELSE
    // do nothing
END

That will not work, if one of them has null. We must use this(doesn't work on Sql Server, works on Postgres, DB2):

IF new.sub_account_of IS DISTINCT FROM old.sub_account_of THEN
    // log changes here
ELSE
    // do nothing
END


Truth table for C#-like null comparison:
   A ==    B
NULL == NULL : True
NULL ==    1 : False
NULL ==    2 : False
   1 == NULL : False
   1 ==    1 : True
   1 ==    2 : False
   2 == NULL : False
   2 ==    1 : False
   2 ==    2 : True

   A !=    B
NULL != NULL : False
NULL !=    1 : True
NULL !=    2 : True
   1 != NULL : True
   1 !=    1 : False
   1 !=    2 : True
   2 != NULL : True
   2 !=    1 : True
   2 !=    2 : False

For the code above, expand the following:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace NullComparison
{
    class Program
    {
        struct Rec
        {
            public int? a, b;

            public Rec(int? a, int? b)
            {
                this.a = a;
                this.b = b;
            }            
        }


        static void Main(string[] args)
        {
            Console.WriteLine("{0,4} == {1,4}", "A", "B");
            foreach (Rec r in Values())
            {
                Console.WriteLine("{0,4} == {1,4} : {2}", 
                    r.a != null ? r.a.ToString() : "NULL", 
                    r.b != null ? r.b.ToString() : "NULL", 
                    r.a == r.b);                
            }

            Console.WriteLine();
            Console.WriteLine("{0,4} != {1,4}", "A", "B");
            foreach (Rec r in Values())
            {
                Console.WriteLine("{0,4} != {1,4} : {2}", 
                    r.a != null ? r.a.ToString() : "NULL", 
                    r.b != null ? r.b.ToString() : "NULL", r.a != r.b);                
            }

            Console.ReadLine();
            
        }

        static IEnumerable<Rec> Values()
        {
            yield return new Rec(null, null);
            yield return new Rec(null, 1);
            yield return new Rec(null, 2);
            yield return new Rec(1, null);
            yield return new Rec(1, 1);
            yield return new Rec(1, 2);
            yield return new Rec(2, null);
            yield return new Rec(2, 1);
            yield return new Rec(2, 2);
        }

    }//class Program

}//namespace NullComparison

If you want to go directly to Sql Server logic, click this, otherwise continue reading...

Now let's do it first in PostgreSQL to verify the logic of our code, since it has first class support for C#-like nullable values comparison behavior. For equality, it uses IS NOT DISTINCT FROM, for inequality it uses IS DISTINCT FROM, pretty intuitive isn't it? If I remember correctly, MySQL uses <=> for its IS DISTINCT FROM, I don't recall if there's an IS NOT DISTINCT FROM operator in MySQL.

To create the table(also works in SQL Server 2008):
select a,b into x_m
from 
(
 values
 (null,null),
 (null,1),
 (null,2),
 (1,null),
 (1,1),
 (1,2),
 (2,null),
 (2,1),
 (2,2)
)as x(a,b)

Without further ado, this is the logic of equality: coalesce(a = b or (a is null and b is null), false) as manual_comp

Why coalesce is needed there? It's when either of a or b is null, the equality comparison will result to null. So we need to translate null to false, for that we will use coalesce.

To test:
select a,b
,coalesce(a = b or (a is null and b is null), false) as manual_comp
,a is not distinct from b as built_in_comp
from x_m

Outputs:
 a | b | manual_comp | built_in_comp
---+---+-------------+---------------
   |   | t           | t
   | 1 | f           | f
   | 2 | f           | f
 1 |   | f           | f
 1 | 1 | t           | t
 1 | 2 | f           | f
 2 |   | f           | f
 2 | 1 | f           | f
 2 | 2 | t           | t
(9 rows)

Now how to check if b is different from a? use IS DISTINCT FROM, now how do we do it in manual SQL? One approach is we just put NOT on the whole expression, another one is we rewrite the query:

select a,b
,coalesce(a <> b and (a is not null or b is not null), true) as manual_comp
,a is distinct from b as built_in_comp
from x_m

Outputs:
 a | b | manual_comp | built_in_comp
---+---+-------------+---------------
   |   | f           | f
   | 1 | t           | t
   | 2 | t           | t
 1 |   | t           | t
 1 | 1 | f           | f
 1 | 2 | t           | t
 2 |   | t           | t
 2 | 1 | t           | t
 2 | 2 | f           | f
(9 rows)


As we can see, re-wiring rewriting the query seems a bit tedious to read:

coalesce(a <> b and (a is not null or b is not null), true) as manual_comp

Programmers also tend to re-read their logic if it involves multiple NOT EQUAL comparison. I cannot just help myself to skim-read that query. Maybe one of those NOT EQUAL comparisons could launch a missile!

Let's find one example how using only one NOT happens in real life. If an interviewer asked the interviewee about her favorite singer, and the answer is neither John, Paul, George nor Ringo, then the interviewer should go find another interviewee.

Now how most programmers(not all programmers) translate that to program?

IF ans != 'John' AND ans != 'Paul' AND ans != 'George' AND ans != 'Ringo' THEN 
    FindAnotherInterviewee
END

See the pattern there? You never read written instructions like this in real life "Please interview another person if the interviewee's answer is not john and not paul and not george and not ringo." In real life, you just use one NOT. The code above doesn't mimic the original English sentence anymore, it's hard to read that logic, it compels the programmer to re-read the logic, multiple NOT EQUALs promotes doubtful assertion, perhaps you've seen others prefer this (translation of if answer is neither Y, N nor C):

IF ans != 'Y' AND ans != 'N' AND ans != 'C' THEN ask again END

Over this readable one:
IF NOT(ans = 'Y' OR ans = 'N' or ans = 'C') THEN ask again END


So I shun multiple NOT EQUALs in logic, they make the program harder to read. So the English sentence of neither John,Paul,George nor Ringo, should be written like this:

IF NOT(ans = 'John' OR ans = 'Paul' OR ans = 'George' OR ans = 'Ringo') THEN     
    FindAnotherInterviewee
END

That directly mimics the english sentence with only one NOT. Actually the word nor from actual English sentence is a signal in itself that one should not use AND, should use OR instead when translating the English sentence to code.

So let's stick with readable programs.

To continue, for inequality comparisons, we just stick NOT in front of the equality expression.

select a,b
,not coalesce(a = b or (a is null and b is null), false) as manual_comp
,a is distinct from b as built_in_comp
from x_m

Here's the output:
 a | b | manual_comp | built_in_comp
---+---+-------------+---------------
   |   | f           | f
   | 1 | t           | t
   | 2 | t           | t
 1 |   | t           | t
 1 | 1 | f           | f
 1 | 2 | t           | t
 2 |   | t           | t
 2 | 1 | t           | t
 2 | 2 | f           | f
(9 rows)



Now let's go back to the main gist of this article, since up to this time of writing, Microsoft SQL Server 2008 still doesn't support IS DISTINCT FROM and IS NOT DISTINCT FROM operators. We must find our way to support those conditions.


So how to translate those IS DISTINCT FROM and IS NOT DISTINCT FROM to Sql Server?

Postgresql version:

select a,b
,coalesce(a = b or (a is null and b is null), false) as manual_comp
,a is not distinct from b as built_in_comp
from x_m

To translate to Sql Server-consumable one, use CASE WHEN:

select a,b,
    case 
    when a = b then 1
    when a is null and b is null then 1 
    else 0
    end as mssql_manual_comp
from x_m

Outputs:
a           b           mssql_manual_comp
----------- ----------- -----------------
NULL        NULL        1
NULL        1           0
NULL        2           0
1           NULL        0
1           1           1
1           2           0
2           NULL        0
2           1           0
2           2           1

(9 row(s) affected)

Working properly, now how to check if a IS DISTINCT FROM b? Since Sql Server don't have first-class support for booleans, we cannot just put NOT on front of CASE statement. This will result to incorrect syntax:
select a,b,
      NOT
      case 
      when a = b then 1
      when a is null and b is null then 1 
      else 0
      end as mssql_manual_comp
from x_m


So let's get some help from XOR operator to toggle the logic:
select a,b,
    case 
    when a = b then 1
    when a is null and b is null then 1 
    else 0
    end ^ 1 as mssql_manual_comp
from x_m

So what's the output for our inequality comparison?
a           b           mssql_manual_comp
----------- ----------- -----------------
NULL        NULL        0
NULL        1           1
NULL        2           1
1           NULL        1
1           1           0
1           2           1
2           NULL        1
2           1           1
2           2           0

(9 row(s) affected)

So our logic is correct, so now let's give some IS DISTINCT FROM love to Sql Server, which if I may add can already be done on other major RDBMS

To recap, how to do IS NOT DISTINCT FROM in Sql Server?
select a,b
from x_m
where a IS NOT DISTINCT FROM b

Here's the equivalent in Sql Server:
select a,b
from x_m
where  
    (case 
    when a = b then 1
    when a is null and b is null then 1 
    else 0
    end) = 1

Outputs:
a           b
----------- -----------
NULL        NULL
1           1
2           2

(3 row(s) affected)


And how to do IS DISTINCT FROM in Sql Server?
select a,b
from x_m
where a IS DISTINCT FROM b


Here's the equivalent in Sql Server:
select a,b
from x_m
where  
    (case 
    when a = b then 1
    when a is null and b is null then 1 
    else 0
    end) = 0

Outputs:
a           b
----------- -----------
NULL        1
NULL        2
1           NULL
1           2
2           NULL
2           1

(6 row(s) affected)


Please do note that we don't need to use XOR in our CASE expression in WHERE clause, we can just directly compare the CASE expression to 1 or 0. 1 denotes equality, 0 denotes inequality.


Regarding DeMorgan logic simplifier and why multiple != (aka <>) should be suspected of code smell http://stackoverflow.com/questions/4615113/how-can-the-and-and-or-key-word-used-together-in-a-select-query/4615182#4615182


UPDATE 2011-08-08

Here's another approach for IS NOT DISTINCT FROM:
select a,b
from x_m
where 
    (
        a = b
        or
        (a is null and b is null)
    )

For IS DISTINCT FROM:
select a,b
from x_m
where 
    a <> b
    or 
    ( 
        not (a is null and b is null )
        and ( a is null or b is null )
    )   


UPDATE 2014-04-08

Here's another approach for IS DISTINCT FROM: http://sqlfiddle.com/#!15/d41d8/1731
with x as
(
    select a, b
    from
    (
        values
        (null,null),
        (null,1),
        (null,2),
        (1,null),
        (1,1),
        (1,2),
        (2,null),
        (2,1),
        (2,2)
    )as x(a,b)
)
select *,
 
 
    x.a IS DISTINCT FROM x.b as perfect,
 
 
    x.a <> x.b as flawed,
 
 
    a <> b
    or
    (
        not (a is null and b is null )
        and ( a is null or b is null )
    )  
        as good_workaround,
 
 
 
    (x.a <> x.b or x.a is null or x.b is null)
    and not (x.a is null and x.b is null)
 
        as perfect_workaround
 
 
 
from x;

Output:
|      A |      B | PERFECT | FLAWED | GOOD_WORKAROUND | PERFECT_WORKAROUND |
|--------|--------|---------|--------|-----------------|--------------------|
| (null) | (null) |       0 | (null) |          (null) |                  0 |
| (null) |      1 |       1 | (null) |               1 |                  1 |
| (null) |      2 |       1 | (null) |               1 |                  1 |
|      1 | (null) |       1 | (null) |               1 |                  1 |
|      1 |      1 |       0 |      0 |               0 |                  0 |
|      1 |      2 |       1 |      1 |               1 |                  1 |
|      2 | (null) |       1 | (null) |               1 |                  1 |
|      2 |      1 |       1 |      1 |               1 |                  1 |
|      2 |      2 |       0 |      0 |               0 |                  0 |  

2 comments:

  1. You mentioned "MySQL uses <=> for its IS DISTINCT FROM, I don't recall if there's an IS NOT DISTINCT FROM operator in MySQL".

    It's actually the other way around. <=> is NULL-safe equality: ie IS NOT DISTINCT FROM.

    NULL-safe inequality is NOT (a <=> b) because there is no yet in MySQL.

    http://dev.mysql.com/doc/refman/5.6/en/comparison-operators.html#operator_equal-to

    ReplyDelete
    Replies
    1. You are correct. Why I didn't read the MySQL reference. I won't commit the same mistake again, there's sqlfiddle now :-) http://sqlfiddle.com/#!2/f9d5f5/4

      Delete