Thursday, June 20, 2013

Distinct() nuances

You are unique just like everyone else are, it's just hard to convince the computer. According to computer, you are just a series of bits of 1s and 0s


Using a straightforward Distinct() would work if the elements share the same memory address.

using System;
using System.Linq;
using System.Collections.Generic;
 
public class Test
{
        public static void Main()
        {
                var rose = new Fact { Thing = "Roses are", Color = 0xFF0000 };
                var violet = new Fact { Thing = "Violets are", Color = 0x0000FF };
                var sugar = new Fact { Thing = "Sugar is", Color = 0x1337 };
 
                var list = new List<Fact> () {
                    rose, rose, violet, sugar, sugar
                };
 
                foreach (var item in list.Distinct()) {
                    Console.WriteLine ("{0:x} {1:x}", item.Thing, item.Color);
                }
        }
}
 
 
public class Fact
{
    public string Thing { get; set; }
    public int Color { get; set; }
}




Live code: http://ideone.com/psEvjS
Output:
Roses are ff0000
Violets are ff
Sugar is 1337




In real application, it is not that simple, even some elements have the same content, those elements are not slotted to same memory, by default the Distinct() would not work on that kind of scenario. To wit, the following doesn't work:

using System;
using System.Linq;
using System.Collections.Generic;
 
public class Test
{
        public static void Main()
        {
                var list = new List<Fact> () {
                            new Fact { Thing = "Roses are", Color = 0xFF0000 },
                                    new Fact { Thing = "Roses are", Color = 0xFF0000 },
                                    new Fact { Thing = "Violets are", Color = 0x0000FF },
                                    new Fact { Thing = "Sugar is", Color = 0x1337 },
                                    new Fact { Thing = "Sugar is", Color = 0x1337 },
                            };
                
                foreach (var item in list.Distinct()) {
                    Console.WriteLine ("{0:x} {1:x}", item.Thing, item.Color);
                }
        }
}
 
 
public class Fact
{
    public string Thing { get; set; }
    public int Color { get; set; }
}

Live code: http://ideone.com/58gynx

Thus the above Distinct() incorrectly returns duplicate objects:
Roses are ff0000
Roses are ff0000
Violets are ff
Sugar is 1337
Sugar is 1337



You can use anonymous type on Distinct though. Anonymous types has implicit implementation of Equals and GetHashCode, that's why Distinct() on anonymous type can detect objects with similar content:


using System;
using System.Linq;
using System.Collections.Generic;
 
public class Test
{
        public static void Main()
        {
                var list = new List<Fact> () {
                            new Fact { Thing = "Roses are", Color = 0xFF0000 },
                                    new Fact { Thing = "Roses are", Color = 0xFF0000 },
                                    new Fact { Thing = "Violets are", Color = 0x0000FF },
                                    new Fact { Thing = "Sugar is", Color = 0x1337 },
                                    new Fact { Thing = "Sugar is", Color = 0x1337 },
                            };
                
                foreach (var item in list.Select(x => new { x.Thing, x.Color } ).Distinct()) {
                    Console.WriteLine ("{0} {1}", item.Thing, item.Color);
                }
        }
}
 
 
public class Fact
{
    public string Thing { get; set; }
    public int Color { get; set; }
}


Live code: http://ideone.com/8K79Dn
Hence that shall return distinct elements:
Roses are ff0000
Violets are ff
Sugar is 1337



However, you cannot use that technique when you need to return a list from a function, a function requires a non-anonymous type. Thus you cannot do the following, this will have a compilation error:


public static IEnumerable<Fact> GetFacts(IEnumerable<Fact> list) {
    return list.Select(x => new { x.Thing, x.Color } ).Distinct();
}

Live Code: http://ideone.com/1v6x0i



The following works, it felt kludgy and DRY-violating though:

public static IEnumerable<Fact> GetFacts(IEnumerable<Fact> list) {
            return list.Select(x => new { x.Thing, x.Color }).Distinct()
                       .Select(x => new Fact { Thing = x.Thing, Color = x.Color } );
}

Live Code: http://ideone.com/N4iKTx
Output:
Roses are ff0000
Violets are ff
Sugar is 1337


So what does a good developer shall do to be able to write an elegant code? The dev shall learn how to do a proper object similarity comparison. Do you remember the implicit implementation of Equals and GetHashCode on anonymous type I mentioned a while ago? That's what we will do, we will do it explicitly on the concerning class. To wit, we need to have this code on Fact class:


public override bool Equals (object obj)
{
    if (obj == null) return false;
    
    var x = obj as Fact;
    if (x == null) return  false;
    
    return this.Thing == x.Thing && this.Color == x.Color;
}


public override int GetHashCode ()
{        
    // http://stackoverflow.com/questions/263400/what-is-the-best-algorithm-for-an-overridden-system-object-gethashcode
    unchecked 
    {
        int hash = 17;
        hash = hash * 23 + this.Thing.GetHashCode ();
        hash = hash * 23 + this.Color.GetHashCode ();

        return hash;
    }
}   


When we use Distinct() on that class, that returns the distinct elements even each object has different memory addresses:

Live Code: http://ideone.com/eMrO4y
Output:
Roses are ff0000
Violets are ff
Sugar is 1337


That's it folks!

Happy Coding! ツ

No comments:

Post a Comment