class Programmer : IEnableMuch {: Database

Showing posts with label Database. Show all posts

Thursday, January 10, 2013

My musings on database deadlock issues part 2

Sometimes we can just throw(not waste) money at the problem to solve programming woes. It’s good that stackoverflow don’t have to buy Oracle in order to solve their database deadlocks issues, stackoverflow was built in the fortunate time(2007), it was built when SQL Server already has an MVCC option(SQL Server 2005), contrast that to Oracle which uses MVCC since version 3(built in 1983). The decision-makers at stackoverflow just flick a switch(switch their SQL Server database to use MVCC) to solve their database deadlock issues.

Oracle using MVCC as its only database transaction operation tells a lot, I can’t see MVCC as just all about row-versioning, they know the advantages far outweighs the disadvantages, hence them making MVCC the only database transaction operation supported by their database. If Oracle know the disadvantages can outweighs the advantages, they will not hardwire the MVCC architecture to their database.

We can’t fault companies like airlines and financial institutions if they choses Oracle over other RDBMSes; information correctness far matter most than anything else, which MVCC delivers properly, there are no dirty reads with MVCC. And performance is the natural byproduct of MVCC’s writers not blocking the readers and vice versa, sweet deal. MVCC delivers the trifecta of a very scalable database: correct information (delivers the I in ACID), no intermittent deadlocks, performance.

We can see for ourselves the mayhem that nolock can do to our queries(providing correct reports for that matter), it doesn’t even solve the deadlock issues, the very problem most of us are trying to solve with it: http://blogs.msdn.com/b/craigfr/archive/2007/06/12/query-failure-with-read-uncommitted.aspx

And we are already aware of the evil that SET TRANSACTION ISOLATION READ UNCOMMITTED can commit in the name of performance.

<microsoft-bashing> I will not expect Microsoft will put MVCC literature on SQL Server BOL, Microsoft doesn’t want to be looked upon as mere copycat or a johnny-come-lately. They will avoid things that will give them those impressions. Case in point, Materialized Views is the de facto nomenclature for er.. Materialized Views, but how do Microsoft named their materialized views feature? Microsoft want to be different, they named their Materialized Views as Indexed Views! I’m not surprised Microsoft didn’t put MVCC nor Materialized Views on their SQL Server documentation. For finding good information on something cannot be found on Microsoft’s official documentation, we can just use our trusted search engines</microsoft-bashing>

I’ll stop here, I’m turning into an MVCC salesman now, hahah :D

As with most things in life, we just have to decide which option that offers advantages that far outweighs its disadvantages. Some are even loading their whole database to memory in order to achieve performance, hardware is getting cheaper, I don’t even see that trend reversing ツ

Happy Coding! ツ

Wednesday, January 9, 2013

My musings on database deadlock issues

Dirty reads yields dirty information (pardon my tautology)

Making a correct program fast is easier than making a fast program correct. It’s the latter(fast, yet has incorrect information) what SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED delivers, it’s Microsoft’s work-around for deadlocks, instead of delivering true isolation.

Think of MVCC transaction isolation as users having their own database copy(or scratchpad? I’m not good at analogy) for their transactions, whatever writes they are doing to their database copy, deletes, inserts, updates, etc, just happens in their own database copy only and they doesn’t affect the official state of the database. This yields vast improvements when doing concurrent access(which is the norm) on your database, the writers(insert,update,delete) won’t ever affect other user’s SELECT, hence giving performance boost when accessing the official state of your database. Good for reports and information browsing, good for developers and clients alike.

And MVCC yields correct information to boot. It’s hard not to swoon on that points alone(correct information); plus it’s fast, there’s no waiting/deadlocks incurring between writers and readers. It eludes me why SQL Server didn’t make MVCC the default.

Correctness should rule everything around us

“I find that deadlocks are difficult to understand and even more difficult to troubleshoot. Fortunately, it's easy enough to fix by setting read committed snapshot on the database for our particular workload. But I can't help thinking our particular database vendor just isn't as optimistic as they perhaps should be.“ -- http://www.codinghorror.com/blog/2008/08/deadlocked.html

SET READ_COMMITTED_SNAPSHOT ON (turns on MVCC) is what fixes the stackoverflow website’s performance issues. Jeff Atwood is the author of www.codinghorror.com and http://www.stackoverflow.com

nolock is no joy too, being able to read a row is not guaranteed with nolock(a more granular SET TRANSACTION ISOLATION READ UNCOMMITTED):
http://blogs.msdn.com/b/craigfr/archive/2007/06/12/query-failure-with-read-uncommitted.aspx

And try to reproduce the same scenario on the above link without nolock, and using MVCC settings, It Just Works™ :

ALTER DATABASE testDatabase SET ALLOW_SNAPSHOT_ISOLATION ON;
ALTER DATABASE testDatabase SET READ_COMMITTED_SNAPSHOT ON;

Tuesday, October 30, 2012

Alphabet Soup 101: Sargable

Programming industry is notorious for alphabet soup of acronyms and jargons. If some concept or phrase don't have any acronym or catchy word yet, someone will make a made-up one.

When I'm talking to someone about database optimization, I'm referring to an index-friendly condition as index-friendly condition. Could it be any harder than that? But lo and behold, there's a handy word for that programming discipline, it's called SARGable

In relational databases, a condition (or predicate) in a query is said to be sargable if the DBMS engine can take advantage of an index to speed up the execution of the query. The term is derived from a contraction of Search ARGument ABLE.

A query failing to be sargable is known as Non-Sargable query and has an effect in query time, so one of the steps in query optimization is convert them to be sargable.

The typical thing that will make a sql query non-sargable is to include a function in left part of a condition of a Where clause.

Read the rest of the article at: http://en.wikipedia.org/wiki/Sargable

http://www.sql-server-performance.com/2007/t-sql-where/

Sunday, July 15, 2012

CASE WHEN ELSE is faster than OR approach

Common denominator feature. When ideals get trumped by reality.

I encountered an SQL query where I thought it would be ideal to make it ORM-friendly, so when time comes you need to port it to an ORM query, porting shall be easier. OR so I thought.

I saw this question on http://stackoverflow.com/questions/10646018/can-you-use-case-to-specify-a-join-to-use-in-mysql:

I am wondering if it is possible to use some sort of case in a mysql query to effectively achieve the following and reduce the amount of data that has to be looked up each time ..

SELECT um.message, UNIX_TIMESTAMP(um.time), um.read, user.username
FROM usermessages um
CASE
WHEN um.friendId = 1
INNER JOIN user ON um.sourceUserId = user.id
WHEN um.sourceUserId = 1
INNER JOIN user ON um.friendId = user.id
END
WHERE (um.friendId = 1 OR um.sourceUserId = 1)

I offered this solution: http://stackoverflow.com/questions/10646018/can-you-use-case-to-specify-a-join-to-use-in-mysql/10646233#10646233

SELECT um.message, UNIX_TIMESTAMP(um.time), um.read, user.username
FROM usermessages um
INNER JOIN user ON 
    (um.friendId = 1 AND um.sourceUserId = user.id)
    OR
    um.friendId = user.id
WHERE (um.friendId = 1 OR um.sourceUserId = 1);

Then another stackoverfellow offer this solution: http://stackoverflow.com/questions/10646018/can-you-use-case-to-specify-a-join-to-use-in-mysql/10646078#10646078

SELECT um.message, UNIX_TIMESTAMP(um.time), um.read, user.username
FROM usermessages um
INNER JOIN user ON 
    CASE 
        WHEN um.friendId = 1 THEN um.sourceUserId
                             ELSE um.friendId
    END = user.id
WHERE (um.friendId = 1 OR um.sourceUserId = 1)

I disliked his solution initially, as I embrace ORM more and more, I'm of the opinion that we should write portable code, and what's more portable than writing the query with constructs that exists on both SQL and ORM? I disliked the CASE WHEN approach to the problem as it don't have any analogous construct on ORM, Linq in particular.

With that in mind, I tried to dissuade the user from using the CASE WHEN approach. But we need to wise-up though, portability is not a panacea that can magically make our applications have more customers. How can a customer be happy if bad application performance spoils the fun on using your application. With this in mind, I tried to profile both queries to see which one is faster. I'm expecting the OR to be slower(RDBMS don't do short-circuit, it uses cost-based analysis on determining how it should perform the query) than CASE WHEN, but I'm not expecting very big difference in performance, then much to my chagrin when I see a staggering difference in performance, the CASE WHEN approach can finish the job in 88 milliseconds, while the OR approach took 4.7 seconds. The performance of OR is not acceptable

CASE WHEN approach's Query Cost is 3% only, while the OR approach is 97%. So there goes our portability ideals, it's not ideal. We need to tap the strengths of a given tool in order to get our job done in an efficient manner. I will not be dead set anymore to make a given code construct be very compatible on another platform or API.

Expand to see the code used for benchmarking: http://www.sqlfiddle.com/#!6/29531/2

create table usermessages(
  message varchar(30),
  friendId int,
  sourceUserId int
);

create table "user"(
  userId int identity(1,1) primary key,
  userName varchar(30)
);


create index ix_usermessages__friendid 
on usermessages(friendid);

create index ix_usermessages__combined
on usermessages(friendid,sourceUserId);


create index ix_usermessages__sourceuserid
on usermessages(sourceuserid);



insert into usermessages
select 'hello',row_number() over(order by x.message_id) % 50,1 from sys.messages x;


insert into "user"(userName)
select 'hello' from sys.messages;


select count(*)
from usermessages um
join "user" u

on
(case when um.friendId = 1 then um.sourceUserId else um.friendId end) = u.userId;


select count(*)
from usermessages um
join "user" u

on 
(um.friendId = 1 and um.sourceUserId = u.userId)
or 
(um.friendId = u.userId);

Wednesday, May 23, 2012

Running total. Here and now

Sql Server 2012 is upon us, it's 2012 already. But for some people, their here and now is still Sql Server 2008

Read first how not to do(set-based) running total at http://sqlblog.com/blogs/adam_machanic/archive/2006/07/12/running-sums-redux.aspx

Given Sql Server 2008 windowing capability limitation, most would give their new-fangled CTE skills a shot by writing running total query in a recursive manner:

with T AS
(
 select ROW_NUMBER() over(order by OrderID) as rn, * from test
)
,R(Rn, OrderId, Qty, RunningTotal) as
(
 select Rn, OrderID, Qty, Qty
 from t 
 where rn = 1
 
 union all
 
 select t.Rn, t.OrderId, t.Qty, p.RunningTotal + t.Qty
 from t t
 join r p on t.rn = p.rn + 1
 
)
select R.OrderId, R.Qty, R.RunningTotal from r
option(maxrecursion 0)

All is fine and dandy, except when that query is ran on a production database, that query will not scale, that query took a good 9 seconds on a 5,000 rows table. (DDL at the bottom of this post)

Now let's try another approach, let's think for a while we are back in time, say Sql Server 2000. What would your DBA grandpa would do to facilitate that running total report?

create function TestRunningTotal()
returns @ReturnTable table(
    OrderId int, Qty int, RunningTotal int
)
as begin

    insert into @ReturnTable(OrderID, Qty, RunningTotal)
    select OrderID, Qty, 0 from Test
    order by OrderID;

    declare @RunningTotal int = 0;

    update @ReturnTable set 
           RunningTotal = @RunningTotal, 
           @RunningTotal = @RunningTotal + Qty;

    return;
end;

And that query took 0 second.

And go back further in time, say Sql Server 7, what would he do? He would follow Adam Machanic's approach: http://sqlblog.com/blogs/adam_machanic/archive/2006/07/12/running-sums-redux.aspx. Cursor is one of the rarity cases where running total is a very good choice.

The following is every developer's heaven, works on Postgresql 8.4, Sql Server 2012 and Oracle 9(or 8.1.7 ?):

select OrderID, Qty, sum(Qty) over(order by OrderID) as RunningTotal from test

Now back to regular programming(Sql Server 2008).

Some called that kind of update that relies on physical sort a quirky update. If you feel uneasy with quirky update, you might want to put a guard statement to prevent wrong update order.

create function TestRunningTotalGuarded()
returns @ReturnTable table(
    OrderId int, Qty int, RunningTotal int not null, RN int identity(1,1) not null
)
as begin

    insert into @ReturnTable(OrderID, Qty, RunningTotal)
    select OrderID, Qty, 0 from Test
    order by OrderID;
    
    declare @RunningTotal int = 0;
    
    declare @RN_check INT = 0;
    
    update @ReturnTable set 
            @RN_check = @RN_check + 1,
            @RunningTotal = (case when RN = @RN_check then @RunningTotal + Qty else 1/0 end),
            RunningTotal = @RunningTotal;

    return;
end;

If UPDATE really update rows in unpredictable order, the @RN_Check will not be equal to RN(identity order) anymore, the code will raise a divide-by-zero error then.

DDL

create table Test(
 OrderID int primary key,
 Qty int not null
);


declare @i int = 1;

while @i <= 5000 begin
 insert into Test(OrderID, Qty) values (@i * 2,rand() * 10); 
 set @i = @i + 1;
end;

Running total example results:

OrderId     Qty         RunningTotal
----------- ----------- ------------
2           4           4
4           8           12
6           4           16
8           5           21
10          3           24
12          8           32
14          2           34
16          9           43
18          1           44
20          2           46
22          0           46
24          2           48
26          6           54
28          2           56
30          8           64
32          6           70
34          0           70
36          4           74
38          2           76
40          5           81
42          4           85

UPDATE: May 29, 2012

Use cursor, quirky update is well.. quirky. Until further notice, please use something predictable, like cursor. http://www.ienablemuch.com/2012/05/recursive-cte-is-evil-and-cursor-is.html

UPDATE: May 29, 2012 8:27 PM

Just by putting a clustered primary key on the table variable, it makes the updating of rows in order. Check the test(looped 100 times) on the bottom of this article, it has no divide-by-zero(guard statement) error anymore: http://www.ienablemuch.com/2012/05/recursive-cte-is-evil-and-cursor-is.html

Until further notice, I would say quirky update is not really quirky.

Open Source