Sunday, June 5, 2011

Using of GUID as primary key in MS SQL is antipattern

We've decided to persist our domain model into database probably a year ago so beside the xml persistence our product uses any database system as a primary storage. The next step was key failure because we've designed the database schema according to current state of our domain model. Why not as NHibernate is so powerful that you are able to map almost every relation type, every scope of accessibility etc.

Beside slow performance, I've already written those articles before, which we've fixed yet the another major issue appeared before a few months. The performance of MS SQL database during continuous insertion of medium-large rows slows down until the system appears almost dead. Why?

As I've already written above, we've decided to not adjust the model according common database recommendations and just persist it. The lifecycle of our domain model needn't be usually coupled with database persistence. So all primary keys in all domain entities were absolutely unique GUIDs.

It doesn't sound so dramatic, we've hoped that it would only occupied more space than simple integer and maybe slightly slows the performance but now I know that choose GUID as primary key is terrible decision.

GUID as primary key decreases performance a lot

I've made PoC how to increase the performance of a few complicated selects statements. They were no "join hell" but they were slow according to our acceptance criteria. I've discover clustered index as very powerful feature which can improve performance itself a lot. I've made further inquiry and found that simple select above sorted table using clustered index is 35% faster with using int as indexed key with respect to previous GUID. Very nice!

Notice that the index was already ordered, for one milion of rows it took almost four minutes so it was useless for us anyway.

GUID as primary key kills the insertion performance

I believe that you can tune the performance for the all time so 35% up was great but expected when you start to play with all those nice profiler apps. The worst issue we've met was that already referenced above.

The primary key was unique GUID generated during the model creation. The consequent inserts just use those IDs and use them in the particular insert, no big deal. The blocker problem was absolutely heavy decreasing performance of those inserts. Lets look at real numbers: first thousand of inserts took approximately one second, when the database table contains almost one million of rows, one thousand of additional inserts took almost 20 seconds!

The database was dead, processor was not doing anything, hard drive as well. No dead-lock was found, it just slept. The first suspicious was cluster primary key, it's cut-off only postponed the problem. After a few tests we've discovered that the clue is primary key. Cut-off the primary key notation returns the performance into original and requested numbers. Unfortunately our system also reads during these insertions so we are unable to switch all keys off.

The current solution is to rewrite model to use auto-generated integer primary keys which also clarified our view to domain driven design's value object to avoid the meaningless use of entity type of object everywhere it's possible.

Do not use GUID within database

I know that there are certainly places where GUID approach is usable but be aware of use them as any key. It:
  • takes significantly more space (4 bytes vs. 38 bytes), it can be serious issue when you use database with space limit, e.g. MS SQL Express
  • kills the database during the insertion
  • slows the performance of select clause, 35% down with respect to int
  • slows joins to tables 
  • almost exclude your application from the use of clustered index