Skip to main content

Using of GUID as primary key in MS SQL is antipattern

We've decided to persist our domain model into database probably a year ago so beside the xml persistence our product uses any database system as a primary storage. The next step was key failure because we've designed the database schema according to current state of our domain model. Why not as NHibernate is so powerful that you are able to map almost every relation type, every scope of accessibility etc.

Beside slow performance, I've already written those articles before, which we've fixed yet the another major issue appeared before a few months. The performance of MS SQL database during continuous insertion of medium-large rows slows down until the system appears almost dead. Why?

As I've already written above, we've decided to not adjust the model according common database recommendations and just persist it. The lifecycle of our domain model needn't be usually coupled with database persistence. So all primary keys in all domain entities were absolutely unique GUIDs.

It doesn't sound so dramatic, we've hoped that it would only occupied more space than simple integer and maybe slightly slows the performance but now I know that choose GUID as primary key is terrible decision.

GUID as primary key decreases performance a lot

I've made PoC how to increase the performance of a few complicated selects statements. They were no "join hell" but they were slow according to our acceptance criteria. I've discover clustered index as very powerful feature which can improve performance itself a lot. I've made further inquiry and found that simple select above sorted table using clustered index is 35% faster with using int as indexed key with respect to previous GUID. Very nice!

Notice that the index was already ordered, for one milion of rows it took almost four minutes so it was useless for us anyway.

GUID as primary key kills the insertion performance

I believe that you can tune the performance for the all time so 35% up was great but expected when you start to play with all those nice profiler apps. The worst issue we've met was that already referenced above.

The primary key was unique GUID generated during the model creation. The consequent inserts just use those IDs and use them in the particular insert, no big deal. The blocker problem was absolutely heavy decreasing performance of those inserts. Lets look at real numbers: first thousand of inserts took approximately one second, when the database table contains almost one million of rows, one thousand of additional inserts took almost 20 seconds!

The database was dead, processor was not doing anything, hard drive as well. No dead-lock was found, it just slept. The first suspicious was cluster primary key, it's cut-off only postponed the problem. After a few tests we've discovered that the clue is primary key. Cut-off the primary key notation returns the performance into original and requested numbers. Unfortunately our system also reads during these insertions so we are unable to switch all keys off.

The current solution is to rewrite model to use auto-generated integer primary keys which also clarified our view to domain driven design's value object to avoid the meaningless use of entity type of object everywhere it's possible.

Do not use GUID within database

I know that there are certainly places where GUID approach is usable but be aware of use them as any key. It:
  • takes significantly more space (4 bytes vs. 38 bytes), it can be serious issue when you use database with space limit, e.g. MS SQL Express
  • kills the database during the insertion
  • slows the performance of select clause, 35% down with respect to int
  • slows joins to tables 
  • almost exclude your application from the use of clustered index 

Comments

Nikola Malovic said…
If we speak about *normal* Guids - you are right, but if we speak about sequential ones (generated by SQL server with newsequetialid() or COMB.guid (used by NHibernate or genrated manually) then there's no real performance hit occuring by usage of guids neither in seek nor in insert operations.

That been said even without any perf hits occuring, GUIDS have disadvantage in amount of space they occupy and the fact they are nto human readable and thus was HiLo placed. They makes sense in scenarios (like the one I have) where you merge in cloud the data of multiple different databases of the disconected clients (kind a like Git approach) in which case they are really great.

Very old (perf measured on SQL 2000) but still very valid article on the subject of guid performance http://www.informit.com/articles/article.aspx?p=25862&seqNum=7

Popular posts from this blog

Performance Battle of NoSQL blob storages #1: Cassandra

Preface We spend last five years on HP Service Virtualization using MsSQL database . Non-clustered server. Our app utilizes this system for all kinds of persistence. No polyglot so far. As we tuned the performance of the response time - we started at 700ms/call and we achieved couple milliseconds per call at the end when DB involved - we had to learn a lot of stuff. Transactions, lock escalation , isolation levels , clustered and non clustered indexes, buffered reading, index structure and it's persistence, GUID ids in clustered indexes , bulk importing , omit slow joins, sparse indexes, and so on. We also rewrite part of NHibernate to support multiple tables for one entity type which allows use scaling up without lock escalation. It was good time. The end also showed us that famous Oracle has half of our favorite features once we decided to support this database. Well, as I'm thinking about all issues which we encountered during the development, unpredictive behavio

ETCD: POST vs. PUT understanding

ETCD is distributed key value store used as a core component in CoreOS . I've already send a post earlier this week. Here is a page describing how to use ETCD basic commands = ETCD API. Code snippets placed in a page mostly use put , but ETCD allows to use post as well.  Most of us understand differences between those two commands in a notion of a REST(ful) service, but how does it work in key value store? POST Example over many words. curl -v http://127.0.0.1:2379/v2/keys/test -XPOST -D value="some value" curl -v http://127.0.0.1:2379/v2/keys/test -XPOST -D value="some value" Two same command result into following content: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 { "action" : "get" , "node" : { "key" : "/test" , "dir" : true , "nodes" : [ { "key" : "/test/194" , "value" : &