Skip to main content

Using of GUID as primary key in MS SQL is antipattern

We've decided to persist our domain model into database probably a year ago so beside the xml persistence our product uses any database system as a primary storage. The next step was key failure because we've designed the database schema according to current state of our domain model. Why not as NHibernate is so powerful that you are able to map almost every relation type, every scope of accessibility etc.

Beside slow performance, I've already written those articles before, which we've fixed yet the another major issue appeared before a few months. The performance of MS SQL database during continuous insertion of medium-large rows slows down until the system appears almost dead. Why?

As I've already written above, we've decided to not adjust the model according common database recommendations and just persist it. The lifecycle of our domain model needn't be usually coupled with database persistence. So all primary keys in all domain entities were absolutely unique GUIDs.

It doesn't sound so dramatic, we've hoped that it would only occupied more space than simple integer and maybe slightly slows the performance but now I know that choose GUID as primary key is terrible decision.

GUID as primary key decreases performance a lot

I've made PoC how to increase the performance of a few complicated selects statements. They were no "join hell" but they were slow according to our acceptance criteria. I've discover clustered index as very powerful feature which can improve performance itself a lot. I've made further inquiry and found that simple select above sorted table using clustered index is 35% faster with using int as indexed key with respect to previous GUID. Very nice!

Notice that the index was already ordered, for one milion of rows it took almost four minutes so it was useless for us anyway.

GUID as primary key kills the insertion performance

I believe that you can tune the performance for the all time so 35% up was great but expected when you start to play with all those nice profiler apps. The worst issue we've met was that already referenced above.

The primary key was unique GUID generated during the model creation. The consequent inserts just use those IDs and use them in the particular insert, no big deal. The blocker problem was absolutely heavy decreasing performance of those inserts. Lets look at real numbers: first thousand of inserts took approximately one second, when the database table contains almost one million of rows, one thousand of additional inserts took almost 20 seconds!

The database was dead, processor was not doing anything, hard drive as well. No dead-lock was found, it just slept. The first suspicious was cluster primary key, it's cut-off only postponed the problem. After a few tests we've discovered that the clue is primary key. Cut-off the primary key notation returns the performance into original and requested numbers. Unfortunately our system also reads during these insertions so we are unable to switch all keys off.

The current solution is to rewrite model to use auto-generated integer primary keys which also clarified our view to domain driven design's value object to avoid the meaningless use of entity type of object everywhere it's possible.

Do not use GUID within database

I know that there are certainly places where GUID approach is usable but be aware of use them as any key. It:
  • takes significantly more space (4 bytes vs. 38 bytes), it can be serious issue when you use database with space limit, e.g. MS SQL Express
  • kills the database during the insertion
  • slows the performance of select clause, 35% down with respect to int
  • slows joins to tables 
  • almost exclude your application from the use of clustered index 
1 comment

Popular posts from this blog

Http and TCP Load Balancing with Kubernetes

Kubernetes allows to manage large clusters which are composed of docker containers. And where is large computation power there is large amount of data throughput. However, there is still a need to spread the data throughput so it can reach and utilize particular docker container(s). This is called load balancingKubernetes supports two basic forms of load balancing. Internal among docker containers themselves and external for clients which call you application from outside environment - just users.

Internal Load Balancing with Kubernetes Usual approach during the modeling of an application in kubernetes is to provide domain models for pods, replications controllers and services. Look at the great documentation if you are not aware of all these principles.

For simplification, pod is a set of docker containers which are always located on the same node. Replication controller allows and guarantees scalability. The last but not least is the service which hides all instances of pods - cre…

Validating nginx config file using docker approach

I try to setup nginx as a load balancer. The configuration is just a file with a lot of constrains so I need a validation. There is no online validation service, as e.g. CoreOS has, and I don't want to install nginx on my laptop as I work on a distributed app.

Docker is right approach for me. Let say I have following config:

In short, I'm going to pass nginx config to running nginx instance and look to the logs.

Put you nginx.config to the temp and start the docker image:

sudo docker run --name nginx -v /tmp/nginx.config:/etc/nginx/nginx.conf:ro -d nginx It uses volume mapping so the command just starts a new docker container and mounts a local /tmp/nginx.config to the given in-container path. You can obviously change the volume path to your personal path. Is it working or not? Look at logs.

sudo docker logs nginx If there is no entry, your file is fine. In the case of an error, you can see something like this:

2016/01/08 11:37:31 [emerg] 1#1: unexpected "}" in /etc/…