July 2014 ~ Martin Podval' Log

Preface

We spend last five years on HP Service Virtualization using MsSQL database. Non-clustered server. Our app utilizes this system for all kinds of persistence. No polyglot so far. As we tuned the performance of the response time - we started at 700ms/call and we achieved couple milliseconds per call at the end when DB involved - we had to learn a lot of stuff.

Transactions, lock escalation, isolation levels, clustered and non clustered indexes, buffered reading, index structure and it's persistence, GUID ids in clustered indexes, bulk importing, omit slow joins, sparse indexes, and so on. We also rewrite part of NHibernate to support multiple tables for one entity type which allows use scaling up without lock escalation. It was good time. The end also showed us that famous Oracle has half of our favorite features once we decided to support this database.

Well, as I'm thinking about all issues which we encountered during the development, unpredictive behavior is the feeling I have once I did all this with my DB specialist colleague. When we started to care about the performance more and more we must have understood those features and we usually took out default behavior and plug-in some special one - we could have done that as we understood essentials of the system. This is the way.

As I've described those features above, a lot of NoSQL storages allow to use them on your own. You can easily provide inverted index in key-value store even if the database itself does not support such functionality. What you need to do is to write error-less code and maintain indexes by your own code. Look at twitter example for famous key-value store Redis. It is not so simple as three/four tables in MySql with build-in indexing, isn't it?

There is nothing bad about SQL databases. I still believe that they fit the purpose of 80% of programs as they are best for difficult querying and/or less than massive scaling. But who need real massive scaling? Everyone?

Why we are going to use NoSQL?

Here in this article and the whole series, I'm going to evaluate certain attributes of storages for persistence of binary data of variable length. The goal is to persist network traffic in binary data format.

No relation decomposition, no querying of particular attributes of particular messages. No implicit transactions complying with ACID. No indexes and so on. You can find the reasons in the preface, we just hope that we know what we do :-)

Cassandra

The first evaluated storage is Cassandra. It's famous NoSQL database, commercial support provides virtual confidence for fast support and so on.

Cassandra is somewhere in the middle between schemaless and schemaware (SQL like) databases as it supports schema and CQL query language. I was pretty surprised when I have saw features like time series with dynamic columns, build-in lists and sets. Next good stuff is OpsCenter monitoring tool, I always show some screenshots from the application when I want to demonstrate easy-of-use monitor for some distributed app. Pretty nice.

Hardware

Hardware used within the tests:

DB cluster:

HP Proliant BL460c gen 8, 32core 2.6 GHZ, 32GB RAM, W12k server
HP Proliant BL460c gen 8, 32core 2.6 GHZ, 192GB RAM, ubuntu 12 server

Tests executor:

xeon, 16 cores, 32gb ram, w8k server

10Gbps network
Java 7

Performance Measurement of Cassandra Cluster

The setup used one seed and one replica, java tests use datastax driver. Replication factor two when not explicitly defined. After some issues:

#GC and #cassandra ...ufff :-( pic.twitter.com/ohTciftP1e
— Martin Podval (@MartinPodval) February 17, 2014

... my final CQL create script was:



CREATE COLUMNFAMILY recording (


PARTKEY INT,


ID uuid,


VS_ID uuid,


URL ascii,


MESSAGE_TIME BIGINT,


DATA BLOB,


PRIMARY KEY((PARTKEY, VS_ID), ID)


);

Batch Size

Datastax driver allows to use queries in batches, how does it scale?

Blob size \ Batch Size [TPS]	128	256	512	1024	8196	32768
100 bytes (8 clients)	67 100	63 900	55 400	55 800	23 500	Timeouted
1 kb (4 clients)	15 600	20 600	20 300	19 400	5 500	N/A
5 kb (1 client)	N/A	8 300	9 200	10 000	3 000

256 rows seems like best size of the batch.

Number of Connections

The tests utilizes variable count of connections to the Cassandra cluster. Note that the batch size was 256 and consistency was set to ONE.

Blob size \ Connections [TPS]	1	2	4	8	24
100 b	27 000	53 000	75 000	62 000	62 000
1 kb	17 900	31 000	42 000	40 000	--
5 kb	9 000	14 600	16 000	10 300	--
20 kb	3 300	4 500	3 800

You can see that 2 or 4 connections per one client are enough whilst one connection slows down the overall performance.

Number of Partitions

The table contains partition key, see the CQL code snippet. The test generates variable values for the partition key. Batch size was also 256 and consistency one.

Blob size \ Partitions [TPS]	1	8	64
100 b	58 000	71 000	70 000
10 kb	13 200	9100	7 300

Partition key says where the data will reside, it defines target machine. As two nodes are involved in my tests, its better to use more than one partition.

Variable Consistency

Tuning of consistency is the most beautiful attribute in Cassandra. You can define consistency level in every aspect of the communication with cassandra driver in any language. There are different levels, see the documentation.

The test used 8 partitions, 8 connections and 256 batch size.

Blob size \ Consisency [TPS]	Any	One	Two
100 b	67 000	66 000	66 000

Well, I'm surprised. There is almost no difference between consistency levels for such blob size despite totally different purpose or implementation. This could happen due to almost ideal environment, very fast network and machines as well.

Variable Replicas

I used two replicas by default but this tests tries to reveal the price for second copy of the data.

Blob size \ Replicas [TPS]	1	2
100 b	121 000	72 000
10 kb	19 200	8 200

There is apparent penalty for second replica.

Long Running

Previous tests used a few millions of inserts/rows but the current one uses hundred of millions. This is the way how to definitely omit the advantage of various caches and clever algorithms. This simulates heavy weight load test to the server.

8 partitions, 8 connections, 256 batch size.

Blob size \ Replicas+Consistency [TPS]	1+ONE	2+ONE	2+TWO
100 b (1 billion)	121 000	63 000	62 00
10 kb (30 m)	--	8 200	--

Large Messages

The largest message in these tests has 20 kb so far, but what is the behavior when our app would meet some large file as a attachment?

Consistency One, 2 partitions, 2 connections, 2 batch size.

Blob size [TPS]
500 kb	90
5 mb	9.5

Occupied Space

Apache Cassandra automatically enables compression since 1.1 version.

Blob size \ [Occupied space in bytes per message]
100b	89b

Here is the structure of serialized entity.

CREATE COLUMNFAMILY recording (

    PARTKEY INT,

    ID uuid,

    VS_ID uuid,

    URL ascii,

    MESSAGE_TIME bigint,

    DATA blob,

    PRIMARY KEY((PARTKEY, VS_ID), ID)

);

Conclusion

This kind of testing is almost opposite in comparison to neflix tests. They analyzed scalability but we have done testing of simple cluster having two machines only. We need to figure out the performance of one node (obviously in cluster deployment).

I was impressed as the numbers above are good, the overall throughput on the network reached hundred mbps coming out from one client.

What next. Same kinds of tests agains Kafka, Redis, Couchbase, Aerospike, Couchbase.

Martin Podval' Log

Menu

Wednesday, July 30, 2014

Performance Battle of NoSQL blob storages #1: Cassandra