Saturday, October 25, 2014

Performance Battle of NoSQL blob storages #2: Apache Kafka

The first article of this series brought scaling factors for blob-based content on Apache Cassandra. It's well know piece of software, requiring full installation on nodes, management application and so on. You also need to tune various configs to achieve best performance results. I've spend nice time playing with yamls on ubuntu :-)

The configuration is sometimes tricky. I was little bit confused once or twice so I planned to hire Cassandra guru to our team as issues I encountered seems really complicated :-)

Well, we do not need much functionality in HP Service Virtualization. The core is to replicate messages to achieve reliability. The next steps is to process them. Yes, the last part is to aggregate the results. Sounds exactly like map and reduce ala hadoop. Evaluation of Apache Storm or Apache Samza are different stories, but they allow me to find Apache Kafka.

Kafka is pretty nice software, much more simpler comparing to Cassandra how I've described it above. The only operation to use it is to depend on kafka jars in your pom files. That's it! Maven downloads couple of dependent jar files and your environment is (almost) ready.

As you can see below, Kafka is incredibly fast, much more faster than Cassandra. Last year I read some article. An author labels Redis as most incredible software he had met so far. I agreed. Now there are two candidates for this label :-)

It was also very beneficial to read their technical documentation for me as for technical engineer. The guys did performance research how to store data on hard drives and they also employed founded approaches in Kafka implementation. Their documentation contains interesting technical papers, like e.g. lmax pdf does with disruptor.

Setup

The setup was identical to the described in first article.

Performance Measurement of Kafka Cluster

Batch Size

  • 8 topics
  • two partitions per one topic
Blob size \ Batch Size [TPS]128256512102432768
100 b392k432k433k495k396k
20 kb9k--------

There is little difference between various batch sizes so you can tune the value according your needs. Note that overall throughput is incredible: raw 400 mbits/s.



Number of Connections

  • batch size is 512

Blob size \ Connections [TPS per connection]12481632
100 b84k74k62k58k37k17k
20 kb3.2k2.5k1.7k------

You can see that number of connections significantly increases the throughput of Kafka during writes. Two connections handles 150k but, 8 ones allows 464k messages per all connections.

Number of Partitions

  • 8 connections

Blob size \ Partitions [TPS]12481664
100 b535k457k493k284k--

Two partitions brought little difference in the overall score. There is same approach or pattern like in hyper threading.

Long Running


The goal of this test is to verify the behavior once the cluster is under the continual heavy load for couple of minutes - as the underlying storage goes out of space (150GB). As Kafka is really fast, such space lasts couple of minutes.

Blob size \ [TPS]
100b490k

The test generates 150GB of data successfully stored by Kafka. The throughput result is almost the same as in short test.

Replication


This is crucial verification how is Kafka affected by the replication.

Blob size \ Replicas [TPS]12
100 b577k536k

There is difference less than 10% when two nodes are involved in the replication. Great! Both nodes handles 400mbits throughput.

Large Messages


Previous tests use relatively small messages, how does it behave with larger message?

Blob size \ [TPS]
100b640k
20 kb6k
500kb314

The best throughput achieved last - large - message. Even for so large entity, it handles unbelievable 1.2Gbits/s. Note that all of this is a remote communication, we have 10gbps network.

Occupied Space


As the kafka stores byte array, the occupied space for this system depends on the serialization framework. I used famous kryo.

Blob size \ [Occupied space in bytes per message]
100b183b

Here is the structure of serialized entity.
class Message {
    private UUID messageId;
    private UUID virtualServiceId;
    private String targetUrlSuffix;
    private int responseStatusCode;
    private long time;
    private byte[] data;
}


Conclusion


Kafka surprised me a lot. It's performance is incredible. The installation is just the dependency on a jar file, the configuration is very easy. The API is really simple, It's up to you what kind and form of the content you prefer.

The only drawback is that this piece of software is pretty young. In the time this article was being written, beta version 0.8 is out. For example, async API is now in a proposal only.

On the other hand, there is large set of various articles, videos and other materials, how one used it in his project, especially along with Apache Storm.

Well,  if you want to use messaging within your new solution you should definitely look at Apache Kafka.

1 comments:

Unknown said...

There are lots of information about latest technology, like Hadoop cluster is a special type of computational cluster designed specifically for storing and analyzing huge amounts of unstructured data in a distributed computing environment. This information seems to be more unique and interesting. Thanks for sharing.
Big Data Training | Big Data Hadoop Training in Chennai | Big Data Course in Chennai