Performance Battle of NoSQL blob storages #2: Apache Kafka

The first article of this series brought scaling factors for blob-based content on Apache Cassandra. It’s well know piece of software, requiring full installation on nodes, management application and so on. You also need to tune various configs to achieve best performance results. I’ve spend nice time playing with yamls on ubuntu 🙂

The configuration is sometimes tricky. I was little bit confused once or twice so I planned to hire Cassandra guru to our team as issues I encountered seems really complicated 🙂

Well, we do not need much functionality in HP Service Virtualization. The core is to replicate messages to achieve reliability. The next steps is to process them. Yes, the last part is to aggregate the results. Sounds exactly like map and reduce ala hadoop. Evaluation of Apache Storm or Apache Samza are different stories, but they allow me to find Apache Kafka.

Kafka is pretty nice software, much more simpler comparing to Cassandra how I’ve described it above. The only operation to use it is to depend on kafka jars in your pom files. That’s it! Maven downloads couple of dependent jar files and your environment is (almost) ready.

As you can see below, Kafka is incredibly fast, much more faster than Cassandra. Last year I read some article. An author labels Redis as most incredible software he had met so far. I agreed. Now there are two candidates for this label 🙂

It was also very beneficial to read their technical documentation for me as for technical engineer. The guys did performance research how to store data on hard drives and they also employed founded approaches in Kafka implementation. Their documentation contains interesting technical papers, like e.g. lmax pdf does with disruptor.

Setup

The setup was identical to the described in first article.

Performance Measurement of Kafka Cluster

Batch Size

  • 8 topics
  • two partitions per one topic
Blob size \ Batch Size [TPS] 128 256 512 1024 32768
100 b 392k 432k 433k 495k 396k
20 kb 9k

There is little difference between various batch sizes so you can tune the value according your needs. Note that overall throughput is incredible: raw 400 mbits/s.

Number of Connections

  • batch size is 512

Blob size \ Connections [TPS per connection] 1 2 4 8 16 32
100 b 84k 74k 62k 58k 37k 17k
20 kb 3.2k 2.5k 1.7k

You can see that number of connections significantly increases the throughput of Kafka during writes. Two connections handles 150k but, 8 ones allows 464k messages per all connections.

Number of Partitions

  • 8 connections

Blob size \ Partitions [TPS] 1 2 4 8 16 64
100 b 535k 457k 493k 284k

Two partitions brought little difference in the overall score. There is same approach or pattern like in hyper threading.

Long Running

The goal of this test is to verify the behavior once the cluster is under the continual heavy load for couple of minutes – as the underlying storage goes out of space (150GB). As Kafka is really fast, such space lasts couple of minutes.

Blob size \ [TPS]
100b 490k

The test generates 150GB of data successfully stored by Kafka. The throughput result is almost the same as in short test.

Replication

This is crucial verification how is Kafka affected by the replication.

Blob size \ Replicas [TPS] 1 2
100 b 577k 536k

There is difference less than 10% when two nodes are involved in the replication. Great! Both nodes handles 400mbits throughput.

Large Messages

Previous tests use relatively small messages, how does it behave with larger message?

Blob size \ [TPS]
100b 640k
20 kb 6k
500kb 314

The best throughput achieved last – large – message. Even for so large entity, it handles unbelievable 1.2Gbits/s. Note that all of this is a remote communication, we have 10gbps network.

Occupied Space

As the kafka stores byte array, the occupied space for this system depends on the serialization framework. I used famous kryo.

Blob size \ [Occupied space in bytes per message]
100b 183b

Here is the structure of serialized entity.
class Message {

    private UUID messageId;
private UUID virtualServiceId;
private String targetUrlSuffix;
private int responseStatusCode;
private long time;
private byte[] data;
}

Conclusion

Kafka surprised me a lot. It’s performance is incredible. The installation is just the dependency on a jar file, the configuration is very easy. The API is really simple, It’s up to you what kind and form of the content you prefer.

The only drawback is that this piece of software is pretty young. In the time this article was being written, beta version 0.8 is out. For example, async API is now in a proposal only.

On the other hand, there is large set of various articles, videos and other materials, how one used it in his project, especially along with Apache Storm.

Well,  if you want to use messaging within your new solution you should definitely look at Apache Kafka.

2 thoughts on “Performance Battle of NoSQL blob storages #2: Apache Kafka

  1. There are lots of information about latest technology, like Hadoop cluster is a special type of computational cluster designed specifically for storing and analyzing huge amounts of unstructured data in a distributed computing environment. This information seems to be more unique and interesting. Thanks for sharing.Big Data Training | Big Data Hadoop Training in Chennai | Big Data Course in Chennai

    Like

  2. This is one such interesting and useful article that i have ever read. The way you have structured the content is so realistic and meaningful. Thank you so much for sharing this in here. Keep up this good work and I'm expecting more contents like this from you in future. Big Data Training Chennai | Best hadoop training institute in chennai | hadoop course in Chennai

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this:
search previous next tag category expand menu location phone mail time cart zoom edit close