Skip to main content

Performance Battle of NoSQL blob storages #2: Apache Kafka

The first article of this series brought scaling factors for blob-based content on Apache Cassandra. It's well know piece of software, requiring full installation on nodes, management application and so on. You also need to tune various configs to achieve best performance results. I've spend nice time playing with yamls on ubuntu :-)

The configuration is sometimes tricky. I was little bit confused once or twice so I planned to hire Cassandra guru to our team as issues I encountered seems really complicated :-)

Well, we do not need much functionality in HP Service Virtualization. The core is to replicate messages to achieve reliability. The next steps is to process them. Yes, the last part is to aggregate the results. Sounds exactly like map and reduce ala hadoop. Evaluation of Apache Storm or Apache Samza are different stories, but they allow me to find Apache Kafka.

Kafka is pretty nice software, much more simpler comparing to Cassandra how I've described it above. The only operation to use it is to depend on kafka jars in your pom files. That's it! Maven downloads couple of dependent jar files and your environment is (almost) ready.

As you can see below, Kafka is incredibly fast, much more faster than Cassandra. Last year I read some article. An author labels Redis as most incredible software he had met so far. I agreed. Now there are two candidates for this label :-)

It was also very beneficial to read their technical documentation for me as for technical engineer. The guys did performance research how to store data on hard drives and they also employed founded approaches in Kafka implementation. Their documentation contains interesting technical papers, like e.g. lmax pdf does with disruptor.

Setup

The setup was identical to the described in first article.

Performance Measurement of Kafka Cluster

Batch Size

  • 8 topics
  • two partitions per one topic
Blob size \ Batch Size [TPS]128256512102432768
100 b392k432k433k495k396k
20 kb9k--------

There is little difference between various batch sizes so you can tune the value according your needs. Note that overall throughput is incredible: raw 400 mbits/s.



Number of Connections

  • batch size is 512

Blob size \ Connections [TPS per connection]12481632
100 b84k74k62k58k37k17k
20 kb3.2k2.5k1.7k------

You can see that number of connections significantly increases the throughput of Kafka during writes. Two connections handles 150k but, 8 ones allows 464k messages per all connections.

Number of Partitions

  • 8 connections

Blob size \ Partitions [TPS]12481664
100 b535k457k493k284k--

Two partitions brought little difference in the overall score. There is same approach or pattern like in hyper threading.

Long Running


The goal of this test is to verify the behavior once the cluster is under the continual heavy load for couple of minutes - as the underlying storage goes out of space (150GB). As Kafka is really fast, such space lasts couple of minutes.

Blob size \ [TPS]
100b490k

The test generates 150GB of data successfully stored by Kafka. The throughput result is almost the same as in short test.

Replication


This is crucial verification how is Kafka affected by the replication.

Blob size \ Replicas [TPS]12
100 b577k536k

There is difference less than 10% when two nodes are involved in the replication. Great! Both nodes handles 400mbits throughput.

Large Messages


Previous tests use relatively small messages, how does it behave with larger message?

Blob size \ [TPS]
100b640k
20 kb6k
500kb314

The best throughput achieved last - large - message. Even for so large entity, it handles unbelievable 1.2Gbits/s. Note that all of this is a remote communication, we have 10gbps network.

Occupied Space


As the kafka stores byte array, the occupied space for this system depends on the serialization framework. I used famous kryo.

Blob size \ [Occupied space in bytes per message]
100b183b

Here is the structure of serialized entity.
class Message {
    private UUID messageId;
    private UUID virtualServiceId;
    private String targetUrlSuffix;
    private int responseStatusCode;
    private long time;
    private byte[] data;
}


Conclusion


Kafka surprised me a lot. It's performance is incredible. The installation is just the dependency on a jar file, the configuration is very easy. The API is really simple, It's up to you what kind and form of the content you prefer.

The only drawback is that this piece of software is pretty young. In the time this article was being written, beta version 0.8 is out. For example, async API is now in a proposal only.

On the other hand, there is large set of various articles, videos and other materials, how one used it in his project, especially along with Apache Storm.

Well,  if you want to use messaging within your new solution you should definitely look at Apache Kafka.

Comments

Unknown said…
There are lots of information about latest technology, like Hadoop cluster is a special type of computational cluster designed specifically for storing and analyzing huge amounts of unstructured data in a distributed computing environment. This information seems to be more unique and interesting. Thanks for sharing.
Big Data Training | Big Data Hadoop Training in Chennai | Big Data Course in Chennai

Popular posts from this blog

Performance Battle of NoSQL blob storages #1: Cassandra

Preface We spend last five years on HP Service Virtualization using MsSQL database . Non-clustered server. Our app utilizes this system for all kinds of persistence. No polyglot so far. As we tuned the performance of the response time - we started at 700ms/call and we achieved couple milliseconds per call at the end when DB involved - we had to learn a lot of stuff. Transactions, lock escalation , isolation levels , clustered and non clustered indexes, buffered reading, index structure and it's persistence, GUID ids in clustered indexes , bulk importing , omit slow joins, sparse indexes, and so on. We also rewrite part of NHibernate to support multiple tables for one entity type which allows use scaling up without lock escalation. It was good time. The end also showed us that famous Oracle has half of our favorite features once we decided to support this database. Well, as I'm thinking about all issues which we encountered during the development, unpredictive behavio

NHibernate performance issues #3: slow inserts (stateless session)

The whole series of NHibernate performance issues isn't about simple use-cases. If you develop small app, such as simple website, you don't need to care about performance. But if you design and develop huge application and once you have decided to use NHibernate you'll solve various sort of issue. For today the use-case is obvious: how to insert many entities into the database as fast as possible? Why I'm taking about previous stuff? The are a lot of articles how the original NHibernate's purpose isn't to support batch operations , like inserts. Once you have decided to NHibernate, you have to solve this issue. Slow insertion The basic way how to insert mapped entity into database is: SessionFactory.GetCurrentSession().Save(object); But what happen when I try to insert many entities? Lets say, I want to persist 1000 libraries each library has 100 books = 100k of books each book has 5 rentals - there are 500k of rentals  It's really slow! The inser

Java, Docker, Spring boot ... and signals

I spend last couple weeks working on java apps running within docker containers deployed on clustered CoreOS machines . It's pretty simple to run java app within a docker container. You just have to choose a base image for your app and write a docker file. Note that docker registry contains many java distributions usually based on open jdk. We use our internal image for Oracle's Java 8 , build on top of something like this docker file . Once you make a decision whether oracle or openjdk, you can start to write your own docker file. FROM dockerfile/java:oracle-java8 ADD your.jar /opt/your-app ADD /dependencies /opt/your-app/dependency WORKDIR /opt/your-app CMD ["java -jar /opt/your-app/your.jar"] However, your app would probably require some parameters. Therefore, last line usually calls your shell script. Such script than validates number and format of those parameters among other things. This is also useful during the development phase because none of us