The first article of this series brought scaling factors for blob-based content on Apache Cassandra. It's well know piece of software, requiring full installation on nodes, management application and so on. You also need to tune various configs to achieve best performance results. I've spend nice time playing with yamls on ubuntu :-)
The configuration is sometimes tricky. I was little bit confused once or twice so I planned to hire Cassandra guru to our team as issues I encountered seems really complicated :-)
Well, we do not need much functionality in HP Service Virtualization. The core is to replicate messages to achieve reliability. The next steps is to process them. Yes, the last part is to aggregate the results. Sounds exactly like map and reduce ala hadoop. Evaluation of Apache Storm or Apache Samza are different stories, but they allow me to find Apache Kafka.
Kafka is pretty nice software, much more simpler comparing to Cassandra how I've described it above. The only operation to use it is to depend on kafka jars in your pom files. That's it! Maven downloads couple of dependent jar files and your environment is (almost) ready.
As you can see below, Kafka is incredibly fast, much more faster than Cassandra. Last year I read some article. An author labels Redis as most incredible software he had met so far. I agreed. Now there are two candidates for this label :-)
It was also very beneficial to read their technical documentation for me as for technical engineer. The guys did performance research how to store data on hard drives and they also employed founded approaches in Kafka implementation. Their documentation contains interesting technical papers, like e.g. lmax pdf does with disruptor.
There is little difference between various batch sizes so you can tune the value according your needs. Note that overall throughput is incredible: raw 400 mbits/s.
You can see that number of connections significantly increases the throughput of Kafka during writes. Two connections handles 150k but, 8 ones allows 464k messages per all connections.
Two partitions brought little difference in the overall score. There is same approach or pattern like in hyper threading.
The goal of this test is to verify the behavior once the cluster is under the continual heavy load for couple of minutes - as the underlying storage goes out of space (150GB). As Kafka is really fast, such space lasts couple of minutes.
The test generates 150GB of data successfully stored by Kafka. The throughput result is almost the same as in short test.
This is crucial verification how is Kafka affected by the replication.
There is difference less than 10% when two nodes are involved in the replication. Great! Both nodes handles 400mbits throughput.
Previous tests use relatively small messages, how does it behave with larger message?
The best throughput achieved last - large - message. Even for so large entity, it handles unbelievable 1.2Gbits/s. Note that all of this is a remote communication, we have 10gbps network.
As the kafka stores byte array, the occupied space for this system depends on the serialization framework. I used famous kryo.
Here is the structure of serialized entity.
class Message {
Kafka surprised me a lot. It's performance is incredible. The installation is just the dependency on a jar file, the configuration is very easy. The API is really simple, It's up to you what kind and form of the content you prefer.
The only drawback is that this piece of software is pretty young. In the time this article was being written, beta version 0.8 is out. For example, async API is now in a proposal only.
On the other hand, there is large set of various articles, videos and other materials, how one used it in his project, especially along with Apache Storm.
Well, if you want to use messaging within your new solution you should definitely look at Apache Kafka.
The configuration is sometimes tricky. I was little bit confused once or twice so I planned to hire Cassandra guru to our team as issues I encountered seems really complicated :-)
Well, we do not need much functionality in HP Service Virtualization. The core is to replicate messages to achieve reliability. The next steps is to process them. Yes, the last part is to aggregate the results. Sounds exactly like map and reduce ala hadoop. Evaluation of Apache Storm or Apache Samza are different stories, but they allow me to find Apache Kafka.
Kafka is pretty nice software, much more simpler comparing to Cassandra how I've described it above. The only operation to use it is to depend on kafka jars in your pom files. That's it! Maven downloads couple of dependent jar files and your environment is (almost) ready.
As you can see below, Kafka is incredibly fast, much more faster than Cassandra. Last year I read some article. An author labels Redis as most incredible software he had met so far. I agreed. Now there are two candidates for this label :-)
It was also very beneficial to read their technical documentation for me as for technical engineer. The guys did performance research how to store data on hard drives and they also employed founded approaches in Kafka implementation. Their documentation contains interesting technical papers, like e.g. lmax pdf does with disruptor.
Setup
The setup was identical to the described in first article.Performance Measurement of Kafka Cluster
Batch Size
- 8 topics
- two partitions per one topic
There is little difference between various batch sizes so you can tune the value according your needs. Note that overall throughput is incredible: raw 400 mbits/s.
Number of Connections
- batch size is 512
Number of Partitions
- 8 connections
Two partitions brought little difference in the overall score. There is same approach or pattern like in hyper threading.
Long Running
The goal of this test is to verify the behavior once the cluster is under the continual heavy load for couple of minutes - as the underlying storage goes out of space (150GB). As Kafka is really fast, such space lasts couple of minutes.
The test generates 150GB of data successfully stored by Kafka. The throughput result is almost the same as in short test.
Replication
This is crucial verification how is Kafka affected by the replication.
There is difference less than 10% when two nodes are involved in the replication. Great! Both nodes handles 400mbits throughput.
Large Messages
Previous tests use relatively small messages, how does it behave with larger message?
The best throughput achieved last - large - message. Even for so large entity, it handles unbelievable 1.2Gbits/s. Note that all of this is a remote communication, we have 10gbps network.
Occupied Space
As the kafka stores byte array, the occupied space for this system depends on the serialization framework. I used famous kryo.
Here is the structure of serialized entity.
class Message {
private UUID messageId; private UUID virtualServiceId; private String targetUrlSuffix; private int responseStatusCode; private long time; private byte[] data; }
Conclusion
Kafka surprised me a lot. It's performance is incredible. The installation is just the dependency on a jar file, the configuration is very easy. The API is really simple, It's up to you what kind and form of the content you prefer.
The only drawback is that this piece of software is pretty young. In the time this article was being written, beta version 0.8 is out. For example, async API is now in a proposal only.
On the other hand, there is large set of various articles, videos and other materials, how one used it in his project, especially along with Apache Storm.
Well, if you want to use messaging within your new solution you should definitely look at Apache Kafka.