We have designed cluster-aware and cloud-based solution of service virtualization for some time. Incredible work as distributed system is next dimension of thinking – and learning as well 🙂
We was deciding to split atomic piece of our domain work to multiple nodes. Such step would hopefully provide us to point certain messages to certain nodes. We can than store data in main memory and avoid distributed locking above some remote store, probably Redis. This is usual approach how to achieve better response time of the whole solution to avoid database search for all calls. Obviously with well designed replication.
I was really worried about latency. We need really low response time in couple of milliseconds at most under heavy load. I’m talking about the whole solution which consumes CPU a lot because of the calculations of simulated response under the hood, so no space for any wasting.
What is the best latency of such transmission between two nodes? What is the throughput of one or multiple connections? Except theoretical analysis I also did our own proof of concept to verify our thoughts in the reality.
I like kryo serialization framework. It’s very fast, very simple and the boot of anybody is in level of minutes. Hopefully I discovered that guys from Esoteric Software think about remote communication and they did very logic step. They developed framework allowing TCP communication using kryo serialization – kryonet.
Well, I believe that nobody is surprised that the solution is very fast and super easy as well. Like kryo itself.
There are some benchmarks how TCP communication can be fast, in level o microseconds, but what is the real response time, performance, when I need to transfer some domain object to another node in real environment using gigabit network. I mean no laboratory environment but just real one where our app will probably run in couple of months?
Well, in short it’s hundred of microseconds.
Kryonet via Kryo
It is built above java nio, kryo serialization and asynchronous approach when the response is being received.
It’s better to read the manual on github as coding around the sending and the receiving is very easy.
Kryo preserves the way that the serializators are separated classes, later registered to the system, so you do no have to affect your domain classes. But obviously, you have to open them to support injections of fields from outside. Or you can use the reflection too.
The response is received asynchronously to the processing of the original request. Kryonet exposes TCP port on the server side and uses TCP under the hood. But don’t worry, you almost can’t touch complexity of TCP messaging.
Number meaning and setup
What was my message? It contains identifier, uuid, and map of elements. The element contains the data but the important thing is the overall size of the message.
Response time is the whole round-trip:
- Serialization on the client side
- To send the request to the server
- Deserialization of the message on server side
- To respond the message identifier back to the client
- The deserialization of received message on the client size
Well, the response time is just the number you can expect when you want to push some message to remote component and wait for the acknowledgment – ACK.
Both client and server are HP Proliant, 24 and 32 cores, >1 gigabit network.
One TCP Connection
First of all, I used one TCP connection.
|Message Size [bytes]/Number of elements||Average RT [micro-seconds]||Median RT [micro-seconds]||Buffer size||Throughput [msgs/s]||Rough Load [msgs/s]|
|48 / 2||190||160||256 B||5400||100k|
|48 / 2||202||170||16 kB||5400||100k|
|48 / 2||672||157||1kB||18300||1M|
|48 / 2||418||130||16kB||19300||4M|
|3216 / 200||316||313||16 kB||650||1k|
|3216 / 200||332||331||256 B||–||1k|
Message size: size of serialized message into raw byte.
Average vs. median: as my test keeps all response time value, it was very interesting to analyze them. Average value reflects all edge cases, mostly the divergence during the flushing of buffers etc., but median value does not. What does it mean in real? The difference between average and median value indicates growing number of slower responses.
Buffer: kryo uses serialization and object buffer, see the documentation.
Rough load: all calls are asynchronous which means that I had to limit incoming load generated by my generator. The approach I chose was to call standard java thread sleep after certain messages. The operation in milliseconds is not quite accurate but it was fine for rough limitation of input load. 4M means that the generator thread sleeps for 1 milliseconds once it generates four thousand of messages.
- there is only two times slower response time for 100 times larger message
- there is almost no different response time for various buffer sizes
- increasing load means almost the same median value but highly increased average response time. Well, the response time is not stable at all which indicates that you can’t rely on response time in the case of massive load
Multiple Client Connections to One TCP Server Port
The approach having multiple clients using one TCP port usually scales well, but not for kryonet as it uses limited buffer size, see following table. I used 10 concurrent clients for the measure.
|Message Size [bytes]/Number of elements||Average RT [micro-seconds]||Median RT [micro-seconds]||Buffer size (client/server)||Throughput [msgs/s]||Rough Load [msgs/s]|
|48 / 2||256||254||16k/16k||9000||1k|
|48 / 2||580||535||16k/160k||42000||5k|
|48 / 2||33000||23000||16k/300k||80000||10k|
- kryo itself is very fast but single threaded deserialization brings certains limits so huge throughput was achieved for the cost of really high latency
- the increasing of the buffer size within the server allows to serve much more requests as they just fall into buffer and were not rejected as before for lower buffer size
Multiple Connections to Multiple Ports
The last includes multiple clients connected to the multiple server ports. It brought the best performance as expected.
|Message Size [bytes]/Number of elements||Connections||Average RT [micro-seconds]||Median RT [micro-seconds]||Buffer size (client/server)||Throughput [msgs/s]||Rough Load [msgs/s]|
|48 / 2||10||542||486||256/256||54000||100k|
|48 / 2||10||5000||518||16k/16k||155000||500k|
|48 / 2||20||2008||1423||16k/32k||210000||500k|
|48 / 2||50||1460||404||16k/32k||182000||100k|
|48 / 2||50||2024||378||256k/256k||220000||100m|
|48 / 2||100||3120||404||256k/256k||231000||100m|
|3216 / 200||10||405||368||8k/32k||21000||3k|
|3216 / 200||10||415||353||16k/32k||21000||3k|
|3216 / 200||20||399||350||16k/32k||32300||2k|
|3216 / 200||50||380||327||16k/32k||43700||1k|
|3216 / 200||50||410||344||64k/128k||78000||2k|
- incredible load for 100/50 clients per second in no bad time as I would originally expected
- as I’ve already discuss the stuff above, there is huge different between median and average RT as well
- there is no such difference between 20 and 100 clients in throughtput but in average/median RT