Tuning Tansu: 600,000 record/s with 13MB of RAM

This article is the first in series of articles about the process of tuning Tansu an Apache licensed Open Source Kafka compatible broker, proxy and (early) client API written in async Rust. This article focuses on the tuning of the protocol layer using a null storage engine.


As preparation for tuning:

Both were a great source of inspiration during some of the wrong turns I took while tuning. Be prepared to have completely wrong assumptions about what is actually slow. Measure (at least) twice, change, measure again (at least twice).

The first step was to write a self contained performance test for the protocol layer. Bench is a CLI tool (using the awesome clap for argument parsing), that decodes and encodes a series of protocol frames representing different API requests and versions (original, flexible and tag buffered). Reasonably quick to run at ~5 seconds, so that the feedback loop of using hyperfine and cargo flamegraph has a quick turnaround. The bench CLI allows changing the number of iterations while testing and the API keys being tested (simpler to narrow the tuning to a single Kafka API encoding/decoding).

Background

The protocol implementation uses the Kafka project's JSON message definitions that are fed into a proc macro generating the many structures that represent the protocol (~60k lines of code). It uses a custom written serde data format to convert a stream of bytes into those structures and vice versa (the deserializer and serializer).

This means that Tansu can understand every version of ~88 Kafka APIs. While a broker can guide a client using its API versions response, not every client complies: a broker cannot dictate which of the 18 possible versions of fetch a client is going to choose.

Tansu fully supports:

  • Original (fixed length sizes) and flexible (variable length sizes) messages, both of these formats that depend on the API version being used; and
  • Tag buffer (a type of message that doesn't require a API version change).

The protocol crate is tansu_sans_io, with:

Tansu is built using composable layers handling the networking, routing, layering and processing of Kafka messages, in the client, broker and proxy.

Before Tune

Before tuning, hyperfine ran the codec produce benchmark in just under 4s (Mac Mini M4):

Benchmark 1: ./target/release/bench
  Time (mean ± σ):      3.703 s ±  0.091 s    [User: 3.679 s, System: 0.009 s]
  Range (min … max):    3.629 s …  3.918 s    10 runs

The cargo flamegraph revealed several areas to investigate:

  • Deserialization was where the initial major gains could be made (starting with the widest column).
  • Allocation, reallocation and freeing memory while growing various Vec.
  • CRC32 checks were eating through a lot of samples.
  • Several places where Option::ok_or, or Result::map_or where the or part was being eagerly evaluated (and unused... wasted CPU cycles!).
Bench produce flame graph before tuning

After Tune

After tuning, hyperfine ran the codec produce benchmark in 1.9s mean (3.7s mean before tuning):

Benchmark 1: ./target/release/bench
  Time (mean ± σ):      1.994 s ±  0.051 s    [User: 1.976 s, System: 0.004 s]
  Range (min … max):    1.965 s …  2.136 s    10 runs

The cargo flamegraph was looking better:

Bench produce flame graph after tuning

Once the basic codec was tuned up, I moved onto tuning the end-to-end of the broker. I decided to use the kafka-producer-perf-test to run end-to-end tuning runs. Something that I could test consistently without introducing any noise (whether that was IO or memory usage). I decided to write a /dev/null storage engine, that could later act as a baseline for the IO heavy (SQLite, PostgreSQL, memory and S3) storage engines.

/dev/null storage

The null storage engine responds to Kafka API requests but does almost nothing with them:

  • creating a topic just puts its name in a hash map of table metadata
  • metadata requests respond with the table metadata stored in the map
  • produce requests return as if they had been stored (without storing anything)
  • fetch requests return an empty fetch

Tansu can also act as a Kafka API Proxy, the null storage engine can also be used as a high performance origin broker. Without the network hop and latency of getting to the origin, that could be separately tested by the Tansu Kafka API Client.

Running kafka-producer-perf-test with record_size of 1024, over localhost with the broker and perf test running on an Mac Mini M4:

throughput (record/sec)bandwidth (MB/sec)average latency (ms)max latency (ms)50th %(ms)95th %(ms)99th %(ms)99.9th %(ms)
100009.760.98900335
2000019.520.65910336
3000029.280.52910247
10000097.610.299301312
200000195.260.2010901213
300000292.860.179801114
600000585.800.129101111

For some reason the first row of every performance output indicated a ~90ms latency, which is strange, because the broker was left running after each test.

The performance tops out at ~600k records per second:

kafka-producer-perf-test --topic test --num-records 25000000 --record-size 1024 --throughput 1000000 --producer-props bootstrap.servers=${ADVERTISED_LISTENER}
3043618 records sent, 608723.6 records/sec (594.46 MB/sec), 0.2 ms avg latency, 92.0 ms max latency.
3291383 records sent, 658276.6 records/sec (642.85 MB/sec), 0.1 ms avg latency, 1.0 ms max latency.
3291189 records sent, 658237.8 records/sec (642.81 MB/sec), 0.1 ms avg latency, 1.0 ms max latency.
3290017 records sent, 658003.4 records/sec (642.58 MB/sec), 0.1 ms avg latency, 1.0 ms max latency.
3277629 records sent, 655525.8 records/sec (640.16 MB/sec), 0.1 ms avg latency, 2.0 ms max latency.
3279412 records sent, 655882.4 records/sec (640.51 MB/sec), 0.1 ms avg latency, 1.0 ms max latency.
3279941 records sent, 655988.2 records/sec (640.61 MB/sec), 0.1 ms avg latency, 1.0 ms max latency.
25000000 records sent, 650567.3 records/sec (635.32 MB/sec), 0.10 ms avg latency, 92.00 ms max latency, 0 ms 50th, 1 ms 95th, 1 ms 99th, 2 ms 99.9th.

Using iperf3 on localhost with the same Mac Mini M4, suggests that we should be able to use a lot more bandwidth:

% iperf3-darwin -c localhost 
Connecting to host localhost, port 5201
[ ID] Interval           Transfer     Bitrate
[  7]   0.00-10.00  sec   128 GBytes   110 Gbits/sec    0             sender
[  7]   0.00-10.00  sec   128 GBytes   110 Gbits/sec                  receiver

The result of some of the earlier tuning means that the RSS of the broker process while responding under load was:

ps -p $(pgrep tansu) -o rss= | awk '{print $1/1024 " MB"}'
13 MB

Conclusion

We have a baseline performance:

  • Tests were performed using a null storage engine to get consistent results independent of any storage latency.
  • The broker has sub-millisecond latency when responding to several hundred thousand records per second.
  • The tests emulate a Kafka API Proxy, with a high performance origin broker.
  • It looks like something is capping performance, but that something for another day.
  • RSS Memory usage of 13MB looks good under load of 600,000 records per second.
  • The iperf3 metric could be a useful North Star.

Want to try it out for yourself? Clone (and ⭐) Tansu at https://github.com/tansu-io/tansu.


Other articles in this blog include: