Tuning Tansu: 600,000 record/s with 13MB of RAM

This article is the first in series of articles about the process of tuning Tansu an Apache licensed Open Source Kafka compatible broker, proxy and (early) client API written in async Rust. This article focuses on the tuning of the protocol layer using a null storage engine.

As preparation for tuning:

I read through the excellent The Rust Performance Book.
I had a 10+ hour video with Jon Gjengset: One Billion Row Challenge running in the background (skipping around, watching, watching again, thank you for the chapters/timeline!)

Both were a great source of inspiration during some of the wrong turns I took while tuning. Be prepared to have completely wrong assumptions about what is actually slow. Measure (at least) twice, change, measure again (at least twice).

The first step was to write a self contained performance test for the protocol layer. Bench is a CLI tool (using the awesome clap for argument parsing), that decodes and encodes a series of protocol frames representing different API requests and versions (original, flexible and tag buffered). Reasonably quick to run at ~5 seconds, so that the feedback loop of using hyperfine and cargo flamegraph has a quick turnaround. The bench CLI allows changing the number of iterations while testing and the API keys being tested (simpler to narrow the tuning to a single Kafka API encoding/decoding).

Background

The protocol implementation uses the Kafka project's JSON message definitions that are fed into a proc macro generating the many structures that represent the protocol (~60k lines of code). It uses a custom written serde data format to convert a stream of bytes into those structures and vice versa (the deserializer and serializer).

This means that Tansu can understand every version of ~88 Kafka APIs. While a broker can guide a client using its API versions response, not every client complies: a broker cannot dictate which of the 18 possible versions of fetch a client is going to choose.

Tansu fully supports:

Original (fixed length sizes) and flexible (variable length sizes) messages, both of these formats that depend on the API version being used; and
Tag buffer (a type of message that doesn't require a API version change).

The protocol crate is tansu_sans_io, with:

Encoding into bytes is done by Frame::request and Frame::response.
Decoding from bytes is done by Frame::request_from_bytes and Frame::response_from_bytes.

Tansu is built using composable layers handling the networking, routing, layering and processing of Kafka messages, in the client, broker and proxy.

Before Tune

Before tuning, hyperfine ran the codec produce benchmark in just under 4s (Mac Mini M4):

Benchmark 1: ./target/release/bench
  Time (mean ± σ):      3.703 s ±  0.091 s    [User: 3.679 s, System: 0.009 s]
  Range (min … max):    3.629 s …  3.918 s    10 runs

The cargo flamegraph revealed several areas to investigate:

Deserialization was where the initial major gains could be made (starting with the widest column).
Allocation, reallocation and freeing memory while growing various Vec.
CRC32 checks were eating through a lot of samples.
Several places where Option::ok_or, or Result::map_or where the or part was being eagerly evaluated (and unused... wasted CPU cycles!).

After Tune

After tuning, hyperfine ran the codec produce benchmark in 1.9s mean (3.7s mean before tuning):

Benchmark 1: ./target/release/bench
  Time (mean ± σ):      1.994 s ±  0.051 s    [User: 1.976 s, System: 0.004 s]
  Range (min … max):    1.965 s …  2.136 s    10 runs

The cargo flamegraph was looking better:

Serialization is still allocating and reallocating (covered by a work in progress pull request: so that a frame can indicate the size of allocation for serialization).
An Option is being cloned somehow (and somewhere), that I'd like to remove.
Lots more that could be done, but there are probably now bigger flame graphs elsewhere.

Once the basic codec was tuned up, I moved onto tuning the end-to-end of the broker. I decided to use the kafka-producer-perf-test to run end-to-end tuning runs. Something that I could test consistently without introducing any noise (whether that was IO or memory usage). I decided to write a /dev/null storage engine, that could later act as a baseline for the IO heavy (SQLite, PostgreSQL, memory and S3) storage engines.

/dev/null storage

The null storage engine responds to Kafka API requests but does almost nothing with them:

creating a topic just puts its name in a hash map of table metadata
metadata requests respond with the table metadata stored in the map
produce requests return as if they had been stored (without storing anything)
fetch requests return an empty fetch

Tansu can also act as a Kafka API Proxy, the null storage engine can also be used as a high performance origin broker. Without the network hop and latency of getting to the origin, that could be separately tested by the Tansu Kafka API Client.

Running kafka-producer-perf-test with record_size of 1024, over localhost with the broker and perf test running on an Mac Mini M4:

throughput (record/sec)	bandwidth (MB/sec)	average latency (ms)	max latency (ms)	95th %(ms)	99th %(ms)	99.9th %(ms)
10000	9.76	0.98	90	3	3	5
20000	19.52	0.65	91	3	3	6
30000	29.28	0.52	91	2	4	7
100000	97.61	0.29	93	1	3	12
200000	195.26	0.20	109	1	2	13
300000	292.86	0.17	98	1	1	14
600000	585.80	0.12	91	1	1	11

For some reason the first row of every performance output indicated a ~90ms latency, which is strange, because the broker was left running after each test.

The performance tops out at ~600k records per second:

kafka-producer-perf-test --topic test --num-records 25000000 --record-size 1024 --throughput 1000000 --producer-props bootstrap.servers=${ADVERTISED_LISTENER}
3043618 records sent, 608723.6 records/sec (594.46 MB/sec), 0.2 ms avg latency, 92.0 ms max latency.
3291383 records sent, 658276.6 records/sec (642.85 MB/sec), 0.1 ms avg latency, 1.0 ms max latency.
3291189 records sent, 658237.8 records/sec (642.81 MB/sec), 0.1 ms avg latency, 1.0 ms max latency.
3290017 records sent, 658003.4 records/sec (642.58 MB/sec), 0.1 ms avg latency, 1.0 ms max latency.
3277629 records sent, 655525.8 records/sec (640.16 MB/sec), 0.1 ms avg latency, 2.0 ms max latency.
3279412 records sent, 655882.4 records/sec (640.51 MB/sec), 0.1 ms avg latency, 1.0 ms max latency.
3279941 records sent, 655988.2 records/sec (640.61 MB/sec), 0.1 ms avg latency, 1.0 ms max latency.
25000000 records sent, 650567.3 records/sec (635.32 MB/sec), 0.10 ms avg latency, 92.00 ms max latency, 0 ms 50th, 1 ms 95th, 1 ms 99th, 2 ms 99.9th.

Using iperf3 on localhost with the same Mac Mini M4, suggests that we should be able to use a lot more bandwidth:

% iperf3-darwin -c localhost 
Connecting to host localhost, port 5201
[ ID] Interval           Transfer     Bitrate
[  7]   0.00-10.00  sec   128 GBytes   110 Gbits/sec    0             sender
[  7]   0.00-10.00  sec   128 GBytes   110 Gbits/sec                  receiver

The result of some of the earlier tuning means that the RSS of the broker process while responding under load was:

ps -p $(pgrep tansu) -o rss= | awk '{print $1/1024 " MB"}'
13 MB

Conclusion

We have a baseline performance:

Tests were performed using a null storage engine to get consistent results independent of any storage latency.
The broker has sub-millisecond latency when responding to several hundred thousand records per second.
The tests emulate a Kafka API Proxy, with a high performance origin broker.
It looks like something is capping performance, but that something for another day.
RSS Memory usage of 13MB looks good under load of 600,000 records per second.
The iperf3 metric could be a useful North Star.

Want to try it out for yourself? Clone (and ⭐) Tansu at https://github.com/tansu-io/tansu.

In this Performance Tuning series:

You're reading this article which tuned the broker with the null storage engine using cargo flamegraph
The second article tuned a hot regular expression, stopped copying uncompressed data and used a faster CRC32 implementation using the SQLite storage engine