Benchmarking and writing a concurrent mini Redis in Elixir

In our previous post of this series, we wrote a mini Redis integrated with our own RESP parser and a KV store.

In this post, we are going to benchmark it and make changes along the way to make our mini Redis more performant!

This post consists of the following sections:

Benchmarking with redis-benchmark
Handling concurrent requests in our TCP server
Tuning on gen_tcp configuration to improve performance
Comparing with the real Redis server

This post is inspired by Rust Tokio Mini-Redis Tutorial, where it walks through the reader to implement a mini Redis with tokio. This post is part of the series of implementing mini Redis in Elixir:

Part 1: Writing a simple Redis Protocol parser in Elixir
Part 2: Writing a mini Redis server in Elixir
Part 3: Benchmarking and writing concurrent mini Redis server in Elixir

Benchmarking with `redis-benchmark`

We can benchmark our mini Redis server by running the redis-benchmark.

Before the benchmark send the actual commands, it also send some CONFIG commands to get some configuration. Let’s make sure we handle those as well so that our server doesn’t crash because of unmatched patterns:

defp handle_command(socket, command) do
  case command do
    ["SET", key, value] ->
      MiniRedis.KV.set(key, value)
      reply(socket, "+OK\r\n")

    ["GET", key] ->
      case MiniRedis.KV.get(key) do
        {:ok, value} -> reply(socket, "+#{value}\r\n")
        {:error, :not_found} -> reply(socket, "$-1\r\n")
      end
+   _ ->
+     reply(socket, "+OK\r\n")
  end
end

With that, we could run the benchmark by running the following command:

# In project terminal
mix run --no-halt

# In another terminal
# -t set: to specify to just benchmark SET command
# -c 1: Limit the parallel client to 1
redis-benchmark -t set -c 1

upon running this in my local machine, this is the output I get:

╰─➤  redis-benchmark -t set -c 1

ERROR: failed to fetch CONFIG from 127.0.0.1:6379
WARN: could not fetch server CONFIG
====== SET ======
  100000 requests completed in 6.70 seconds
  1 parallel clients
  3 bytes payload
  keep alive: 1
  multi-thread: no

Latency by percentile distribution:
0.000% <= 0.031 milliseconds (cumulative count 3499)
50.000% <= 0.063 milliseconds (cumulative count 53168)
# ....
100.000% <= 12.223 milliseconds (cumulative count 100000)

Cumulative distribution of latencies:
# ....
99.999% <= 11.103 milliseconds (cumulative count 99999)
100.000% <= 13.103 milliseconds (cumulative count 100000)

Summary:
  throughput summary: 14927.60 requests per second
  latency summary (msec):
        avg       min       p50       p95       p99       max
        0.063     0.024     0.063     0.087     0.159    12.223

Result maybe vary based on your hardware specification. 14k requests per second, not bad.

If we try to bump up the -c to multiple clients, weird things happen. In my local machine, this is how it behaves. It start with:

SET: rps=12500.0 (overall: 13326.7) avg_msec=0.076 (overall: 0.071)

then the rps slowly reduce to:

...
SET: rps=0.0 (overall: 2423.4) avg_msec=nan (overall: 0.069)).069))
...
...
SET: rps=0.0 (overall: 1817.5) avg_msec=nan (overall: 0.069)).069))

and it seems like taking forever to complete the benchmark. After minutes, here’s the output I obtained:

╰─➤  redis-benchmark -t set -c 2

ERROR: failed to fetch CONFIG from 127.0.0.1:6379
WARN: could not fetch server CONFIG
====== SET ======
  100000 requests completed in 682.69 seconds
  2 parallel clients
  3 bytes payload
  keep alive: 1
  multi-thread: no

Latency by percentile distribution:
0.000% <= 0.031 milliseconds (cumulative count 12)
50.000% <= 0.071 milliseconds (cumulative count 78123)
# ....
100.000% <= 7.671 milliseconds (cumulative count 100000)

Cumulative distribution of latencies:
# ....
99.998% <= 5.103 milliseconds (cumulative count 99998)
99.999% <= 6.103 milliseconds (cumulative count 99999)
100.000% <= 8.103 milliseconds (cumulative count 100000)

Summary:
  throughput summary: 146.48 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        0.069     0.024     0.071     0.095     0.151     7.671

146 requests per second… Seems like something is not right with our implementation.

What’s making it slow?

Can you guess what it is? The hint is in the title of this post.

Remember the code that we wrote below?

defp loop_acceptor(socket) do
  {:ok, client} = :gen_tcp.accept(socket)
  Logger.info("Accepting client #{inspect(client)}")
  serve(client, "")
  loop_acceptor(socket)
end

Here, we are only accepting another client (connection), until we serve the previous client successfully. In short, it is handling the incoming requests sequentially. Our code is not written to handle concurrent requests.

Let’s update our server to deal with it better.

Handling concurrent requests in our TCP server

I won’t go into details on this, as we are essentially just following through the guide from Elixir website on Task Supervisor section.

Essentially, we will be serving the client on a separate process using Task.Supervisor:

defp loop_acceptor(socket) do
  {:ok, client} = :gen_tcp.accept(socket)
  Logger.info("Accepting client #{inspect(client)}")

  {:ok, pid} =
    Task.Supervisor.start_child(MiniRedis.TaskSupervisor, fn ->
      serve(client, "")
    end)

  Logger.info("Serving new client with pid: #{inspect(pid)}...")
  :ok = :gen_tcp.controlling_process(client, pid)

  loop_acceptor(socket)
end

The need of calling :gen_tcp.controlling_process has been explained as well in the guide mentioned above. Here’s a direct quote from it:

You might notice that we added a line, :ok = :gen_tcp.controlling_process(client, pid). This makes the child process the “controlling process” of the client socket. If we didn’t do this, the acceptor would bring down all the clients if it crashed because sockets would be tied to the process that accepted them (which is the default behaviour).

Also, let’s don’t forget to include the Task.Supervisor in our application supervisor. In lib/mini_redis/application.ex:

children = [
  # Starts a worker by calling: MiniRedis.Worker.start_link(arg)
  # {MiniRedis.Worker, arg}
  MiniRedis.KV,
+ {Task.Supervisor, name: MiniRedis.TaskSupervisor},
  {Task, fn -> MiniRedis.Server.accept(String.to_integer(System.get_env("PORT") || "6379")) end},
]

Let’s run the benchmark again with more concurrency to see the outcome of our small changes:

redis-benchmark -t set -c 5

With 5 clients, we are now at around 52k requests per second:

╰─➤  redis-benchmark -t set -c 5
ERROR: failed to fetch CONFIG from 127.0.0.1:6379
WARN: could not fetch server CONFIG
====== SET ======
  100000 requests completed in 1.92 seconds
  5 parallel clients
  3 bytes payload
  keep alive: 1
  multi-thread: no

Latency by percentile distribution:
0.000% <= 0.031 milliseconds (cumulative count 65)
50.000% <= 0.087 milliseconds (cumulative count 58111)
# ....
100.000% <= 6.751 milliseconds (cumulative count 100000)

Cumulative distribution of latencies:
73.668% <= 0.103 milliseconds (cumulative count 73668)
# ....
100.000% <= 7.103 milliseconds (cumulative count 100000)

Summary:
  throughput summary: 51975.05 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        0.089     0.024     0.087     0.151     0.199     6.751

It seems like our requests per second (RPS) scale linearly to our number of clients. (Okay, not exactly linearly, but my point is, it does scale as we increase the number of clients).

Pushing it by a little bit

However, according to Universal Scalability Law, at some point, the system will not scale linearly with more concurrency but result in a loss of performance. So, let’s push our system further and see how much we can go without it scaling backward. Let’s just start with additional 1 client:

redis-benchmark -t set -c 6

upon running it, this is the output I get:

╰─➤  redis-benchmark -t set -c 6
ERROR: failed to fetch CONFIG from 127.0.0.1:6379
WARN: could not fetch server CONFIG
SET: rps=0.0 (overall: nan) avg_msec=nan (overall: nan)

Oops, it doesn’t even work… What could be wrong this time?

When I was implementing it, I was so lost, as I don’t know what went wrong in the first sight.

After benchmarking different part of my code, I reach a conclusion of: the bottleneck is probably on our TCP server implementation. This is because, I could have multiple clients to talk to my KV store and still performing well.

Knowing that, I research around and study further about gen_tcp and finally found out one of the most important configuration that I need to set.

Turns out that our bottleneck this time is :gen_tcp. There is a configuration we need to tweak to make it work.

Tuning on `gen_tcp` configuration to improve performance

While I was researching around, I came across this StackOverflow question regarding why gen_tcp performance drop when getting too much concurrent requests. The last answer pointed out the backlog option in gen_tcp and suggested tuning it.

The backlog is a queue for our TCP server to buffer incoming requests that can’t be handle yet. Here’s what the documentation said:

{backlog, B}

B is an integer >= 0. The backlog value defines the maximum length that the queue of pending connections can grow to. Defaults to 5.

Since by default, the backlog is set to 5, that explain why our Redis server start facing issue when there is 6 clients.

Hence, to resolve it, it is as simple as updating the options we provide when calling ;gen_tcp.listen:

+ case :gen_tcp.listen(port, [:binary, packet: :line, backlog: 50, active: false, reuseaddr: true]) do
- case :gen_tcp.listen(port, [:binary, packet: :line, active: false, reuseaddr: true]) do

with this changes, we could rerun the benchmark again, and it should work as expected:

╰─➤  redis-benchmark -t set -c 6
ERROR: failed to fetch CONFIG from 127.0.0.1:6379
WARN: could not fetch server CONFIG
====== SET ======
  100000 requests completed in 1.45 seconds
  6 parallel clients
  3 bytes payload
  keep alive: 1
  multi-thread: no

Latency by percentile distribution:
0.000% <= 0.031 milliseconds (cumulative count 63)
50.000% <= 0.079 milliseconds (cumulative count 59243)
# ....
100.000% <= 1.599 milliseconds (cumulative count 100000)

Cumulative distribution of latencies:
81.055% <= 0.103 milliseconds (cumulative count 81055)
# ....
100.000% <= 1.607 milliseconds (cumulative count 100000)

Summary:
  throughput summary: 68775.79 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        0.080     0.024     0.079     0.143     0.183     1.599

68k requests per second. Seems great!

Comparing with the real Redis server

It’s not clear yet if 68k requests per second of a synthetic benchmark is good or not. So let’s try to compare it with the actual Redis implementation:

# Make sure we close our Redis server before starting the real one.

# In a separate terminal, start the redis server
redis-server

# In current terminal
redis-benchmark -t set  -c 6

Here’s the output running on my local machine:

╰─➤  redis-benchmark -t set  -c 6

====== SET ======
  100000 requests completed in 0.83 seconds
  6 parallel clients
  3 bytes payload
  keep alive: 1
  host configuration "save": 3600 1 300 100 60 10000
  host configuration "appendonly": no
  multi-thread: no

Latency by percentile distribution:
0.000% <= 0.015 milliseconds (cumulative count 846)
50.000% <= 0.039 milliseconds (cumulative count 61555)
# ....
100.000% <= 0.455 milliseconds (cumulative count 100000)

Cumulative distribution of latencies:
# ....
99.994% <= 0.303 milliseconds (cumulative count 99994)
100.000% <= 0.503 milliseconds (cumulative count 100000)

Summary:
  throughput summary: 120772.95 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        0.038     0.008     0.039     0.063     0.087     0.455

120k requests per second, 60k requests faster than our implementation. Given that how much code we have written, and how easy it is, I think it is fair enough.

Note about our benchmark

Since this is a synthetic benchmark, don’t take it super seriously. Our system might behave differently on different workload. The main reason of benchmarking here, is to provide a better picture for the performance of our implementation. This help us understand the boundary and limit of our system and hence, help us to improve it further.

For instance, by comparing both benchmark result of our mini Redis and the Redis, I have found that our implementation have degradaded performance after passing certain number of clients. At the number of clients of 15:

redis-benchmark -t set -c 15
# ....
Summary:
  throughput summary: 44883.30 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        0.326     0.048     0.319     0.519     0.719     4.031

I’m ending up with way worse performance than before. That is not the true if I run the benchmark against the Redis server:

redis-benchmark -t set -c 15
...
Summary:
  throughput summary: 176056.33 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        0.061     0.016     0.055     0.103     0.167     0.695

This indicates that our current implementation introduced certain overheads or have some contention when we have more concurrent requests and it’s something we can improve further.

What’s next?

As mentioend, our current implementation still have some limitations. For instance, we are spawning a new Task for every incoming requests, this could potentially be replaced by using a pool of processes, with library like nimble_pool or poolboy.

Would it result in better performance and resource utilization? I’m not sure. In theory, it will reduce the resources needed. But who knows, if you want to dive in further, try to use a pool and run some benchmark to see if it works better.

If you are interested in this topic, here’s some articles that I think you might found interesting as well:

Sometime in the future, maybe there will be a part 4 where we further optimize our implementation to improve the performance of our mini Redis server. But for now, that’s the end of this series.

Wrap Up

Can you believe that we are able to write a mini Redis server with such a short time and code? I didn’t believe it until I tried to do it.

A lot of systems seem complex and complicated in the first place. But once you reduce the scope of the system, and try to break it down into smaller parts, it start to get more manageable. Repeat the process, by slowly increasing the scope back, and with time, I believe that everyone is able to understand complex systems.

Will those systems be less complex and complicated? Less likely to be, but we can always improve our domain knowledge and skills to make those systems to be more understandable. Only by understanding it, we could make it simple.

Anyway, thanks for reading it until the end and hopefully you have learnt one or two things throughout this series.