Recommended design for ingress-networking for multi-threaded server? #951

RedBeard0531 · 2023-09-21T12:53:35Z

RedBeard0531
Sep 21, 2023

I've seen a ton of examples (including on your wiki) for using io_uring for a single-threaded IO-bound server, but none for a multi-threaded, compute-bound server¹ that is expected to handle 10s of thousands of long-lived, mostly-idle connections. One of our constraints is that requests have a wide variety of compute requirements, from sub-100 microseconds up to minutes or even hours, and it can be difficult/impossible to predict in advance how long a request will take to process. For the fast requests, any attempt to dispatch from a networking thread to a compute thread pool will kill us with the overhead of context switching. For the slow requests, by the time we know they are slow we are already deep in the call stack of application code, and it can be tricky to switch threads at that point. So we've found that it is best to have the thread/core that runs the recv() also process the request and send() the reply. This of course also has a side benefit of having the request and reply buffers already hot in that core's cache, which we wouldn't necessarily have if we dispatched to another thread.

I think an ideal loop for our use case would look something like this (very much pseudo-code, eg all error handling is elided):

atomicInt threadsDoingNetworking = 0

start:
    set up uring
    submit:
        multi-shot accept() on each listening socket
        timeout(1 second) # used for reaping excess threads
    launch N-cores processing threads

thread:
    threadsDoingNetorking++
    enter processing loop

processing loop:
    io_uring_enter() # all submits inside the loop are sent to kernel here
    switch (consume 1 completion event from uring):
        case ACCEPT(conn):
            submit recv(conn, 1k buffer) # possibly using provided buffers
        
        case RECV(conn, requestBuffer, readSoFar):
            if client closed their end:
                close(conn)
            else if not have whole request in initial recv:
                size = computeFullSize(requestBuffer)
                requestBuffer.copyGrow(fullSize);
                submit recv(conn, requestBuffer + readSoFar, fullSize - readSoFar, MSG_WAITALL) # no more user-space waking until we have full request
            else:
                if --threadsDoingNetorking == 0:
                    launch additional thread
                replyBuffer = process(requestBuffer) # may take a lot of CPU or do blocking IO
                threadsDoingNetorking++
                # eliding some accounting to prevent multiple in-flight sends on same connection
                submit:
                    send(conn, replyBuffer, MSG_WAITALL) # use send_zc if big enough
                    recv(conn, 1k buffer, IO_RECVSEND_POLL_FIRST) # possibly using provided buffers

        case SEND(conn, replyBuffer):
            free(replyBuffer)

        case TIMEOUT:
            if threadsDoingNetorking > N:
                threadsDoingNetworking-- # really a CAS loop including the if
                end current thread
            submit timeout(1 second)

This is essentially a single-threaded server design, but with 2 main changes: we only consume a single queued event so that other threads can consume other events, and some additional logic to ensure we have some threads, but not too many, ready to process new incoming requests. These are both to ensure that fast requests can be serviced with reasonable latency (assuming available cores), without getting stuck behind slow requests.

However, I have a few questions/concerns:

I know you have stated in many places that it is strongly discouraged to share urings between threads. While I could easily have a submission queue per thread, I don't think a completion queue per thread really works. We really don't want independent requests to block behind each other as long as there is an available CPU, so we don't want to eg, have each thread waiting for recvs on a subset of connections. The extreme version with a thread per connection would work, but is wasteful when the connection is idle, and probably no better than just using blocking IO.
- One possible solution to this would be some flag to allow CQEs to be sent to any ring that is sharing a common worker pool via IORING_SETUP_ATTACH_WQ, or allowing separate submission queues with a common completion queue.
- A more "out there" solution would be to allow the kernel to manage the user-space thread pool similarly to how it manages the kernel-space worker pool. I've often thought that the kernel is the best place to manage thread pools since it knows when threads get stuck in a blocking sys call and can pre-empt them if they spend too long in compute, things that are hard/impossible to do in user space.
If we decide to share a single ring for all threads, is there a way to guarantee that if N threads are blocked in io_uring_enter() that exactly 1 will be woken for each event that becomes ready? The docs seem unclear to me about whether passing 1 for min_complete is sufficient for this. The use of "min" implies that a thread may be expected to consume multiple CQEs. Alternatively, if a single event becomes ready, I don't see any guarantee that it won't wake all threads blocked on the ring (eg if it independently checks n_complete >= min_complete for each thread without decrementing n_complete between checks).
- Just to be clear, I'm totally fine if a "live" thread that happens to finish its request at the right time wins the race and "steals" the CQE intended for the thread to be woken. But if I have 1000-threads in io_uring_enter (eg after a burst of slow ops that has cooled off) I don't want them all to wake up all the time just for 999 go right back to sleep.
Are we still likely to maintain the property that the kernel and user-space processing of send and recv buffers will happen on the same cores, or are we likely to see core handoffs between the kernel filling the request buffer and the userspace processing it (and vice-versa for send buffers).
This is largely based on a design we found works well with epoll but adapted to work with completions rather than readiness (which is better for us anyway). But maybe a radically different approach is better with io_uring?

This is half a question, half a discussion opener. If there is already an obviously correct way to use io_uring for our use case and you could just explain it or link to some docs, that would be great. Even if we need to change our design. If not, I'd love to discuss potential improvements to io_uring that would make it work for us. I'd also understand if this use case is out of scope for what io_uring can reasonably be expected to work well for, and we should continue using alternatives such as epoll.

¹In this specific case I am working on a database server, but I believe this question would apply to many types of multi-threaded RPC servers, including http servers serving dynamic content.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recommended design for ingress-networking for multi-threaded server? #951

{{title}}

Replies: 0 comments

Select a reply

Recommended design for ingress-networking for multi-threaded server? #951

RedBeard0531 Sep 21, 2023

Replies: 0 comments

RedBeard0531
Sep 21, 2023