Why can't I get close to FIO benchmark speeds? #819

shannonmhair · 2023-03-10T16:47:04Z

shannonmhair
Mar 10, 2023

Problem Summary

I am working on a project that requires streaming data to disk at very high speeds on a single Linux server. An fio benchmark using the command below shows that I should be able to get the desired write speeds (> 40 GB/s) using io_uring.

fio --name=seqwrite --rw=write --direct=1 --ioengine=io_uring --bs=128k --numjobs=4 --size=100G --runtime=300 --directory=/mnt/md0/ --iodepth=128 --buffered=0 --numa_cpu_nodes=0 --sqthread_poll=1  --hipri=1

However, I am not able to replicate this performance with my own code using liburing. My current write speed is about 9 GB/s. I suspect that the extra overhead of liburing as compared to io_uring might be the bottleneck, but I have a few questions to ask about my approach before I give up on the much-prettier liburing code.

My approach

Utilizing submission queue polling
NOT queueing gather/scatter io requests with writev(), but rather queueing requests to use the normal write() function to write to disk. (tried gather / scatter IO requests, but this does not seem to have a major impact on my write speeds.)
Multithreaded with one ring per thread

Additional Information

Running a simplified version of this code that makes no use of threading yielded similar results.
My debugger shows that I am creating the number of threads specified in the NUM_JOBS macro. However, it does not tell me about threads that are created by the kernel for sq polling.
My performance declines when running more than four threads
The linux server has 96 physical cores to work with (no hyperthreading)
The data is being written to a RAID0 configuration
I am using bpftrace -e 'tracepoint:io_uring:io_uring_submit_sqe {printf("%s(%d)\n", comm, pid);}' in a separate terminal, which shows that the kernel thread(s) dedicated to sq polling as active.
I have verified that the data written to disk exactly matches what I expect it to be in size and contents.
I have tried using the IORING_SETUP_ATTACH_WQ flag when setting up the rings. This helps marginally, but by no means gets me to the write speed I need.
I have tried various block sizes, 128k seems to be the sweet spot

Questions

I expect that the kernel would spin up a single thread per ring to handle sq polling. However, I do not know how to verify this is actually happening. Can I assume that it is?
Why does my performance decrease when running more than two jobs? Is this due to contention between the threads for the file being written to? Maybe it is because there is actually only a single thread working on sq polling that might get bogged down handling requests from multiple rings?
Are there other flags or options I should be using that might help?
Is it time to bight the bullet and use direct io_uring calls?

The Code

The code below is a simplified version that removes a lot of error handling code for the sake of brevity. However, the performance and function of this simplified version is the same as the full-featured code.

The main function

#include <fcntl.h>
#include <liburing.h>
#include <cstring>
#include <thread>
#include <vector>
#include "utilities.h"

#define NUM_JOBS 4 // number of single-ring threads
#define QUEUE_DEPTH 128 // size of each ring
#define IO_BLOCK_SIZE 128 * 1024 // write block size
#define WRITE_SIZE (IO_BLOCK_SIZE * 10000) // Total number of bytes to write
#define FILENAME  "/mnt/md0/test.txt" // File to write to

char incomingData[WRITE_SIZE]; // Will contain the data to write to disk

int main() 
{
    // Initialize variables
    std::vector<std::thread> threadPool;
    std::vector<io_uring*> ringPool;
    io_uring_params params;
    int fds[2];

    int bytesPerThread = WRITE_SIZE / NUM_JOBS;
    int bytesRemaining = WRITE_SIZE % NUM_JOBS;
    int bytesAssigned = 0;
    
    utils::generate_data(incomingData, WRITE_SIZE); // this just fills the incomingData buffer with known data

    // Open the file, store its descriptor
    fds[0] = open(FILENAME, O_WRONLY | O_TRUNC | O_CREAT);
    
    // initialize Rings
    ringPool.resize(NUM_JOBS);
    for (int i = 0; i < NUM_JOBS; i++)
    {
        io_uring* ring = new io_uring;

        // Configure the io_uring parameters and init the ring
        memset(&params, 0, sizeof(params));
        params.flags |= IORING_SETUP_SQPOLL;
        params.sq_thread_idle = 2000;
        io_uring_queue_init_params(QUEUE_DEPTH, ring, &params);
        io_uring_register_files(ring, fds, 1); // required for sq polling

        // Add the ring to the pool
        ringPool.at(i) = ring;
    }
    
    // Spin up threads to write to the file
    threadPool.resize(NUM_JOBS);
    for (int i = 0; i < NUM_JOBS; i++)
    {
        int bytesToAssign = (i != NUM_JOBS - 1) ? bytesPerThread : bytesPerThread + bytesRemaining;
        threadPool.at(i) = std::thread(writeToFile, 0, ringPool[i], incomingData + bytesAssigned, bytesToAssign, bytesAssigned);
        bytesAssigned += bytesToAssign;
    }

    // Wait for the threads to finish
    for (int i = 0; i < NUM_JOBS; i++)
    {
        threadPool[i].join();
    }

    // Cleanup the rings
    for (int i = 0; i < NUM_JOBS; i++)
    {
        io_uring_queue_exit(ringPool[i]);
    }

    // Close the file
    close(fds[0]);

    return 0;
}

The writeToFile() function

void writeToFile(int fd, io_uring* ring, char* buffer, int size, int fileIndex)
{
    io_uring_cqe *cqe;
    io_uring_sqe *sqe;

    int bytesRemaining = size;
    int bytesToWrite;
    int bytesWritten = 0;
    int writesPending = 0;

    while (bytesRemaining || writesPending)
    {
        while(writesPending < QUEUE_DEPTH && bytesRemaining)
        {
            /* In this first inner loop,
             * Write up to QUEUE_DEPTH blocks to the submission queue
             */

            bytesToWrite = bytesRemaining > IO_BLOCK_SIZE ? IO_BLOCK_SIZE : bytesRemaining;
            sqe = io_uring_get_sqe(ring);
            if (!sqe) break; // if can't get a sqe, break out of the loop and wait for the next round
            io_uring_prep_write(sqe, fd, buffer + bytesWritten, bytesToWrite, fileIndex + bytesWritten);
            sqe->flags |= IOSQE_FIXED_FILE;
            
            writesPending++;
            bytesWritten += bytesToWrite;
            bytesRemaining -= bytesToWrite;
            if (bytesRemaining == 0) break;
        }

        io_uring_submit(ring);

        while(writesPending)
        {
            /* In this second inner loop,
             * Handle completions
             * Additional error handling removed for brevity
             * The functionality is the same as with errror handling in the case that nothing goes wrong
             */

            int status = io_uring_peek_cqe(ring, &cqe);
            if (status == -EAGAIN) break; // if no completions are available, break out of the loop and wait for the next round
            
            io_uring_cqe_seen(ring, cqe);

            writesPending--;
        }
    }
}

axboe · 2023-03-24T14:54:47Z

axboe
Mar 24, 2023
Maintainer

https://stackoverflow.com/questions/75697877/why-is-liburing-write-performance-lower-than-expected/75707600?noredirect=1#comment133602869_75707600

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why can't I get close to FIO benchmark speeds? #819

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Why can't I get close to FIO benchmark speeds? #819

shannonmhair Mar 10, 2023

Problem Summary

My approach

Additional Information

Questions

The Code

The main function

The writeToFile() function

Replies: 1 comment

axboe Mar 24, 2023 Maintainer

shannonmhair
Mar 10, 2023

axboe
Mar 24, 2023
Maintainer