You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am working on a project that requires streaming data to disk at very high speeds on a single Linux server. An fio benchmark using the command below shows that I should be able to get the desired write speeds (> 40 GB/s) using io_uring.
However, I am not able to replicate this performance with my own code using liburing. My current write speed is about 9 GB/s. I suspect that the extra overhead of liburing as compared to io_uring might be the bottleneck, but I have a few questions to ask about my approach before I give up on the much-prettier liburing code.
My approach
Utilizing submission queue polling
NOT queueing gather/scatter io requests with writev(), but rather queueing requests to use the normal write() function to write to disk. (tried gather / scatter IO requests, but this does not seem to have a major impact on my write speeds.)
Multithreaded with one ring per thread
Additional Information
Running a simplified version of this code that makes no use of threading yielded similar results.
My debugger shows that I am creating the number of threads specified in the NUM_JOBS macro. However, it does not tell me about threads that are created by the kernel for sq polling.
My performance declines when running more than four threads
The linux server has 96 physical cores to work with (no hyperthreading)
The data is being written to a RAID0 configuration
I am using bpftrace -e 'tracepoint:io_uring:io_uring_submit_sqe {printf("%s(%d)\n", comm, pid);}' in a separate terminal, which shows that the kernel thread(s) dedicated to sq polling as active.
I have verified that the data written to disk exactly matches what I expect it to be in size and contents.
I have tried using the IORING_SETUP_ATTACH_WQ flag when setting up the rings. This helps marginally, but by no means gets me to the write speed I need.
I have tried various block sizes, 128k seems to be the sweet spot
Questions
I expect that the kernel would spin up a single thread per ring to handle sq polling. However, I do not know how to verify this is actually happening. Can I assume that it is?
Why does my performance decrease when running more than two jobs? Is this due to contention between the threads for the file being written to? Maybe it is because there is actually only a single thread working on sq polling that might get bogged down handling requests from multiple rings?
Are there other flags or options I should be using that might help?
Is it time to bight the bullet and use direct io_uring calls?
The Code
The code below is a simplified version that removes a lot of error handling code for the sake of brevity. However, the performance and function of this simplified version is the same as the full-featured code.
The main function
#include <fcntl.h>
#include <liburing.h>
#include <cstring>
#include <thread>
#include <vector>
#include "utilities.h"
#define NUM_JOBS 4 // number of single-ring threads
#define QUEUE_DEPTH 128 // size of each ring
#define IO_BLOCK_SIZE 128 * 1024 // write block size
#define WRITE_SIZE (IO_BLOCK_SIZE * 10000) // Total number of bytes to write
#define FILENAME "/mnt/md0/test.txt" // File to write to
char incomingData[WRITE_SIZE]; // Will contain the data to write to disk
int main()
{
// Initialize variables
std::vector<std::thread> threadPool;
std::vector<io_uring*> ringPool;
io_uring_params params;
int fds[2];
int bytesPerThread = WRITE_SIZE / NUM_JOBS;
int bytesRemaining = WRITE_SIZE % NUM_JOBS;
int bytesAssigned = 0;
utils::generate_data(incomingData, WRITE_SIZE); // this just fills the incomingData buffer with known data
// Open the file, store its descriptor
fds[0] = open(FILENAME, O_WRONLY | O_TRUNC | O_CREAT);
// initialize Rings
ringPool.resize(NUM_JOBS);
for (int i = 0; i < NUM_JOBS; i++)
{
io_uring* ring = new io_uring;
// Configure the io_uring parameters and init the ring
memset(¶ms, 0, sizeof(params));
params.flags |= IORING_SETUP_SQPOLL;
params.sq_thread_idle = 2000;
io_uring_queue_init_params(QUEUE_DEPTH, ring, ¶ms);
io_uring_register_files(ring, fds, 1); // required for sq polling
// Add the ring to the pool
ringPool.at(i) = ring;
}
// Spin up threads to write to the file
threadPool.resize(NUM_JOBS);
for (int i = 0; i < NUM_JOBS; i++)
{
int bytesToAssign = (i != NUM_JOBS - 1) ? bytesPerThread : bytesPerThread + bytesRemaining;
threadPool.at(i) = std::thread(writeToFile, 0, ringPool[i], incomingData + bytesAssigned, bytesToAssign, bytesAssigned);
bytesAssigned += bytesToAssign;
}
// Wait for the threads to finish
for (int i = 0; i < NUM_JOBS; i++)
{
threadPool[i].join();
}
// Cleanup the rings
for (int i = 0; i < NUM_JOBS; i++)
{
io_uring_queue_exit(ringPool[i]);
}
// Close the file
close(fds[0]);
return 0;
}
The writeToFile() function
void writeToFile(int fd, io_uring* ring, char* buffer, int size, int fileIndex)
{
io_uring_cqe *cqe;
io_uring_sqe *sqe;
int bytesRemaining = size;
int bytesToWrite;
int bytesWritten = 0;
int writesPending = 0;
while (bytesRemaining || writesPending)
{
while(writesPending < QUEUE_DEPTH && bytesRemaining)
{
/* In this first inner loop,
* Write up to QUEUE_DEPTH blocks to the submission queue
*/
bytesToWrite = bytesRemaining > IO_BLOCK_SIZE ? IO_BLOCK_SIZE : bytesRemaining;
sqe = io_uring_get_sqe(ring);
if (!sqe) break; // if can't get a sqe, break out of the loop and wait for the next round
io_uring_prep_write(sqe, fd, buffer + bytesWritten, bytesToWrite, fileIndex + bytesWritten);
sqe->flags |= IOSQE_FIXED_FILE;
writesPending++;
bytesWritten += bytesToWrite;
bytesRemaining -= bytesToWrite;
if (bytesRemaining == 0) break;
}
io_uring_submit(ring);
while(writesPending)
{
/* In this second inner loop,
* Handle completions
* Additional error handling removed for brevity
* The functionality is the same as with errror handling in the case that nothing goes wrong
*/
int status = io_uring_peek_cqe(ring, &cqe);
if (status == -EAGAIN) break; // if no completions are available, break out of the loop and wait for the next round
io_uring_cqe_seen(ring, cqe);
writesPending--;
}
}
}
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Problem Summary
I am working on a project that requires streaming data to disk at very high speeds on a single Linux server. An fio benchmark using the command below shows that I should be able to get the desired write speeds (> 40 GB/s) using io_uring.
However, I am not able to replicate this performance with my own code using liburing. My current write speed is about 9 GB/s. I suspect that the extra overhead of liburing as compared to io_uring might be the bottleneck, but I have a few questions to ask about my approach before I give up on the much-prettier liburing code.
My approach
writev()
, but rather queueing requests to use the normalwrite()
function to write to disk. (tried gather / scatter IO requests, but this does not seem to have a major impact on my write speeds.)Additional Information
NUM_JOBS
macro. However, it does not tell me about threads that are created by the kernel for sq polling.bpftrace -e 'tracepoint:io_uring:io_uring_submit_sqe {printf("%s(%d)\n", comm, pid);}'
in a separate terminal, which shows that the kernel thread(s) dedicated to sq polling as active.IORING_SETUP_ATTACH_WQ
flag when setting up the rings. This helps marginally, but by no means gets me to the write speed I need.Questions
The Code
The code below is a simplified version that removes a lot of error handling code for the sake of brevity. However, the performance and function of this simplified version is the same as the full-featured code.
The main function
The writeToFile() function
Beta Was this translation helpful? Give feedback.
All reactions