1BRC in Rust #57

tumdum · 2024-01-03T19:36:17Z

tumdum
Jan 3, 2024

Nothing too exciting - mmap whole file, split it into slices (one per core), collect partial results and merge.

Code is here: https://github.com/tumdum/1brc/blob/main/src/main.rs

Runs in ~15.5s on a m2pro

PurpleMyst · 2024-01-03T21:06:45Z

PurpleMyst
Jan 3, 2024

I had the same idea! Runs in 127.3ms on a 13600k, however I am utilizing both rayon and fxhash.

https://github.com/PurpleMyst/1brc.rs/blob/main/src/main.rs

4 replies

tumdum Jan 3, 2024
Author

Interesting, for me it takes ~50s to run just read_to_string.

PurpleMyst Jan 3, 2024

Looking better at my measurements.txt it seems to be cut-off. That's strange.

PurpleMyst Jan 3, 2024

I'm probably just going to implement a criterion benchmark that measures throughput, as it seems I've not got the space for 1B rows.

koyeung Feb 1, 2024

not sure if it is feasible to run on CI, with iai?

name: benchmark

on:
  push:
    branches:
      - main
  pull_request:

permissions:
  contents: read

env:
  CARGO_TERM_COLOR: always

  RUSTFLAGS: -Dwarnings

  # setup sccache for Rust; see https://github.com/Mozilla-Actions/sccache-action
  SCCACHE_GHA_ENABLED: "true"
  RUSTC_WRAPPER: "sccache"

jobs:

  iai:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v4
    - uses: mozilla-actions/[email protected]
    - uses: dtolnay/rust-toolchain@stable

    - run: sudo apt-get -y install valgrind

    - run: cargo bench --bench benchmark_by_cachegrind

Butch78 · 2024-01-04T10:21:38Z

Butch78
Jan 4, 2024

@coriolinus
Solution:
https://github.com/coriolinus/1brc
Implemented a full application using the STD library that runs in around 17 seconds on my i7-1185G7 @ 3.00GHz and 16GB of RAM Laptopn

0 replies

mtb0x1 · 2024-01-04T14:56:43Z

mtb0x1
Jan 4, 2024

@Lucretiel's version slightly modified, it runs under 10s => https://github.com/mtb0x1/1brc. I am pretty sure that we can speed it up by using custom hasher with Fxhashmap (make assumption about keys "city names", instead of comparing every bytes. check 2 firsts and 2 lasts or something similar)

0 replies

thebracket · 2024-01-04T18:41:49Z

thebracket
Jan 4, 2024

I gave it a go, too. https://github.com/thebracket/one_billion_rows
I'm getting about 3.2 seconds on my workstation (Intel i7, 17 cores, 32gb RAM, 64-bit Linux). It's quite likely that I've made a mistake somewhere, but that was fun. :-)

My biggest speedups came from using memmap2, chunking slices on a newline boundary, forward-only parsing (to keep each core's cache happy) and using one ahash map per thread - returning vectors for merging. Keying the hashmap on the slice of name bytes rather than a string of some sort helped a LOT.

7 replies

thebracket Jan 5, 2024

Interesting! I've been testing with the create_measurements program in the same workspace (a quick and dirty copy of the Java program. I don't have a Java dev setup to play with) and I'm getting the same results with the naieve_create_average program (close to the original Java) and my having_fun program. I'm not seeing zeroes!

With having_fun:

{Abha=-33.8/67.8/18.0, Abidjan=-21.6/76.1/26.0, Abéché=-20.4/80.6/29.4, Accra=-27.0/74.7/26.4, Addis Ababa=-36.3/66.1/16.0, Adelaide=-30.9/70.5/17.3, Aden=-23.4/78.1/29.1, Ahvaz=-27.8/75.5/25.4, Albuquerque=-36.0/62.4/14.0,

With the naieve program:

{Abha=-33.8/67.8/18.0, Abidjan=-21.6/76.1/26.0, Abéché=-20.4/80.6/29.4, Accra=-27.0/74.7/26.4, Addis Ababa=-36.3/66.1/16.0, Adelaide=-30.9/70.5/17.3, Aden=-23.4/78.1/29.1, Ahvaz=-27.8/75.5/25.4, Albuquerque=-36.0/62.4/14.0,

If I can find the reference file, I'll download it. My suspicion is that I generated the values file slightly differently from the original - and then built my parser around that mistake.

thebracket Jan 5, 2024

Found the issue!

My measurements generator is adding a space:

Singapore; 23.6
Yinchuan; 24.6
Monaco; 9.5

Whereas the original data lacks the space after the semicolon:

Hamburg;12.0
Bulawayo;8.9
Palembang;38.8

That should be an easy fix.

thebracket Jan 5, 2024

Ok, I corrected the measurement creator and having_fun. My measurements now look sane:

Baguio;24.4
Tripoli;19.2
Omaha;13.3
Arkhangelsk;-6.9
Niamey;21.4

And I get the same results on both of my test programs. Still clocking around 3.2 seconds. (The last push also includes some stuff I did before; reworking it to run in cargo bench as well as running the binary).

thebracket Jan 5, 2024

Corrected another minor issue. I was printing out 10 when I should have printed out 10.0.

thebracket Jan 5, 2024

And it keeps getting better! Replaced the native f32 parsing with my own parser, used integer math throughout (multiplied by 10, since we only have one decimal place of data).

 Beach=-33.2/66.8/157.9, Vladivostok=-45.8/55.7/48.8, Warsaw=-41.1/59.8/85.2, Washington, D.C.=-36.2/59.9/146.0, Wau=-21.3/79.1/278.0, Wellington=-39.1/67.9/129.0, Whitehorse=-51.0/53.2/-0.9, Wichita=-35.5/65.2/139.0, Willemstad=-20.7/76.5/279.9, Winnipeg=-45.9/55.2/30.0, Wrocław=-43.8/57.7/96.0, Xi'an=-37.3/64.7/141.0, Yakutsk=-59.2/41.3/-88.0, Yangon=-18.4/76.0/275.0, Yaoundé=-26.8/74.8/238.0, Yellowknife=-52.6/45.0/-42.9, Yerevan=-39.6/69.2/124.0, Yinchuan=-50.9/59.4/90.0, Zagreb=-39.8/60.7/107.0, Zanzibar City=-27.8/76.0/260.0, Zürich=-38.6/60.5/93.1, Ürümqi=-41.3/55.4/74.0, İzmir=-32.5/72.7/179.0}
read_it                 time:   [2.7701 s 2.8382 s 2.9057 s]
                        change: [-25.155% -23.092% -21.067%] (p = 0.00 < 0.05)
                        Performance has improved.

thebracket · 2024-01-05T00:50:30Z

thebracket
Jan 5, 2024

Not yet (I actually generated my own input file) - that'll be tomorrow, ran out of time for today.

…

On Thu, Jan 4, 2024, 6:48 PM Corlin Palmer ***@***.***> wrote: Did you check to see that your output was identical to the output of the reference on the same file? If so, you might have the fastest solution around! — Reply to this email directly, view it on GitHub <#57 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADRU4365VBHFF3OVT3LQO4TYM5EXJAVCNFSM6AAAAABBL4UL32VHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4DAMJYGQ4DA> . You are receiving this because you commented.Message ID: ***@***.***>

6 replies

dannyvankooten Jan 5, 2024

Nice work! Would you be so kind to measure the runtime of my C implementation here: https://github.com/dannyvankooten/1brc? Your solution runs in about 9s on my machine so I’m lacking some hardware. :(

thebracket Jan 5, 2024

Here you go:

real    0m5.889s
user    0m28.202s
sys     0m2.897s

thebracket Jan 5, 2024

Running it a few more times (to ensure its in the disk cache):

herbert@bertix23:~/Rust/tmp/1brc$ time bin/analyze measurements.txt > /dev/null

real    0m1.909s
user    0m27.498s
sys     0m0.722s

dannyvankooten Jan 5, 2024

Awesome, thank you! And wow, that cache difference is huge on your machine. ~~For some reason I’m not seeing any improvements in consecutive runs on mine.~~

EDIT: Scratch that. It's just that my cache was pretty much always full because of running it all the time. Calling echo 3 > /proc/sys/vm/drop_caches and then running it on a cold cache yields a runtime of ~5s for me as well. With a warm cache, I'm getting sub-2 seconds now too!

joaoeinride Jan 15, 2024

You can run your solution with https://crates.io/crates/hyperfine to get a consistent result

k0nserv · 2024-01-07T22:03:49Z

k0nserv
Jan 7, 2024

~3.5s on M1 MAX Macbook Pro
~6.437 on Hetzner CX33

Strategy

mmap the input file.

Treat the first eight bytes of the name as a hash key into a fixed size 10k item hash table. Resolve collisions by finding the next unoccupied slot. Parse temperatures as u16 with -99.9 as 0 and then each increment representing 0.1.

Split input based on core count and run it with a thread per core.

https://github.com/k0nserv/brc

2 replies

k0nserv Jan 7, 2024

Reduced it to 3.24s on the Macbook by switching from MAP_PRIVATE to MAP_SHARED

k0nserv Jan 8, 2024

With some SIMD it's now ~2.5s on the Macbook

gabrieledarrigo · 2024-01-11T20:54:09Z

gabrieledarrigo
Jan 11, 2024

Hi guys,
My implementation:

https://github.com/gabrieledarrigo/1brc

I'm learning the Rust basics, so my implementation is very naive (and quite OO) and uses a single thread.

~200s on a i7-3930K @3.20Ghx, with 64 GB of DD3 on Windows, running the binary on WSL.

$ time 

target/release/one-billion-row-count  200.48s user 5.96s system 99% cpu 3:26.44 total

If a kind soul would like to review my code I'll be grateful forever!

4 replies

joaoeinride Jan 15, 2024

Looks good! I didn't know about split_once. What about not using a BufReader since it creates a new string per call to lines()?

k0nserv Jan 15, 2024

I had a quick look and there are two things that stand out to me:

Needless allocation

You always allocate a String for the name of the station, this can be avoided and would likely improve performance quite a bit. Return Option<(&str, f32)> from parse_line and don't use the entry API in Station::add_measurement, instead use HashMap::get_mut and only if the key is missing insert it, as it forces premature allocation.

Error handling

The error handling is a bit unidomatic. Read the error handling chapter in the Rust Book and refactor based on that. Consider using anyhow to simplify error handling.

gabrieledarrigo Jan 17, 2024

@joaoeinride OK, and what can I use instead?

@k0nserv Thank you for your suggestions 😊
Just not allocating String in parse_line improved the performances a little bit, and a cold run on WSL Ubuntu was a little faster:

target/release/one-billion-row-count  185.21s user 4.75s system 94% cpu 3:20.19 total

Thank you so much guys, very appreciated 🙏 ❤️

lnicola Jan 17, 2024

joaoeinride OK, and what can I use instead?

You can use https://doc.rust-lang.org/std/io/trait.BufRead.html#method.read_line, but make sure to .clear() the string after each line, since read_line won't do it.

Namit-Nayan · 2024-01-27T14:43:48Z

Namit-Nayan
Jan 27, 2024

Hi, my attempt at the problem https://github.com/Namit-Nayan/one_billion_row
Runs under a minute. Learnt many things along the way. Started with single threaded, using BufReader -> shared-state and message passing concurrency.

0 replies

arthurlm · 2024-01-29T15:08:10Z

arthurlm
Jan 29, 2024

Hi,

Here you can find my attempt: https://github.com/arthurlm/one-brc-rs
When input file is cached in RAM it run under 800ms on my workstation 😁.

I have explain in details my code and how I achieve this.
Details on how I did the benchmark are also present in the README file of my repository.

I am also very curious of this program performances on any other computer.
So if anyone can run it and share the stat it will be really nice 😀.

4 replies

DalenW Jan 29, 2024

I tried on my 32GB M2 Max, got a segmentation fault... But I'm in class so I haven't looked into it, but I'm willing to take a closer look if you're curious.

arthurlm Jan 30, 2024

It is true I have used a lot of unsafe code and mostly targeted x86_64 architecture but I did not expect an segfault 🤔.

Here few things you may try:

Running the program with debug profile, so all the debug_assert! will be enabled again.
Disable CPU specific optimization
Update per thread stack size. I have reduce it but maybe too much.

I am really interested by your feedback, so feel free to ask any more question if needed.

DalenW Feb 4, 2024

I opened up an issue on the repo with a screenshot of the error.

GeistInDerSH Mar 29, 2024

Running on a Ryzen 7 5800X 8-Core & 32GB of RAM, I got:

# Cold Cache (likely due to the SATA drive)
Inside main total duration: 26.360960134s
Elapsed: 0:26.74

# Warm Cache
Inside main total duration: 1.644969495s
Elapsed: 0:01.98

GeistInDerSH · 2024-03-29T03:22:26Z

GeistInDerSH
Mar 29, 2024

Certainly late to the party, but here is my attempt none the less: https://github.com/GeistInDerSH/1brc
Over all a fun challenge, and good excuse to play with parts of Rust I haven't yet :)

8 replies

GeistInDerSH Apr 7, 2024

I figured there may be some collision, but adding the following to the and_modify method never printed anything:

// ...
                .entry(hash) // The source code for this function does the collision handling
                .and_modify(|data: &mut Data<'a>| {
                    if data.name != station { // Added for debugging
                        let data_name = std::str::from_utf8(data.name).unwrap();
                        let station = std::str::from_utf8(station).unwrap();
                        eprintln!("Names do not match!!! {} vs {}", data_name, station);
                    }
                    data.add_value(value)
                })
// ...
// Same was true for the other and_modify, skipping for brevity

Diffing some of the other results with my own shows that mine is indeed incorrect, but only for a single key. If the collision was an error, I'd think it would impact more than just that key.
It looks like the actual error was in the next_newline function, which would return the end of the mmap region. However, because there is a newline at the end of the file, this would get included when parsing the temperature value, i.e. amN;9.5 vs amN;9.5\n. Simply subtracting one from the file length seems to fix that.

tivrfoa Apr 7, 2024

Your last change fixed the error that I posted.

The point is that even if there's no collision, the code still needs to check if the station names are the same, and if not get another key.
All Java solutions had to do that.
This thread explains more: #186

How does this check that you added for debugging impact your time?
Still 0:01.61 for the 10k?

tivrfoa Apr 7, 2024

And if the code has to consider a possible collision, then the code below would not work anymore, because two local maps (local_values) could have different keys for the same station:

    // We have to send any data we have that has not already been sent
    if let Ok(mut shared_map) = entries.lock() {
        for (station, data) in local_values.into_iter() {
            shared_map
                .entry(station)
                .and_modify(|map_data| map_data.add_data(&data))
                .or_insert(data);
        }
    }

GeistInDerSH Apr 12, 2024

I've updated all HashMaps to use the station name as the key, and the run times as well

univta0001 Apr 15, 2024

The code can be further improved by removing the unnecessary allocation in or_insert(Data::new(value)). The or_insert function will always create the Data object whether the entry is found or not.

            local_values
                .entry(station)
                .and_modify(|data: &mut Data| data.add_value(value))
                .or_insert(Data::new(value));

To fix it, the code should be

            local_values
                .entry(station)
                .and_modify(|data: &mut Data| data.add_value(value))
                .or_insert_with(|| Data::new(value));

ANKerD · 2024-04-12T03:53:34Z

ANKerD
Apr 12, 2024

Never wrote any rust code in my life before this challenge.

I broke the rules about not using external dependencies on this but anyway here's my code https://github.com/ANKerD/1brc

It runs on about 15s on a Macbook air M1 8GB. I used an external heap allocator (mimalloc), FxHashpMap for faster HashMap access, itertools to iterate over the sorted keys, and used Profile-Guided Optimization for general 10~20% faster execution times.

0 replies

ayebear · 2024-04-25T17:25:12Z

ayebear
Apr 25, 2024

A concise solution using rayon:
https://github.com/ayebear/ayebear-1brc/blob/main/src/main.rs

Runs in 27s on ryzen 7 3800x. No optimizations were done other than adding par_lines from rayon. Shows how simple and high-level Rust code can be while still utilizing available hardware.

The one-liner in main is just:

fs::read_to_string("measurements.txt")?
    .par_lines()
    .flat_map(parse_line)
    .fold(Stations::default, Stations::insert_line)
    .reduce(Stations::default, Stations::merge)
    .print();

1 reply

ayebear Jul 21, 2024

Did some optimizations and made a new version without rayon, still quite simple IMHO, now down to ~1.22s on ryzen 9 5950x:
https://github.com/ayebear/1brc

Memory mapping, ditching rayon for a custom thread scheduler with chunking, using hashbrown HashMap, custom float parser, some simd, and removing unnecessary allocations helped make this possible.

SametHaymana · 2024-05-29T15:57:04Z

SametHaymana
May 29, 2024

~50s on Ryzen 7 3700U 8 core 2.3 GH

With using Mmap and thread pool.

Implementation:
1brc impl with Rust on Github - Samet Haymana

I am open to your support and development ideas, please comment.

2 replies

GeistInDerSH May 30, 2024

Hi!
I think there may be a issue in the code, because the final print out is quite different from other submission results. It doesn't seem like the whole file is being read.

    let thread_count = std::thread::available_parallelism().unwrap().into(); // Note: Likely 8 on a R7 3700U
    // ...
    for i in 0..thread_count { // Note: Only reading the first 8 1048576 byte chunks of the file
        let mmap_arc_clone = mmap_arc.clone(); // Clone the Arc, not the Mmap
        let results_clone = results.clone();
        let start = i * CHUNK_SIZE;
        let mut end = start + CHUNK_SIZE;
        if end > file_len {
            end = file_len;
        }
        // ...

SametHaymana May 30, 2024

Omg how I make this mistake :( I was incredibly surprised how it works so fast, I should have understood.

Now I fix issue it is now run under 1 minute =~ 50 sn on my machine

Thanks for response

smabbasht · 2024-06-12T12:24:00Z

smabbasht
Jun 12, 2024

Can someone be nice enough to review my code. I am using pure std library. It is currently taking 27s on a 13th Gen Intel(R) Core(TM) i7-13700 out of which 24s are just reading the file which indicates that I am definitely doing something wrong here however, I am not sure how can I improve on it using std library only. I am making partial hashmaps local to each thread to ensure no thread hog original hashmap lock longer than it should. This gave me significant speedups as the computation part is just 3s after this on this machine which has 16 cores.

0 replies

SuperioOne · 2024-07-01T02:17:18Z

SuperioOne
Jul 1, 2024

Update: Optimized float parsing and unchecked UTF8 str usage little bit more, it now completes the task in 0.95107 seconds with same hardware.

I'm able to parse measurements.txt correctly in 1.37268 seconds with Ryzen 5950x. I think it can achieve sub-second timings with newer x86-64 hardwares 🤔

I used libc, cityhash64 and Rust std. Feel free to test/roast my code: https://github.com/SuperioOne/1brc

0 replies

sehnryr · 2024-07-21T13:19:42Z

sehnryr
Jul 21, 2024

Here's my attempt using only the std library: https://github.com/sehnryr/1brc/ 🙌, it runs in 4.5s or so and using only ~20MB of RAM :D.

Not many seems to mention the this: my SSD which is noted to have 3,500MB/s Seq. Read by the manufacturer would theoretically allow me to read the 14 GB file in ~4s, so attaining sub 1s is kind of a lost hope for now :')

2 replies

sehnryr Sep 29, 2024

Now runs in 2.3s 🚀, without any dependency nor unsafe calls 🙌. I've improved my code (hash function and table) and mapped the file in memory by moving it in a tmpfs partition to bypass the SSD speed limitations.

sehnryr Oct 6, 2024

Down to 1.08s after using SIMD (through the experimental portable_simd feature) for finding indexes of bytes and removing the limit on threads (I had it on 8 threads instead of the 16 my CPU have)

Cobinja · 2024-07-27T13:46:15Z

Cobinja
Jul 27, 2024

I know I'm late to the game but here is my rusty solution: https://github.com/Cobinja/onebrc
It uses no 3rd party crates and solves it in ~10.1 seconds on my Ryzen 5900x with 64 GB RAM.

NOTE: Don't run this on 16GB machines, it reads the whole 13.8 GB to RAM.

2 replies

Cobinja Jul 27, 2024

Brought it down to ~8.3 seconds

Cobinja Jul 27, 2024

Down to 7.8 seconds after dropping caches, 6.0 seconds with caches

nicolube · 2024-09-12T15:47:44Z

nicolube
Sep 12, 2024

Well this is my go at it.
On my Laptop: 7,868s (CPU: 12th Gen Intel i5-1245U (12) @ 4.400GHz )
And on my PC 3,336s (AMD Ryzen 9 5900X (24) @ 3.700GHz)

https://github.com/nicolube/1BRC

1 reply

nicolube Sep 12, 2024

@sehnryr
I stole your HashTable, and now it is at 1,448s on my PC xD

1BRC in Rust #57

Replies: 18 comments · 43 replies

tumdum Jan 3, 2024 Author

Strategy

Needless allocation

Error handling

Replies: 18 comments 43 replies

tumdum Jan 3, 2024
Author