Alternate node store #590

cldellow · 2023-11-21T01:54:14Z

I'm slowly remembering how to write C++. This is what #571 probably should have been.

This PR:

extracts a NodeStore interface
renames the old NodeStore to BinarySearchNodeStore
adds SortedNodeStore, a store that exploits PBFs that have the Sort.Type_then_ID feature.

This allows the efficient construction of a store that arranges nodes in a three-level hierarchy:

Level 1 is groups: there are 256K groups
Level 2 is chunks: each group has 256 chunks
Level 3 is nodes: each chunk has 256 nodes

This allows us to store 2^34 nodes, with a fixed overhead of only 2M -- the space required for the level 1 pointers. (And 2^35 nodes with fixed overhead of 4M, and so on.)

Groups and chunks store their data sparsely, e.g. if a group has 7 chunks, it only uses storage for 7 chunks.

The precise storage cost varies based on the underlying PBF. Nova Scotia uses ~8.61 bytes per node, Great Britain uses ~9.13 bytes per node.

Measurements for reading Great Britain, master:

peak memory 13,890MB
real 1m39.446s
user 20m21.833s
sys 0m53.539s

This branch:

peak memory 12,900MB
real 1m33.271s
user 18m41.343s
sys 0m58.368s

This store should also perform better in the face of swapping, as lookups will have much better locality than the binary search of the previous store.

It should also be straight-forward to use something like simdcomp to extend this to store delta-encoded, zigzag-encoded, bitpacked data, with each worker keeping a per-thread uncompressed cache of its recently used chunks.

SortedNodeStore was complex enough that I wanted to give it its own files vs ramming it into osm_store -- this entailed refactoring some existing files. I also moved a few things around so that the node stores could build without pulling in the boost geometry templates, as they have a significant impact on compile times. Sorry for the impact that the churn has on reviewability :(

Open questions:

1 - Can we require that everyone only uses PBFs that support Sort.Type_then_ID ? If yes, we can delete BinarySearchNodeStore in favour of SortedNodeStore. If not, I'll do some work to intelligently pick which one to use based on the first PBF we read.

2 - I feel like there must be a more elegant way to do the work distribution in read_pbf. Perhaps the change for ReadPhase::Nodes should just be the default for all phases?

3 - This adds libpopcnt.h, a library to do fast population counts of bitvectors. It's BSD-licensed. How would you like attribution handled? Looks like other libraries are just cited in Tilemaker's README?

I'd like to add an alternative NodeStore that can be used when the `Type_then_ID` property is present in the PBF. First, a small (?) refactor: - make `NodeStore` an interface, with two concrete implementations - extract the NodeStore related things to their own files - this will cause some churn, as they'll depend on things that also need to get extracted to their own files. Short term pain, hopefully long term gain in faster compile times. Changing the invocations of the functions to be virtual may have impact on performance. Will need to revisit that before committing to virtual methods.

Currently, when a worker needs work, it gets the next unprocessed block. This means blocks are read sequentially at a global level, but from the perspective of each worker, there are gaps in the blocks they see. For nodes, we'd prefer to give each worker thread contiguous blocks from the underlying PBF. This will enable a more efficient storage for PBFs with the `Sort.Type_then_ID` flag.

SortedNodeStore is uesful for PBFs with the `Sort.Type_then_ID` property, e.g. the planet and Geofabrik exports. It stores nodes in a hierarchy: - Level 1 is groups: there are 256K groups - Level 2 is chunks: each group has 256 chunks - Level 3 is nodes: each chunk has 256 nodes This allows us to store 2^34 nodes, with a fixed overhead of only 2M -- the space required for the level 1 pointers. Groups and chunks store their data sparsely. If a group has 7 chunks, it only uses storage for 7 chunks. On Great Britain's 184M node PBF, it needs ~9.13 bytes per node. Looking up a node can be done in fixed time: First, get some offsets: - Group: `nodeID / 65536` - Chunk: `(nodeID / 65536) / 256` - Position within chunk: `nodeID % 256` For example, Cape Chignecto Provincial Park has ID 4855703, giving: - Group 74 - Chunk 23 - Offset 151 Group 74's chunks may be sparse. To map chunk 23 to its physical location, each group has a 256-bit bitmask indicating which chunks are present. Use its physical location to get its `chunkOffset`. That allows you to get to the `ChunkInfo` struct. From there, do the same thing to get the node data. This design should also let us do some interesting things down the road, like efficiently compressing each chunk using something like delta encoding, zigzag encoding and bit packing. Then, to avoid paying a decompression cost, we'd likely give each worker a cache of uncompressed chunks.

I don't understand why these can't be passed as a copy in the Windows and Mac builds. Whatever, try passing a reference.

cldellow · 2023-11-21T04:53:35Z

In particular, this code doesn't need to sort any of the mmaped memory, so I'd expect it to perform much better than the old NodeStore for datasets that don't fit in memory.

systemed · 2023-11-21T10:45:12Z

Wow - super interesting. I'll try and benchmark this tonight on some reasonably sized extracts.

systemed · 2023-11-21T14:40:33Z

This works really nicely for GB and Germany: it's a 6.5% and 5% (respectively) memory saving over master.

It currently segfaults for Europe using --store:

/usr/bin/time -v tilemaker --input /media/data3/planet/europe-latest.osm.pbf --output /media/data4/europe-omt.mbtiles --store /media/ssd/tmp/
[...]
Reading .shp glacier
Sorting loaded shapes
Reading .pbf /media/data3/planet/europe-latest.osm.pbf
Command terminated by signal 11 ways 0 relations 0

(I'll try with smaller areas using --store later. For now I'm benchmarking Europe with master!)

1 - Can we require that everyone only uses PBFs that support Sort.Type_then_ID ? If yes, we can delete BinarySearchNodeStore in favour of SortedNodeStore. If not, I'll do some work to intelligently pick which one to use based on the first PBF we read.

Experience is that people will try and throw all sorts of wild and wacky PBFs at tilemaker ;) so we should probably support both. But you should just be able to read the option flag from the PBF - PbfReader::ReadPbfFile already does that for LocationsOnWays.

2 - I feel like there must be a more elegant way to do the work distribution in read_pbf. Perhaps the change for ReadPhase::Nodes should just be the default for all phases?

Worth trying?

3 - This adds libpopcnt.h, a library to do fast population counts of bitvectors. It's BSD-licensed. How would you like attribution handled? Looks like other libraries are just cited in Tilemaker's README?

Yes - that should be fine!

cldellow · 2023-11-21T15:13:10Z

Oops, I'm embarrassed to admit I didn't even test --store, I assumed it would work for free by virtue of the mmap allocators. :(

Thank you for noticing! Will have a look at investigating + fixing that.

I think nested containers may not be wired up quite correctly. Instead, manage the char* buffers directly, rather than as `std::vector<char>` I'll fixup the other aspects (attributing libpopcnt, picking Sorted vs BinarySearch on the fly) later

All read phases use the same striding-over-batches-of-blocks approach. This required changing how progress is reported, as block IDs are no longer globally montonically increasing. Rather than thread the state into ReadBlock, I just adopted 2 atomic counters for the whole class -- the progress reporter already assumes that it's the only thing dumping to stdout, so the purity of avoiding class-global doesn't buy us anything.

systemed · 2023-11-21T21:05:41Z

With europe-latest.osm.pbf (today's download from Geofabrik) and --store it now says:

terminate called after throwing an instance of 'std::runtime_error'
  what():  SortedNodeStore: scaledOffset too big

Let me know if there's anything that'd be helpful for me to do to debug!

D'oh, if you get a full group where each chunk is full, you need to be able to express a value _ever so slightly_ larger than 65,536. North America and Europe have examples of this. Use a scale factor of 16, not 8. This'll mean some chunks have up to 15 wasted bytes, but it's not a huge deal. (And I have some thoughts on how to claw it back.)

cldellow · 2023-11-21T21:39:21Z

Oops, yes I see that also happens with the North America extract. I've pushed a fix. (This still isn't ready to merge but should be suitable for testing.)

If the user passes `--compress-nodes`, we use [streamvbyte](https://github.com/lemire/streamvbyte) to compress chunks of nodes in memory. The impact on read time is not much: - GB with `--compress-nodes`: 1m42s - without: 1m35s But the impact on memory is worthwhile, even across very different extracts: North America - 5.52 bytes/node vs 8.48 bytes/node 169482 groups, 18364343 chunks, 1757589784 nodes, needed 9706167278 bytes 169482 groups, 18364343 chunks, 1757589784 nodes, needed 14916095182 bytes Great Britain - 5.97 bytes/node vs 9.25 bytes/node 163074 groups, 4871807 chunks, 184655287 nodes, needed 1104024510 bytes 163074 groups, 4871807 chunks, 184655287 nodes, needed 1708093150 bytes Nova Scota - 5.81 bytes/node vs 8.7 bytes/node 26777 groups, 157927 chunks, 12104733 nodes, needed 70337950 bytes 26777 groups, 157927 chunks, 12104733 nodes, needed 105367598 bytes Monaco - 10.43 bytes/node vs 13.52 bytes/node 1196 groups, 2449 chunks, 30477 nodes, needed 318114 bytes 1196 groups, 2449 chunks, 30477 nodes, needed 412258 bytes

cldellow · 2023-11-22T06:34:48Z

Ok! I think this is probably good now. The most recent commit buys a bit more memory savings - ~6 bytes/node or so for most extracts.

very unclear why this is needed, but we seem to be getting C2131 on this line.

Workaround for MSVC

cldellow · 2023-12-03T03:35:43Z

This lays more groundwork for lazy geometries. Once this and refactor_geometries are merged into master, it should be possible to change OsmMemTiles to compute geometries on the fly and get a large memory savings.

Changes since my last comment:

extract a WayStore interface
provide a SortedWayStore that can be used in the same circumstances as SortedNodeStore
- stores node IDs compactly
  - omits final node for closed ways, instead stores a bit to track that the way is closed
  - packs bits 32..35 into a single nibble, rather than allocating an entire 4-byte integer
  - special case of the above: if all nodes in the way have the same bits 32..35, stores them only once
  - supports compressing the lower 32-bit values via streamvbyte
- adds minunit for unit testing
  - simplistic, but useful for faster dev cycles, available via make test
  - only available via Makefile, not via CMake, so doesn't run as part of CI
extract mmap_allocator.cpp so that unit tests don't have to drag in osm_store.cpp

SortedWayStore with --compress-ways stores way information with a cost between 2.3 bytes/node (North America) to 2.9 bytes/node (GB). master achieves about 8.1 bytes/node (plus overhead from std::deque that I'm too lazy to estimate).

In general, its impact is not very meaningful yet, as we currently only store those ways that are later used in relations, but it's an important step on the way to lazy geometries.

Timings and memory use for GB:

alternate-node-store: 13,000MB, 2m48s
alternate-node-store with --compress-nodes --compress-ways: 12,840MB, 2m55s
- I'm a little skeptical of my memory measurements. They show the resident memory of the process as reported by htop...but I know separately that --compress-nodes uses 600MB less memory, so it's confusing to me that the resident memory only decreased by 160MB. Ah, well.
master: 13,950MB, 3m1s

systemed · 2023-12-04T21:49:12Z

Looks great!

I've just merged refactor_geometries into master so we can iterate on that from now on.

cldellow · 2023-12-04T22:32:22Z

Great, I've merged master into this branch & will update my other PRs to target master.

New measurements:

master: 10,500 MB

real 2m29.892s
user 35m52.622s
sys 0m24.567s

this branch: 9,500 MB

real 2m14.724s
user 32m8.778s
sys 0m22.840s

this branch, --compress-nodes --compress-ways: 9,400 MB

real 2m17.028s
user 32m43.232s
sys 0m22.265s

cldellow · 2023-12-05T02:21:14Z

It looks like I'm now able to process North America on my 32GB laptop, which is great. Previously, even Canada was too much, and would cause me to OOM. I say "looks like", because I aborted at 66% done, after 188 minutes.

With --compress-nodes --compress-ways, it needed ~12GB of RAM. Presumably that's mainly the output objects + attributes, which can't spill to disk yet.

During the reading phase, I mostly kept my 16 cores pegged - there was the odd stutter due to running out of I/O.

During the writing phase, my cores were always pegged. Disk reads were mostly in the 5-25MB/sec range, with very occasional spikes to 300-600MB/sec.

The new z6 progress indicator did draw my eye to something fishy... I'll write up my observations in an issue before I forget.

systemed · 2023-12-05T21:54:49Z

I'll run a few benchmarks for Europe over the next couple of days. I have actually managed to run Europe in-memory with this branch but that peaked at 128GB usage which is probably a bit niche!

systemed · 2023-12-06T14:46:29Z

As per #595 we're down below 144GB RAM usage for Europe, so whatever I specify it'll be in-memory anyway.

With --compress-nodes --compress-ways it's 3h11m48, 127.3GB RAM used.
Without, it's 3hr10m23, 136GB.

The difference in execution time is sufficiently small, and the memory saving significant enough, I wonder if we should just default to --compress-nodes --compress-ways.

The generated map looks good from a quick scout around and free of geometry issues.

We're 20 minutes slower compared to when #595 was merged (2hr50). That's not the end of the world but I wonder what's causing it. I tried removing the backtracking from #602 but that didn't make any difference.

cldellow · 2023-12-06T15:17:21Z

I wonder if we should just default to --compress-nodes --compress-ways.

I can make that change and instead offer --no-compress-nodes and --no-compress-ways flags. I think the streamvbyte authors expect that the code should work on any CPU from 2010 or later, which is pretty broad.

We're 20 minutes slower

Hmm, two ideas:

is it possible that some of it is I/O on initial read?

It wouldn't explain the whole 20 minutes, but for me, if I haven't got the PBF entirely in my kernel buffers already, I see a substantial delay on initial startup while it reads through the PBF to find the blocks. For North America, it's ~1 min or so on my pretty fast SSD. You can generally rule this in/out by either forcing the PBF into buffers before running tilemaker (run time cat pbf > /dev/null until it's instantaneous) or forcing the PBF out of buffers by flushing them (on Linux, echo 3 | sudo tee /proc/sys/vm/drop_caches)

(Separately, I notice that for me, Tilemaker is slower at single-threaded reading of an unbuffered file than cat is - so there's probably some room to improve this.)

I vaguely recall your reference system might be somewhat older -- does its CPU support an efficient popcnt instruction?

I assume the slow down is from the node store. I would normally expect this store to be faster than the old store, because the old store does a binary search. Extrapolating from North America, I assume Europe has ~3.8B nodes, so you need approx 32 searches to find any given node. This store should be able to find the node by doing 2 indirections, but each indirection needs a popcnt. As I'm typing this out, I'm skeptical, because even a naive, non-optimized popcnt should be relatively fast.

Maybe a cheap test: do you see a similar slow down with GB? For me, this store was a little faster on GB than master.

cldellow · 2023-12-06T15:29:19Z

Although, if your CPU is an X5650 (mentioned #315 (comment)), I think it supports popcnt. You can confirm with grep popcnt /proc/cpuinfo on Linux.

systemed · 2023-12-08T15:54:09Z

It is indeed an X5650 and it looks like it does support popcnt. I'm not 100% sure the slowdown is necessarily a result of this branch - it's possible I've got confused between branches and it was another recent commit (it's also possible we've since obsoleted it!). I'll do a bit more digging but it might not be anything to worry about.

Running this branch on Europe with #607 and #608 (plus 0dbbe6e which speeds up .shp processing, and 4bd28f1 which speeds up geometry output):

wall clock 2:18:46 (down from 3:11:48 in previous comment, 5:54:49 before this whole batch of changes) 🎉
memory 130GB

systemed · 2023-12-09T13:51:40Z

Merged - thank you. Amazing gains.

We should probably move minunit.h into include/external/ and move sorted_way_store.test.cpp into a new test/ directory but that's not urgent.

In systemed#590, I took the path of least resistance and compiled these as C++ files. This was wrong, as they're meant to be compiled as C files. Depending on your compiler, you would hit errors to do with `restrict` or `__restrict__` references. This reverts the streamvbyte code to be a more vanilla copy of upstream, and updates the `Makefile` build to build them as C. Still to do: update CMakeLists.txt.

In systemed#590, I compiled these as C++ files. This was wrong, as they're meant to be compiled as C files. Depending on your compiler, you would hit errors to do with `restrict` or `__restrict__` references. This reverts the streamvbyte code to be a more vanilla copy of upstream, and updates the `Makefile` and `CMakeLists.txt` files to build them as C.

cldellow added 10 commits November 19, 2023 16:47

cmake build

3e65351

tidy up

c7e70d6

tweak

20e07bb

tweak

805ccec

derp

07d8f30

mac/windows build

8ad715c

fix build?

406fc4d

I don't understand why these can't be passed as a copy in the Windows and Mac builds. Whatever, try passing a reference.

cldellow added 4 commits November 21, 2023 10:47

fix --store

b799c58

I think nested containers may not be wired up quite correctly. Instead, manage the char* buffers directly, rather than as `std::vector<char>` I'll fixup the other aspects (attributing libpopcnt, picking Sorted vs BinarySearch on the fly) later

attribution for libpopcnt

988cef4

clear allocatedMemory

eacf69d

cldellow added 8 commits November 21, 2023 16:40

comment out debug stats

1e561e6

windows build

da1a5ac

derp

e213532

use SortedNodeStore if PBFs have Sort.Type_then_ID

579b290

build

3ba2b6d

build

cd161b1

remove __restrict__ to satisfy windows build

62a7734

cldellow added 8 commits December 2, 2023 18:13

sortednodestore: throw std::out_of_range

8455cd0

support way compression

122b745

remove dead code, robust against empty ways

736d730

implement clear()

2b693e1

Merge remote-tracking branch 'origin/master' into alternate-node-store

0e49cc4

maybe fix windows build?

d27b1e3

very unclear why this is needed, but we seem to be getting C2131 on this line.

don't use variable-length arrays on stack

a22be0f

Workaround for MSVC

avoid more variable-length arrays

3e3baf2

cldellow added 2 commits December 4, 2023 17:25

Merge branch 'master' into alternate-node-store

b11f8bc

make the other vectors as thread-local

1ef5180

Merge remote-tracking branch 'origin/master' into alternate-node-store

5acc073

--no-compress-ways, --no-compress-nodes

78e1582

cldellow mentioned this pull request Dec 7, 2023

Clipping monster polygons 2 #607

Merged

Merge branch 'master' into alternate-node-store

134adf3

systemed merged commit 8300b0c into systemed:master Dec 9, 2023
5 checks passed

cldellow mentioned this pull request Jan 6, 2024

compile streamvbyte as C, not C++ #634

Merged

cldellow mentioned this pull request Feb 2, 2024

MAC OS issue with types during the make command #664

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alternate node store #590

Alternate node store #590

cldellow commented Nov 21, 2023

cldellow commented Nov 21, 2023

systemed commented Nov 21, 2023

systemed commented Nov 21, 2023

cldellow commented Nov 21, 2023

systemed commented Nov 21, 2023 •

edited

Loading

cldellow commented Nov 21, 2023

cldellow commented Nov 22, 2023

cldellow commented Dec 3, 2023

systemed commented Dec 4, 2023

cldellow commented Dec 4, 2023

cldellow commented Dec 5, 2023

systemed commented Dec 5, 2023

systemed commented Dec 6, 2023

cldellow commented Dec 6, 2023 •

edited

Loading

cldellow commented Dec 6, 2023

systemed commented Dec 8, 2023

systemed commented Dec 9, 2023

Alternate node store #590

Alternate node store #590

Conversation

cldellow commented Nov 21, 2023

cldellow commented Nov 21, 2023

systemed commented Nov 21, 2023

systemed commented Nov 21, 2023

cldellow commented Nov 21, 2023

systemed commented Nov 21, 2023 • edited Loading

cldellow commented Nov 21, 2023

cldellow commented Nov 22, 2023

cldellow commented Dec 3, 2023

systemed commented Dec 4, 2023

cldellow commented Dec 4, 2023

cldellow commented Dec 5, 2023

systemed commented Dec 5, 2023

systemed commented Dec 6, 2023

cldellow commented Dec 6, 2023 • edited Loading

cldellow commented Dec 6, 2023

systemed commented Dec 8, 2023

systemed commented Dec 9, 2023

systemed commented Nov 21, 2023 •

edited

Loading

cldellow commented Dec 6, 2023 •

edited

Loading