-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alternate node store #590
Alternate node store #590
Conversation
I'd like to add an alternative NodeStore that can be used when the `Type_then_ID` property is present in the PBF. First, a small (?) refactor: - make `NodeStore` an interface, with two concrete implementations - extract the NodeStore related things to their own files - this will cause some churn, as they'll depend on things that also need to get extracted to their own files. Short term pain, hopefully long term gain in faster compile times. Changing the invocations of the functions to be virtual may have impact on performance. Will need to revisit that before committing to virtual methods.
Currently, when a worker needs work, it gets the next unprocessed block. This means blocks are read sequentially at a global level, but from the perspective of each worker, there are gaps in the blocks they see. For nodes, we'd prefer to give each worker thread contiguous blocks from the underlying PBF. This will enable a more efficient storage for PBFs with the `Sort.Type_then_ID` flag.
SortedNodeStore is uesful for PBFs with the `Sort.Type_then_ID` property, e.g. the planet and Geofabrik exports. It stores nodes in a hierarchy: - Level 1 is groups: there are 256K groups - Level 2 is chunks: each group has 256 chunks - Level 3 is nodes: each chunk has 256 nodes This allows us to store 2^34 nodes, with a fixed overhead of only 2M -- the space required for the level 1 pointers. Groups and chunks store their data sparsely. If a group has 7 chunks, it only uses storage for 7 chunks. On Great Britain's 184M node PBF, it needs ~9.13 bytes per node. Looking up a node can be done in fixed time: First, get some offsets: - Group: `nodeID / 65536` - Chunk: `(nodeID / 65536) / 256` - Position within chunk: `nodeID % 256` For example, Cape Chignecto Provincial Park has ID 4855703, giving: - Group 74 - Chunk 23 - Offset 151 Group 74's chunks may be sparse. To map chunk 23 to its physical location, each group has a 256-bit bitmask indicating which chunks are present. Use its physical location to get its `chunkOffset`. That allows you to get to the `ChunkInfo` struct. From there, do the same thing to get the node data. This design should also let us do some interesting things down the road, like efficiently compressing each chunk using something like delta encoding, zigzag encoding and bit packing. Then, to avoid paying a decompression cost, we'd likely give each worker a cache of uncompressed chunks.
I don't understand why these can't be passed as a copy in the Windows and Mac builds. Whatever, try passing a reference.
In particular, this code doesn't need to sort any of the mmaped memory, so I'd expect it to perform much better than the old |
Wow - super interesting. I'll try and benchmark this tonight on some reasonably sized extracts. |
This works really nicely for GB and Germany: it's a 6.5% and 5% (respectively) memory saving over master. It currently segfaults for Europe using --store:
(I'll try with smaller areas using
Experience is that people will try and throw all sorts of wild and wacky PBFs at tilemaker ;) so we should probably support both. But you should just be able to read the option flag from the PBF -
Worth trying?
Yes - that should be fine! |
Oops, I'm embarrassed to admit I didn't even test Thank you for noticing! Will have a look at investigating + fixing that. |
I think nested containers may not be wired up quite correctly. Instead, manage the char* buffers directly, rather than as `std::vector<char>` I'll fixup the other aspects (attributing libpopcnt, picking Sorted vs BinarySearch on the fly) later
All read phases use the same striding-over-batches-of-blocks approach. This required changing how progress is reported, as block IDs are no longer globally montonically increasing. Rather than thread the state into ReadBlock, I just adopted 2 atomic counters for the whole class -- the progress reporter already assumes that it's the only thing dumping to stdout, so the purity of avoiding class-global doesn't buy us anything.
With europe-latest.osm.pbf (today's download from Geofabrik) and
Let me know if there's anything that'd be helpful for me to do to debug! |
D'oh, if you get a full group where each chunk is full, you need to be able to express a value _ever so slightly_ larger than 65,536. North America and Europe have examples of this. Use a scale factor of 16, not 8. This'll mean some chunks have up to 15 wasted bytes, but it's not a huge deal. (And I have some thoughts on how to claw it back.)
Oops, yes I see that also happens with the North America extract. I've pushed a fix. (This still isn't ready to merge but should be suitable for testing.) |
If the user passes `--compress-nodes`, we use [streamvbyte](https://github.com/lemire/streamvbyte) to compress chunks of nodes in memory. The impact on read time is not much: - GB with `--compress-nodes`: 1m42s - without: 1m35s But the impact on memory is worthwhile, even across very different extracts: North America - 5.52 bytes/node vs 8.48 bytes/node 169482 groups, 18364343 chunks, 1757589784 nodes, needed 9706167278 bytes 169482 groups, 18364343 chunks, 1757589784 nodes, needed 14916095182 bytes Great Britain - 5.97 bytes/node vs 9.25 bytes/node 163074 groups, 4871807 chunks, 184655287 nodes, needed 1104024510 bytes 163074 groups, 4871807 chunks, 184655287 nodes, needed 1708093150 bytes Nova Scota - 5.81 bytes/node vs 8.7 bytes/node 26777 groups, 157927 chunks, 12104733 nodes, needed 70337950 bytes 26777 groups, 157927 chunks, 12104733 nodes, needed 105367598 bytes Monaco - 10.43 bytes/node vs 13.52 bytes/node 1196 groups, 2449 chunks, 30477 nodes, needed 318114 bytes 1196 groups, 2449 chunks, 30477 nodes, needed 412258 bytes
Ok! I think this is probably good now. The most recent commit buys a bit more memory savings - ~6 bytes/node or so for most extracts. |
very unclear why this is needed, but we seem to be getting C2131 on this line.
Workaround for MSVC
This lays more groundwork for lazy geometries. Once this and Changes since my last comment:
In general, its impact is not very meaningful yet, as we currently only store those ways that are later used in relations, but it's an important step on the way to lazy geometries. Timings and memory use for GB:
|
Looks great! I've just merged |
Great, I've merged master into this branch & will update my other PRs to target master. New measurements: master: 10,500 MB
this branch: 9,500 MB
this branch,
|
It looks like I'm now able to process North America on my 32GB laptop, which is great. Previously, even Canada was too much, and would cause me to OOM. I say "looks like", because I aborted at 66% done, after 188 minutes. With During the reading phase, I mostly kept my 16 cores pegged - there was the odd stutter due to running out of I/O. During the writing phase, my cores were always pegged. Disk reads were mostly in the 5-25MB/sec range, with very occasional spikes to 300-600MB/sec. The new z6 progress indicator did draw my eye to something fishy... I'll write up my observations in an issue before I forget. |
I'll run a few benchmarks for Europe over the next couple of days. I have actually managed to run Europe in-memory with this branch but that peaked at 128GB usage which is probably a bit niche! |
As per #595 we're down below 144GB RAM usage for Europe, so whatever I specify it'll be in-memory anyway.
The difference in execution time is sufficiently small, and the memory saving significant enough, I wonder if we should just default to The generated map looks good from a quick scout around and free of geometry issues. We're 20 minutes slower compared to when #595 was merged (2hr50). That's not the end of the world but I wonder what's causing it. I tried removing the backtracking from #602 but that didn't make any difference. |
I can make that change and instead offer
Hmm, two ideas:
It wouldn't explain the whole 20 minutes, but for me, if I haven't got the PBF entirely in my kernel buffers already, I see a substantial delay on initial startup while it reads through the PBF to find the blocks. For North America, it's ~1 min or so on my pretty fast SSD. You can generally rule this in/out by either forcing the PBF into buffers before running tilemaker (run (Separately, I notice that for me, Tilemaker is slower at single-threaded reading of an unbuffered file than
I assume the slow down is from the node store. I would normally expect this store to be faster than the old store, because the old store does a binary search. Extrapolating from North America, I assume Europe has ~3.8B nodes, so you need approx 32 searches to find any given node. This store should be able to find the node by doing 2 indirections, but each indirection needs a popcnt. As I'm typing this out, I'm skeptical, because even a naive, non-optimized popcnt should be relatively fast. Maybe a cheap test: do you see a similar slow down with GB? For me, this store was a little faster on GB than master. |
Although, if your CPU is an X5650 (mentioned #315 (comment)), I think it supports popcnt. You can confirm with |
It is indeed an X5650 and it looks like it does support Running this branch on Europe with #607 and #608 (plus 0dbbe6e which speeds up .shp processing, and 4bd28f1 which speeds up geometry output):
|
Merged - thank you. Amazing gains. We should probably move |
In systemed#590, I took the path of least resistance and compiled these as C++ files. This was wrong, as they're meant to be compiled as C files. Depending on your compiler, you would hit errors to do with `restrict` or `__restrict__` references. This reverts the streamvbyte code to be a more vanilla copy of upstream, and updates the `Makefile` build to build them as C. Still to do: update CMakeLists.txt.
In systemed#590, I compiled these as C++ files. This was wrong, as they're meant to be compiled as C files. Depending on your compiler, you would hit errors to do with `restrict` or `__restrict__` references. This reverts the streamvbyte code to be a more vanilla copy of upstream, and updates the `Makefile` and `CMakeLists.txt` files to build them as C.
In systemed#590, I compiled these as C++ files. This was wrong, as they're meant to be compiled as C files. Depending on your compiler, you would hit errors to do with `restrict` or `__restrict__` references. This reverts the streamvbyte code to be a more vanilla copy of upstream, and updates the `Makefile` and `CMakeLists.txt` files to build them as C.
I'm slowly remembering how to write C++. This is what #571 probably should have been.
This PR:
NodeStore
interfaceNodeStore
toBinarySearchNodeStore
SortedNodeStore
, a store that exploits PBFs that have theSort.Type_then_ID
feature.This allows the efficient construction of a store that arranges nodes in a three-level hierarchy:
This allows us to store 2^34 nodes, with a fixed overhead of only 2M -- the space required for the level 1 pointers. (And 2^35 nodes with fixed overhead of 4M, and so on.)
Groups and chunks store their data sparsely, e.g. if a group has 7 chunks, it only uses storage for 7 chunks.
The precise storage cost varies based on the underlying PBF. Nova Scotia uses ~8.61 bytes per node, Great Britain uses ~9.13 bytes per node.
Measurements for reading Great Britain, master:
This branch:
This store should also perform better in the face of swapping, as lookups will have much better locality than the binary search of the previous store.
It should also be straight-forward to use something like simdcomp to extend this to store delta-encoded, zigzag-encoded, bitpacked data, with each worker keeping a per-thread uncompressed cache of its recently used chunks.
SortedNodeStore
was complex enough that I wanted to give it its own files vs ramming it intoosm_store
-- this entailed refactoring some existing files. I also moved a few things around so that the node stores could build without pulling in the boost geometry templates, as they have a significant impact on compile times. Sorry for the impact that the churn has on reviewability :(Open questions:
1 - Can we require that everyone only uses PBFs that support
Sort.Type_then_ID
? If yes, we can deleteBinarySearchNodeStore
in favour ofSortedNodeStore
. If not, I'll do some work to intelligently pick which one to use based on the first PBF we read.2 - I feel like there must be a more elegant way to do the work distribution in
read_pbf
. Perhaps the change forReadPhase::Nodes
should just be the default for all phases?3 - This adds
libpopcnt.h
, a library to do fast population counts of bitvectors. It's BSD-licensed. How would you like attribution handled? Looks like other libraries are just cited in Tilemaker's README?