Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternate node store #590

Merged
merged 61 commits into from
Dec 9, 2023
Merged

Conversation

cldellow
Copy link
Contributor

I'm slowly remembering how to write C++. This is what #571 probably should have been.

This PR:

  • extracts a NodeStore interface
  • renames the old NodeStore to BinarySearchNodeStore
  • adds SortedNodeStore, a store that exploits PBFs that have the Sort.Type_then_ID feature.

This allows the efficient construction of a store that arranges nodes in a three-level hierarchy:

  • Level 1 is groups: there are 256K groups
  • Level 2 is chunks: each group has 256 chunks
  • Level 3 is nodes: each chunk has 256 nodes

This allows us to store 2^34 nodes, with a fixed overhead of only 2M -- the space required for the level 1 pointers. (And 2^35 nodes with fixed overhead of 4M, and so on.)

Groups and chunks store their data sparsely, e.g. if a group has 7 chunks, it only uses storage for 7 chunks.

The precise storage cost varies based on the underlying PBF. Nova Scotia uses ~8.61 bytes per node, Great Britain uses ~9.13 bytes per node.

Measurements for reading Great Britain, master:

peak memory 13,890MB
real 1m39.446s
user 20m21.833s
sys 0m53.539s

This branch:

peak memory 12,900MB
real 1m33.271s
user 18m41.343s
sys 0m58.368s

This store should also perform better in the face of swapping, as lookups will have much better locality than the binary search of the previous store.

It should also be straight-forward to use something like simdcomp to extend this to store delta-encoded, zigzag-encoded, bitpacked data, with each worker keeping a per-thread uncompressed cache of its recently used chunks.

SortedNodeStore was complex enough that I wanted to give it its own files vs ramming it into osm_store -- this entailed refactoring some existing files. I also moved a few things around so that the node stores could build without pulling in the boost geometry templates, as they have a significant impact on compile times. Sorry for the impact that the churn has on reviewability :(

Open questions:

1 - Can we require that everyone only uses PBFs that support Sort.Type_then_ID ? If yes, we can delete BinarySearchNodeStore in favour of SortedNodeStore. If not, I'll do some work to intelligently pick which one to use based on the first PBF we read.

2 - I feel like there must be a more elegant way to do the work distribution in read_pbf. Perhaps the change for ReadPhase::Nodes should just be the default for all phases?

3 - This adds libpopcnt.h, a library to do fast population counts of bitvectors. It's BSD-licensed. How would you like attribution handled? Looks like other libraries are just cited in Tilemaker's README?

I'd like to add an alternative NodeStore that can be used when the
`Type_then_ID` property is present in the PBF.

First, a small (?) refactor:

- make `NodeStore` an interface, with two concrete implementations
- extract the NodeStore related things to their own files
- this will cause some churn, as they'll depend on things that also
  need to get extracted to their own files. Short term pain, hopefully
  long term gain in faster compile times.

Changing the invocations of the functions to be virtual may have impact
on performance. Will need to revisit that before committing to virtual
methods.
Currently, when a worker needs work, it gets the next unprocessed block.
This means blocks are read sequentially at a global level, but from
the perspective of each worker, there are gaps in the blocks they see.

For nodes, we'd prefer to give each worker thread contiguous blocks
from the underlying PBF. This will enable a more efficient storage
for PBFs with the `Sort.Type_then_ID` flag.
SortedNodeStore is uesful for PBFs with the `Sort.Type_then_ID`
property, e.g. the planet and Geofabrik exports.

It stores nodes in a hierarchy:

- Level 1 is groups: there are 256K groups
- Level 2 is chunks: each group has 256 chunks
- Level 3 is nodes: each chunk has 256 nodes

This allows us to store 2^34 nodes, with a fixed overhead of
only 2M -- the space required for the level 1 pointers.

Groups and chunks store their data sparsely. If a group has 7 chunks,
it only uses storage for 7 chunks.

On Great Britain's 184M node PBF, it needs ~9.13 bytes per node.

Looking up a node can be done in fixed time:

First, get some offsets:
- Group: `nodeID / 65536`
- Chunk: `(nodeID / 65536) / 256`
- Position within chunk: `nodeID % 256`

For example, Cape Chignecto Provincial Park has ID 4855703, giving:
- Group 74
- Chunk 23
- Offset 151

Group 74's chunks may be sparse. To map chunk 23 to its physical
location, each group has a 256-bit bitmask indicating which
chunks are present.

Use its physical location to get its `chunkOffset`. That allows you
to get to the `ChunkInfo` struct.

From there, do the same thing to get the node data.

This design should also let us do some interesting things down the road,
like efficiently compressing each chunk using something like delta
encoding, zigzag encoding and bit packing. Then, to avoid paying a
decompression cost, we'd likely give each worker a cache of uncompressed
chunks.
I don't understand why these can't be passed as a copy in the Windows
and Mac builds. Whatever, try passing a reference.
@cldellow
Copy link
Contributor Author

In particular, this code doesn't need to sort any of the mmaped memory, so I'd expect it to perform much better than the old NodeStore for datasets that don't fit in memory.

@systemed
Copy link
Owner

Wow - super interesting. I'll try and benchmark this tonight on some reasonably sized extracts.

@systemed
Copy link
Owner

This works really nicely for GB and Germany: it's a 6.5% and 5% (respectively) memory saving over master.

It currently segfaults for Europe using --store:

/usr/bin/time -v tilemaker --input /media/data3/planet/europe-latest.osm.pbf --output /media/data4/europe-omt.mbtiles --store /media/ssd/tmp/
[...]
Reading .shp glacier
Sorting loaded shapes
Reading .pbf /media/data3/planet/europe-latest.osm.pbf
Command terminated by signal 11 ways 0 relations 0        

(I'll try with smaller areas using --store later. For now I'm benchmarking Europe with master!)

1 - Can we require that everyone only uses PBFs that support Sort.Type_then_ID ? If yes, we can delete BinarySearchNodeStore in favour of SortedNodeStore. If not, I'll do some work to intelligently pick which one to use based on the first PBF we read.

Experience is that people will try and throw all sorts of wild and wacky PBFs at tilemaker ;) so we should probably support both. But you should just be able to read the option flag from the PBF - PbfReader::ReadPbfFile already does that for LocationsOnWays.

2 - I feel like there must be a more elegant way to do the work distribution in read_pbf. Perhaps the change for ReadPhase::Nodes should just be the default for all phases?

Worth trying?

3 - This adds libpopcnt.h, a library to do fast population counts of bitvectors. It's BSD-licensed. How would you like attribution handled? Looks like other libraries are just cited in Tilemaker's README?

Yes - that should be fine!

@cldellow
Copy link
Contributor Author

Oops, I'm embarrassed to admit I didn't even test --store, I assumed it would work for free by virtue of the mmap allocators. :(

Thank you for noticing! Will have a look at investigating + fixing that.

I think nested containers may not be wired up quite correctly.
Instead, manage the char* buffers directly, rather than as
`std::vector<char>`

I'll fixup the other aspects (attributing libpopcnt, picking
Sorted vs BinarySearch on the fly) later
All read phases use the same striding-over-batches-of-blocks approach.

This required changing how progress is reported, as block IDs are no
longer globally montonically increasing.

Rather than thread the state into ReadBlock, I just adopted 2 atomic
counters for the whole class -- the progress reporter already assumes
that it's the only thing dumping to stdout, so the purity of avoiding
class-global doesn't buy us anything.
@systemed
Copy link
Owner

systemed commented Nov 21, 2023

With europe-latest.osm.pbf (today's download from Geofabrik) and --store it now says:

terminate called after throwing an instance of 'std::runtime_error'
  what():  SortedNodeStore: scaledOffset too big

Let me know if there's anything that'd be helpful for me to do to debug!

D'oh, if you get a full group where each chunk is full, you need to be
able to express a value _ever so slightly_ larger than 65,536.

North America and Europe have examples of this.

Use a scale factor of 16, not 8. This'll mean some chunks have up to 15
wasted bytes, but it's not a huge deal. (And I have some thoughts on how
to claw it back.)
@cldellow
Copy link
Contributor Author

Oops, yes I see that also happens with the North America extract. I've pushed a fix. (This still isn't ready to merge but should be suitable for testing.)

If the user passes `--compress-nodes`, we use [streamvbyte](https://github.com/lemire/streamvbyte)
to compress chunks of nodes in memory.

The impact on read time is not much:
- GB with `--compress-nodes`: 1m42s
- without: 1m35s

But the impact on memory is worthwhile, even across very different
extracts:

North America - 5.52 bytes/node vs 8.48 bytes/node
169482 groups, 18364343 chunks, 1757589784 nodes, needed 9706167278 bytes
169482 groups, 18364343 chunks, 1757589784 nodes, needed 14916095182 bytes

Great Britain - 5.97 bytes/node vs 9.25 bytes/node
163074 groups, 4871807 chunks, 184655287 nodes, needed 1104024510 bytes
163074 groups, 4871807 chunks, 184655287 nodes, needed 1708093150 bytes

Nova Scota - 5.81 bytes/node vs 8.7 bytes/node
26777 groups, 157927 chunks, 12104733 nodes, needed 70337950 bytes
26777 groups, 157927 chunks, 12104733 nodes, needed 105367598 bytes

Monaco - 10.43 bytes/node vs 13.52 bytes/node
1196 groups, 2449 chunks, 30477 nodes, needed 318114 bytes
1196 groups, 2449 chunks, 30477 nodes, needed 412258 bytes
@cldellow
Copy link
Contributor Author

Ok! I think this is probably good now. The most recent commit buys a bit more memory savings - ~6 bytes/node or so for most extracts.

@cldellow
Copy link
Contributor Author

cldellow commented Dec 3, 2023

This lays more groundwork for lazy geometries. Once this and refactor_geometries are merged into master, it should be possible to change OsmMemTiles to compute geometries on the fly and get a large memory savings.

Changes since my last comment:

  • extract a WayStore interface
  • provide a SortedWayStore that can be used in the same circumstances as SortedNodeStore 
    • stores node IDs compactly
      • omits final node for closed ways, instead stores a bit to track that the way is closed
      • packs bits 32..35 into a single nibble, rather than allocating an entire 4-byte integer
      • special case of the above: if all nodes in the way have the same bits 32..35, stores them only once
      • supports compressing the lower 32-bit values via streamvbyte
    • adds minunit for unit testing
      • simplistic, but useful for faster dev cycles, available via make test
      • only available via Makefile, not via CMake, so doesn't run as part of CI
  • extract mmap_allocator.cpp so that unit tests don't have to drag in osm_store.cpp

SortedWayStore with --compress-ways stores way information with a cost between 2.3 bytes/node (North America) to 2.9 bytes/node (GB). master achieves about 8.1 bytes/node (plus overhead from std::deque that I'm too lazy to estimate).

In general, its impact is not very meaningful yet, as we currently only store those ways that are later used in relations, but it's an important step on the way to lazy geometries.

Timings and memory use for GB:

  • alternate-node-store: 13,000MB, 2m48s
  • alternate-node-store with --compress-nodes --compress-ways: 12,840MB, 2m55s
    • I'm a little skeptical of my memory measurements. They show the resident memory of the process as reported by htop...but I know separately that --compress-nodes uses 600MB less memory, so it's confusing to me that the resident memory only decreased by 160MB. Ah, well.
  • master: 13,950MB, 3m1s

@systemed
Copy link
Owner

systemed commented Dec 4, 2023

Looks great!

I've just merged refactor_geometries into master so we can iterate on that from now on.

@cldellow
Copy link
Contributor Author

cldellow commented Dec 4, 2023

Great, I've merged master into this branch & will update my other PRs to target master.

New measurements:

master: 10,500 MB

real 2m29.892s
user 35m52.622s
sys 0m24.567s

this branch: 9,500 MB

real 2m14.724s
user 32m8.778s
sys 0m22.840s

this branch, --compress-nodes --compress-ways: 9,400 MB

real 2m17.028s
user 32m43.232s
sys 0m22.265s

@cldellow
Copy link
Contributor Author

cldellow commented Dec 5, 2023

It looks like I'm now able to process North America on my 32GB laptop, which is great. Previously, even Canada was too much, and would cause me to OOM. I say "looks like", because I aborted at 66% done, after 188 minutes.

With --compress-nodes --compress-ways, it needed ~12GB of RAM. Presumably that's mainly the output objects + attributes, which can't spill to disk yet.

During the reading phase, I mostly kept my 16 cores pegged - there was the odd stutter due to running out of I/O.

During the writing phase, my cores were always pegged. Disk reads were mostly in the 5-25MB/sec range, with very occasional spikes to 300-600MB/sec.

The new z6 progress indicator did draw my eye to something fishy... I'll write up my observations in an issue before I forget.

@systemed
Copy link
Owner

systemed commented Dec 5, 2023

I'll run a few benchmarks for Europe over the next couple of days. I have actually managed to run Europe in-memory with this branch but that peaked at 128GB usage which is probably a bit niche!

@systemed
Copy link
Owner

systemed commented Dec 6, 2023

As per #595 we're down below 144GB RAM usage for Europe, so whatever I specify it'll be in-memory anyway.

  • With --compress-nodes --compress-ways it's 3h11m48, 127.3GB RAM used.
  • Without, it's 3hr10m23, 136GB.

The difference in execution time is sufficiently small, and the memory saving significant enough, I wonder if we should just default to --compress-nodes --compress-ways.

The generated map looks good from a quick scout around and free of geometry issues.

We're 20 minutes slower compared to when #595 was merged (2hr50). That's not the end of the world but I wonder what's causing it. I tried removing the backtracking from #602 but that didn't make any difference.

@cldellow
Copy link
Contributor Author

cldellow commented Dec 6, 2023

I wonder if we should just default to --compress-nodes --compress-ways.

I can make that change and instead offer --no-compress-nodes and --no-compress-ways flags. I think the streamvbyte authors expect that the code should work on any CPU from 2010 or later, which is pretty broad.

We're 20 minutes slower

Hmm, two ideas:

  1. is it possible that some of it is I/O on initial read?

It wouldn't explain the whole 20 minutes, but for me, if I haven't got the PBF entirely in my kernel buffers already, I see a substantial delay on initial startup while it reads through the PBF to find the blocks. For North America, it's ~1 min or so on my pretty fast SSD. You can generally rule this in/out by either forcing the PBF into buffers before running tilemaker (run time cat pbf > /dev/null until it's instantaneous) or forcing the PBF out of buffers by flushing them (on Linux, echo 3 | sudo tee /proc/sys/vm/drop_caches)

(Separately, I notice that for me, Tilemaker is slower at single-threaded reading of an unbuffered file than cat is - so there's probably some room to improve this.)

  1. I vaguely recall your reference system might be somewhat older -- does its CPU support an efficient popcnt instruction?

I assume the slow down is from the node store. I would normally expect this store to be faster than the old store, because the old store does a binary search. Extrapolating from North America, I assume Europe has ~3.8B nodes, so you need approx 32 searches to find any given node. This store should be able to find the node by doing 2 indirections, but each indirection needs a popcnt. As I'm typing this out, I'm skeptical, because even a naive, non-optimized popcnt should be relatively fast.

Maybe a cheap test: do you see a similar slow down with GB? For me, this store was a little faster on GB than master.

@cldellow
Copy link
Contributor Author

cldellow commented Dec 6, 2023

Although, if your CPU is an X5650 (mentioned #315 (comment)), I think it supports popcnt. You can confirm with grep popcnt /proc/cpuinfo on Linux.

@systemed
Copy link
Owner

systemed commented Dec 8, 2023

It is indeed an X5650 and it looks like it does support popcnt. I'm not 100% sure the slowdown is necessarily a result of this branch - it's possible I've got confused between branches and it was another recent commit (it's also possible we've since obsoleted it!). I'll do a bit more digging but it might not be anything to worry about.

Running this branch on Europe with #607 and #608 (plus 0dbbe6e which speeds up .shp processing, and 4bd28f1 which speeds up geometry output):

  • wall clock 2:18:46 (down from 3:11:48 in previous comment, 5:54:49 before this whole batch of changes) 🎉
  • memory 130GB

@systemed systemed merged commit 8300b0c into systemed:master Dec 9, 2023
5 checks passed
@systemed
Copy link
Owner

systemed commented Dec 9, 2023

Merged - thank you. Amazing gains.

We should probably move minunit.h into include/external/ and move sorted_way_store.test.cpp into a new test/ directory but that's not urgent.

cldellow added a commit to cldellow/tilemaker that referenced this pull request Jan 6, 2024
In systemed#590, I took the path of least resistance and compiled these as C++
files.

This was wrong, as they're meant to be compiled as C files. Depending on
your compiler, you would hit errors to do with `restrict` or
`__restrict__` references.

This reverts the streamvbyte code to be a more vanilla copy of upstream,
and updates the `Makefile` build to build them as C.

Still to do: update CMakeLists.txt.
cldellow added a commit to cldellow/tilemaker that referenced this pull request Jan 6, 2024
In systemed#590, I compiled these as C++ files.

This was wrong, as they're meant to be compiled as C files. Depending on
your compiler, you would hit errors to do with `restrict` or
`__restrict__` references.

This reverts the streamvbyte code to be a more vanilla copy of upstream,
and updates the `Makefile` and `CMakeLists.txt` files to build them as C.
cldellow added a commit to cldellow/tilemaker that referenced this pull request Jan 6, 2024
In systemed#590, I compiled these as C++ files.

This was wrong, as they're meant to be compiled as C files. Depending on
your compiler, you would hit errors to do with `restrict` or
`__restrict__` references.

This reverts the streamvbyte code to be a more vanilla copy of upstream,
and updates the `Makefile` and `CMakeLists.txt` files to build them as C.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants