-
Notifications
You must be signed in to change notification settings - Fork 58
[WIP] Add some parallelization to ahc-ld and ahc-link #622
base: master
Are you sure you want to change the base?
Conversation
asterius/src/Asterius/Ld.hs
Outdated
@@ -45,9 +46,13 @@ data LinkTask | |||
loadTheWorld :: LinkTask -> IO AsteriusCachedModule | |||
loadTheWorld LinkTask {..} = do | |||
ncu <- newNameCacheUpdater | |||
lib <- mconcat <$> for linkLibs (loadAr ncu) | |||
objs <- rights <$> for linkObjs (tryGetFile ncu) | |||
lib <- parallelFor 1 linkLibs (loadAr ncu) -- TODO: Parameterize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
loadAr
is already parallelized, and here we use parallelFor
again, which creates nested parallelism. Not to say this won't run, but the overall efficiency will be reduced.
We need to either:
- Refactor
loadAr
, so that it only returns theArchiveEntry
s, and after we collect all theArchiveEntry
s, parallelize their deserialization (along withobjs
) - Improve our parallelization framework so that it supports nested parallelism; each job can spawn new jobs to be evaluated somewhere else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went with the first approach:
Refactor
loadAr
, so that it only returns theArchiveEntry
s, and after we collect all theArchiveEntry
s, parallelize their deserialization (along withobjs
)
but I don't completely understand what you mean with the second bullet:
Improve our parallelization framework so that it supports nested parallelism; each job can spawn new jobs to be evaluated somewhere else.
Nested parallelism is already allowed, right? I understand that nested calls would pin threads to the same capabilities (hence slower), but it'd still work. What is the alternative solution you mean here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, the second approach elaborated is:
- We have a
P
monad which you can liftIO
actions into. It carries the context (thread pool, task queue) needed for all parallelism - There's a
fork :: P a -> P (Future a)
method which spawns aP a
action for parallel execution. TheP a
action itself canfork
new jobs, thus nested parallelism here. - There's a
join :: Future a -> P a
method which blocks until the result is available.
Going through all the trouble of creating the P
monad abstraction has an advantage: we can enforce the number of worker threads to match CPU core count all the time. Directly calling parallelFor
inside parallelFor
would be much less efficient.
0415318
to
e5c99d8
Compare
168c9b9
to
dd2b411
Compare
@TerrorJack, I think it is ready for reviewing now. I skipped parallelization of data segments in the binaryen backend as we discussed so that it doesn't backfire on us. I have also left a couple of |
@@ -103,6 +103,11 @@ parseTask args = case err_msgs of | |||
in if i >= 0 && i <= 2 | |||
then t {shrinkLevel = i} | |||
else error "Shrink level must be [0..2]", | |||
str_opt "thread-pool-size" $ \s t -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't forget to pass the --thread-pool-size
option to ahc-ld
as well. ahc-ld
is called as the linker executable by ahc
, so search for --optl
in this module and see how other ahc-ld
options are being passed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! Should be ok now.
We have to hold this one for a bit longer; I gave this a try on my build server, and the overall clock time of linking
Merely reading the flamegraph doesn't seem to help too much either. I think we also need to investigate the eventlogs with tools like |
41e06a2
to
73d3789
Compare
@TerrorJack I took your advice and tried out
The implementation of
Not sure about that, but if you could have another go at linking |
Nice finding about the missed laziness in |
c9bce59
to
a3c554f
Compare
@TerrorJack some updates on this one: Inspired by your fix for #663 (0085dcf), I tried giving up on parallelization in import Distribution.Simple
main = defaultMain wall clock time seems to only be getting worse:
|
@gkaracha Thanks for the update! Indeed, it seems that adding parallelization is pragmatic only after we further cut down memory usage of the single-threaded case. |
5586a43
to
cc6d042
Compare
This is WIP. TODOs as described in #621:
ahc-ld
, we can parallelize the deserialization of each object file. All object files are converted toByteString
s first, either via direct reading orArchiveEntry
, then the deserialization can be performed in parallel.AsteriusModule
should be fully evaluated, and this can be done in parallel as well.binaryen
backend, we can parallelize the marshaling of different data segments and functions.binaryen
will transparently switch to a new allocator when it notices it's allocating an IR node on a different thread, so we should ensure each Haskell worker thread is pinned usingforkOn
.