-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance analysis #176
Comments
I'm going to list things I have attempted that have not quite succeeded. 1. Monomorphizing the Parse monad 8555e20This actually seems to make things slower, going from 130s to 170s over my running example. It seems we spend more time in The version on master boils down to: $c>>_rgvi
= \ @ a_a9mB @ b_a9mC eta_XfM eta1_Xvx ->
\ r1_ab2X ->
let { m1_sbv8 = (eta_XfM `cast` <Co:10>) r1_ab2X } in
(\ s1_ab44 ->
case ((m1_sbv8 `cast` <Co:6>) s1_ab44) `cast` <Co:13> of {
Left e1_ab3d -> (Left e1_ab3d) `cast` <Co:7>;
Right x_ab3f ->
case x_ab3f of { (a1_ab47, s'_ab48) ->
((((eta1_Xvx `cast` <Co:10>) r1_ab2X) `cast` <Co:6>) s'_ab48)
`cast` <Co:6>
}
})
`cast` <Co:17> While the monomorphized version looks like: $c>>_rgFk
= \ @ a_a9dj @ b_a9dk eta_XfL eta1_Xvv ->
\ env_a7ds st_a7dt ->
case (eta_XfL `cast` <Co:2>) env_a7ds st_a7dt of {
Left e_a7du -> Left e_a7du;
Right ds_daq9 ->
case ds_daq9 of { (st'_a7dv, a1_a7dw) ->
(eta1_Xvv `cast` <Co:2>) env_a7ds st'_a7dv
}
} I don't get why it's slower honestly. 2. Avoiding a redundant comparison? a50bd8aWe spend about 4% of the time in This barely makes any difference in my tests unfortunately. 3. Looking up
|
dedupMetadata :: Seq PartialUnnamedMd -> Seq PartialUnnamedMd | |
dedupMetadata pumd = helper (mkPartialUnnamedMdMap pumd) <$> pumd | |
where helper pumdMap pum = | |
let pumdMap' = Map.delete (pumValues pum) pumdMap -- don't self-reference | |
in pum { pumValues = maybeTransform pumdMap' (pumValues pum) } | |
-- | We avoid erroneously recursing into ValMdValues and exit early on | |
-- a few other constructors de-duplication wouldn't affect. | |
maybeTransform :: Map PValMd Int -> PValMd -> PValMd | |
maybeTransform pumdMap v@(ValMdNode _) = transform (trans pumdMap) v | |
maybeTransform pumdMap v@(ValMdLoc _) = transform (trans pumdMap) v | |
maybeTransform pumdMap v@(ValMdDebugInfo _) = transform (trans pumdMap) v | |
maybeTransform _ v = v | |
trans :: Map PValMd Int -> PValMd -> PValMd | |
trans pumdMap v = case Map.lookup v pumdMap of | |
Just idex -> ValMdRef idex | |
Nothing -> v | |
mkPartialUnnamedMdMap :: Seq PartialUnnamedMd -> Map PValMd Int | |
mkPartialUnnamedMdMap = | |
foldl' (\mp part -> Map.insert (pumValues part) (pumIndex part) mp) Map.empty |
The dedupMetadata
functions performs many lookups in a map keyed by values of type PValMd
.
type PValMd = ValMd' Int |
Instead, I tried to implement this as a HashMap
keyed by the hashes of those same values, with bins as values to handle collisions correctly.
This is mildly faster, but allocates slightly more.
I'm still not quite sure how to dissect the main cost centers even more. One of them is the pretty-printing and the output IO, but the other two are just reported as "the continuation of >>=
" (one for the Parse
monad, one for the GetBits
monad), and I'm not sure how to get more details on this.
Thanks for digging into this! It sounds like there might not be much low-hanging fruit, and that progress might require us to just rethink how we parse (i.e., avoid eagerly parsing everything in the bitcode file) |
Agreed: knowing what didn't work is helpful! One item to think about is that the bitcode documentation (https://www.llvm.org/docs/BitCodeFormat.html#blocks) makes the assertion that
I think our current nested-monad scheme doesn't really use (and perhaps cannot use) this technique. I don't know if it would represent savings in the event that full details were known, but if parsing of some blocks could be lazily deferred that might be an area for a reasonable win. |
Yeah, I think we can try and implement those skips, though, for function bodies, I believe we are reading them all since we output them? |
I found the Here is an output, expanded a lot, and annotated: The regions correspond to:
Region 8 should be able to drastically reduce:
|
This list is interesting. @RyanGlScott do you think any of these might be interesting candidates for unlifted data tricks? Particularly those parts around (Of course, all of this would be complemented well by doing less work by disassembling on-demand - but optimizing the individual bits will still help that) |
At a glance, I don't know how much
That's pretty svelte to begin with, and the llvm-pretty-bc-parser/src/Data/LLVM/BitCode/GetBits.hs Lines 90 to 91 in 4b2e006
I'd be inclined to try optimizing |
One important element here is avoiding GHC 9.0.2:
gave me:
where |
Wow, that's incredible. I always had an odd feeling about GHC 9.0.2's performance, and this is a strong piece of evidence to justify that feeling. |
llvm-disasm
is noticeably slower than its LLVM counterpart. This issue will attempt to keep track of data regarding this performance discrepancy and attempts at improving the situation.This is the current flame graph of running
llvm-disasm
on a ~30MB PX4 bitcode file generated by Clang 12:There is an almost even divide between the time spent parsing, and the time spent pretty-printing.
Parsing
It seems like one third of the time is spent gobbling bytes from the input stream. Perhaps one could check whether there is something inherently inefficient in the way we consume input.
The other two thirds seem dominated by
parseModuleBlock
:From left to right, the columns correspond to
Data.LLVM.BitCode.IR.Function.finalizeStmt
,Data.Generics.Uniplate.Internal.Data.descendBiData
, andData.LLVM.BitCode.IR.Function.parseFunctionBlockEntry
.Pretty-printing
(NOTE: this graph has been obtained with a branch where I replaced the
pretty
package withprettyprinter
, to see if it made a difference. It does not do much on its own.)On the pretty-printing side, we only see details of two cost centres:
Prettyprinter.Render.Text.renderIO
, andText.LLVM.PP.ppModule
.The former seems unavoidable, the latter is decomposed as such:
I haven't looked into this yet, but those
ppDebugLoc'
andppDebugInfo'
sure make things slow, I wonder if there's something to look into there.But, going back up a step, the pretty-printing flame graph also had this huge, unlabeled portion. My guess is that it likely corresponds to
Prettyprinter.layoutPretty
, since this is definitely called, doesn't appear elsewhere, and is likely heavy in computation. I'm not sure why it is not reported though.As mentioned, I tried to replace
pretty
withprettyprinter
, and it's not much faster, even with an unbounded-width layout. It definitely goes faster when usingrenderCompact
, but that one has very ugly output. I wonder if there's something we can do in between, given there's barely any interesting pretty-printing going on in our output.The text was updated successfully, but these errors were encountered: