Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance analysis #176

Open
Ptival opened this issue Oct 27, 2021 · 9 comments
Open

Performance analysis #176

Ptival opened this issue Oct 27, 2021 · 9 comments

Comments

@Ptival
Copy link
Contributor

Ptival commented Oct 27, 2021

llvm-disasm is noticeably slower than its LLVM counterpart. This issue will attempt to keep track of data regarding this performance discrepancy and attempts at improving the situation.

This is the current flame graph of running llvm-disasm on a ~30MB PX4 bitcode file generated by Clang 12:

image

There is an almost even divide between the time spent parsing, and the time spent pretty-printing.

Parsing

image

It seems like one third of the time is spent gobbling bytes from the input stream. Perhaps one could check whether there is something inherently inefficient in the way we consume input.

The other two thirds seem dominated by parseModuleBlock:

image

From left to right, the columns correspond to Data.LLVM.BitCode.IR.Function.finalizeStmt, Data.Generics.Uniplate.Internal.Data.descendBiData, and Data.LLVM.BitCode.IR.Function.parseFunctionBlockEntry.

Pretty-printing

(NOTE: this graph has been obtained with a branch where I replaced the pretty package with prettyprinter, to see if it made a difference. It does not do much on its own.)

image

On the pretty-printing side, we only see details of two cost centres: Prettyprinter.Render.Text.renderIO, and Text.LLVM.PP.ppModule.

The former seems unavoidable, the latter is decomposed as such:

image

I haven't looked into this yet, but those ppDebugLoc' and ppDebugInfo' sure make things slow, I wonder if there's something to look into there.

But, going back up a step, the pretty-printing flame graph also had this huge, unlabeled portion. My guess is that it likely corresponds to Prettyprinter.layoutPretty, since this is definitely called, doesn't appear elsewhere, and is likely heavy in computation. I'm not sure why it is not reported though.

As mentioned, I tried to replace pretty with prettyprinter, and it's not much faster, even with an unbounded-width layout. It definitely goes faster when using renderCompact, but that one has very ugly output. I wonder if there's something we can do in between, given there's barely any interesting pretty-printing going on in our output.

@Ptival
Copy link
Contributor Author

Ptival commented Nov 22, 2021

I'm going to list things I have attempted that have not quite succeeded.

1. Monomorphizing the Parse monad 8555e20

This actually seems to make things slower, going from 130s to 170s over my running example.

It seems we spend more time in >>=.\, which I think means the lambda within >>=.

The version on master boils down to:

$c>>_rgvi
  = \ @ a_a9mB @ b_a9mC eta_XfM eta1_Xvx ->
      \ r1_ab2X ->
        let { m1_sbv8 = (eta_XfM `cast` <Co:10>) r1_ab2X } in
        (\ s1_ab44 ->
           case ((m1_sbv8 `cast` <Co:6>) s1_ab44) `cast` <Co:13> of {
             Left e1_ab3d -> (Left e1_ab3d) `cast` <Co:7>;
             Right x_ab3f ->
               case x_ab3f of { (a1_ab47, s'_ab48) ->
               ((((eta1_Xvx `cast` <Co:10>) r1_ab2X) `cast` <Co:6>) s'_ab48)
               `cast` <Co:6>
               }
           })
        `cast` <Co:17>

While the monomorphized version looks like:

$c>>_rgFk
  = \ @ a_a9dj @ b_a9dk eta_XfL eta1_Xvv ->
      \ env_a7ds st_a7dt ->
        case (eta_XfL `cast` <Co:2>) env_a7ds st_a7dt of {
          Left e_a7du -> Left e_a7du;
          Right ds_daq9 ->
            case ds_daq9 of { (st'_a7dv, a1_a7dw) ->
            (eta1_Xvv `cast` <Co:2>) env_a7ds st'_a7dv
            }
        }

I don't get why it's slower honestly.

2. Avoiding a redundant comparison? a50bd8a

We spend about 4% of the time in getBitStringPartial. I tried to see if we could cut on that by avoiding a redundant check and doing some manual inlining of splitWord in the process based on information we get.

This barely makes any difference in my tests unfortunately.

3. Looking up PValMd by their hash rather than their values c8435b4

dedupMetadata :: Seq PartialUnnamedMd -> Seq PartialUnnamedMd
dedupMetadata pumd = helper (mkPartialUnnamedMdMap pumd) <$> pumd
where helper pumdMap pum =
let pumdMap' = Map.delete (pumValues pum) pumdMap -- don't self-reference
in pum { pumValues = maybeTransform pumdMap' (pumValues pum) }
-- | We avoid erroneously recursing into ValMdValues and exit early on
-- a few other constructors de-duplication wouldn't affect.
maybeTransform :: Map PValMd Int -> PValMd -> PValMd
maybeTransform pumdMap v@(ValMdNode _) = transform (trans pumdMap) v
maybeTransform pumdMap v@(ValMdLoc _) = transform (trans pumdMap) v
maybeTransform pumdMap v@(ValMdDebugInfo _) = transform (trans pumdMap) v
maybeTransform _ v = v
trans :: Map PValMd Int -> PValMd -> PValMd
trans pumdMap v = case Map.lookup v pumdMap of
Just idex -> ValMdRef idex
Nothing -> v
mkPartialUnnamedMdMap :: Seq PartialUnnamedMd -> Map PValMd Int
mkPartialUnnamedMdMap =
foldl' (\mp part -> Map.insert (pumValues part) (pumIndex part) mp) Map.empty

The dedupMetadata functions performs many lookups in a map keyed by values of type PValMd.

type PValMd = ValMd' Int

https://github.com/elliottt/llvm-pretty/blob/3cf1539bd6f70ccd319a4aa9bc0665951c26c9fa/src/Text/LLVM/AST.hs#L1004-L1011

Instead, I tried to implement this as a HashMap keyed by the hashes of those same values, with bins as values to handle collisions correctly.

This is mildly faster, but allocates slightly more.


I'm still not quite sure how to dissect the main cost centers even more. One of them is the pretty-printing and the output IO, but the other two are just reported as "the continuation of >>=" (one for the Parse monad, one for the GetBits monad), and I'm not sure how to get more details on this.

@travitch
Copy link

Thanks for digging into this! It sounds like there might not be much low-hanging fruit, and that progress might require us to just rethink how we parse (i.e., avoid eagerly parsing everything in the bitcode file)

@kquick
Copy link
Member

kquick commented Nov 22, 2021

Agreed: knowing what didn't work is helpful!

One item to think about is that the bitcode documentation (https://www.llvm.org/docs/BitCodeFormat.html#blocks) makes the assertion that

Block definitions allow the reader to efficiently skip blocks in constant time if the reader wants a summary of blocks, or if it wants to efficiently skip data it does not understand. The LLVM IR reader uses this mechanism to skip function bodies, lazily reading them on demand.

I think our current nested-monad scheme doesn't really use (and perhaps cannot use) this technique. I don't know if it would represent savings in the event that full details were known, but if parsing of some blocks could be lazily deferred that might be an area for a reasonable win.

@Ptival
Copy link
Contributor Author

Ptival commented Nov 22, 2021

Yeah, I think we can try and implement those skips, though, for function bodies, I believe we are reading them all since we output them?

@Ptival
Copy link
Contributor Author

Ptival commented Nov 22, 2021

I found the profiteur tool output to actually be much more interesting!

Here is an output, expanded a lot, and annotated:

image

The regions correspond to:

  1. Parse m >>= f = Parse (m >>= unParse . f)
  2. https://github.com/ndmitchell/uniplate/blob/b90c58ef7019fc117e3b2f0aa9fd13b6cbd57501/Data/Generics/Uniplate/Internal/Data.hs#L263-L267
    (part of dedupMetadata)
  3. pure x = GetBits (\ off -> return (x,off))
  4. fmap f m = GetBits (\ off -> first f <$> unGetBits m off)
  5. getBitStringPartial :: Int -> SubWord -> C.Get (BitString, SubWord)
    getBitStringPartial n _ | n > 32 =
    fail $ "getBitStringPartial: refusing to read " ++ show n ++ " (> 32) bits."
    getBitStringPartial n sw = case splitWord n sw of
    (bs, Right off) -> return (bs, off)
    (bs, Left n') -> do
    (rest, off) <- getBitString n'
    return (bs `mappend` rest, off)
  6. m >>= f = GetBits $ \ off0 -> do
    (x,off1) <- unGetBits m off0
    unGetBits (f x) off1

    (specifically the lambda)
  7. m >>= f = GetBits $ \ off0 -> do
    (x,off1) <- unGetBits m off0
    unGetBits (f x) off1

    (specifically just the constructor?)
  8. let putRender = putStrLn . renderStyle s
  9. https://github.com/elliottt/llvm-pretty/blob/3cf1539bd6f70ccd319a4aa9bc0665951c26c9fa/src/Text/LLVM/PP.hs#L1173-L1174
    (as part of mcommas as part of ppCall)
  10. https://github.com/elliottt/llvm-pretty/blob/3cf1539bd6f70ccd319a4aa9bc0665951c26c9fa/src/Text/LLVM/PP.hs#L1173-L1174
    (just as part of ppCall)
  11. https://github.com/elliottt/llvm-pretty/blob/3cf1539bd6f70ccd319a4aa9bc0665951c26c9fa/src/Text/LLVM/PP.hs#L370-L372

Region 8 should be able to drastically reduce:

@travitch
Copy link

This list is interesting. @RyanGlScott do you think any of these might be interesting candidates for unlifted data tricks? Particularly those parts around GetBits are called a lot, and allocating anything is going to be really bad.

(Of course, all of this would be complemented well by doing less work by disassembling on-demand - but optimizing the individual bits will still help that)

@RyanGlScott
Copy link
Contributor

@RyanGlScott do you think any of these might be interesting candidates for unlifted data tricks? Particularly those parts around GetBits are called a lot, and allocating anything is going to be really bad.

At a glance, I don't know how much GetBits would benefit from UnliftedDatatypes. The definition of GetBits is:

newtype GetBits a = GetBits { unGetBits :: SubWord -> C.Get (a,SubWord) }

That's pretty svelte to begin with, and the SubWord data type is itself pretty straightforward:

data SubWord
= SubWord !Int !Word32

I'd be inclined to try optimizing Get before the surrounding parts of GetBits. (Of course, this is all speculation on my part.)

@kquick
Copy link
Member

kquick commented Aug 22, 2022

One important element here is avoiding GHC 9.0.2:

$ time -f '%e real secs, %M KB  $(ghc --version | sed -e 's/.*version //')" cabal run llvm-disasm -- input-700KB-file.bc | wc

gave me:

real secs Mem (KB) GHC ver
1.81 160464 8.8.4
1.80 159120 8.10.7
45.81 2958812 9.0.2
1.66 195076 9.2.4

where wc verified that all the output sizes were 31588 lines in length.

@RyanGlScott
Copy link
Contributor

Wow, that's incredible. I always had an odd feeling about GHC 9.0.2's performance, and this is a strong piece of evidence to justify that feeling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants