Performance analysis #176

Ptival · 2021-10-27T21:02:13Z

llvm-disasm is noticeably slower than its LLVM counterpart. This issue will attempt to keep track of data regarding this performance discrepancy and attempts at improving the situation.

This is the current flame graph of running llvm-disasm on a ~30MB PX4 bitcode file generated by Clang 12:

There is an almost even divide between the time spent parsing, and the time spent pretty-printing.

Parsing

It seems like one third of the time is spent gobbling bytes from the input stream. Perhaps one could check whether there is something inherently inefficient in the way we consume input.

The other two thirds seem dominated by parseModuleBlock:

From left to right, the columns correspond to Data.LLVM.BitCode.IR.Function.finalizeStmt, Data.Generics.Uniplate.Internal.Data.descendBiData, and Data.LLVM.BitCode.IR.Function.parseFunctionBlockEntry.

Pretty-printing

(NOTE: this graph has been obtained with a branch where I replaced the pretty package with prettyprinter, to see if it made a difference. It does not do much on its own.)

On the pretty-printing side, we only see details of two cost centres: Prettyprinter.Render.Text.renderIO, and Text.LLVM.PP.ppModule.

The former seems unavoidable, the latter is decomposed as such:

I haven't looked into this yet, but those ppDebugLoc' and ppDebugInfo' sure make things slow, I wonder if there's something to look into there.

But, going back up a step, the pretty-printing flame graph also had this huge, unlabeled portion. My guess is that it likely corresponds to Prettyprinter.layoutPretty, since this is definitely called, doesn't appear elsewhere, and is likely heavy in computation. I'm not sure why it is not reported though.

As mentioned, I tried to replace pretty with prettyprinter, and it's not much faster, even with an unbounded-width layout. It definitely goes faster when using renderCompact, but that one has very ugly output. I wonder if there's something we can do in between, given there's barely any interesting pretty-printing going on in our output.

The text was updated successfully, but these errors were encountered:

Ptival · 2021-11-22T18:43:22Z

I'm going to list things I have attempted that have not quite succeeded.

1. Monomorphizing the Parse monad `8555e20`

This actually seems to make things slower, going from 130s to 170s over my running example.

It seems we spend more time in >>=.\, which I think means the lambda within >>=.

The version on master boils down to:

$c>>_rgvi
  = \ @ a_a9mB @ b_a9mC eta_XfM eta1_Xvx ->
      \ r1_ab2X ->
        let { m1_sbv8 = (eta_XfM `cast` <Co:10>) r1_ab2X } in
        (\ s1_ab44 ->
           case ((m1_sbv8 `cast` <Co:6>) s1_ab44) `cast` <Co:13> of {
             Left e1_ab3d -> (Left e1_ab3d) `cast` <Co:7>;
             Right x_ab3f ->
               case x_ab3f of { (a1_ab47, s'_ab48) ->
               ((((eta1_Xvx `cast` <Co:10>) r1_ab2X) `cast` <Co:6>) s'_ab48)
               `cast` <Co:6>
               }
           })
        `cast` <Co:17>

While the monomorphized version looks like:

$c>>_rgFk
  = \ @ a_a9dj @ b_a9dk eta_XfL eta1_Xvv ->
      \ env_a7ds st_a7dt ->
        case (eta_XfL `cast` <Co:2>) env_a7ds st_a7dt of {
          Left e_a7du -> Left e_a7du;
          Right ds_daq9 ->
            case ds_daq9 of { (st'_a7dv, a1_a7dw) ->
            (eta1_Xvv `cast` <Co:2>) env_a7ds st'_a7dv
            }
        }

I don't get why it's slower honestly.

2. Avoiding a redundant comparison? `a50bd8a`

We spend about 4% of the time in getBitStringPartial. I tried to see if we could cut on that by avoiding a redundant check and doing some manual inlining of splitWord in the process based on information we get.

This barely makes any difference in my tests unfortunately.

3. Looking up `PValMd` by their hash rather than their values `c8435b4`

llvm-pretty-bc-parser/src/Data/LLVM/BitCode/IR/Metadata.hs

Lines 273 to 294 in 062dc02

    
           dedupMetadata :: Seq PartialUnnamedMd -> Seq PartialUnnamedMd 
        
           dedupMetadata pumd = helper (mkPartialUnnamedMdMap pumd) <$> pumd 
        
             where helper pumdMap pum = 
        
                     let pumdMap' = Map.delete (pumValues pum) pumdMap -- don't self-reference 
        
                     in pum { pumValues = maybeTransform pumdMap' (pumValues pum) } 
        
                   -- | We avoid erroneously recursing into ValMdValues and exit early on 
        
                   -- a few other constructors de-duplication wouldn't affect. 
        
                   maybeTransform :: Map PValMd Int -> PValMd -> PValMd 
        
                   maybeTransform pumdMap v@(ValMdNode _)      = transform (trans pumdMap) v 
        
                   maybeTransform pumdMap v@(ValMdLoc _)       = transform (trans pumdMap) v 
        
                   maybeTransform pumdMap v@(ValMdDebugInfo _) = transform (trans  pumdMap) v 
        
                   maybeTransform _       v                    = v 
        
                   trans :: Map PValMd Int -> PValMd -> PValMd 
        
                   trans pumdMap v = case Map.lookup v pumdMap of 
        
                                       Just idex -> ValMdRef idex 
        
                                       Nothing   -> v 
        
                   mkPartialUnnamedMdMap :: Seq PartialUnnamedMd -> Map PValMd Int 
        
                   mkPartialUnnamedMdMap = 
        
                     foldl' (\mp part -> Map.insert (pumValues part) (pumIndex part) mp) Map.empty

The dedupMetadata functions performs many lookups in a map keyed by values of type PValMd.

llvm-pretty-bc-parser/src/Data/LLVM/BitCode/Parse.hs

Line 409 in bea158b

type PValMd = ValMd' Int

https://github.com/elliottt/llvm-pretty/blob/3cf1539bd6f70ccd319a4aa9bc0665951c26c9fa/src/Text/LLVM/AST.hs#L1004-L1011

Instead, I tried to implement this as a HashMap keyed by the hashes of those same values, with bins as values to handle collisions correctly.

This is mildly faster, but allocates slightly more.

I'm still not quite sure how to dissect the main cost centers even more. One of them is the pretty-printing and the output IO, but the other two are just reported as "the continuation of >>=" (one for the Parse monad, one for the GetBits monad), and I'm not sure how to get more details on this.

travitch · 2021-11-22T18:45:52Z

Thanks for digging into this! It sounds like there might not be much low-hanging fruit, and that progress might require us to just rethink how we parse (i.e., avoid eagerly parsing everything in the bitcode file)

kquick · 2021-11-22T19:34:41Z

Agreed: knowing what didn't work is helpful!

One item to think about is that the bitcode documentation (https://www.llvm.org/docs/BitCodeFormat.html#blocks) makes the assertion that

Block definitions allow the reader to efficiently skip blocks in constant time if the reader wants a summary of blocks, or if it wants to efficiently skip data it does not understand. The LLVM IR reader uses this mechanism to skip function bodies, lazily reading them on demand.

I think our current nested-monad scheme doesn't really use (and perhaps cannot use) this technique. I don't know if it would represent savings in the event that full details were known, but if parsing of some blocks could be lazily deferred that might be an area for a reasonable win.

Ptival · 2021-11-22T19:39:08Z

Yeah, I think we can try and implement those skips, though, for function bodies, I believe we are reading them all since we output them?

Ptival · 2021-11-22T22:42:33Z

I found the profiteur tool output to actually be much more interesting!

Here is an output, expanded a lot, and annotated:

The regions correspond to:

llvm-pretty-bc-parser/src/Data/LLVM/BitCode/Parse.hs

Line 64 in 062dc02

Parse m >>= f = Parse (m >>= unParse . f)
https://github.com/ndmitchell/uniplate/blob/b90c58ef7019fc117e3b2f0aa9fd13b6cbd57501/Data/Generics/Uniplate/Internal/Data.hs#L263-L267
(part of dedupMetadata)
llvm-pretty-bc-parser/src/Data/LLVM/BitCode/GetBits.hs

Line 44 in 062dc02

pure x = GetBits (\ off -> return (x,off))
llvm-pretty-bc-parser/src/Data/LLVM/BitCode/GetBits.hs

Line 40 in 062dc02

fmap f m = GetBits (\ off -> first f <$> unGetBits m off)

llvm-pretty-bc-parser/src/Data/LLVM/BitCode/GetBits.hs

Lines 116 to 123 in 062dc02

    
           getBitStringPartial :: Int -> SubWord -> C.Get (BitString, SubWord) 
        
           getBitStringPartial n _ | n > 32 = 
        
             fail $ "getBitStringPartial: refusing to read " ++ show n ++ " (> 32) bits." 
        
           getBitStringPartial n sw = case splitWord n sw of 
        
             (bs, Right off) -> return (bs, off) 
        
             (bs, Left n') -> do 
        
               (rest, off) <- getBitString n' 
        
               return (bs `mappend` rest, off)

llvm-pretty-bc-parser/src/Data/LLVM/BitCode/GetBits.hs

Lines 57 to 59 in 062dc02

    
           m >>= f = GetBits $ \ off0 -> do 
        
             (x,off1) <- unGetBits m off0 
        
             unGetBits (f x) off1

(specifically the lambda)

llvm-pretty-bc-parser/src/Data/LLVM/BitCode/GetBits.hs

Lines 57 to 59 in 062dc02

    
           m >>= f = GetBits $ \ off0 -> do 
        
             (x,off1) <- unGetBits m off0 
        
             unGetBits (f x) off1

(specifically just the constructor?)

llvm-pretty-bc-parser/llvm-disasm/LLVMDis.hs

Line 104 in 062dc02

let putRender = putStrLn . renderStyle s
https://github.com/elliottt/llvm-pretty/blob/3cf1539bd6f70ccd319a4aa9bc0665951c26c9fa/src/Text/LLVM/PP.hs#L1173-L1174
(as part of mcommas as part of ppCall)
https://github.com/elliottt/llvm-pretty/blob/3cf1539bd6f70ccd319a4aa9bc0665951c26c9fa/src/Text/LLVM/PP.hs#L1173-L1174
(just as part of ppCall)
https://github.com/elliottt/llvm-pretty/blob/3cf1539bd6f70ccd319a4aa9bc0665951c26c9fa/src/Text/LLVM/PP.hs#L370-L372

Region 8 should be able to drastically reduce:

either by foregoing fsep in favor of hsep as I proposed here: fsep is slow even when we don't benefit from it, relax its use? llvm-pretty#89
or by keeping fsep if it ever gets optimized for cases where the document width is unbounded, which I have suggested here: Is it possible to cut on expensive computations when the layout is unbounded? quchen/prettyprinter#215 (even though we current use pretty and not prettyprinter, I think the fix would likely be the same)

travitch · 2021-11-22T22:49:58Z

This list is interesting. @RyanGlScott do you think any of these might be interesting candidates for unlifted data tricks? Particularly those parts around GetBits are called a lot, and allocating anything is going to be really bad.

(Of course, all of this would be complemented well by doing less work by disassembling on-demand - but optimizing the individual bits will still help that)

RyanGlScott · 2021-11-23T14:14:20Z

@RyanGlScott do you think any of these might be interesting candidates for unlifted data tricks? Particularly those parts around GetBits are called a lot, and allocating anything is going to be really bad.

At a glance, I don't know how much GetBits would benefit from UnliftedDatatypes. The definition of GetBits is:

llvm-pretty-bc-parser/src/Data/LLVM/BitCode/GetBits.hs

Line 31 in 4b2e006

newtype GetBits a = GetBits { unGetBits :: SubWord -> C.Get (a,SubWord) }

That's pretty svelte to begin with, and the SubWord data type is itself pretty straightforward:

llvm-pretty-bc-parser/src/Data/LLVM/BitCode/GetBits.hs

Lines 90 to 91 in 4b2e006

    
           data SubWord 
        
             = SubWord !Int !Word32

I'd be inclined to try optimizing Get before the surrounding parts of GetBits. (Of course, this is all speculation on my part.)

kquick · 2022-08-22T15:27:42Z

One important element here is avoiding GHC 9.0.2:

$ time -f '%e real secs, %M KB  $(ghc --version | sed -e 's/.*version //')" cabal run llvm-disasm -- input-700KB-file.bc | wc

gave me:

real secs	Mem (KB)	GHC ver
1.81	`160464`	8.8.4
1.80	`159120`	8.10.7
45.81	`2958812`	9.0.2
1.66	`195076`	9.2.4

where wc verified that all the output sizes were 31588 lines in length.

RyanGlScott · 2022-08-22T15:36:22Z

Wow, that's incredible. I always had an odd feeling about GHC 9.0.2's performance, and this is a strong piece of evidence to justify that feeling.

Ptival mentioned this issue Oct 28, 2021

Improve performance of parseModuleBlockEntry.unnamedGlobalsCount #178

Closed

RyanGlScott mentioned this issue May 23, 2022

Parse bitcode incrementally #190

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance analysis #176

Performance analysis #176

Ptival commented Oct 27, 2021

Ptival commented Nov 22, 2021 •

edited

Loading

travitch commented Nov 22, 2021

kquick commented Nov 22, 2021

Ptival commented Nov 22, 2021

Ptival commented Nov 22, 2021

travitch commented Nov 22, 2021

RyanGlScott commented Nov 23, 2021

kquick commented Aug 22, 2022

RyanGlScott commented Aug 22, 2022

Performance analysis #176

Performance analysis #176

Comments

Ptival commented Oct 27, 2021

Parsing

Pretty-printing

Ptival commented Nov 22, 2021 • edited Loading

1. Monomorphizing the Parse monad 8555e20

2. Avoiding a redundant comparison? a50bd8a

3. Looking up PValMd by their hash rather than their values c8435b4

travitch commented Nov 22, 2021

kquick commented Nov 22, 2021

Ptival commented Nov 22, 2021

Ptival commented Nov 22, 2021

travitch commented Nov 22, 2021

RyanGlScott commented Nov 23, 2021

kquick commented Aug 22, 2022

RyanGlScott commented Aug 22, 2022

Ptival commented Nov 22, 2021 •

edited

Loading

1. Monomorphizing the Parse monad `8555e20`

2. Avoiding a redundant comparison? `a50bd8a`

3. Looking up `PValMd` by their hash rather than their values `c8435b4`