-
Notifications
You must be signed in to change notification settings - Fork 836
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
doc: propose design for indicating which tensors to compress #2700
doc: propose design for indicating which tensors to compress #2700
Conversation
This is the start of some documentation on the compression process and tools, which are in prototype and are being slowly merged (#2636). Currently, some of the design it discusses is proposed—comments welcome. Eventually that will be rewritten as descriptive once the design settles. |
1. Allow tensors to be excluded from consideration by a command-line option. | ||
|
||
1. Disable automatic discovery and take an explicit list of tensors to compress | ||
by command-line option. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach generally seems to be: we'll try to do it all for you, but give you an escape hatch if you need to override. While that might just work in most cases, I think I'd generally prefer to go the route of "compress the tensors we tell you to."
Every compressed tensor comes with a trade off of performance for size. We need the user to opt into this for every single time it is used, because only the user knows what the best choice is. While one could argue that they've already done that during the binning stage, there's also the possibility that the heuristic could pick up additional tensors that could be compressed just based on the number of unique values. Then we could compress tensors that weren't intended to be, and we've cost the user some additional performance.
I think a list of tensors to compress would generally be sufficient. I would be okay with defaulting the bit_width to the lowest value possible for the given number of unique values, but an override for that also seems like a nice to have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I'll head in that direction: Require an explicit list of tensors via a configuration file (per comment below)
Since there's already a configuration file and no need to keep things simple enough for a command line argument, perhaps requiring explicit bit-width specifications likewise helps the user verify the result, and absolves the compression tool of knowing operator capabilities or imposing arbitrary limits in order to provide sanity checks. With the input model and the configuration are in separate files, there's the possibility of a mismatch. The compressor could give feedback like:
- error: tensor 1,33 has too many values to be compressed with 4-bit indices
- warning: tensor 0,55 was compressed with 8-bit indices, but could have been compressed with 2-bit indices
A use case for specifying a bit-widths larger than necessary is to measure the latency and size of larger widths without re-binning the model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point. It's possible as models are updated that the tensor numbers may change. Perhaps we should consider using the tensor names as the identifiers?
## Alternative Designs | ||
|
||
1. The list of tensors to compress could be communicated via metadata added to | ||
the model by the binning stage, rather than via command-line options. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd generally recommend the tool be capable of being imported as a python module or called independently on a command line. In that sense, any configuration should be representable as python objects. A simple json or yaml config would likely be sufficient to populate those.
|
||
## Alternative Designs | ||
|
||
1. The list of tensors to compress could be communicated via metadata added to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing to watch for: multiple tensors could point at the same buffer. If that is the case, we need both tensors to be added to the list for compressing. I'd suggest throwing an error if we detect that not all tensors with the same buffer are being compressed.
Alternatively, we could compress by buffer index, but that's harder information to get.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rkuester A general comment: Alternate axis tensors also need to be handled. This could be done automatically by scanning the operator inputs for the specified tensor, to see if the operation (i.e. DEPTHWISE_CONV) requires special handling in constructing the value table. Or done automatically without consulting the operators. Instead just look at the tensor scale->size and quantized_dimension, and if the quantized_dimension is other than zero, then do alt-axis construction of the value table. The quantized_dimension == 0 being the norm for per-channel quantized tensors is an assumption though. Or it could just be done manually by an additional command line option (as is currently done). |
No description provided.