Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

doc: propose design for indicating which tensors to compress #2700

Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
91 changes: 91 additions & 0 deletions tensorflow/lite/micro/compression/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
# Design: Selecting Tensors to Compress

## Background

Some TFLM operators can read tensors which have been reduced in size by
*Lookup-Table (LUT) Compression*. To create such a compressed tensor, its
elements' values, which are ordinarily of any value encodable in the data type
of the tensor, are *binned*[^1] to a smaller set of carefully chosen values.
Those values are still encoded in the data type of the tensor. The tensor is
then *compressed* by replacing each element with an index into a lookup table
containing the smaller set of values. The indices are encoded in a smaller data
type than that of the original tensor, therefore the resulting tensor is
smaller---i.e., compressed.

A model in which some tensors have been compressed, and with metadata added to
describe the compression, is a *compressed model*. A compressed model is no
longer readable by software which expects a standard models in TFLite flatbuffer
format.

It is useful to separate the creation of compressed models into the *binning*
stage and the *compression* stage. During the binning stage, the element values
are transformed in-place and retain their original data types, albeit a
restricted set of values of the data type. The output of this stage, a *binned
model*, remains in a standard format and therefore is compatible with existing
software. This is as advantage while developing a model---choosing which tensors
to compress, choosing the restricted set of values for the bins of each
compressed tensor, and testing the resulting model.

A separate *compression* stage rewrites the binned model: creates the lookup
tables, rewrites the binned tensors as indices into the lookup tables, and
writes other metadata to the model describing the compression. The result is a
compressed model, ready for use by the TFLM interpreter and other software
that can decompress compressed tensors as necessary.

In TFLM, only certain operators are capable of decompressing compressed tensors.
It is invalid to compress an input tensor of an operator which is not capable of
decompression.

## Problem Statement

The compression stage must discover or be told which tensors to compress.

## Proposed Design

The compression stage will:

1. By default, automatically compress any tensor that can be compressed.

1. Allow tensors to be excluded from consideration by a command-line option.

1. Disable automatic discovery and take an explicit list of tensors to compress
by command-line option.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach generally seems to be: we'll try to do it all for you, but give you an escape hatch if you need to override. While that might just work in most cases, I think I'd generally prefer to go the route of "compress the tensors we tell you to."

Every compressed tensor comes with a trade off of performance for size. We need the user to opt into this for every single time it is used, because only the user knows what the best choice is. While one could argue that they've already done that during the binning stage, there's also the possibility that the heuristic could pick up additional tensors that could be compressed just based on the number of unique values. Then we could compress tensors that weren't intended to be, and we've cost the user some additional performance.

I think a list of tensors to compress would generally be sufficient. I would be okay with defaulting the bit_width to the lowest value possible for the given number of unique values, but an override for that also seems like a nice to have.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I'll head in that direction: Require an explicit list of tensors via a configuration file (per comment below)

Since there's already a configuration file and no need to keep things simple enough for a command line argument, perhaps requiring explicit bit-width specifications likewise helps the user verify the result, and absolves the compression tool of knowing operator capabilities or imposing arbitrary limits in order to provide sanity checks. With the input model and the configuration are in separate files, there's the possibility of a mismatch. The compressor could give feedback like:

  • error: tensor 1,33 has too many values to be compressed with 4-bit indices
  • warning: tensor 0,55 was compressed with 8-bit indices, but could have been compressed with 2-bit indices

A use case for specifying a bit-widths larger than necessary is to measure the latency and size of larger widths without re-binning the model.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. It's possible as models are updated that the tensor numbers may change. Perhaps we should consider using the tensor names as the identifiers?


The compression stage will automatically try to compress any tensor that is used
only as an input to operators which are capable of decompression. If the total
size of the lookup table plus the size of the tensor elements, rewritten as
indices into the lookup table, is smaller than the size of the original tensor,
the tensor will be compressed.

The binned values of a tensor are discovered heuristically, by gathering the
set of unique values into a lookup table.

The data type of the indices---i.e., the bit width of the unsigned integers used
for the indices---is determined by the number of unique values discovered in the
tensor as written by the binning stage. This width is constrained by the
implementation of the operators. If the set of unique values in a tensor cannot
be indexed by an integer with a bit width implemented by all the operators to
which the tensor is an input, the tensor will not be compressed.

The compression stage will output, in addition to the compressed model, a
description of which tensors have been compressed and with what bit widths.

## Alternative Designs

1. The list of tensors to compress could be communicated via metadata added to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing to watch for: multiple tensors could point at the same buffer. If that is the case, we need both tensors to be added to the list for compressing. I'd suggest throwing an error if we detect that not all tensors with the same buffer are being compressed.

Alternatively, we could compress by buffer index, but that's harder information to get.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rascani @rkuester I believe I already added code for handling multiple tensors pointing to the same buffer (I think I added that code, memory is hazy already).

the model by the binning stage, rather than via command-line options.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd generally recommend the tool be capable of being imported as a python module or called independently on a command line. In that sense, any configuration should be representable as python objects. A simple json or yaml config would likely be sufficient to populate those.


1. The list of tensors to compress could be communicated as string matches to
tensor names, rather than by index.

1. The binned values and or bit-widths could be communicated instead of
automatically discovered.

1. The communication of options could be done via a configuration file rather
than via command-line option.

---
[^1]: The word *quantization* is being avoided and a new word, *binning*, is
used, because quantization typically refers to quantization of floating
point values to the nearest point on a uniform grid of discrete values
indexed by an integer data type; however, the general idea is similar.