Skip to content

Latest commit

 

History

History
929 lines (788 loc) · 34.7 KB

model_configuration.md

File metadata and controls

929 lines (788 loc) · 34.7 KB

Model Configuration

Is this your first time writing a config file? Check out this guide or this example!

Each model in a model repository must include a model configuration that provides required and optional information about the model. Typically, this configuration is provided in a config.pbtxt file specified as ModelConfig protobuf. In some cases, discussed in Auto-Generated Model Configuration, the model configuration can be generated automatically by Triton and so does not need to be provided explicitly.

This section describes the most important model configuration properties but the documentation in the ModelConfig protobuf should also be consulted.

Minimal Model Configuration

A minimal model configuration must specify the platform and/or backend properties, the max_batch_size property, and the input and output tensors of the model.

As an example consider a TensorRT model that has two inputs, input0 and input1, and one output, output0, all of which are 16 entry float32 tensors. The minimal configuration is:

  platform: "tensorrt_plan"
  max_batch_size: 8
  input [
    {
      name: "input0"
      data_type: TYPE_FP32
      dims: [ 16 ]
    },
    {
      name: "input1"
      data_type: TYPE_FP32
      dims: [ 16 ]
    }
  ]
  output [
    {
      name: "output0"
      data_type: TYPE_FP32
      dims: [ 16 ]
    }
  ]

Name, Platform and Backend

The model configuration name property is optional. If the name of the model is not specified in the configuration it is assumed to be the same as the model repository directory containing the model. If name is specified it must match the name of the model repository directory containing the model. The required values for platform and backend are described in the backend documentation.

Model Transaction Policy

The model_transaction_policy property describes the nature of transactions expected from the model.

Decoupled

This boolean setting indicates whether responses generated by the model are decoupled with the requests issued to it. Using decoupled means the number of responses generated by the model may differ from number of requests issued, and the responses may be out of order relative to the order of requests. The default is false, which means the model will generate exactly one response for each request.

Maximum Batch Size

The max_batch_size property indicates the maximum batch size that the model supports for the types of batching that can be exploited by Triton. If the model's batch dimension is the first dimension, and all inputs and outputs to the model have this batch dimension, then Triton can use its dynamic batcher or sequence batcher to automatically use batching with the model. In this case max_batch_size should be set to a value greater-or-equal-to 1 that indicates the maximum batch size that Triton should use with the model.

For models that do not support batching, or do not support batching in the specific ways described above, max_batch_size must be set to zero.

Inputs and Outputs

Each model input and output must specify a name, datatype, and shape. The name specified for an input or output tensor must match the name expected by the model.

Special Conventions for PyTorch Backend

Naming Convention:

Due to the absence of sufficient metadata for inputs/outputs in TorchScript model files, the "name" attribute of inputs/outputs in the configuration must follow specific naming conventions. These are detailed below.

  1. [Only for Inputs] When the input is not a Dictionary of Tensors, the input names in the configuration file should mirror the names of the input arguments to the forward function in the model's definition.

For example, if the forward function for the Torchscript model was defined as forward(self, input0, input1), the first and second inputs should be named "input0" and "input1" respectively.

  1. <name>__<index>: Where <name> can be any string and <index> is an integer index that refers to the position of the corresponding input/output.

This means that if there are two inputs and two outputs, the first and second inputs can be named "INPUT__0" and "INPUT__1" and the first and second outputs can be named "OUTPUT__0" and "OUTPUT__1" respectively.

  1. If all inputs (or outputs) do not follow the same naming convention, then we enforce strict ordering from the model configuration i.e. we assume the order of inputs (or outputs) in the configuration is the true ordering of these inputs.

Dictionary of Tensors as Input:

The PyTorch backend supports passing of inputs to the model in the form of a Dictionary of Tensors. This is only supported when there is a single input to the model of type Dictionary that contains a mapping of string to tensor. As an example, if there is a model that expects the input of the form:

{'A': tensor1, 'B': tensor2}

The input names in the configuration in this case must not follow the above naming conventions <name>__<index>. Instead, the names of the inputs in this case must map to the string value 'key' for that specific tensor. For this case, the inputs would be "A" and "B", where input "A" refers to value corresponding to tensor1 and "B" refers to the value corresponding to tensor2.


The datatypes allowed for input and output tensors varies based on the type of the model. Section Datatypes describes the allowed datatypes and how they map to the datatypes of each model type.

An input shape indicates the shape of an input tensor expected by the model and by Triton in inference requests. An output shape indicates the shape of an output tensor produced by the model and returned by Triton in response to an inference request. Both input and output shape must have rank greater-or-equal-to 1, that is, the empty shape [ ] is not allowed.

Input and output shapes are specified by a combination of max_batch_size and the dimensions specified by the input or output dims property. For models with max_batch_size greater-than 0, the full shape is formed as [ -1 ] + dims. For models with max_batch_size equal to 0, the full shape is formed as dims. For example, for the following configuration the shape of "input0" is [ -1, 16 ] and the shape of "output0" is [ -1, 4 ].

  platform: "tensorrt_plan"
  max_batch_size: 8
  input [
    {
      name: "input0"
      data_type: TYPE_FP32
      dims: [ 16 ]
    }
  ]
  output [
    {
      name: "output0"
      data_type: TYPE_FP32
      dims: [ 4 ]
    }
  ]

For a configuration that is identical except that max_batch_size equal to 0, the shape of "input0" is [ 16 ] and the shape of "output0" is [ 4 ].

  platform: "tensorrt_plan"
  max_batch_size: 0
  input [
    {
      name: "input0"
      data_type: TYPE_FP32
      dims: [ 16 ]
    }
  ]
  output [
    {
      name: "output0"
      data_type: TYPE_FP32
      dims: [ 4 ]
    }
  ]

For models that support input and output tensors with variable-size dimensions, those dimensions can be listed as -1 in the input and output configuration. For example, if a model requires a 2-dimensional input tensor where the first dimension must be size 4 but the second dimension can be any size, the model configuration for that input would include dims: [ 4, -1 ]. Triton would then accept inference requests where that input tensor's second dimension was any value greater-or-equal-to 0. The model configuration can be more restrictive than what is allowed by the underlying model. For example, even though the framework model itself allows the second dimension to be any size, the model configuration could be specified as dims: [ 4, 4 ]. In this case, Triton would only accept inference requests where the input tensor's shape was exactly [ 4, 4 ].

The reshape property must be used if there is a mismatch between the input shape that Triton receives in an inference request and the input shape expected by the model. Similarly, the reshape property must be used if there is a mismatch between the output shape produced by the model and the shape that Triton returns in a response to an inference request.

Model inputs can specify allow_ragged_batch to indicate that the input is a ragged input. The field is used with dynamic batcher to allow batching without enforcing the input to have the same shape in all requests.

Auto-Generated Model Configuration

The model configuration file containing the required settings must be available with each model to be deployed on Triton. In some cases the required portions of the model configuration can be generated automatically by Triton. The required portion of the model configuration are the settings shown in the Minimal Model Configuration. By default, Triton will try to complete these sections. However, by starting Triton with --disable-auto-complete-config option, Triton can be configured to not auto-complete model configuration on the backend side. However, even with this option Triton will fill in missing instance_group settings with default values.

Triton can derive all the required settings automatically for most of the TensorRT, TensorFlow saved-model, ONNX models, and OpenVINO models. For Python models, auto_complete_config function can be implemented in Python backend to provide max_batch_size, input and output properties using set_max_batch_size, add_input, and add_output functions. These properties will allow Triton to load the Python model with Minimal Model Configuration in absence of a configuration file. All other model types must provide a model configuration file.

When developing a custom backend, you can populate required settings in the configuration and call TRITONBACKEND_ModelSetConfig API to update completed configuration with Triton core. You can take a look at TensorFlow and Onnxruntime backends as examples of how to achieve this. Currently, only inputs, outputs, max_batch_size and dynamic batching settings can be populated by backend. For custom backends, your config.pbtxt file must include a backend field or your model name must be in the form <model_name>.<backend_name>.

You can also see the model configuration generated for a model by Triton using the model configuration endpoint. The easiest way to do this is to use a utility like curl:

$ curl localhost:8000/v2/models/<model name>/config

This will return a JSON representation of the generated model configuration. From this you can take the max_batch_size, inputs, and outputs sections of the JSON and convert it to a config.pbtxt file. Triton only generates the minimal portion of the model configuration. You must still provide the optional portions of the model configuration by editing the config.pbtxt file.

Custom Model Configuration

Sometimes when multiple devices running Triton instances that share one model repository, it is necessary to have models configured differently on each platform in order to achieve the best performance. Triton allows users to pick the custom model configuration name by setting --model-config-name option.

For example, when running ./tritonserver --model-repository=</path/to/model/repository> --model-config-name=h100, the server will search the custom configuration file h100.pbtxt under /path/to/model/repository/<model-name>/configs directory for each model that is loaded. If h100.pbtxt exists, it will be used as the configuration for this model. Otherwise, the default configuration /path/to/model/repository/<model-name>/config.pbtxt or auto-generated model configuration will be selected based on the settings.

Custom model configuration also works with Explicit and Poll model control modes. Users may delete or add new custom configurations and the server will pick the configuration file for each loaded model dynamically.

Note: custom model configuration name should not contain any space character.

Example 1: --model-config-name=h100

.
└── model_repository/
    ├── model_a/
    │   ├── configs/
    │   │   ├── v100.pbtxt
    │   │   └── **h100.pbtxt**
    │   └── config.pbtxt
    ├── model_b/
    │   ├── configs/
    │   │   └── v100.pbtxt
    │   └── **config.pbtxt**
    └── model_c/
        ├── configs/
        │   └── config.pbtxt
        └── **config.pbtxt**

Example 2: --model-config-name=config

.
└── model_repository/
    ├── model_a/
    │   ├── configs/
    │   │   ├── v100.pbtxt
    │   │   └── h100.pbtxt
    │   └── **config.pbtxt**
    ├── model_b/
    │   ├── configs/
    │   │   └── v100.pbtxt
    │   └── **config.pbtxt**
    └── model_c/
        ├── configs/
        │   └── **config.pbtxt**
        └── config.pbtxt

Example 3: --model-config-name not set

.
└── model_repository/
    ├── model_a/
    │   ├── configs/
    │   │   ├── v100.pbtxt
    │   │   └── h100.pbtxt
    │   └── **config.pbtxt**
    ├── model_b/
    │   ├── configs/
    │   │   └── v100.pbtxt
    │   └── **config.pbtxt**
    └── model_c/
        ├── configs/
        │   └── config.pbtxt
        └── **config.pbtxt**

Default Max Batch Size and Dynamic Batcher

When a model is using the auto-complete feature, a default maximum batch size may be set by using the --backend-config=default-max-batch-size=<int> command line argument. This allows all models which are capable of batching and which make use of Auto Generated Model Configuration to have a default maximum batch size. This value is set to 4 by default. Backend developers may make use of this default-max-batch-size by obtaining it from the TRITONBACKEND_BackendConfig api. Currently, the following backends which utilize these default batch values and turn on dynamic batching in their generated model configurations are:

  1. TensorFlow backend
  2. Onnxruntime backend
  3. TensorRT backend
    1. TensorRT models store the maximum batch size explicitly and do not make use of the default-max-batch-size parameter. However, if max_batch_size > 1 and no scheduler is provided, the dynamic batch scheduler will be enabled.

If a value greater than 1 for the maximum batch size is set for the model, the dynamic_batching config will be set if no scheduler is provided in the configuration file.

Datatypes

The following table shows the tensor datatypes supported by Triton. The first column shows the name of the datatype as it appears in the model configuration file. The next four columns show the corresponding datatype for supported model frameworks. If a model framework does not have an entry for a given datatype, then Triton does not support that datatype for that model. The sixth column, labeled "API", shows the corresponding datatype for the TRITONSERVER C API, TRITONBACKEND C API, HTTP/REST protocol and GRPC protocol. The last column shows the corresponding datatype for the Python numpy library.

Model Config TensorRT TensorFlow ONNX Runtime PyTorch API NumPy
TYPE_BOOL kBOOL DT_BOOL BOOL kBool BOOL bool
TYPE_UINT8 kUINT8 DT_UINT8 UINT8 kByte UINT8 uint8
TYPE_UINT16 DT_UINT16 UINT16 UINT16 uint16
TYPE_UINT32 DT_UINT32 UINT32 UINT32 uint32
TYPE_UINT64 DT_UINT64 UINT64 UINT64 uint64
TYPE_INT8 kINT8 DT_INT8 INT8 kChar INT8 int8
TYPE_INT16 DT_INT16 INT16 kShort INT16 int16
TYPE_INT32 kINT32 DT_INT32 INT32 kInt INT32 int32
TYPE_INT64 kINT64 DT_INT64 INT64 kLong INT64 int64
TYPE_FP16 kHALF DT_HALF FLOAT16 FP16 float16
TYPE_FP32 kFLOAT DT_FLOAT FLOAT kFloat FP32 float32
TYPE_FP64 DT_DOUBLE DOUBLE kDouble FP64 float64
TYPE_STRING DT_STRING STRING BYTES dtype(object)
TYPE_BF16 kBF16 BF16

For TensorRT each value is in the nvinfer1::DataType namespace. For example, nvinfer1::DataType::kFLOAT is the 32-bit floating-point datatype.

For TensorFlow each value is in the tensorflow namespace. For example, tensorflow::DT_FLOAT is the 32-bit floating-point value.

For ONNX Runtime each value is prepended with ONNX_TENSOR_ELEMENT_DATA_TYPE_. For example, ONNX_TENSOR_ELEMENT_DATA_TYPE_FLOAT is the 32-bit floating-point datatype.

For PyTorch each value is in the torch namespace. For example, torch::kFloat is the 32-bit floating-point datatype.

For Numpy each value is in the numpy module. For example, numpy.float32 is the 32-bit floating-point datatype.

Reshape

The ModelTensorReshape property on a model configuration input or output is used to indicate that the input or output shape accepted by the inference API differs from the input or output shape expected or produced by the underlying framework model or custom backend.

For an input, reshape can be used to reshape the input tensor to a different shape expected by the framework or backend. A common use-case is where a model that supports batching expects a batched input to have shape [ batch-size ], which means that the batch dimension fully describes the shape. For the inference API the equivalent shape [ batch-size, 1 ] must be specified since each input must specify a non-empty dims. For this case the input should be specified as:

  input [
    {
      name: "in"
      dims: [ 1 ]
      reshape: { shape: [ ] }
    }

For an output, reshape can be used to reshape the output tensor produced by the framework or backend to a different shape that is returned by the inference API. A common use-case is where a model that supports batching expects a batched output to have shape [ batch-size ], which means that the batch dimension fully describes the shape. For the inference API the equivalent shape [ batch-size, 1 ] must be specified since each output must specify a non-empty dims. For this case the output should be specified as:

  output [
    {
      name: "in"
      dims: [ 1 ]
      reshape: { shape: [ ] }
    }

Shape Tensors

For models that support shape tensors, the is_shape_tensor property must be set appropriately for inputs and outputs that are acting as shape tensors. The following shows an example configuration that specifies shape tensors.

  name: "myshapetensormodel"
  platform: "tensorrt_plan"
  max_batch_size: 8
  input [
    {
      name: "input0"
      data_type: TYPE_FP32
      dims: [ 1 , 3]
    },
    {
      name: "input1"
      data_type: TYPE_INT32
      dims: [ 2 ]
      is_shape_tensor: true
    }
  ]
  output [
    {
      name: "output0"
      data_type: TYPE_FP32
      dims: [ 1 , 3]
    }
  ]

As discussed above, Triton assumes that batching occurs along the first dimension which is not listed in in the input or output tensor dims. However, for shape tensors, batching occurs at the first shape value. For the above example, an inference request must provide inputs with the following shapes.

  "input0": [ x, 1, 3]
  "input1": [ 3 ]
  "output0": [ x, 1, 3]

Where x is the batch size of the request. Triton requires the shape tensors to be marked as shape tensors in the model when using batching. Note that "input1" has shape [ 3 ] and not [ 2 ], which is how it is described in model configuration. As myshapetensormodel model is a batching model, the batch size should be provided as an additional value. Triton will accumulate all the shape values together for "input1" in batch dimension before issuing the request to model.

For example, assume the client sends following three requests to Triton with following inputs:

Request1:
input0: [[[1,2,3]]] <== shape of this tensor [1,1,3]
input1: [1,4,6] <== shape of this tensor [3]

Request2:
input0: [[[4,5,6]], [[7,8,9]]] <== shape of this tensor [2,1,3]
input1: [2,4,6] <== shape of this tensor [3]

Request3:
input0: [[[10,11,12]]] <== shape of this tensor [1,1,3]
input1: [1,4,6] <== shape of this tensor [3]

Assuming these requests get batched together would be delivered to the model as:

Batched Requests to model:
input0: [[[1,2,3]], [[4,5,6]], [[7,8,9]], [[10,11,12]]] <== shape of this tensor [4,1,3]
input1: [4, 4, 6] <== shape of this tensor [3]

Currently, only TensorRT supports shape tensors. Read Shape Tensor I/O to learn more about shape tensors.

Non-Linear I/O Formats

For models that process input or output data in non-linear formats, the is_non_linear_format_io property must be set. The following example model configuration shows how to specify that INPUT0 and INPUT1 use non-linear I/O data formats.

  name: "mytensorrtmodel"
  platform: "tensorrt_plan"
  max_batch_size: 8
  input [
    {
      name: "INPUT0"
      data_type: TYPE_FP16
      dims: [ 3,224,224 ]
      is_non_linear_format_io: true
    },
    {
      name: "INPUT1"
      data_type: TYPE_FP16
      dims: [ 3,224,224 ]
      is_non_linear_format_io: true
    }
  ]
  output [
    {
      name: "OUTPUT0"
      data_type: TYPE_FP16
      dims: [ 1,3 ]
     }
  ]

Currently, only TensorRT supports this property. To learn more about I/O formats, refer to the I/O Formats documentation.

Version Policy

Each model can have one or more versions. The ModelVersionPolicy property of the model configuration is used to set one of the following policies.

  • All: All versions of the model that are available in the model repository are available for inferencing. version_policy: { all: {}}

  • Latest: Only the latest ‘n’ versions of the model in the repository are available for inferencing. The latest versions of the model are the numerically greatest version numbers. version_policy: { latest: { num_versions: 2}}

  • Specific: Only the specifically listed versions of the model are available for inferencing. version_policy: { specific: { versions: [1,3]}}

If no version policy is specified, then Latest (with n=1) is used as the default, indicating that only the most recent version of the model is made available by Triton. In all cases, the addition or removal of version subdirectories from the model repository can change which model version is used on subsequent inference requests.

The following configuration specifies that all versions of the model will be available from the server.

  platform: "tensorrt_plan"
  max_batch_size: 8
  input [
    {
      name: "input0"
      data_type: TYPE_FP32
      dims: [ 16 ]
    },
    {
      name: "input1"
      data_type: TYPE_FP32
      dims: [ 16 ]
    }
  ]
  output [
    {
      name: "output0"
      data_type: TYPE_FP32
      dims: [ 16 ]
    }
  ]
  version_policy: { all { }}

Instance Groups

Triton can provide multiple instances of a model so that multiple inference requests for that model can be handled simultaneously. The model configuration ModelInstanceGroup property is used to specify the number of execution instances that should be made available and what compute resource should be used for those instances.

Multiple Model Instances

By default, a single execution instance of the model is created for each GPU available in the system. The instance-group setting can be used to place multiple execution instances of a model on every GPU or on only certain GPUs. For example, the following configuration will place two execution instances of the model to be available on each system GPU.

  instance_group [
    {
      count: 2
      kind: KIND_GPU
    }
  ]

And the following configuration will place one execution instance on GPU 0 and two execution instances on GPUs 1 and 2.

  instance_group [
    {
      count: 1
      kind: KIND_GPU
      gpus: [ 0 ]
    },
    {
      count: 2
      kind: KIND_GPU
      gpus: [ 1, 2 ]
    }
  ]

For a more detailed example of using instance groups, see this guide.

CPU Model Instance

The instance group setting is also used to enable execution of a model on the CPU. A model can be executed on the CPU even if there is a GPU available in the system. The following places two execution instances on the CPU.

  instance_group [
    {
      count: 2
      kind: KIND_CPU
    }
  ]

If no count is specified for a KIND_CPU instance group, then the default instance count will be 2 for selected backends (Tensorflow and Onnxruntime). All other backends will default to 1.

Host Policy

The instance group setting is associated with a host policy. The following configuration will associate all instances created by the instance group setting with host policy "policy_0". By default the host policy will be set according to the device kind of the instance, for instance, KIND_CPU is "cpu", KIND_MODEL is "model", and KIND_GPU is "gpu_<gpu_id>".

  instance_group [
    {
      count: 2
      kind: KIND_CPU
      host_policy: "policy_0"
    }
  ]

Rate Limiter Configuration

Instance group optionally specifies rate limiter configuration which controls how the rate limiter operates on the instances in the group. The rate limiter configuration is ignored if rate limiting is off. If rate limiting is on and if an instance_group does not provide this configuration, then the execution on the model instances belonging to this group will not be limited in any way by the rate limiter. The configuration includes the following specifications:

Resources

The set of resources required to execute a model instance. The "name" field identifies the resource and "count" field refers to the number of copies of the resource that the model instance in the group requires to run. The "global" field specifies whether the resource is per-device or shared globally across the system. Loaded models can not specify a resource with the same name both as global and non-global. If no resources are provided then triton assumes the execution of model instance does not require any resources and will start executing as soon as model instance is available.

Priority

Priority serves as a weighting value to be used for prioritizing across all the instances of all the models. An instance with priority 2 will be given 1/2 the number of scheduling chances as an instance with priority 1.

The following example specifies the instances in the group requires four "R1" and two "R2" resources for execution. Resource "R2" is a global resource. Additionally, the rate-limiter priority of the instance_group is 2.

  instance_group [
    {
      count: 1
      kind: KIND_GPU
      gpus: [ 0, 1, 2 ]
      rate_limiter {
        resources [
          {
            name: "R1"
            count: 4
          },
          {
            name: "R2"
            global: True
            count: 2
          }
        ]
        priority: 2
      }
    }
  ]

The above configuration creates 3 model instances, one on each device (0, 1 and 2). The three instances will not contend for "R1" among themselves as "R1" is local for their own device, however, they will contend for "R2" because it is specified as a global resource which means "R2" is shared across the system. Though these instances don't contend for "R1" among themselves, but they will contend for "R1" with other model instances which includes "R1" in their resource requirements and run on the same device as them.

Ensemble Model Instance Groups

Ensemble models are an abstraction Triton uses to execute a user-defined pipeline of models. Since there is no physical instance associated with an ensemble model, the instance_group field can not be specified for it.

However, each composing model that makes up an ensemble can specify instance_group in its config file and individually support parallel execution as described above when the ensemble receives multiple requests.

CUDA Compute Capability

Similar to the default_model_filename field, you can optionally specify the cc_model_filenames field to map the GPU's CUDA Compute Capability to a corresponding model filename at model load time. This is particularly useful for TensorRT models, since they are generally tied to a specific compute capability.

cc_model_filenames [
  {
    key: "7.5"
    value: "resnet50_T4.plan"
  },
  {
    key: "8.0"
    value: "resnet50_A100.plan"
  }
]

Optimization Policy

The model configuration ModelOptimizationPolicy property is used to specify optimization and prioritization settings for a model. These settings control if/how a model is optimized by the backend and how it is scheduled and executed by Triton. See the ModelConfig protobuf and optimization documentation for the currently available settings.

Model Warmup

When a model is loaded by Triton the corresponding backend initializes for that model. For some backends, some or all of this initialization is deferred until the model receives its first inference request (or first few inference requests). As a result, the first (few) inference requests can be significantly slower due to deferred initialization.

To avoid these initial, slow inference requests, Triton provides a configuration option that enables a model to be "warmed up" so that it is completely initialized before the first inference request is received. When the ModelWarmup property is defined in a model configuration, Triton will not show the model as being ready for inference until model warmup has completed.

The model configuration ModelWarmup is used to specify warmup settings for a model. The settings define a series of inference requests that Triton will create to warm-up each model instance. A model instance will be served only if it completes the requests successfully. Note that the effect of warming up models varies depending on the framework backend, and it will cause Triton to be less responsive to model update, so the users should experiment and choose the configuration that suits their need. See the ModelWarmup protobuf documentation for the currently available settings, and L0_warmup for examples on specifying different variants of warmup samples.

Response Cache

The model configuration response_cache section has an enable boolean used to enable the Response Cache for this model.

response_cache {
  enable: true
}

In addition to enabling the cache in the model config, a --cache-config must be specified when starting the server to enable caching on the server-side. See the Response Cache doc for more details on enabling server-side caching.