Pre-commit fixes

triton-inference-server · Oct 8, 2024 · 2a20922 · 2a20922
1 parent 66481c4
commit 2a20922
Show file tree

Hide file tree

Showing 9 changed files with 60 additions and 30 deletions.
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -65,7 +65,7 @@ repos:
   - id: check-json
   - id: check-toml
   - id: check-yaml
-    exclude: ^Deployment/Kubernetes/[^/]+/chart/templates/.+$
+    exclude: ^Deployment/Kubernetes/.+$
   - id: check-shebang-scripts-are-executable
   - id: end-of-file-fixer
     types_or: [c, c++, cuda, proto, textproto, java, python]

diff --git a/Deployment/Kubernetes/EKS_Multinode_Triton_TRTLLM/2. Configure_EKS_Cluster.md b/Deployment/Kubernetes/EKS_Multinode_Triton_TRTLLM/2. Configure_EKS_Cluster.md
@@ -1,7 +1,7 @@
 # Steps to set up cluster
 
-In this guide we will set up the Kubernetes cluster for the deployment of LLMs using Triton Server and TRT-LLM. 
-* 
+In this guide we will set up the Kubernetes cluster for the deployment of LLMs using Triton Server and TRT-LLM.
+*
 ## 1. Add node label and taint
 
 As first step we will add node labels and taints
@@ -98,7 +98,7 @@ In you local browser, you should be able to see metrics in `localhost:8080`.
 
 ## 7. Install Prometheus Adapter
 
-This allows the Triton metrics collected by Prometheus server to be available to Kuberntes' Horizontal Pod Autoscaler service.
+This allows the Triton metrics collected by Prometheus server to be available to Kubernetes' Horizontal Pod Autoscaler service.
 
 ```
 helm install -n monitoring prometheus-adapter prometheus-community/prometheus-adapter \
@@ -125,7 +125,7 @@ This generates custom metrics from a formula that uses the Triton metrics collec
 kubectl apply -f triton-metrics_prometheus-rule.yaml
 ```
 
-At this point, all metrics components should have been installed. All metrics including Triton metrics, DCGM metrics, and custom metrics should be availble to Prometheus server now. You can verify by showing all metrics in Prometheus server:
+At this point, all metrics components should have been installed. All metrics including Triton metrics, DCGM metrics, and custom metrics should be available to Prometheus server now. You can verify by showing all metrics in Prometheus server:
 
 ```
 kubectl -n monitoring port-forward svc/prometheus-kube-prometheus-prometheus 8080:9090

diff --git a/Deployment/Kubernetes/EKS_Multinode_Triton_TRTLLM/3. Deploy_Triton.md b/Deployment/Kubernetes/EKS_Multinode_Triton_TRTLLM/3. Deploy_Triton.md
@@ -87,7 +87,7 @@ trtllm-build --checkpoint_dir ./converted_checkpoint \
              --use_custom_all_reduce disable \ # only disable on non-NVLink machines like g5.12xlarge
              --max_input_len 2048 \
              --max_output_len 2048 \
-             --max_batch_size 4 
+             --max_batch_size 4
 ```
 
 ### c. Prepare the Triton model repository
@@ -108,7 +108,7 @@ python3 tools/fill_template.py -i triton_model_repo/ensemble/config.pbtxt triton
 ```
 
 > [!Note]
-> Be sure to substitute the correct values for `<PATH_TO_TOKENIZER>` and `<PATH_TO_ENGINES>` in the example above. Keep in mind that the tokenizer, the TRT-LLM engines, and the Triton model repository shoudl be in a shared file storage between your nodes. They're required to launch your model in Triton. For example, if using AWS EFS, the values for `<PATH_TO_TOKENIZER>` and `<PATH_TO_ENGINES>` should be respect to the actutal EFS mount path. This is determined by your persistent-volume claim and mount path in chart/templates/deployment.yaml. Make sure that your nodes are able to access these files.
+> Be sure to substitute the correct values for `<PATH_TO_TOKENIZER>` and `<PATH_TO_ENGINES>` in the example above. Keep in mind that the tokenizer, the TRT-LLM engines, and the Triton model repository should be in a shared file storage between your nodes. They're required to launch your model in Triton. For example, if using AWS EFS, the values for `<PATH_TO_TOKENIZER>` and `<PATH_TO_ENGINES>` should be respect to the actutal EFS mount path. This is determined by your persistent-volume claim and mount path in chart/templates/deployment.yaml. Make sure that your nodes are able to access these files.
 
 ## 3. Create `example_values.yaml` file for deployment
 
@@ -177,7 +177,7 @@ kubectl logs --follow leaderworkerset-sample-0
 You should output something similar to below:
 
 ```
-I0717 23:01:28.501008 300 server.cc:674] 
+I0717 23:01:28.501008 300 server.cc:674]
 +----------------+---------+--------+
 | Model          | Version | Status |
 +----------------+---------+--------+
@@ -187,7 +187,7 @@ I0717 23:01:28.501008 300 server.cc:674]
 | tensorrt_llm   | 1       | READY  |
 +----------------+---------+--------+
 
-I0717 23:01:28.501073 300 tritonserver.cc:2579] 
+I0717 23:01:28.501073 300 tritonserver.cc:2579]
 +----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
 | Option                           | Value                                                                                                                                                                                                           |
 +----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
@@ -347,9 +347,9 @@ kubectl logs -f $(kubectl get pods | grep launcher | cut -d ' ' -f 1)
 You should output something similar to below (example of 2 x g5.12xlarge):
 
 ```
-[1,0]<stdout>:#                                                              out-of-place                       in-place          
+[1,0]<stdout>:#                                                              out-of-place                       in-place
 [1,0]<stdout>:#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
-[1,0]<stdout>:#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
+[1,0]<stdout>:#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
 [1,0]<stdout>:           8             2     float     sum      -1[1,0]<stdout>:    99.10    0.00    0.00      0[1,0]<stdout>:    100.6    0.00    0.00      0
 [1,0]<stdout>:          16             4     float     sum      -1[1,0]<stdout>:    103.4    0.00    0.00      0[1,0]<stdout>:    102.5    0.00    0.00      0
 [1,0]<stdout>:          32             8     float     sum      -1[1,0]<stdout>:    103.5    0.00    0.00      0[1,0]<stdout>:    102.5    0.00    0.00      0
@@ -429,7 +429,7 @@ genai-perf \
 You should output something similar to below (example of Mixtral 8x7B on 2 x g5.12xlarge):
 
 ```
-                                            LLM Metrics                                             
+                                            LLM Metrics
 ┏━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┓
 ┃                Statistic ┃       avg ┃       min ┃       max ┃       p99 ┃       p90 ┃       p75 ┃
 ┡━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━┩

diff --git a/Deployment/Kubernetes/EKS_Multinode_Triton_TRTLLM/README.md b/Deployment/Kubernetes/EKS_Multinode_Triton_TRTLLM/README.md
@@ -6,7 +6,7 @@ We have 1 pod per node, so the main challenge in deploying models that require m
 
  1. **LeaderWorkerSet for launching Triton+TRT-LLM on groups of pods:**  To launch Triton and TRT-LLM across nodes you use MPI to have one node launch TRT-LLM processes on all the nodes (including itself) that will make up one instance of the model. Doing this requires knowing the hostnames of all involved nodes. Consequently we need to spawn groups of pods and know which model instance group they belong to. To achieve this we use [LeaderWorkerSet](https://github.com/kubernetes-sigs/lws/tree/main), which lets us create "megapods" that consist of a group of pods - one leader pod and a specified number of worker pods -  and provides pod labels identifying group membership. We configure the LeaderWorkerSet and launch Triton+TRT-LLM via MPI in [`deployment.yaml`](multinode_helm_chart/chart/templates/deployment.yaml) and [server.py](multinode_helm_chart/containers/server.py).
  2. **Gang Scheduling:** Gang scheduling simply means ensuring all pods that make up a model instance are ready before Triton+TRT-LLM is launched. We show how to use `kubessh` to achieve this in the `wait_for_workers` function of [server.py](multinode_helm_chart/containers/server.py).
- 3. **Autoscaling:** By default the Horizontal Pod Autoscaler (HPA) scales individual pods, but LeaderWorkerSet makes it possible to scale each "megapod". However, since these are GPU workloads we don't want to use cpu and host memory usage for autoscaling. We show how to leverage the metrics Triton Server exposes through Prometheus and set up GPU utilization recording rules in [`triton-metrics_prometheus-rule.yaml`](multinode_helm_chart/triton-metrics_prometheus-rule.yaml). We also demonstrate how to properly set up PodMonitors and an HPA in [`pod-monitor.yaml`](multinode_helm_chart/chart/templates/pod-monitor.yaml) and [`hpa.yaml`](multinode_helm_chart/chart/templates/hpa.yaml) (the key is to only scrape metrics from the leader pods). Instructions for properly setting up Prometheus and exposing GPU metrics are found in [Configure EKS Cluster and Install Dependencies](https://github.com/Wenhan-Tan/EKS_Multinode_Triton_TRTLLM/blob/main/Cluster_Setup_Steps.md). To enable deployment to dynamically add more nodes in reponse to HPA, we also setup [Cluster Autoscaler](https://github.com/Wenhan-Tan/EKS_Multinode_Triton_TRTLLM/blob/main/Cluster_Setup_Steps.md#10-install-cluster-autoscaler)
+ 3. **Autoscaling:** By default the Horizontal Pod Autoscaler (HPA) scales individual pods, but LeaderWorkerSet makes it possible to scale each "megapod". However, since these are GPU workloads we don't want to use cpu and host memory usage for autoscaling. We show how to leverage the metrics Triton Server exposes through Prometheus and set up GPU utilization recording rules in [`triton-metrics_prometheus-rule.yaml`](multinode_helm_chart/triton-metrics_prometheus-rule.yaml). We also demonstrate how to properly set up PodMonitors and an HPA in [`pod-monitor.yaml`](multinode_helm_chart/chart/templates/pod-monitor.yaml) and [`hpa.yaml`](multinode_helm_chart/chart/templates/hpa.yaml) (the key is to only scrape metrics from the leader pods). Instructions for properly setting up Prometheus and exposing GPU metrics are found in [Configure EKS Cluster and Install Dependencies](https://github.com/Wenhan-Tan/EKS_Multinode_Triton_TRTLLM/blob/main/Cluster_Setup_Steps.md). To enable deployment to dynamically add more nodes in response to HPA, we also setup [Cluster Autoscaler](https://github.com/Wenhan-Tan/EKS_Multinode_Triton_TRTLLM/blob/main/Cluster_Setup_Steps.md#10-install-cluster-autoscaler)
  4. **LoadBalancer Setup:** Although there are multiple pods in each instance of the model, only one pod within each group accepts requests. We show how to correctly set up a LoadBalancer Service to allow external clients to submit requests in [`service.yaml`](multinode_helm_chart/chart/templates/service.yaml)
 
 

diff --git a/...ultinode_Triton_TRTLLM/multinode_helm_chart/aws-efa-k8s-device-plugin/README.md b/...ultinode_Triton_TRTLLM/multinode_helm_chart/aws-efa-k8s-device-plugin/README.md
@@ -19,7 +19,7 @@ helm install efa ./aws-efa-k8s-device-plugin -n kube-system
 
 # Configuration
 
-Paramter | Description | Default
+Parameter | Description | Default
 --- | --- | ---
 `image.repository` | EFA image repository | `602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/aws-efa-k8s-device-plugin`
 `image.tag` | EFA image tag | `v0.5.3`
@@ -31,7 +31,7 @@ Paramter | Description | Default
 `nodeSelector` | Node labels for pod assignment | `{}`
 `tolerations` | Optional deployment tolerations | `[]`
 `additionalPodAnnotations` | Pod annotations to apply in addition to the default ones | `{}`
-`additionalPodLabels` | Pod labels to apply in addition to the defualt ones | `{}`
+`additionalPodLabels` | Pod labels to apply in addition to the default ones | `{}`
 `nameOverride` | Override the name of the chart | `""`
 `fullnameOverride` | Override the full name of the chart | `""`
 `imagePullSecrets` | Docker registry pull secret | `[]`

diff --git a/...ment/Kubernetes/EKS_Multinode_Triton_TRTLLM/multinode_helm_chart/chart/values.schema.json b/...ment/Kubernetes/EKS_Multinode_Triton_TRTLLM/multinode_helm_chart/chart/values.schema.json
@@ -127,7 +127,7 @@
       },
       "required": [
         "image",
-        "triton_model_repo_path" 
+        "triton_model_repo_path"
       ],
       "type": "object"
     },

diff --git a/...ubernetes/EKS_Multinode_Triton_TRTLLM/multinode_helm_chart/containers/README.md b/...ubernetes/EKS_Multinode_Triton_TRTLLM/multinode_helm_chart/containers/README.md
@@ -17,7 +17,7 @@
 
 # Container Generation
 
-The files in this folder are intended to be used to create the custom container image for multi-node Triton + TRT-LLM EKS deployment including installation of EFA componenets.
+The files in this folder are intended to be used to create the custom container image for multi-node Triton + TRT-LLM EKS deployment including installation of EFA components.
 
 Run the following command to create the container image.
 

diff --git a/Deployment/Kubernetes/EKS_Multinode_Triton_TRTLLM/multinode_helm_chart/containers/server.py b/Deployment/Kubernetes/EKS_Multinode_Triton_TRTLLM/multinode_helm_chart/containers/server.py
@@ -26,6 +26,7 @@
 EXIT_SUCCESS = 0
 DELAY_BETWEEN_QUERIES = 2
 
+
 def die(exit_code: int):
     if exit_code is None:
         exit_code = ERROR_CODE_FATAL
@@ -36,10 +37,17 @@ def die(exit_code: int):
 
     exit(exit_code)
 
+
 def parse_arguments():
     parser = argparse.ArgumentParser()
     parser.add_argument("mode", type=str, choices=["leader", "worker"])
-    parser.add_argument("--triton_model_repo_dir", type=str, default=None,required=True,help="Directory that contains Triton Model Repo to be served")
+    parser.add_argument(
+        "--triton_model_repo_dir",
+        type=str,
+        default=None,
+        required=True,
+        help="Directory that contains Triton Model Repo to be served",
+    )
     parser.add_argument("--pp", type=int, default=1, help="Pipeline parallelism.")
     parser.add_argument("--tp", type=int, default=1, help="Tensor parallelism.")
     parser.add_argument("--iso8601", action="count", default=0)
@@ -55,11 +63,19 @@ def parse_arguments():
         type=int,
         help="How many gpus are in each pod/node (We launch one pod per node). Only required in leader mode.",
     )
-    parser.add_argument("--stateful_set_group_key",type=str,default=None,help="Value of leaderworkerset.sigs.k8s.io/group-key, Leader uses this to gang schedule and its only needed in leader mode")
-    parser.add_argument("--enable_nsys", action="store_true", help="Enable Triton server profiling")
+    parser.add_argument(
+        "--stateful_set_group_key",
+        type=str,
+        default=None,
+        help="Value of leaderworkerset.sigs.k8s.io/group-key, Leader uses this to gang schedule and its only needed in leader mode",
+    )
+    parser.add_argument(
+        "--enable_nsys", action="store_true", help="Enable Triton server profiling"
+    )
 
     return parser.parse_args()
 
+
 def run_command(cmd_args: [str], omit_args: [int] = None):
     command = ""
 
@@ -75,10 +91,12 @@ def run_command(cmd_args: [str], omit_args: [int] = None):
 
     return subprocess.call(cmd_args, stderr=sys.stderr, stdout=sys.stdout)
 
+
 def signal_handler(sig, frame):
     write_output(f"Signal {sig} detected, quitting.")
     exit(EXIT_SUCCESS)
 
+
 def wait_for_workers(num_total_pod: int, args):
     if num_total_pod is None or num_total_pod <= 0:
         raise RuntimeError("Argument `world_size` must be greater than zero.")
@@ -131,14 +149,19 @@ def wait_for_workers(num_total_pod: int, args):
 
     return workers
 
+
 def write_output(message: str):
     print(message, file=sys.stdout, flush=True)
 
+
 def write_error(message: str):
     print(message, file=sys.stderr, flush=True)
 
+
 def do_leader(args):
-    write_output(f"Server is assuming each node has {args.gpu_per_node} GPUs. To change this, use --gpu_per_node")
+    write_output(
+        f"Server is assuming each node has {args.gpu_per_node} GPUs. To change this, use --gpu_per_node"
+    )
 
     world_size = args.tp * args.pp
 
@@ -152,9 +175,11 @@ def do_leader(args):
     workers = wait_for_workers(world_size / args.gpu_per_node, args)
 
     if len(workers) != (world_size / args.gpu_per_node):
-        write_error(f"fatal: {len(workers)} found, expected {world_size / args.gpu_per_node}.")
+        write_error(
+            f"fatal: {len(workers)} found, expected {world_size / args.gpu_per_node}."
+        )
         die(ERROR_EXIT_DELAY)
-    
+
     workers_with_mpi_slots = [worker + f":{args.gpu_per_node}" for worker in workers]
 
     if args.enable_nsys:
@@ -241,17 +266,21 @@ def do_leader(args):
 
     exit(result)
 
+
 def do_worker(args):
     signal.signal(signal.SIGINT, signal_handler)
     signal.signal(signal.SIGTERM, signal_handler)
 
     write_output("Worker paused awaiting SIGINT or SIGTERM.")
     signal.pause()
 
+
 def main():
     write_output("Reporting system information.")
     run_command(["whoami"])
-    run_command(["cgget", "-n", "--values-only", "--variable memory.limit_in_bytes", "/"])
+    run_command(
+        ["cgget", "-n", "--values-only", "--variable memory.limit_in_bytes", "/"]
+    )
     run_command(["nvidia-smi"])
 
     args = parse_arguments()
@@ -275,5 +304,6 @@ def main():
         write_error(f'       Supported values are "init" or "exec".')
         die(ERROR_CODE_USAGE)
 
-if __name__ == '__main__':
+
+if __name__ == "__main__":
     main()
diff --git a/Deployment/Kubernetes/EKS_Multinode_Triton_TRTLLM/p5-trtllm-cluster-config.yaml b/Deployment/Kubernetes/EKS_Multinode_Triton_TRTLLM/p5-trtllm-cluster-config.yaml
@@ -15,19 +15,19 @@ vpc:
     public:
       us-east-1a:
         id: $PLACEHOLDER_SUBNET_PUBLIC_1
-        
+
   clusterEndpoints:
     privateAccess: true
     publicAccess: true
-      
+
 cloudwatch:
   clusterLogging:
-    enableTypes: ["*"]  
+    enableTypes: ["*"]
 
 iam:
   withOIDC: true
 
-            
+
 managedNodeGroups:
   - name: cpu-node-group
     instanceType: c5.2xlarge
@@ -45,7 +45,7 @@ managedNodeGroups:
         albIngress: true
   - name: gpu-compute-node-group
     instanceType: p5.48xlarge
-    instancePrefix: trtllm-compute-node 
+    instancePrefix: trtllm-compute-node
     privateNetworking: true
     efaEnabled: true
     minSize: 0