Skip to content

Commit

Permalink
Pre-commit fixes
Browse files Browse the repository at this point in the history
  • Loading branch information
indrajit96 committed Oct 8, 2024
1 parent 66481c4 commit 2a20922
Show file tree
Hide file tree
Showing 9 changed files with 60 additions and 30 deletions.
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ repos:
- id: check-json
- id: check-toml
- id: check-yaml
exclude: ^Deployment/Kubernetes/[^/]+/chart/templates/.+$
exclude: ^Deployment/Kubernetes/.+$
- id: check-shebang-scripts-are-executable
- id: end-of-file-fixer
types_or: [c, c++, cuda, proto, textproto, java, python]
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Steps to set up cluster

In this guide we will set up the Kubernetes cluster for the deployment of LLMs using Triton Server and TRT-LLM.
*
In this guide we will set up the Kubernetes cluster for the deployment of LLMs using Triton Server and TRT-LLM.
*
## 1. Add node label and taint

As first step we will add node labels and taints
Expand Down Expand Up @@ -98,7 +98,7 @@ In you local browser, you should be able to see metrics in `localhost:8080`.

## 7. Install Prometheus Adapter

This allows the Triton metrics collected by Prometheus server to be available to Kuberntes' Horizontal Pod Autoscaler service.
This allows the Triton metrics collected by Prometheus server to be available to Kubernetes' Horizontal Pod Autoscaler service.

```
helm install -n monitoring prometheus-adapter prometheus-community/prometheus-adapter \
Expand All @@ -125,7 +125,7 @@ This generates custom metrics from a formula that uses the Triton metrics collec
kubectl apply -f triton-metrics_prometheus-rule.yaml
```

At this point, all metrics components should have been installed. All metrics including Triton metrics, DCGM metrics, and custom metrics should be availble to Prometheus server now. You can verify by showing all metrics in Prometheus server:
At this point, all metrics components should have been installed. All metrics including Triton metrics, DCGM metrics, and custom metrics should be available to Prometheus server now. You can verify by showing all metrics in Prometheus server:

```
kubectl -n monitoring port-forward svc/prometheus-kube-prometheus-prometheus 8080:9090
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -87,7 +87,7 @@ trtllm-build --checkpoint_dir ./converted_checkpoint \
--use_custom_all_reduce disable \ # only disable on non-NVLink machines like g5.12xlarge
--max_input_len 2048 \
--max_output_len 2048 \
--max_batch_size 4
--max_batch_size 4
```

### c. Prepare the Triton model repository
Expand All @@ -108,7 +108,7 @@ python3 tools/fill_template.py -i triton_model_repo/ensemble/config.pbtxt triton
```

> [!Note]
> Be sure to substitute the correct values for `<PATH_TO_TOKENIZER>` and `<PATH_TO_ENGINES>` in the example above. Keep in mind that the tokenizer, the TRT-LLM engines, and the Triton model repository shoudl be in a shared file storage between your nodes. They're required to launch your model in Triton. For example, if using AWS EFS, the values for `<PATH_TO_TOKENIZER>` and `<PATH_TO_ENGINES>` should be respect to the actutal EFS mount path. This is determined by your persistent-volume claim and mount path in chart/templates/deployment.yaml. Make sure that your nodes are able to access these files.
> Be sure to substitute the correct values for `<PATH_TO_TOKENIZER>` and `<PATH_TO_ENGINES>` in the example above. Keep in mind that the tokenizer, the TRT-LLM engines, and the Triton model repository should be in a shared file storage between your nodes. They're required to launch your model in Triton. For example, if using AWS EFS, the values for `<PATH_TO_TOKENIZER>` and `<PATH_TO_ENGINES>` should be respect to the actutal EFS mount path. This is determined by your persistent-volume claim and mount path in chart/templates/deployment.yaml. Make sure that your nodes are able to access these files.
## 3. Create `example_values.yaml` file for deployment

Expand Down Expand Up @@ -177,7 +177,7 @@ kubectl logs --follow leaderworkerset-sample-0
You should output something similar to below:

```
I0717 23:01:28.501008 300 server.cc:674]
I0717 23:01:28.501008 300 server.cc:674]
+----------------+---------+--------+
| Model | Version | Status |
+----------------+---------+--------+
Expand All @@ -187,7 +187,7 @@ I0717 23:01:28.501008 300 server.cc:674]
| tensorrt_llm | 1 | READY |
+----------------+---------+--------+
I0717 23:01:28.501073 300 tritonserver.cc:2579]
I0717 23:01:28.501073 300 tritonserver.cc:2579]
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option | Value |
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Expand Down Expand Up @@ -347,9 +347,9 @@ kubectl logs -f $(kubectl get pods | grep launcher | cut -d ' ' -f 1)
You should output something similar to below (example of 2 x g5.12xlarge):

```
[1,0]<stdout>:# out-of-place in-place
[1,0]<stdout>:# out-of-place in-place
[1,0]<stdout>:# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
[1,0]<stdout>:# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
[1,0]<stdout>:# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
[1,0]<stdout>: 8 2 float sum -1[1,0]<stdout>: 99.10 0.00 0.00 0[1,0]<stdout>: 100.6 0.00 0.00 0
[1,0]<stdout>: 16 4 float sum -1[1,0]<stdout>: 103.4 0.00 0.00 0[1,0]<stdout>: 102.5 0.00 0.00 0
[1,0]<stdout>: 32 8 float sum -1[1,0]<stdout>: 103.5 0.00 0.00 0[1,0]<stdout>: 102.5 0.00 0.00 0
Expand Down Expand Up @@ -429,7 +429,7 @@ genai-perf \
You should output something similar to below (example of Mixtral 8x7B on 2 x g5.12xlarge):

```
LLM Metrics
LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Statistic ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p75 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━┩
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ We have 1 pod per node, so the main challenge in deploying models that require m

1. **LeaderWorkerSet for launching Triton+TRT-LLM on groups of pods:** To launch Triton and TRT-LLM across nodes you use MPI to have one node launch TRT-LLM processes on all the nodes (including itself) that will make up one instance of the model. Doing this requires knowing the hostnames of all involved nodes. Consequently we need to spawn groups of pods and know which model instance group they belong to. To achieve this we use [LeaderWorkerSet](https://github.com/kubernetes-sigs/lws/tree/main), which lets us create "megapods" that consist of a group of pods - one leader pod and a specified number of worker pods - and provides pod labels identifying group membership. We configure the LeaderWorkerSet and launch Triton+TRT-LLM via MPI in [`deployment.yaml`](multinode_helm_chart/chart/templates/deployment.yaml) and [server.py](multinode_helm_chart/containers/server.py).
2. **Gang Scheduling:** Gang scheduling simply means ensuring all pods that make up a model instance are ready before Triton+TRT-LLM is launched. We show how to use `kubessh` to achieve this in the `wait_for_workers` function of [server.py](multinode_helm_chart/containers/server.py).
3. **Autoscaling:** By default the Horizontal Pod Autoscaler (HPA) scales individual pods, but LeaderWorkerSet makes it possible to scale each "megapod". However, since these are GPU workloads we don't want to use cpu and host memory usage for autoscaling. We show how to leverage the metrics Triton Server exposes through Prometheus and set up GPU utilization recording rules in [`triton-metrics_prometheus-rule.yaml`](multinode_helm_chart/triton-metrics_prometheus-rule.yaml). We also demonstrate how to properly set up PodMonitors and an HPA in [`pod-monitor.yaml`](multinode_helm_chart/chart/templates/pod-monitor.yaml) and [`hpa.yaml`](multinode_helm_chart/chart/templates/hpa.yaml) (the key is to only scrape metrics from the leader pods). Instructions for properly setting up Prometheus and exposing GPU metrics are found in [Configure EKS Cluster and Install Dependencies](https://github.com/Wenhan-Tan/EKS_Multinode_Triton_TRTLLM/blob/main/Cluster_Setup_Steps.md). To enable deployment to dynamically add more nodes in reponse to HPA, we also setup [Cluster Autoscaler](https://github.com/Wenhan-Tan/EKS_Multinode_Triton_TRTLLM/blob/main/Cluster_Setup_Steps.md#10-install-cluster-autoscaler)
3. **Autoscaling:** By default the Horizontal Pod Autoscaler (HPA) scales individual pods, but LeaderWorkerSet makes it possible to scale each "megapod". However, since these are GPU workloads we don't want to use cpu and host memory usage for autoscaling. We show how to leverage the metrics Triton Server exposes through Prometheus and set up GPU utilization recording rules in [`triton-metrics_prometheus-rule.yaml`](multinode_helm_chart/triton-metrics_prometheus-rule.yaml). We also demonstrate how to properly set up PodMonitors and an HPA in [`pod-monitor.yaml`](multinode_helm_chart/chart/templates/pod-monitor.yaml) and [`hpa.yaml`](multinode_helm_chart/chart/templates/hpa.yaml) (the key is to only scrape metrics from the leader pods). Instructions for properly setting up Prometheus and exposing GPU metrics are found in [Configure EKS Cluster and Install Dependencies](https://github.com/Wenhan-Tan/EKS_Multinode_Triton_TRTLLM/blob/main/Cluster_Setup_Steps.md). To enable deployment to dynamically add more nodes in response to HPA, we also setup [Cluster Autoscaler](https://github.com/Wenhan-Tan/EKS_Multinode_Triton_TRTLLM/blob/main/Cluster_Setup_Steps.md#10-install-cluster-autoscaler)
4. **LoadBalancer Setup:** Although there are multiple pods in each instance of the model, only one pod within each group accepts requests. We show how to correctly set up a LoadBalancer Service to allow external clients to submit requests in [`service.yaml`](multinode_helm_chart/chart/templates/service.yaml)


Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ helm install efa ./aws-efa-k8s-device-plugin -n kube-system

# Configuration

Paramter | Description | Default
Parameter | Description | Default
--- | --- | ---
`image.repository` | EFA image repository | `602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/aws-efa-k8s-device-plugin`
`image.tag` | EFA image tag | `v0.5.3`
Expand All @@ -31,7 +31,7 @@ Paramter | Description | Default
`nodeSelector` | Node labels for pod assignment | `{}`
`tolerations` | Optional deployment tolerations | `[]`
`additionalPodAnnotations` | Pod annotations to apply in addition to the default ones | `{}`
`additionalPodLabels` | Pod labels to apply in addition to the defualt ones | `{}`
`additionalPodLabels` | Pod labels to apply in addition to the default ones | `{}`
`nameOverride` | Override the name of the chart | `""`
`fullnameOverride` | Override the full name of the chart | `""`
`imagePullSecrets` | Docker registry pull secret | `[]`
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -127,7 +127,7 @@
},
"required": [
"image",
"triton_model_repo_path"
"triton_model_repo_path"
],
"type": "object"
},
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@

# Container Generation

The files in this folder are intended to be used to create the custom container image for multi-node Triton + TRT-LLM EKS deployment including installation of EFA componenets.
The files in this folder are intended to be used to create the custom container image for multi-node Triton + TRT-LLM EKS deployment including installation of EFA components.

Run the following command to create the container image.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@
EXIT_SUCCESS = 0
DELAY_BETWEEN_QUERIES = 2


def die(exit_code: int):
if exit_code is None:
exit_code = ERROR_CODE_FATAL
Expand All @@ -36,10 +37,17 @@ def die(exit_code: int):

exit(exit_code)


def parse_arguments():
parser = argparse.ArgumentParser()
parser.add_argument("mode", type=str, choices=["leader", "worker"])
parser.add_argument("--triton_model_repo_dir", type=str, default=None,required=True,help="Directory that contains Triton Model Repo to be served")
parser.add_argument(
"--triton_model_repo_dir",
type=str,
default=None,
required=True,
help="Directory that contains Triton Model Repo to be served",
)
parser.add_argument("--pp", type=int, default=1, help="Pipeline parallelism.")
parser.add_argument("--tp", type=int, default=1, help="Tensor parallelism.")
parser.add_argument("--iso8601", action="count", default=0)
Expand All @@ -55,11 +63,19 @@ def parse_arguments():
type=int,
help="How many gpus are in each pod/node (We launch one pod per node). Only required in leader mode.",
)
parser.add_argument("--stateful_set_group_key",type=str,default=None,help="Value of leaderworkerset.sigs.k8s.io/group-key, Leader uses this to gang schedule and its only needed in leader mode")
parser.add_argument("--enable_nsys", action="store_true", help="Enable Triton server profiling")
parser.add_argument(
"--stateful_set_group_key",
type=str,
default=None,
help="Value of leaderworkerset.sigs.k8s.io/group-key, Leader uses this to gang schedule and its only needed in leader mode",
)
parser.add_argument(
"--enable_nsys", action="store_true", help="Enable Triton server profiling"
)

return parser.parse_args()


def run_command(cmd_args: [str], omit_args: [int] = None):
command = ""

Expand All @@ -75,10 +91,12 @@ def run_command(cmd_args: [str], omit_args: [int] = None):

return subprocess.call(cmd_args, stderr=sys.stderr, stdout=sys.stdout)


def signal_handler(sig, frame):
write_output(f"Signal {sig} detected, quitting.")
exit(EXIT_SUCCESS)


def wait_for_workers(num_total_pod: int, args):
if num_total_pod is None or num_total_pod <= 0:
raise RuntimeError("Argument `world_size` must be greater than zero.")
Expand Down Expand Up @@ -131,14 +149,19 @@ def wait_for_workers(num_total_pod: int, args):

return workers


def write_output(message: str):
print(message, file=sys.stdout, flush=True)


def write_error(message: str):
print(message, file=sys.stderr, flush=True)


def do_leader(args):
write_output(f"Server is assuming each node has {args.gpu_per_node} GPUs. To change this, use --gpu_per_node")
write_output(
f"Server is assuming each node has {args.gpu_per_node} GPUs. To change this, use --gpu_per_node"
)

world_size = args.tp * args.pp

Expand All @@ -152,9 +175,11 @@ def do_leader(args):
workers = wait_for_workers(world_size / args.gpu_per_node, args)

if len(workers) != (world_size / args.gpu_per_node):
write_error(f"fatal: {len(workers)} found, expected {world_size / args.gpu_per_node}.")
write_error(
f"fatal: {len(workers)} found, expected {world_size / args.gpu_per_node}."
)
die(ERROR_EXIT_DELAY)

workers_with_mpi_slots = [worker + f":{args.gpu_per_node}" for worker in workers]

if args.enable_nsys:
Expand Down Expand Up @@ -241,17 +266,21 @@ def do_leader(args):

exit(result)


def do_worker(args):
signal.signal(signal.SIGINT, signal_handler)
signal.signal(signal.SIGTERM, signal_handler)

write_output("Worker paused awaiting SIGINT or SIGTERM.")
signal.pause()


def main():
write_output("Reporting system information.")
run_command(["whoami"])
run_command(["cgget", "-n", "--values-only", "--variable memory.limit_in_bytes", "/"])
run_command(
["cgget", "-n", "--values-only", "--variable memory.limit_in_bytes", "/"]
)
run_command(["nvidia-smi"])

args = parse_arguments()
Expand All @@ -275,5 +304,6 @@ def main():
write_error(f' Supported values are "init" or "exec".')
die(ERROR_CODE_USAGE)

if __name__ == '__main__':

if __name__ == "__main__":
main()
Original file line number Diff line number Diff line change
Expand Up @@ -15,19 +15,19 @@ vpc:
public:
us-east-1a:
id: $PLACEHOLDER_SUBNET_PUBLIC_1

clusterEndpoints:
privateAccess: true
publicAccess: true

cloudwatch:
clusterLogging:
enableTypes: ["*"]
enableTypes: ["*"]

iam:
withOIDC: true


managedNodeGroups:
- name: cpu-node-group
instanceType: c5.2xlarge
Expand All @@ -45,7 +45,7 @@ managedNodeGroups:
albIngress: true
- name: gpu-compute-node-group
instanceType: p5.48xlarge
instancePrefix: trtllm-compute-node
instancePrefix: trtllm-compute-node
privateNetworking: true
efaEnabled: true
minSize: 0
Expand Down

0 comments on commit 2a20922

Please sign in to comment.