You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I would like to ask, I have two virtual machines, each with only one GPU Nividia A10, I modified the source code, started the ray cluster with two machines, and started the api server with the following command, there will be the following error, what is the problem? Thank you very much
python -m distserve.api_server.distserve_api_server
--host 0.0.0.0
--port 8000
--model openai-community/gpt2
--tokenizer openai-community/gpt2
--context-tensor-parallel-size 1
--context-pipeline-parallel-size 1
--decoding-tensor-parallel-size 1
--decoding-pipeline-parallel-size 1
--block-size 16
--max-num-blocks-per-req 128
--gpu-memory-utilization 0.95
--swap-space 16
--context-sched-policy fcfs
--context-max-batch-size 128
--context-max-tokens-per-batch 8192
--decoding-sched-policy fcfs
--decoding-max-batch-size 1024
--decoding-max-tokens-per-batch 65536
The text was updated successfully, but these errors were encountered:
Currently DistServe relies on cudaIpcMemHandle for KV cache transferring. For one particular layer, DistServe requires the corresponding prefill & decoding instances to be reside on two GPUs on the same node, and those GPUs must be connected via NVLink (not really sure about this).
I would like to ask, I have two virtual machines, each with only one GPU Nividia A10, I modified the source code, started the ray cluster with two machines, and started the api server with the following command, there will be the following error, what is the problem? Thank you very much
python -m distserve.api_server.distserve_api_server
--host 0.0.0.0
--port 8000
--model openai-community/gpt2
--tokenizer openai-community/gpt2
--context-tensor-parallel-size 1
--context-pipeline-parallel-size 1
--decoding-tensor-parallel-size 1
--decoding-pipeline-parallel-size 1
--block-size 16
--max-num-blocks-per-req 128
--gpu-memory-utilization 0.95
--swap-space 16
--context-sched-policy fcfs
--context-max-batch-size 128
--context-max-tokens-per-batch 8192
--decoding-sched-policy fcfs
--decoding-max-batch-size 1024
--decoding-max-tokens-per-batch 65536
The text was updated successfully, but these errors were encountered: