-
Notifications
You must be signed in to change notification settings - Fork 477
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
超长上下文造成服务挂死 #2584
Comments
收到,我们看看如何处理这个问题。 |
This issue is stale because it has been open for 7 days with no activity. |
This issue is stale because it has been open for 7 days with no activity. |
This issue is stale because it has been open for 7 days with no activity. |
This issue was closed because it has been inactive for 5 days since being marked as stale. |
相同的问题,请问是否有解决 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
System Info / 系統信息
registry.cn-hangzhou.aliyuncs.com/xprobe_xinference/xinference:v0.16.3
Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece?
Version info / 版本信息
0.16.3
The command used to start Xinference / 用以启动 xinference 的命令
当我用vllm引擎时,输入的上下文过长,会提示下面的错误,然后模型挂死,删不掉,后续访问没有响应。
Reproduction / 复现过程
xinference launch --model-name glm4-chat-1m
--model-type LLM
--model-uid glm4-chat
--model_path /models/glm-4-9b-chat
--model-engine 'vllm'
--model-format 'pytorch'
--quantization None
--n-gpu 2
--gpu-idx "0,1"
--max_num_seqs 256
--tensor_parallel_size 2
--gpu_memory_utilization 0.95
Expected behavior / 期待表现
我知道这不是xinference的bug,但想从底层掐死这个问题,最好服务层在超长上下文时直接返回“超长”或者顺序截断
The text was updated successfully, but these errors were encountered: