前置:
需要去huggingface注册账号获取token:HUGGING_FACE_HUB_TOKEN
运行vLLM
docker run --name LocalvLLM_qwen1.5B_Int4 --runtime nvidia --gpus all -v D:/vLLM/.cache/huggingface:/root/.cache/huggingface --env "HUGGING_FACE_HUB_TOKEN=changeme" --env "HUGGINGFACE_CO_URL_HOME= https://hf-mirror.com/" --env "_HF_DEFAULT_ENDPOINT=https://hf-mirror.com" --env "HF_ENDPOINT=https://hf-mirror.com" -p 8000:8000 --ipc=host vllm/vllm-openai:latest --model Qwen/Qwen2.5-Coder-1.5B-Instruct-GPTQ-Int4 --gpu-memory-utilization=1 --max-model-len 4096
测试:
curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{
"model": "Qwen/Qwen2.5-Coder-1.5B-Instruct-GPTQ-Int4",
"prompt": "San Francisco is a",
"max_tokens": 7,
"temperature": 0
}'
{"id":"cmpl-e6c75e13fd784f08b764aee18f325f65","object":"text_completion","created":1741695843,"model":"Qwen/Qwen2.5-Coder-1.5B-Instruct-GPTQ-Int4","choices":[{"index":0,"text":" city with a rich history and culture","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":4,"total_tokens":11,"completion_tokens":7,"prompt_tokens_details":null}}
*显存不足,可以通过参数减少最大上下文并采用量化版本。
参考资料: