Xorbits Inference（Xinference）：一款性能强大且功能全面的大模型部署与分布式推理框架

大模型部署与分布式推理框架Xinference

Xinference的基本使用
- 概述
- 安装
- 启动服务
- 模型部署
- 模型参数配置说明
API接口
- 概述
- 对话接口
- 模型列表
- 嵌入模型
- Rerank模型
- 使用Xinference SDK
- 使用OpenAI SDK
命令行工具
- 概述
- 启动模型
- 引擎参数
- 其他操作
集成LoRA
- 启动时集成LoRA
- 应用时集成LoRA
部署其他模型
- 视觉模型
- - 部署
  - 使用Web
  - 使用API
- Embedding模型
- - 部署
  - 使用API
- Rerank模型
- - 部署
  - 使用API
- 图像模型
- - 部署
  - 使用Web
  - 使用API
- 语音模型
- - 部署
  - 使用API
- 自定义模型
异常
- 异常1
- 异常2
- 异常3

Xinference的基本使用

概述

Xorbits Inference（Xinference）是一个性能强大且功能全面的分布式推理框架。可用于大语言模型（LLM），语音识别模型，多模态模型等各种模型的推理。通过Xorbits Inference，你可以轻松地一键部署你自己的模型或内置的前沿开源模型。

GitHub：https://github.com/xorbitsai/inference

官方文档：https://inference.readthedocs.io/zh-cn/latest/index.html

安装

Xinference 在 Linux, Windows, MacOS 上都可以通过 pip 来安装。如果需要使用Xinference进行模型推理，可以根据不同的模型指定不同的引擎。

目前Xinference支持以下推理引擎：

vllm
sglang
llama.cpp
transformers

创建一个xinference虚拟环境，使用Python版本3.10

conda create -n xinference python=3.10

如果希望能够推理所有支持的模型，可以用以下命令安装所有需要的依赖：

pip install "xinference[all]"

使用其他引擎

# Transformers引擎
pip install "xinference[transformers]"

# vLLM 引擎
pip install "xinference[vllm]"

# Llama.cpp 引擎
# 初始步骤：
pip install xinference
# Apple M系列
CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python
# 英伟达显卡：
CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python
# AMD 显卡：
CMAKE_ARGS="-DLLAMA_HIPBLAS=on" pip install llama-cpp-python

# SGLang 引擎
pip install 'xinference[sglang]'

注意：

在执行安装Xinference过程中，可能会出现异常，可参考下文异常项的异常1、异常2，进行解决

启动服务

可以在本地运行Xinference，也可以使用Docker部署Xinference，甚至在集群环境中部署Xinference。这里采用本地运行Xinference。

执行以下命令启动本地的Xinference服务

xinference-local

xinference-local --host 0.0.0.0 --port 9997

启动日志如下：

(xinference) root@master:~# xinference-local --host 0.0.0.0 --port 9997
2024-07-22 06:24:11,551 xinference.core.supervisor 312280 INFO     Xinference supervisor 0.0.0.0:50699 started
2024-07-22 06:24:11,579 xinference.model.image.core 312280 WARNING  Cannot find builtin image model spec: stable-diffusion-inpainting
2024-07-22 06:24:11,579 xinference.model.image.core 312280 WARNING  Cannot find builtin image model spec: stable-diffusion-2-inpainting
2024-07-22 06:24:11,641 xinference.core.worker 312280 INFO     Starting metrics export server at 0.0.0.0:None
2024-07-22 06:24:11,644 xinference.core.worker 312280 INFO     Checking metrics export server...
2024-07-22 06:24:13,027 xinference.core.worker 312280 INFO     Metrics server is started at: http://0.0.0.0:35249
2024-07-22 06:24:13,029 xinference.core.worker 312280 INFO     Xinference worker 0.0.0.0:50699 started
2024-07-22 06:24:13,030 xinference.core.worker 312280 INFO     Purge cache directory: /root/.xinference/cache
2024-07-22 06:24:18,087 xinference.api.restful_api 311974 INFO     Starting Xinference at endpoint: http://0.0.0.0:9997
2024-07-22 06:24:18,535 uvicorn.error 311974 INFO     Uvicorn running on http://0.0.0.0:9997 (Press CTRL+C to quit)

注意：

Xinference默认使用`<HOME>/.xinference`作为主目录存储一些必要信息，如：日志文件和模型文件

通过配置环境变量`XINFERENCE_HOME`修改主目录， 比如：XINFERENCE_HOME=/tmp/xinference xinference-local --host 0.0.0.0 --port 9997

查看存储信息

(xinference) root@master:~# ls .xinference/
cache  logs
(xinference) root@master:~# ls .xinference/cache/
chatglm3-pytorch-6b
(xinference) root@master:~# ls .xinference/logs/
local_1721628924181  local_1721629451488  local_1721697225558  local_1721698858667

通过访问http://localhost:9777地址来使用Web GUI界面
在这里插入图片描述

通过访问http://localhost:9997/docs来查看 API 文档。
在这里插入图片描述

模型部署

1.搜索选择模型

点击Launch Model菜单，选择LANGUAGE MODELS标签，输入关键词以搜索需要部署的模型。这里以搜索ChatGLM3 模型为例。

在这里插入图片描述
2.模型参数配置

模型的具体参数配置参考下文：模型参数配置说明
在这里插入图片描述

3.开始部署模型

模型参数填写完成后，点击卡片左下方的火箭图标按钮开始部署模型
在这里插入图片描述
后台根据配置参数下载量化或非量化LLM模型

注意：

当运行一个模型时，第一次运行是要从默认或指定的模型站点下载模型参数。当下载完成后，Xinference本地会有缓存的处理，以后再运行相同的模型不需要重新下载。

4.已部署模型列表

部署完成后，界面自动跳转到Running Models菜单，在LANGUAGE MODELS标签中，可以看到部署好的模型。
在这里插入图片描述
5.LLM模型对话

点击Launch Web UI图标，自动打开LLM模型的Web界面，可以直接与LLM模型进行对话
在这里插入图片描述
进行对话测试：

注意：当时在进行对话测试时出现了异常，参考下文异常中的异常3

模型参数配置说明

在部署LLM模型时，有以下参数可供选择：

1.必选配置：

Model Engine：模型推理引擎，根据模型不同，可能支持的引擎不同

Model Format: 模型格式，可以选择量化(ggml、gptq等)和非量化(pytorch)的格式

Model Size：模型的参数量大小，不同模型参数量不同，可能是： 6B、7B、13B、70B等

Quantization：量化精度，有4bit、8bit等量化精度选择

N-GPU：模型使用的GPU数量：可选择Auto、CPU、GPU数量，默认Auto

Replica：模型的副本，默认为1

点击chatglm3卡片，填写部署模型的相关信息
在这里插入图片描述

2.可选配置：

Model UID: 模型的UID，可理解为模型自定义名称，默认用原始模型名称

Request Limits: 模型的请求限制数量，默认为None。None表示此模型没有限制

Worker Ip: 指定分布式场景中模型所在的工作器ip

Gpu Idx: 指定模型所在的GPU索引

Download hub: 模型从哪里下载，可选：none、huggingface、modelscope

在这里插入图片描述

3.Lora配置：

Lora Model Config：PEFT（参数高效微调）模型和路径的列表

Lora Load Kwargs for Image Model：图像模型的 lora 加载参数字典

Lora Fuse Kwargs for Image Model：图像模型的 lora fuse 参数字典

在这里插入图片描述

4.传递给推理引擎的其他参数：
在这里插入图片描述

API接口

概述

除了使用LLM模型的Web界面进行操作外，Xinference还提供了API接口，通过调用API接口来使用LLM模型。

在API文档中，存在大量API接口，不仅有LLM模型的接口，还有其他模型(如Embedding)的接口，并且这些接口都是兼容OpenAI API的接口。

通过访问http://localhost:9997/docs来查看API文档。
在这里插入图片描述

对话接口

使用Curl工具调用对话接口

curl -X 'POST' \
  'http://localhost:9997/v1/chat/completions' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "chatglm3",
    "messages": [
      {
        "role": "user",
        "content": "你好啊"
      }
    ]
  }'
  
{"id":"chat73f8c754-4898-11ef-89f6-000c2981d002","object":"chat.completion","created":1721700508,"model":"chatglm3","choices":[{"index":0,"message":{"role":"assistant","content":"你好👋！我是人工智能助手 ChatGLM3-6B，很高兴见到你，欢迎问我任何问题。"},"finish_reason":"stop"}],"usage":{"prompt_tokens":-1,"completion_tokens":-1,"total_tokens":-1}}root@master:~#

模型列表

使用Curl工具调用获取模型列表

curl -X 'GET' \
  'http://localhost:9997/v1/models' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \

{"object":"list","data":[{"id":"chatglm3","object":"model","created":0,"owned_by":"xinference","model_type":"LLM","address":"0.0.0.0:38145","accelerators":["0"],"model_name":"chatglm3","model_lang":["en","zh"],"model_ability":["chat","tools"],"model_description":"ChatGLM3 is the third generation of ChatGLM, still open-source and trained on Chinese and English data.","model_format":"pytorch","model_size_in_billions":6,"model_family":"chatglm3","quantization":"4-bit","model_hub":"modelscope","revision":"v1.0.2","context_length":8192,"replica":1}]}

嵌入模型

使用Curl工具调用嵌入模型接口

curl -X 'POST' \
  'http://localhost:9997/v1/embeddings' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "嵌入模型名称、UID",
    "input": "你好啊"
  }'

Rerank模型

使用Curl工具调用Rerank模型接口

curl -X 'POST' \
  'http://localhost:9997/v1/rerank' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
	  "model": "bge-reranker-base",
	  "query": "你是谁?",
	  "documents": [
		"你是一名乐于助人的AI助手。",
		"你的名字叫'rerank'"
	  ]
	}'

使用Xinference SDK

安装Xinference的Python SDK，使用以下命令安装最少依赖。注意: 版本必须和Xinference服务的版本保持匹配。

pip install xinference-client==${SERVER_VERSION}

from xinference.client import RESTfulClient

client = RESTfulClient("http://127.0.0.1:9997")
# 注意：my-llm是参数`--model-uid`指定的值
model = client.get_model("my-llm")
print(model.chat(
    prompt="你好啊",
    system_prompt="你是一个乐于助人的AI助手。",
    chat_history=[]
))

使用OpenAI SDK

Xinference提供了与OpenAI兼容的API，所以可以将Xinference运行的模型当成OpenAI的本地替代。

from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:9997/v1", api_key="")

response = client.chat.completions.create(
    model="my-llm",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the largest animal?"}
    ]
)
print(response)

命令行工具

概述

Xinference提供了管理模型整个生命周期的能力。同样也可以使用命令行、cURL以及Python代码来管理

执行以下命令以安装xinference命令行工具

pip install xinferenc

查看帮助命令

(xinference) root@master:~# xinference --help
Usage: xinference [OPTIONS] COMMAND [ARGS]...

  Xinference command-line interface for serving and deploying models.

Options:
  -v, --version       Show the current version of the Xinference tool.
  --log-level TEXT    Set the logger level. Options listed from most log to
                      least log are: DEBUG > INFO > WARNING > ERROR > CRITICAL
                      (Default level is INFO)
  -H, --host TEXT     Specify the host address for the Xinference server.
  -p, --port INTEGER  Specify the port number for the Xinference server.
  --help              Show this message and exit.

Commands:
  cached         List all cached models in Xinference.
  cal-model-mem  calculate gpu mem usage with specified model size and...
  chat           Chat with a running LLM.
  engine         Query the applicable inference engine by model name.
  generate       Generate text using a running LLM.
  launch         Launch a model with the Xinference framework with the...
  list           List all running models in Xinference.
  login          Login when the cluster is authenticated.
  register       Register a new model with Xinference for deployment.
  registrations  List all registered models in Xinference.
  remove-cache   Remove selected cached models in Xinference.
  stop-cluster   Stop a cluster using the Xinference framework with the...
  terminate      Terminate a deployed model through unique identifier...
  unregister     Unregister a model from Xinference, removing it from...
  vllm-models    Query and display models compatible with vLLM.

启动模型

使用Xinference框架启动一个模型，Xinference提供了xinference launch命令帮助查询相关的参数配置。

(xinference) root@master:~# xinference launch --help
Usage: xinference launch [OPTIONS]

  Launch a model with the Xinference framework with the given parameters.

Options:
  -e, --endpoint TEXT             Xinference endpoint.
  -n, --model-name TEXT           Provide the name of the model to be
                                  launched.  [required]
  -t, --model-type TEXT           Specify type of model, LLM as default.
  -en, --model-engine TEXT        Specify the inference engine of the model
                                  when launching LLM.
  -u, --model-uid TEXT            Specify UID of model, default is None.
  -s, --size-in-billions TEXT     Specify the model size in billions of
                                  parameters.
  -f, --model-format TEXT         Specify the format of the model, e.g.
                                  pytorch, ggmlv3, etc.
  -q, --quantization TEXT         Define the quantization settings for the
                                  model.
  -r, --replica INTEGER           The replica count of the model, default is
                                  1.
  --n-gpu TEXT                    The number of GPUs used by the model,
                                  default is "auto".
  -lm, --lora-modules <TEXT TEXT>...
                                  LoRA module configurations in the format
                                  name=path. Multiple modules can be
                                  specified.
  -ld, --image-lora-load-kwargs <TEXT TEXT>...
  -fd, --image-lora-fuse-kwargs <TEXT TEXT>...
  --worker-ip TEXT                Specify which worker this model runs on by
                                  ip, for distributed situation.
  --gpu-idx TEXT                  Specify which GPUs of a worker this model
                                  can run on, separated with commas.
  --trust-remote-code BOOLEAN     Whether or not to allow for custom models
                                  defined on the Hub in their own modeling
                                  files.
  -ak, --api-key TEXT             Api-Key for access xinference api with
                                  authorization.
  --help                          Show this message and exit.
(xinference) root@master:~# xinference launch --help

启动一个模型：

xinference launch --model-engine transformers --model-uid my-llm --model-name chatglm3 --quantization 4-bit --size-in-billions 6 --model-format pytorch

参数说明：

--model-engine transformers：指定模型的推理引擎
--model-uid：指定模型的UID，如果没有指定，则随机生成一个ID
--model-name：指定模型名称
--quantization: 指定模型量化精度
--size-in-billions：指定模型参数大小,以十亿为单位
--model-format：指定模型的格式

成功启动日志如下：

(xinference) root@master:~# xinference launch --model-engine transformers --model-uid myllm --model-name chatglm3 --quantization 4-bit --size-in-billions 6 --model-format pytorch
Launch model name: chatglm3 with kwargs: {}
Model uid: myllm

访问http://localhost:9777，查看已运行的模型
在这里插入图片描述

引擎参数

当加载LLM模型时，推理引擎与模型的参数息息相关。Xinference提供了xinference engine命令帮助查询相关的参数组合。

(xinference) root@master:~# xinference engine --help
Usage: xinference engine [OPTIONS]

  Query the applicable inference engine by model name.

Options:
  -n, --model-name TEXT           The model name you want to query.
                                  [required]
  -en, --model-engine TEXT        Specify the `model_engine` to query the
                                  corresponding combination of other
                                  parameters.
  -f, --model-format TEXT         Specify the `model_format` to query the
                                  corresponding combination of other
                                  parameters.
  -s, --model-size-in-billions TEXT
                                  Specify the `model_size_in_billions` to
                                  query the corresponding combination of other
                                  parameters.
  -q, --quantization TEXT         Specify the `quantization` to query the
                                  corresponding combination of other
                                  parameters.
  -e, --endpoint TEXT             Xinference endpoint.
  -ak, --api-key TEXT             Api-Key for access xinference api with
                                  authorization.
  --help                          Show this message and exit.

1.查询与chatglm3模型相关的参数组合，以决定它能够怎样跑在各种推理引擎上。

(xinference) root@master:~# xinference engine --model-name chatglm3
Name      Engine        Format      Size (in billions)  Quantization
--------  ------------  --------  --------------------  --------------
chatglm3  Transformers  pytorch                      6  4-bit
chatglm3  Transformers  pytorch                      6  8-bit
chatglm3  Transformers  pytorch                      6  none
chatglm3  vLLM          pytorch                      6  none

2.想将chatglm3跑在vllm、transformers推理引擎上，但是不知道什么样的其他参数符合这个要求

(xinference) root@master:~# xinference engine --model-name chatglm3 --model-engine vllm
Name      Engine    Format      Size (in billions)  Quantization
--------  --------  --------  --------------------  --------------
chatglm3  vLLM      pytorch                      6  none

(xinference) root@master:~#  xinference engine --model-name chatglm3 --model-engine transformers
Name      Engine        Format      Size (in billions)  Quantization
--------  ------------  --------  --------------------  --------------
chatglm3  Transformers  pytorch                      6  4-bit
chatglm3  Transformers  pytorch                      6  8-bit
chatglm3  Transformers  pytorch                      6  none

3.加载GGUF格式的qwen-chat模型，需要知道其余的参数组合

chatglm3模型不支持参数： --model-format ggufv2

(xinference) root@master:~# xinference engine --model-name qwen-chat -f ggufv2
Name       Engine     Format      Size (in billions)  Quantization
---------  ---------  --------  --------------------  --------------
qwen-chat  llama.cpp  ggufv2                       7  Q4_K_M
qwen-chat  llama.cpp  ggufv2                      14  Q4_K_M

其他操作

列出所有 Xinference 支持的指定类型的模型：

xinference registrations -t LLM

列出所有在运行的模型：

xinference list

当不需要某个正在运行的模型，可以通过以下的方式来停止它并释放资源：

xinference terminate --model-uid "my-llm"

集成LoRA

Xinference 可以在启动 LLM 和 image 模型时连带一个 LoRA 微调模型用以辅助基础模型。

启动时集成LoRA

Xinference目前不会涉及管理 LoRA 模型。用户需要首先下载对应的 LoRA 模型，然后将模型存储路径提供给 Xinference 。

xinference launch <options>
--lora-modules <lora_name1> <lora_model_path1>
--lora-modules <lora_name2> <lora_model_path2>
--image-lora-load-kwargs <load_params1> <load_value1>
--image-lora-load-kwargs <load_params2> <load_value2>
--image-lora-fuse-kwargs <fuse_params1> <fuse_value1>
--image-lora-fuse-kwargs <fuse_params2> <fuse_value2>

from xinference.client import Client

client = Client("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")

lora_model1={'lora_name': <lora_name1>, 'local_path': <lora_model_path1>}
lora_model2={'lora_name': <lora_name2>, 'local_path': <lora_model_path2>}
lora_models=[lora_model1, lora_model2]
image_lora_load_kwargs={'<load_params1>': <load_value1>, '<load_params2>': <load_value2>},
image_lora_fuse_kwargs={'<fuse_params1>': <fuse_value1>, '<fuse_params2>': <fuse_value2>}

peft_model_config = {
"image_lora_load_kwargs": image_lora_load_params,
"image_lora_fuse_kwargs": image_lora_fuse_params,
"lora_list": lora_models
}

client.launch_model(
    <other_options>,
    peft_model_config=peft_model_config
)

注意： image_lora_load_kwargs和image_lora_fuse_kwargs 选项只应用于 image 模型。它们对应于 diffusers 库的 load_lora_weights 和 fuse_lora 接口中的额外参数。如果启动的是 LLM 模型，则无需设置这些选项。

应用时集成LoRA

对于大语言模型，使用时指定其中一个 lora 。具体地，在 generate_config 参数中配置 lora_name 参数。lora_name 对应 launch 过程中你的配置。

from xinference.client import Client

client = Client("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")
model = client.get_model("<model_uid>")
model.chat(
    "<prompt>",
    <other_options>,
    generate_config={"lora_name": "<your_lora_name>"}
)

部署其他模型

注意：可能由于Xinference版本或者与模型不完全适配会出现一些问题，可选择降低Xinference版本或更换类似模型。相信Xinference会越来越完善。

视觉模型

部署

视觉模型是指用于处理和分析视觉数据（如图像和视频）的机器学习或深度学习模型。这些模型的主要目标是理解和解释视觉信息，执行多种任务，包括图像分类、目标检测、图像分割、图像生成等。

可以让模型接收图像并回答有关它们的问题。

视觉模型部署方式与LLM模型部署大同小异，首先点击Launch Model菜单，在LANGUAGE MODELS标签下选择多模态模型。

输入关键词以搜索需要部署的模型。这里以先过滤模型，再搜索选择glm-4v模型为例。
在这里插入图片描述
填写部署模型相关参数，执行部署操作

后台同样可以看到模型下载信息

部署完成，查看运行的模型

使用Web

使用图片和文字与视觉模型进行对话
在这里插入图片描述

使用API

模型可以通过两种主要方式获取图像：通过传递图像的链接或直接在请求中传递 base64 编码的图像。

1.使用OpenAI

import openai

client = openai.Client(
    api_key="cannot be empty",
    base_url=f"http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1"
)
response = client.chat.completions.create(
    model="<MODEL_UID>",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What’s in this image?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "http://xxxx.jpg",
                    },
                },
            ],
        }
    ],
)
print(response.choices[0])

2.上传Base64编码的图片

import openai
import base64

# Function to encode the image
def encode_image(image_path):
with open(image_path, "rb") as image_file:
    return base64.b64encode(image_file.read()).decode('utf-8')

# Path to your image
image_path = "path_to_your_image.jpg"

# Getting the base64 string
b64_img = encode_image(image_path)

client = openai.Client(
    api_key="cannot be empty",
    base_url=f"http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1"
)
response = client.chat.completions.create(
    model="<MODEL_UID>",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What’s in this image?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{b64_img}",
                    },
                },
            ],
        }
    ],
)
print(response.choices[0])

Embedding模型

部署

Embedding模型是一种用于将高维数据（如文本、图像或其他类型的数据）转换为低维向量表示的模型。这种表示方式能够捕捉数据的语义和结构信息，使得相似的对象在向量空间中距离更近。

文本嵌入用于量化不同文本之间的相关性。它们可以应用于各种应用程序，包括搜索、聚类、推荐、异常检测、多样性度量和分类。

嵌入是一组浮点数的向量。两个向量之间的接近程度可以作为它们相似性的指标。距离越小表示相关性越高，而距离越大则表示相关性降低。

首先点击Launch Model菜单，在Embedding Models标签下选择嵌入模型。输入关键词以搜索需要部署的模型，这里搜索选择bge-base-zh-v1.5模型为例。
在这里插入图片描述
对于模型参数，几乎不需要设置，直接部署模型即可。

等待部署、运行成功

使用API

使用Curl调用API接口

curl -X 'POST' \
  'http://localhost:9997/v1/embeddings' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "model": "bge-base-zh-v1.5",
  "input": "你好啊"
}'

Embedding模型响应结果：

{
"object":"list","model":"bge-base-zh-v1.5-1-0",
"data":[{"index":0,"object":"embedding",
"embedding":[0.029834920540452003,-0.019862590357661247,.......,-0.006424838211387396,0.012447659857571125,-0.05162930488586426]}],
"usage":{"prompt_tokens":37,"total_tokens":37}
}

import openai

client = openai.Client(
  api_key="cannot be empty",
  base_url="http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1"
)
client.embeddings.create(
  model=model_uid,
  input=["What is the capital of China?"]
)

from xinference.client import Client

client = Client("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")

model = client.get_model("<MODEL_UID>")
input = "What is the capital of China?"
model.create_embedding(input)

Rerank模型

部署

给定一个查询和一系列文档，Rerank 会根据与查询的语义相关性从最相关到最不相关对文档进行重新排序。在 Xinference 中，可以通过 Rerank 端点调用 Rerank 模型来对一系列文档进行排序。

首先点击Launch Model菜单，在Rerank Models标签下选择Rerank模型。输入关键词以搜索需要部署的模型，这里搜索选择bge-reranker-base模型为例。
在这里插入图片描述
对于模型参数，几乎不需要设置，直接部署模型即可。

等待模型部署、运行成功

使用API

可以通过cURL、OpenAI Client或Xinference的来尝试使用Rerank API：

curl -X 'POST' \
  'http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1/rerank' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "<MODEL_UID>",
    "query": "A man is eating pasta.",
    "documents": [
        "A man is eating food.",
        "A man is eating a piece of bread.",
        "The girl is carrying a baby.",
        "A man is riding a horse.",
        "A woman is playing violin."
    ]
  }'

from xinference.client import Client

client = Client("http://<XINFERENCE_HOST>:<XINFERENCE_HOST>")
model = client.get_model(<MODEL_UID>)

query = "A man is eating pasta."
corpus = [
    "A man is eating food.",
    "A man is eating a piece of bread.",
    "The girl is carrying a baby.",
    "A man is riding a horse.",
    "A woman is playing violin."
]
print(model.rerank(corpus, query))

图像模型

部署

图像模型是指用于处理、分析和理解图像数据的机器学习或深度学习模型。这些模型可以执行多种任务，如图像分类、目标检测、图像分割、图像生成等。

首先点击Launch Model菜单，在Image Models标签下选择嵌入模型。这里搜索选择stable-diffusion-v1.5模型为例。
在这里插入图片描述
对于模型参数，几乎不需要设置，直接部署模型即可。这里指定模型下载站点。

部署完成，查看运行的模型

使用Web

在这个Web界面可以使用文生图、图生图等功能
在这里插入图片描述

使用API

通过 cURL、OpenAI Client 或 Xinference 的方式尝试使用 Text-to-image API。

Images API提供了两种与图像交互的方法：

文生图端点根据文本从零开始创建图像。

图生图端点允许您生成给定图像的变体。

API 端点	OpenAI 兼容端点
Text-to-Image API	/v1/images/generations
Image-to-image API	/v1/images/variations

使用curl

curl -X 'POST' \
  'http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1/images/generations' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "<MODEL_UID>",
    "prompt": "an apple",
  }'

使用openai

import openai

client = openai.Client(
    api_key="cannot be empty",
    base_url="http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1"
)
client.images.generate(
    model=<MODEL_UID>,
    prompt="an apple"
)

使用Xinference Client

from xinference.client import Client

client = Client("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")

model = client.get_model("<MODEL_UID>")
input_text = "an apple"
model.text_to_image(input_text)

语音模型

部署

语音模型是指用于处理和分析语音数据的机器学习或深度学习模型。这些模型旨在理解语音信号，执行各种任务，如语音识别、语音合成、说话人识别等。

使用 Xinference 将音频转换为文本或将文本转换为音频。
在这里插入图片描述

使用API

Audio API提供了三种与音频交互的方法：

API端点	OpenAI兼容端点	描述
Transcription API	/v1/audio/transcriptions	转录终端将音频转录为输入语言
Translation API	/v1/audio/translations	翻译端点将音频转换为英文
Speech API	/v1/audio/speech	转录终端将音频转录为输入语言

可以通过 cURL、OpenAI Client 或者 Xinference 的 Python 客户端来尝试 Transcription API：

1.转录

curl -X 'POST' \
  'http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1/audio/transcriptions' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "<MODEL_UID>",
    "file": "<audio bytes>",
  }'

import openai

client = openai.Client(
    api_key="cannot be empty",
    base_url="http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1"
)
with open("speech.mp3", "rb") as audio_file:
    client.audio.transcriptions.create(
        model=<MODEL_UID>,
        file=audio_file,
    )

from xinference.client import Client

client = Client("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")

model = client.get_model("<MODEL_UID>")
with open("speech.mp3", "rb") as audio_file:
    model.transcriptions(audio=audio_file.read())

2.翻译

curl -X 'POST' \
  'http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1/audio/translations' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "<MODEL_UID>",
    "file": "<audio bytes>",
  }'

import openai

client = openai.Client(
    api_key="cannot be empty",
    base_url="http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1"
)
with open("speech.mp3", "rb") as audio_file:
    client.audio.translations.create(
        model=<MODEL_UID>,
        file=audio_file,
    )

from xinference.client import Client

client = Client("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")

model = client.get_model("<MODEL_UID>")
with open("speech.mp3", "rb") as audio_file:
    model.translations(audio=audio_file.read())

3.语音

curl -X 'POST' \
  'http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1/audio/speech' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "<MODEL_UID>",
    "text": "<The text to generate audio for>",
    "voice": "echo",
    "stream": True,
  }'

import openai

client = openai.Client(
    api_key="cannot be empty",
    base_url="http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1"
)
client.audio.speech.create(
    model=<MODEL_UID>,
    input=<The text to generate audio for>,
    voice="echo",
)

from xinference.client import Client

client = Client("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")

model = client.get_model("<MODEL_UID>")
model.speech(
    input=<The text to generate audio for>,
    voice="echo",
    stream: True,
)

自定义模型

Xinference 提供了一种灵活而全面的方式来集成、管理和应用自定义模型。
在这里插入图片描述

异常

异常1

在执行过程中，出现安装llama-cpp-python时，出现以下问题：

Building wheels for collected packages: llama-cpp-python
  Building wheel for llama-cpp-python (pyproject.toml) ... error
  error: subprocess-exited-with-error
  
  × Building wheel for llama-cpp-python (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [76 lines of output]
      *** scikit-build-core 0.9.8 using CMake 3.22.1 (wheel)
      *** Configuring CMake...
      loading initial cache file /tmp/tmp0pe3_qsj/build/CMakeInit.txt
      -- The C compiler identification is GNU 11.4.0
      -- The CXX compiler identification is GNU 11.4.0
      -- Detecting C compiler ABI info
      -- Detecting C compiler ABI info - done
      -- Check for working C compiler: /usr/bin/gcc - skipped


*** CMake build failed
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for llama-cpp-python
Failed to build llama-cpp-python
ERROR: Could not build wheels for llama-cpp-python, which is required to install pyproject.toml-based projects

访问：llama-cpp-python项目
在这里插入图片描述
目前llama-cpp-python最新版本v0.2.82-cu123，根据系统版本、python版本选择下载

wget https://github.com/abetlen/llama-cpp-python/releases/download/v0.2.82-cu124/llama_cpp_python-0.2.82-cp310-cp310-linux_x86_64.whl

执行以下命令安装

pip install llama_cpp_python-0.2.82-cp310-cp310-linux_x86_64.whl

异常2

在执行过程中，如果出现安装chatglm.cpp相关异常，执行如下操作解决。

访问：chatglm.cpp项目
在这里插入图片描述
目前 chatglm.cpp最新版本v0.4.0，根据系统版本、python版本选择下载

wget https://github.com/li-plus/chatglm.cpp/releases/download/v0.4.0/chatglm_cpp-0.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl

执行以下命令安装

pip install chatglm_cpp-0.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl

异常3

在LLM模型对话界面出现异常提示：

Error
[address=0.0.0.0:34387, pid=354238] GenerationMixin._get_logits_warper() missing 1 required positional argument: 'device'

后台运行异常提示：

Exception: [address=0.0.0.0:34387, pid=354238] GenerationMixin._get_logits_warper() missing 1 required positional argument: 'device'

参阅GitHub项目的issues，对transformers进行降级

(xinference) root@master:~# pip list | grep  transformers
sentence-transformers             3.0.1
transformers                      4.42.4
transformers-stream-generator     0.0.5
(xinference) root@master:~# pip install 'transformers==4.41.2'