Skip to content

Commit

Permalink
FEAT: support generate/chat/create_embedding/register/unregister/regi…
Browse files Browse the repository at this point in the history
…strations method in cmdline (#363)

Co-authored-by: UranusSeven <[email protected]>
  • Loading branch information
pangyoki and UranusSeven authored Aug 18, 2023
1 parent 34ba817 commit 7ed7a02
Show file tree
Hide file tree
Showing 5 changed files with 621 additions and 109 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -226,5 +226,5 @@ For in-depth details on the built-in models, please refer to [built-in models](h
- Xinference will download models automatically for you, and by default the models will be saved under `${USER}/.xinference/cache`.


## Custom models \[Experimental\]
## Custom models
Please refer to [custom models](https://inference.readthedocs.io/en/latest/models/custom.html).
12 changes: 6 additions & 6 deletions README_zh_CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -134,10 +134,10 @@ model = client.get_model(model_uid)
chat_history = []
prompt = "What is the largest animal?"
model.chat(
prompt,
chat_history,
generate_config={"max_tokens": 1024}
)
prompt,
chat_history,
generate_config={"max_tokens": 1024}
)
```

返回值:
Expand Down Expand Up @@ -206,5 +206,5 @@ $ xinference list --all
**注意**:
- Xinference 会自动为你下载模型,默认的模型存放路径为 `${USER}/.xinference/cache`

## 自定义模型\[Experimental\]
请参考 [自定义模型](https://inference.readthedocs.io/en/latest/models/custom.html).
## 自定义模型
请参考 [自定义模型](https://inference.readthedocs.io/en/latest/models/custom.html)
134 changes: 78 additions & 56 deletions doc/source/models/custom.rst
Original file line number Diff line number Diff line change
@@ -1,126 +1,127 @@
.. _models_custom:

============================
Custom Models (Experimental)
============================

Custom models are currently an experimental feature and are expected to be officially released in
version v0.2.0.
=============
Custom Models
=============
Xinference provides a flexible and comprehensive way to integrate, manage, and utilize custom models.

Define a custom model
~~~~~~~~~~~~~~~~~~~~~

Define a custom model based on the following template:

.. code-block:: python
.. code-block:: json
custom_model = {
{
"version": 1,
# model name. must start with a letter or a
# digit, and can only contain letters, digits,
# underscores, or dashes.
"model_name": "custom-llama-2",
# supported languages
"model_lang": [
"en"
],
# model abilities. could be "embed", "generate"
# and "chat".
"model_ability": [
"generate"
],
# model specifications.
"model_specs": [
{
# model format.
"model_format": "pytorch",
"model_size_in_billions": 7,
# quantizations.
"quantizations": [
"4-bit",
"8-bit",
"none"
],
# hugging face model ID.
"model_id": "meta-llama/Llama-2-7b",
# when model_uri is present, xinference will load the model from the given RUI.
"model_uri": "file:///path/to/llama-2-7b"
},
{
# model format.
"model_format": "pytorch",
"model_size_in_billions": 13,
# quantizations.
"quantizations": [
"4-bit",
"8-bit",
"none"
],
# hugging face model ID.
"model_id": "meta-llama/Llama-2-13b"
},
{
# model format.
"model_format": "ggmlv3",
# quantizations.
"model_size_in_billions": 7,
"quantizations": [
"q4_0",
"q8_0"
]
# hugging face model ID.
],
"model_id": "TheBloke/Llama-2-7B-GGML",
# an f-string that takes a quantization.
"model_file_name_template": "llama-2-7b.ggmlv3.{quantization}.bin"
}
],
# prompt style, required by chat models.
# for more details, see: xinference/model/llm/tests/test_utils.py
"prompt_style": None
}
* model_name: A string defining the name of the model. The name must start with a letter or a digit and can only contain letters, digits, underscores, or dashes.
* model_lang: A list of strings representing the supported languages for the model. Example: ["en"], which means that the model supports English.
* model_ability: A list of strings defining the abilities of the model. It could include options like "embed", "generate", and "chat". In this case, the model has the ability to "generate".
* model_specs: An array of objects defining the specifications of the model. These include:
* model_format: A string that defines the model format, could be "pytorch" or "ggmlv3".
* model_size_in_billions: An integer defining the size of the model in billions of parameters.
* quantizations: A list of strings defining the available quantizations for the model. For PyTorch models, it could be "4-bit", "8-bit", or "none". For ggmlv3 models, the quantizations should correspond to values that work with the ``model_file_name_template``.
* model_id: A string representing the model ID, possibly referring to an identifier used by Hugging Face.
* model_uri: A string representing the URI where the model can be loaded from, such as "file:///path/to/llama-2-7b". If model URI is absent, Xinference will try to download the model from Hugging Face with the model ID.
* model_file_name_template: Required by ggml models. An f-string template used for defining the model file name based on the quantization.
* model_format: A string that defines the model format, could be "pytorch" or "ggmlv3".
* model_size_in_billions: An integer defining the size of the model in billions of parameters.
* quantizations: A list of strings defining the available quantizations for the model. For PyTorch models, it could be "4-bit", "8-bit", or "none". For ggmlv3 models, the quantizations should correspond to values that work with the ``model_file_name_template``.
* model_id: A string representing the model ID, possibly referring to an identifier used by Hugging Face.
* model_uri: A string representing the URI where the model can be loaded from, such as "file:///path/to/llama-2-7b". If model URI is absent, Xinference will try to download the model from Hugging Face with the model ID.
* model_file_name_template: Required by ggml models. An f-string template used for defining the model file name based on the quantization.
* prompt_style: An optional field that could be required by chat models to define the style of prompts. The given example has this set to None, but additional details could be found in a referenced file xinference/model/llm/tests/test_utils.py.


Register the Custom Model
~~~~~~~~~~~~~~~~~~~~~~~~~
Register a Custom Model
~~~~~~~~~~~~~~~~~~~~~~~

Register a custom model programmatically:

.. code-block:: python
import json
from xinference.client import Client
with open('model.json') as fd:
model = fd.read()
# replace with real xinference endpoint
endpoint = "http://localhost:9997"
endpoint = 'http://localhost:9997'
client = Client(endpoint)
client.register_model(model_type="LLM", model=json.dumps(custom_model), persist=False)
client.register_model(model_type="LLM", model=model, persist=False)
Or via CLI:

.. code-block:: bash
Load the Custom Model
~~~~~~~~~~~~~~~~~~~~~
xinference register --model-type LLM --file model.json --persist
List the Built-in and Custom Models
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

List built-in and custom models programmatically:

.. code-block:: python
uid = client.launch_model(model_name='custom-llama-2')
registrations = client.list_model_registrations(model_type="LLM")
Or via CLI:

.. code-block:: bash
xinference registrations --model-type LLM
Run the Custom Model
~~~~~~~~~~~~~~~~~~~~
Launch the Custom Model
~~~~~~~~~~~~~~~~~~~~~~~

Launch the custom model programmatically:

.. code-block:: python
uid = client.launch_model(model_name='custom-llama-2', model_format='pytorch')
Or via CLI:

.. code-block:: bash
xinference launch --model-name custom-llama-2 --model-format pytorch
Interact with the Custom Model
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Invoke the model programmatically:

.. code-block:: python
model = client.get_model(model_uid=uid)
model.generate("What is the largest animal in the world?")
model.generate('What is the largest animal in the world?')
Result:

Expand All @@ -145,3 +146,24 @@ Result:
"total_tokens":33
}
}
Or via CLI, replace ``${UID}`` with real model UID:

.. code-block:: bash
xinference generate --model-uid ${UID}
Unregister the Custom Model
~~~~~~~~~~~~~~~~~~~~~~~~~~~

Unregister the custom model programmatically:

.. code-block:: python
model = client.unregister_model(model_type='LLM', model_name='custom-llama-2')
Or via CLI:

.. code-block:: bash
xinference unregister --model-type LLM --model-name custom-llama-2
Loading

0 comments on commit 7ed7a02

Please sign in to comment.