FEAT: support generate/chat/create_embedding/register/unregister/regi…

…strations method in cmdline (#363) Co-authored-by: UranusSeven <[email protected]>
xorbitsai · Aug 18, 2023 · 7ed7a02 · 7ed7a02
1 parent 34ba817
commit 7ed7a02
Show file tree

Hide file tree

Showing 5 changed files with 621 additions and 109 deletions.
diff --git a/README.md b/README.md
@@ -226,5 +226,5 @@ For in-depth details on the built-in models, please refer to [built-in models](h
 - Xinference will download models automatically for you, and by default the models will be saved under `${USER}/.xinference/cache`.
 
 
-## Custom models \[Experimental\]
+## Custom models
 Please refer to [custom models](https://inference.readthedocs.io/en/latest/models/custom.html).
diff --git a/README_zh_CN.md b/README_zh_CN.md
@@ -134,10 +134,10 @@ model = client.get_model(model_uid)
 chat_history = []
 prompt = "What is the largest animal?"
 model.chat(
-            prompt,
-            chat_history,
-            generate_config={"max_tokens": 1024}
-        )
+    prompt,
+    chat_history,
+    generate_config={"max_tokens": 1024}
+)
 ```
 
 返回值：
@@ -206,5 +206,5 @@ $ xinference list --all
 **注意**:
 - Xinference 会自动为你下载模型，默认的模型存放路径为 `${USER}/.xinference/cache`。
 
-## 自定义模型\[Experimental\]
-请参考 [自定义模型](https://inference.readthedocs.io/en/latest/models/custom.html).
+## 自定义模型
+请参考 [自定义模型](https://inference.readthedocs.io/en/latest/models/custom.html)。
diff --git a/doc/source/models/custom.rst b/doc/source/models/custom.rst
@@ -1,126 +1,127 @@
 .. _models_custom:
 
-============================
-Custom Models (Experimental)
-============================
-
-Custom models are currently an experimental feature and are expected to be officially released in
-version v0.2.0.
+=============
+Custom Models
+=============
+Xinference provides a flexible and comprehensive way to integrate, manage, and utilize custom models.
 
 Define a custom model
 ~~~~~~~~~~~~~~~~~~~~~
 
 Define a custom model based on the following template:
 
-.. code-block:: python
+.. code-block:: json
 
-   custom_model = {
+   {
      "version": 1,
-     # model name. must start with a letter or a
-     # digit, and can only contain letters, digits,
-     # underscores, or dashes.
      "model_name": "custom-llama-2",
-     # supported languages
      "model_lang": [
        "en"
      ],
-     # model abilities. could be "embed", "generate"
-     # and "chat".
      "model_ability": [
        "generate"
      ],
-     # model specifications.
      "model_specs": [
        {
-         # model format.
          "model_format": "pytorch",
          "model_size_in_billions": 7,
-         # quantizations.
          "quantizations": [
            "4-bit",
            "8-bit",
            "none"
          ],
-         # hugging face model ID.
          "model_id": "meta-llama/Llama-2-7b",
-         # when model_uri is present, xinference will load the model from the given RUI.
          "model_uri": "file:///path/to/llama-2-7b"
        },
        {
-         # model format.
-         "model_format": "pytorch",
-         "model_size_in_billions": 13,
-         # quantizations.
-         "quantizations": [
-           "4-bit",
-           "8-bit",
-           "none"
-         ],
-         # hugging face model ID.
-         "model_id": "meta-llama/Llama-2-13b"
-       },
-       {
-         # model format.
          "model_format": "ggmlv3",
-         # quantizations.
          "model_size_in_billions": 7,
          "quantizations": [
            "q4_0",
            "q8_0"
-         ]
-         # hugging face model ID.
+         ],
          "model_id": "TheBloke/Llama-2-7B-GGML",
-         # an f-string that takes a quantization.
          "model_file_name_template": "llama-2-7b.ggmlv3.{quantization}.bin"
        }
      ],
-     # prompt style, required by chat models.
-     # for more details, see: xinference/model/llm/tests/test_utils.py
-     "prompt_style": None
    }
 
 * model_name: A string defining the name of the model. The name must start with a letter or a digit and can only contain letters, digits, underscores, or dashes.
 * model_lang: A list of strings representing the supported languages for the model. Example: ["en"], which means that the model supports English.
 * model_ability: A list of strings defining the abilities of the model. It could include options like "embed", "generate", and "chat". In this case, the model has the ability to "generate".
 * model_specs: An array of objects defining the specifications of the model. These include:
-  * model_format: A string that defines the model format, could be "pytorch" or "ggmlv3".
-  * model_size_in_billions: An integer defining the size of the model in billions of parameters.
-  * quantizations: A list of strings defining the available quantizations for the model. For PyTorch models, it could be "4-bit", "8-bit", or "none". For ggmlv3 models, the quantizations should correspond to values that work with the ``model_file_name_template``.
-  * model_id: A string representing the model ID, possibly referring to an identifier used by Hugging Face.
-  * model_uri: A string representing the URI where the model can be loaded from, such as "file:///path/to/llama-2-7b". If model URI is absent, Xinference will try to download the model from Hugging Face with the model ID.
-  * model_file_name_template: Required by ggml models. An f-string template used for defining the model file name based on the quantization.
+   * model_format: A string that defines the model format, could be "pytorch" or "ggmlv3".
+   * model_size_in_billions: An integer defining the size of the model in billions of parameters.
+   * quantizations: A list of strings defining the available quantizations for the model. For PyTorch models, it could be "4-bit", "8-bit", or "none". For ggmlv3 models, the quantizations should correspond to values that work with the ``model_file_name_template``.
+   * model_id: A string representing the model ID, possibly referring to an identifier used by Hugging Face.
+   * model_uri: A string representing the URI where the model can be loaded from, such as "file:///path/to/llama-2-7b". If model URI is absent, Xinference will try to download the model from Hugging Face with the model ID.
+   * model_file_name_template: Required by ggml models. An f-string template used for defining the model file name based on the quantization.
 * prompt_style: An optional field that could be required by chat models to define the style of prompts. The given example has this set to None, but additional details could be found in a referenced file xinference/model/llm/tests/test_utils.py.
 
 
-Register the Custom Model
-~~~~~~~~~~~~~~~~~~~~~~~~~
+Register a Custom Model
+~~~~~~~~~~~~~~~~~~~~~~~
+
+Register a custom model programmatically:
 
 .. code-block:: python
 
    import json
    from xinference.client import Client
 
+   with open('model.json') as fd:
+       model = fd.read()
+
    # replace with real xinference endpoint
-   endpoint = "http://localhost:9997"
+   endpoint = 'http://localhost:9997'
    client = Client(endpoint)
-   client.register_model(model_type="LLM", model=json.dumps(custom_model), persist=False)
+   client.register_model(model_type="LLM", model=model, persist=False)
 
+Or via CLI:
 
+.. code-block:: bash
 
-Load the Custom Model
-~~~~~~~~~~~~~~~~~~~~~
+   xinference register --model-type LLM --file model.json --persist
+
+List the Built-in and Custom Models
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+List built-in and custom models programmatically:
 
 .. code-block:: python
 
-   uid = client.launch_model(model_name='custom-llama-2')
+   registrations = client.list_model_registrations(model_type="LLM")
+
+Or via CLI:
+
+.. code-block:: bash
+
+   xinference registrations --model-type LLM
 
-Run the Custom Model
-~~~~~~~~~~~~~~~~~~~~
+Launch the Custom Model
+~~~~~~~~~~~~~~~~~~~~~~~
+
+Launch the custom model programmatically:
+
+.. code-block:: python
+
+   uid = client.launch_model(model_name='custom-llama-2', model_format='pytorch')
+
+Or via CLI:
+
+.. code-block:: bash
+
+   xinference launch --model-name custom-llama-2 --model-format pytorch
+
+Interact with the Custom Model
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Invoke the model programmatically:
 
 .. code-block:: python
 
    model = client.get_model(model_uid=uid)
-   model.generate("What is the largest animal in the world?")
+   model.generate('What is the largest animal in the world?')
 
 Result:
 
@@ -145,3 +146,24 @@ Result:
          "total_tokens":33
       }
    }
+
+Or via CLI, replace ``${UID}`` with real model UID:
+
+.. code-block:: bash
+
+   xinference generate --model-uid ${UID}
+
+Unregister the Custom Model
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Unregister the custom model programmatically:
+
+.. code-block:: python
+
+   model = client.unregister_model(model_type='LLM', model_name='custom-llama-2')
+
+Or via CLI:
+
+.. code-block:: bash
+
+   xinference unregister --model-type LLM --model-name custom-llama-2