Prepare Models | Documentation

To prepare a model for Instill Model:

Create a model card README.md to describe your model
Write a model.py file that defined the model class which will be decorated into servable model with Instill's python-sdk
Organise the model files into valid Instill Model model layout

#Model Card

Model card is a README.md file that accompanies the model to describe handy information with additional model metadata. Under the hood, a Model card is associated with a specific model. It is an crucial for reproducibility, sharing and discoverability. We highly recommend adding a model card README.md file when preparing your model used in Instill Model.

In a model card, you can provide information about:

the model itself
its use cases and limitations
the datasets used to train the model
the training experiments and configuration
benchmarking and evaluation results
reference materials

After importing a model into Instill Model, the model card will be rendered in the Console on the Model page. Here shows a model card example of a model imported from a GitHub repository model-mobilenetv2.

Model card in Console

INFO

Try our Import GitHub models guideline to import a model from GitHub

#Model Card Metadata

You can insert Front Matter in a model card to define the model metadata. Start with three --- at the top, then include all the metadata and close the section with --- like the example below.

README.md

---
Task: "any Instill Model supported AI task identifier"
Tags:
  - tag1
  - tag2
  - tag3
---

#Specify an AI Task

When importing the model, Instill Model will detect the Task in the model card and verify if output of the model fulfils the AI task requirements. If the model is verified, Instill Model will automatically convert the model output into format of the corresponding standardised VDP AI task format whenever using the model. Please check the supported AI tasks and the corresponding output format for each task.

image-classification

object-detection

keypoint-detection

Task: Classification

If not specified, the model will be recognised with Unspecified AI task, and the raw model output will be wrapped in a standard format.

❓ How to know if the AI task metadata is correctly recognised?

If you include valid AI task metadata, they will show on the Model page of the Console like this:

AI task label in Console

#Model Layout

With Ray under the hood for model serving, Instill Model extends its support to any arbitrary deep learning frameworks the user desires. To deploy a model on Instill Model, we suggest you to prepare the model files similar to the following layout:

.
├── README.md
├── model.py
└── <weights>
    ├── <weight_file_1>
    ├── <weight_file_2>
    ├── ...
    └── <weight_file_n>

The above layout displays a typical Instill Model model consisting of

README.md - model card to embed the metadata in front matter and descriptions in Markdown format
model.py - this is where you defined the decorated model class that contains custom inference logic
<weights> - the directory that holds the necessary weight fiels

You can name the <weights> folder freely provided that the folder name are clear and semantic.

INFO

As long as the model class in your model.py implemented the necessary functions, it can be safely imported into Instill Model and deployed online. Check out this guide for more detail

#Prepare model.py

To implement a custom model that can be imported and served on Instill Model, you only need to implement a simple model class within the model.py file

The custom model class will need to implement the following methods

__init__
- within the __init__ function, this is where to define the model loading process, allowing the weights to be store in memory and allow faster auto-scaling behavior
ModelMetadata
- ModelMetadata method tells the backend service what is the expected input/output shape the model is expecting, if you are using our predefined AI Tasks, you can simply import construct_{task}_metadata_reponse and use it as return
__call__
- __call__ is the inference request entrypoint, this is where you implement your model inference logic.

Following is a simple implementation of TinyLlama model with explanations.

# import neccessary packages
import torch
from transformers import pipeline
# import SDK helper functions
# const package hosts the standard Datatypes and Input class for each standard Instill AI Tasks
from instill.helpers.const import TextGenerationChatInput
# ray_io package hosts the parsers to easily convert request payload into input paramaters, and model outputs to response
from instill.helpers.ray_io import StandardTaskIO
# ray_config package hosts the decorators and deployment object for model class
from instill.helpers.ray_config import instill_deployment, InstillDeployable
from instill.helpers import (
    construct_text_generation_chat_infer_response,
    construct_text_generation_chat_metadata_response,
)
# use instill_deployment decorator to convert the model class to servable model
@instill_deployment
class TinyLlama:
    # within the __init__ function, setup the model instance with the desired framework, in this
    # case is the pipeline from transformers
    def __init__(self):
        self.pipeline = pipeline(
            "text-generation",
            model="tinyllama",
            torch_dtype=torch.bfloat16,
            device_map="auto",
        )
    # ModelMetadata tells the server what inputs and outputs the model is expecting
    def ModelMetadata(self, req):
        return construct_text_generation_chat_metadata_response(req=req)
    # ModelInfer is the method handling the trigger request from Instill Model
    async def __call__(self, request):
        # use StandardTaskIO package to parse the request and get the corresponding input
        # for text-generation-chat task
        task_text_generation_chat_input: TextGenerationChatInput = (
            StandardTaskIO.parse_task_text_generation_chat_input(request=request)
        )
        # prepare prompt with chat template
        prompt = self.pipeline.tokenizer.apply_chat_template(
            task_text_generation_chat_input.chat_history,
            tokenize=False,
            add_generation_prompt=True,
        )
        # inference
        sequences = self.pipeline(
            prompt,
            max_new_tokens=task_text_generation_chat_input.max_new_tokens,
            do_sample=True,
            temperature=task_text_generation_chat_input.temperature,
            top_k=task_text_generation_chat_input.top_k,
            top_p=0.95,
        )
        # convert the output into response output with again the StandardTaskIO
        task_text_generation_chat_output = (
            StandardTaskIO.parse_task_text_generation_chat_output(sequences=sequences)
        )
        return construct_text_generation_chat_infer_response(
            req=request,
            # specify the output dimension
            shape=[1, len(sequences)],
            raw_outputs=[task_text_generation_chat_output],
        )
# now simply declare a global deployable instance with model weight name or model file name
# and specify if this model is going to use GPU or not
deployable = InstillDeployable(
    TinyLlama, model_weight_or_folder_name="tinyllama", use_gpu=True
)
# you can also have a fine-grained control of the min/max replica numbers
deployable.update_max_replicas(2)
deployable.update_min_replicas(0)
# we plan to open up more detailed resource control in the future