To prepare a model for Instill Model:
- Create a model card
README.md
to describe your model - Write a
model.py
file that defined the model class which will be decorated into servable model with Instill'spython-sdk
- Organise the model files into valid Instill Model model layout
#Model Card
Model card is a README.md
file that accompanies the model to describe handy information with additional model metadata. Under the hood, a Model card is associated with a specific model.
It is an crucial for reproducibility, sharing and discoverability. We highly recommend adding a model card README.md
file when preparing your model used in Instill Model.
In a model card, you can provide information about:
- the model itself
- its use cases and limitations
- the datasets used to train the model
- the training experiments and configuration
- benchmarking and evaluation results
- reference materials
After importing a model into Instill Model, the model card will be rendered in the Console on the Model page. Here shows a model card example of a model imported from a GitHub repository model-mobilenetv2.
Try our Import GitHub models guideline to import a model from GitHub
#Model Card Metadata
You can insert Front Matter in a model card to define the model metadata.
Start with three ---
at the top, then include all the metadata and close the section with ---
like the example below.
#Specify an AI Task
When importing the model, Instill Model will detect the Task
in the model card and verify if output of the model fulfils the AI task requirements.
If the model is verified, Instill Model will automatically convert the model output into format of the corresponding standardised VDP AI task format whenever using the model.
Please check the supported AI tasks and the corresponding output format for each task.
If not specified, the model will be recognised with Unspecified
AI task,
and the raw model output will be wrapped in a standard format.
❓ How to know if the AI task metadata is correctly recognised?
If you include valid AI task metadata, they will show on the Model page of the Console like this:
#Model Layout
With Ray under the hood for model serving, Instill Model extends its support to any arbitrary deep learning frameworks the user desires. To deploy a model on Instill Model, we suggest you to prepare the model files similar to the following layout:
.├── README.md├── model.py└── <weights> ├── <weight_file_1> ├── <weight_file_2> ├── ... └── <weight_file_n>
The above layout displays a typical Instill Model model consisting of
README.md
- model card to embed the metadata in front matter and descriptions in Markdown formatmodel.py
- this is where you defined the decorated model class that contains custom inference logic<weights>
- the directory that holds the necessary weight fiels
You can name the <weights>
folder freely provided that the folder name are clear and semantic.
As long as the model class in your model.py
implemented the necessary functions, it can be safely imported into Instill Model and deployed online.
Check out this guide for more detail
#Prepare model.py
To implement a custom model that can be imported and served on Instill Model
, you only need to implement a simple model class within the model.py
file
The custom model class will need to implement the following methods
__init__
- within the
__init__
function, this is where to define the model loading process, allowing the weights to be store in memory and allow faster auto-scaling behavior
- within the
ModelMetadata
ModelMetadata
method tells the backend service what is the expected input/output shape the model is expecting, if you are using our predefined AI Tasks, you can simply import construct_{task}_metadata_reponse and use it as return
__call__
__call__
is the inference request entrypoint, this is where you implement your model inference logic.
Following is a simple implementation of TinyLlama model with explanations.
# import neccessary packagesimport torchfrom transformers import pipeline# import SDK helper functions# const package hosts the standard Datatypes and Input class for each standard Instill AI Tasksfrom instill.helpers.const import TextGenerationChatInput# ray_io package hosts the parsers to easily convert request payload into input paramaters, and model outputs to responsefrom instill.helpers.ray_io import StandardTaskIO# ray_config package hosts the decorators and deployment object for model classfrom instill.helpers.ray_config import instill_deployment, InstillDeployablefrom instill.helpers import ( construct_text_generation_chat_infer_response, construct_text_generation_chat_metadata_response,)# use instill_deployment decorator to convert the model class to servable model@instill_deploymentclass TinyLlama: # within the __init__ function, setup the model instance with the desired framework, in this # case is the pipeline from transformers def __init__(self): self.pipeline = pipeline( "text-generation", model="tinyllama", torch_dtype=torch.bfloat16, device_map="auto", ) # ModelMetadata tells the server what inputs and outputs the model is expecting def ModelMetadata(self, req): return construct_text_generation_chat_metadata_response(req=req) # ModelInfer is the method handling the trigger request from Instill Model async def __call__(self, request): # use StandardTaskIO package to parse the request and get the corresponding input # for text-generation-chat task task_text_generation_chat_input: TextGenerationChatInput = ( StandardTaskIO.parse_task_text_generation_chat_input(request=request) ) # prepare prompt with chat template prompt = self.pipeline.tokenizer.apply_chat_template( task_text_generation_chat_input.chat_history, tokenize=False, add_generation_prompt=True, ) # inference sequences = self.pipeline( prompt, max_new_tokens=task_text_generation_chat_input.max_new_tokens, do_sample=True, temperature=task_text_generation_chat_input.temperature, top_k=task_text_generation_chat_input.top_k, top_p=0.95, ) # convert the output into response output with again the StandardTaskIO task_text_generation_chat_output = ( StandardTaskIO.parse_task_text_generation_chat_output(sequences=sequences) ) return construct_text_generation_chat_infer_response( req=request, # specify the output dimension shape=[1, len(sequences)], raw_outputs=[task_text_generation_chat_output], )# now simply declare a global deployable instance with model weight name or model file name# and specify if this model is going to use GPU or notdeployable = InstillDeployable( TinyLlama, model_weight_or_folder_name="tinyllama", use_gpu=True)# you can also have a fine-grained control of the min/max replica numbersdeployable.update_max_replicas(2)deployable.update_min_replicas(0)# we plan to open up more detailed resource control in the future