Supported AI Tasks | Documentation

When you import a model into Instill Model, it is automatically categorized into one of the standardised AI tasks by identifing the relevant metadata in the model card.

In a data pipeline, model is the core component designed to solve a specific AI task. By standardising the data format of model outputs into AI tasks,

model in a pipeline is modularized: you can freely switch to use different models in a pipeline as long as the model is designed for the same task;
VDP produces a stream of data from models with standard format for use in a data integration or ETL pipeline.

At the moment, Instill Model defines the data interface for popular tasks:

Image Classification - classify images into predefined categories
Object Detection - detect and localise multiple objects in images
Keypoint Detection - detect and localise multiple keypoints of objects in images
OCR (Optical Character Recognition) - detect and recognise text in images
Instance Segmentation - detect, localise and delineate multiple objects in images
Semantic Segmentation - classify image pixels into predefined categories
Text to Image - generate images from input text prompts
Image to Image - generate images from input image promtps
Text Generation - generate texts from input text prompts
Text Generation Chat - generate chat style texts from input text prompts
Visual Question Answering - generate chat style texts from input text and image prompts
The list is growing ... 🌱

The above tasks focus on analysing and understanding the content of data in the same way as human does. The goal is to make a computer/device provide description for the data as complete and accurate as possible. These primitive tasks are the foundation for building many real-world industrial AI applications. Each task is described in depth in the respective section below.

#Image Classification

Image Classification is a Vision task to assign a single pre-defined category label to an entire input image. Generally, an Image Classification model takes an image as the input, and outputs a prediction about what category this image belongs to and a confidence score (usually between 0 and 1) representing the likelihood that the prediction is correct.

Image Classification task

{
  "task": "TASK_CLASSIFICATION",
  "task_outputs": [
    {
      "classification": {
        "category": "golden retriever",
        "score": 0.98
      }
    }
  ]
}

Available models

🔮 Model	Sources	Framework	CPU	GPU
MobileNet v2	GitHub, GitHub-DVC	ONNX	✅	✅
Vision Transformer (ViT)	Hugging Face	ONNX	✅	❌

#Object Detection

Object Detection is a Vision task to localise multiple objects of pre-defined categories in an input image. Generally, an Object Detection model receives an image as the input, and outputs bounding boxes with category labels and confidence scores on detected objects.

Object Detection task

{
  "task": "TASK_DETECTION",
  "task_outputs": [
    {
      "detection": {
        "objects": [
          {
            "category": "dog",
            "score": 0.97,
            "bounding_box": {
              "top": 102,
              "left": 324,
              "width": 208,
              "height": 405
            }
          },
          ...
        ]
      }
    }
  ]
}

Available models

🔮 Model	Sources	Framework	CPU	GPU
YOLOv4	GitHub-DVC	ONNX	✅	✅
YOLOv7	GitHub-DVC	ONNX	✅	✅

#Keypoint Detection

Keypoint Detection task a Vision task to localise multiple objects by identifying their pre-defined keypoints, for example, identifying the keypoints of human body: nose, eyes, ears, shoulders, elbows, wrists, hips, knees and ankles. Normally, a Keypoint Detection task takes an image as the input, and outputs the coordinates and visibility of keypoints with bounding boxes and confidence scores on detected objects.

Keypoint Detection task

{
  "task": "TASK_KEYPOINT",
  "task_outputs": [
    {
      "keypoint": {
        "objects": [
          {
            "keypoints": [
              {
                "v": 0.53722847,
                "x": 542.82764,
                "y": 86.63817
              },
              {
                "v": 0.634061,
                "x": 553.0073,
                "y": 79.440636
              },
              ...
            ],
            "score": 0.94,
            "bounding_box": {
              "top": 86,
              "left": 185,
              "width": 571,
              "height": 203
            }
          },
          ...
        ]
      }
    }
  ]
}

Available models

🔮 Model	Sources	Framework	CPU	GPU
YOLOv7 W6 Pose	GitHub-DVC	ONNX	✅	✅

#Optical Character Recognition (OCR)

OCR is a Vision task to localise and recognise text in an input image. The task can be done in two steps by multiple models: a text detection model to detect bounding boxes containing text and a text recognition model to process typed or handwritten text within each bounding box into machine readable text. Alternatively, there are deep learning models that can accomplish the task in one single step.

OCR task

{
  "task": "TASK_OCR",
  "task_outputs": [
    {
      "ocr": {
        "objects": [
          {
            "text": "ENDS",
            "score": 0.99,
            "bounding_box": {
              "top": 298,
              "left": 279,
              "width": 134,
              "height": 59
            }
          },
          {
            "text": "PAVEMENT",
            "score": 0.99,
            "bounding_box": {
              "top": 228,
              "left": 216,
              "width": 255,
              "height": 65
            }
          }
        ]
      }
    }
  ]
}

Available models

🔮 Model	Sources	Framework	CPU	GPU
PSNet + EasyOCR	GitHub-DVC	ONNX	✅	✅

#Instance Segmentation

Instance Segmentation is a Vision task to detect and delineate multiple objects of pre-defined categories in an input image. Normally, the task takes an image as the input, and outputs uncompressed run-length encoding (RLE) representations (a variable-length comma-delimited string), with bounding boxes, category labels and confidence scores on detected objects.

Instance Segmentation task

Run-length encoding (RLE) is an efficient form to store binary masks. It is commonly used to encode the location of foreground objects in segmentation. We adopt the uncompressed RLE definition used in the COCO dataset. It divides a binary mask (must in colume-major order) into a series of piecewise constant regions and for each piece simply stores the length of that piece.

Examples of encoding masks into RLEs and decoding masks encoded via RLEs

The above image shows examples of encoding masks into RLEs and decoding masks encoded via RLEs. Note that the odd counts in the RLEs are always the numbers of zeros.

INFO

Check out functions to encode masks into RLEs and decode masks encoded via RLEs.

{
  "task": "TASK_INSTANCE_SEGMENTATION",
  "task_outputs": [
    {
      "instance_segmentation": {
        "objects": [
          {
            "rle": "2918,12,382,33,...",
            "score": 0.99,
            "bounding_box": {
              "top": 95,
              "left": 320,
              "width": 215,
              "height": 406
            },
            "category": "dog"
          },
          {
            "rle": "34,18,230,18,...",
            "score": 0.97,
            "bounding_box": {
              "top": 194,
              "left": 130,
              "width": 197,
              "height": 248
            },
            "category": "dog"
          }
        ]
      }
    }
  ]
}

Available models

🔮 Model	Sources	Framework	CPU	GPU
Mask RCNN	GitHub-DVC	PyTorch	✅	✅

#Semantic Segmentation

Semantic Segmentation is a Vision task of assigning a class label to every pixel in the image. Normally, the task takes an image as the input, and outputs segmentation mask (RLE) representations (a variable-length comma-delimited string) for each group of pixel objects and category of the group objects.

Semantic Segmentation task

{
  "task": "TASK_SEMANTIC_SEGMENTATION",
  "task_outputs": [
    {
      "semantic_segmentation": {
        "stuffs": [
          {
            "rle": "2918,12,382,33,...",
            "category": "person"
          },
          {
            "rle": "34,18,230,18,...",
            "category": "sky"
          },
          ...
        ]
      }
    }
  ]
}

Available models

🔮 Model	Sources	Framework	CPU	GPU
Lite R-ASPP based on MobileNetV3	GitHub-DVC	ONNX	✅	✅

#Text to Image

Text to Image is a Generative AI task to generate images from text inputs. Generally, the task takes descriptive text prompts as the input, and outputs generated images in Base64 format based on the text prompts.

Text to Image task

{
  "task": "TASK_TEXT_TO_IMAGE",
  "task_outputs": [
    {
      "text_to_image": {
        "images": ["/9j/4AAQSkZJRgABAQAAAQABAAD/..."]
      }
    }
  ]
}

DECODE BASE64 IMAGES

In above example, the generated images is a list of Base64 encoded images. To obtain the images, we need to decode Base64 as below snippet code.

import base64
import numpy as np
# Decode the first image result
base64_image = out['text_to_image']['images'][0]
image = base64.b64decode(base64_image)
# Save the decoded image
filename = 'text_to_image.jpg'
with open(filename, 'wb') as f:
f.write(image)

Available models

🔮 Model	Sources	Framework	CPU	GPU
Stable Diffusion	GitHub-DVC, Local-CPU, Local-GPU	ONNX	✅	✅
Stable Diffusion XL	GitHub-DVC	PyTorch	❌	✅

TIP

Importing Stable Diffusion from GitHub will take a while. Alternatively, you can download the model locally as a one-time effort.

Step 1: Download Stable Diffusion v1.5 CPU sample model.

curl https://artifacts.instill.tech/vdp/sample-models/stable-diffusion-1-5-cpu.zip --output stable-diffusion-1-5-cpu.zip

Step2: Refer to the guideline on importing local models via no-code or low-code.

#Text Generation

Text Generation is a Generative AI task to generate new text from text inputs. Generally, the task takes incomplete text prompts as the input, and produces new text based on the prompts. The task can fill in incomplete sentences or even generate full stories given the first words.

Text Generation task

{
  "task": "TASK_TEXT_GENERATION",
  "task_outputs": [
    {
      "text_generation": {
        "text": "The winds of change are blowing strong, bring new beginnings, righting wrongs. The world around us is constantly turning, and with each sunrise, our spirits are yearning."
      }
    }
  ]
}

Available models

🔮 Model	Sources	Framework	CPU	GPU
Llama2	GitHub-DVC	Transformer	❌	✅
Code Llama	GitHub-DVC	Transformer	❌	✅
Llama3-instruct	GitHub-DVC	Transformer	❌	✅

TIP

Depending on your internet speed, importing LLM models will take a while.

Some models only supports GPU deployment. By default, VDP can access all your GPUs.

#Text Generation Chat

Text Generation Chat is a Generative AI task to generate new text from text inputs in chat style. Generally, the task takes a series of conversation as the input, and produces new response on the prompts. The task can perform conversation and even answer question based on previous context.

Text Generation Chat task

{
  "task": "TASK_TEXT_GENERATION_CHAT",
  "task_outputs": [
    {
      "text_generation_chat": {
        "text": "What a delicate situation!\n\nI must advise that it's generally not a good idea to..."
      }
    }
  ]
}

Available models

🔮 Model	Sources	Framework	CPU	GPU
Llama2 Chat	GitHub-DVC	Transformer	❌	✅
MosaicML MPT	GitHub-DVC	Transformer	❌	✅
Mistral	GitHub-DVC	Transformer	❌	✅
Zephyr-7b	GitHub-DVC	Transformer	✅	✅

TIP

Depending on your internet speed, importing LLM models will take a while.

Some models only supports GPU deployment. By default, VDP can access all your GPUs.

#Visual Question Answering

Visual Question Answering (VQA) is a task in computer vision that involves answering questions about an image. The goal of VQA is to teach machines to understand the content of an image and answer questions about it in natural language.

Visual Question Answering

{
  "task": "TASK_VISUAL_QUESTION_ANSWERING",
  "task_outputs": [
    {
      "visual_question_answering": {
        "text": "The image appears to show a close-up view of a plant's leaf or a similar plant part."
      }
    }
  ]
}

Available models

🔮 Model	Sources	Framework	CPU	GPU
Llava-1-6	GitHub-DVC	Transformer	❌	✅

TIP

Depending on your internet speed, importing LLM models will take a while.

Some models only supports GPU deployment. By default, VDP can access all your GPUs.

#Unspecified Task

Instill Model is very flexible and allows you to import models even if your task is not standardised yet or the output of the model can't be converted to the format of supported AI tasks. The model will be classified as an Unspecified task. Send an image to the model as the input, VDP will

check the config.pbtxt model configuration file to extract the output names, datatypes and shapes of the model outputs,
and wrap these information along with the raw model output in a standard format.

Unspecified task

{
  "unspecified": {
    "raw_outputs": [
      {
        "data": [0.85, 0.1, 0.05],
        "data_type": "FP32",
        "name": "output_scores",
        "shape": [3]
      },
      {
        "data": ["dog", "cat", "rabbit"],
        "data_type": "BYTES",
        "name": "output_labels",
        "shape": [3]
      }
    ]
  }
}

#Suggest a New Task

Currently, the model output is converted to standard format based on the AI task outputs maintained in Protobuf.

If you'd like to support for a new task, you can create an issue or request it in the #give-feedback channel on Discord.