The Text component is an operator that allows users to extract and manipulate text from different sources. It can carry out the following tasks:
#Release Stage
Alpha
#Configuration
The component configuration is defined and maintained here.
#Supported Tasks
#Convert To Text
Convert document to text.
| Input | ID | Type | Description | 
|---|---|---|---|
| Task ID (required) | task | string | TASK_CONVERT_TO_TEXT | 
| Document (required) | doc | string | Base64 encoded document (PDF, DOC, DOCX, XML, HTML, RTF, etc.) to be converted to plain text | 
| Output | ID | Type | Description | 
|---|---|---|---|
| Body | body | string | Plain text converted from the document | 
| Meta | meta | object | Metadata extracted from the document | 
| MSecs | msecs | number | Time taken to convert the document | 
| Error | error | string | Error message if any during the conversion process | 
#Split By Token
Split text by token.
| Input | ID | Type | Description | 
|---|---|---|---|
| Task ID (required) | task | string | TASK_SPLIT_BY_TOKEN | 
| Text (required) | text | string | Text to be split | 
| Model (required) | model | string | ID of the model to use for tokenization | 
| Chunk Token Size | chunk_token_size | integer | Number of tokens per text chunk | 
| Output | ID | Type | Description | 
|---|---|---|---|
| Token Count | token_count | integer | Total count of tokens in the input text | 
| Text Chunks | text_chunks | array[string] | Text chunks after splitting | 
| Number of Text Chunks | chunk_num | integer | Total number of output text chunks |