The Text component is an operator that allows users to extract and manipulate text from different sources. It can carry out the following tasks:
#Release Stage
Alpha
#Configuration
The component configuration is defined and maintained here.
#Supported Tasks
#Convert To Text
Convert document to text.
Input | ID | Type | Description |
---|---|---|---|
Task ID (required) | task | string | TASK_CONVERT_TO_TEXT |
Document (required) | doc | string | Base64 encoded document (PDF, DOC, DOCX, XML, HTML, RTF, etc.) to be converted to plain text |
Output | ID | Type | Description |
---|---|---|---|
Body | body | string | Plain text converted from the document |
Meta | meta | object | Metadata extracted from the document |
MSecs | msecs | number | Time taken to convert the document |
Error | error | string | Error message if any during the conversion process |
#Split By Token
Split text by token.
Input | ID | Type | Description |
---|---|---|---|
Task ID (required) | task | string | TASK_SPLIT_BY_TOKEN |
Text (required) | text | string | Text to be split |
Model (required) | model | string | ID of the model to use for tokenization |
Chunk Token Size | chunk_token_size | integer | Number of tokens per text chunk |
Output | ID | Type | Description |
---|---|---|---|
Token Count | token_count | integer | Total count of tokens in the input text |
Text Chunks | text_chunks | array[string] | Text chunks after splitting |
Number of Text Chunks | chunk_num | integer | Total number of output text chunks |