8000 docs(style): clean up style for data/spec field types by badmonster0 · Pull Request #669 · cocoindex-io/cocoindex · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

docs(style): clean up style for data/spec field types #669

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 28, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 20 additions & 20 deletions docs/docs/ops/functions.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,10 +11,10 @@ description: CocoIndex Built-in Functions

The spec takes the following fields:

* `text` (type: `str`, required): The source text to parse.
* `language` (type: `str`, optional): The language of the source text. Only `json` is supported now. Default to `json`.
* `text` (`str`): The source text to parse.
* `language` (`str`, optional): The language of the source text. Only `json` is supported now. Default to `json`.

Return type: `Json`
Return: *Json*

## SplitRecursively

Expand Down Expand Up @@ -64,7 +64,7 @@ Input data:

:::

Return type: [*KTable*](/docs/core/data_types#ktable), each row represents a chunk, with the following sub fields:
Return: [*KTable*](/docs/core/data_types#ktable), each row represents a chunk, with the following sub fields:

* `location` (*Range*): The location of the chunk.
* `text` (*Str*): The text of the chunk.
Expand All @@ -79,22 +79,22 @@ Return type: [*KTable*](/docs/core/data_types#ktable), each row represents a chu

The spec takes the following fields:

* `model` (type: `str`, required): The name of the SentenceTransformer model to use.
* `args` (type: `dict[str, Any]`, optional): Additional arguments to pass to the SentenceTransformer constructor. e.g. `{"trust_remote_code": True}`
* `model` (`str`): The name of the SentenceTransformer model to use.
* `args` (`dict[str, Any]`, optional): Additional arguments to pass to the SentenceTransformer constructor. e.g. `{"trust_remote_code": True}`

Input data:

* `text` (type: `str`, required): The text to embed.
* `text` (*Str*): The text to embed.

Return type: `vector[float32; N]`, where `N` is determined by the model
Return: *Vector[Float32, N]*, where *N* is determined by the model

## ExtractByLlm

`ExtractByLlm` extracts structured information from a text using specified LLM. The spec takes the following fields:

* `llm_spec` (type: `cocoindex.LlmSpec`, required): The specification of the LLM to use. See [LLM Spec](/docs/ai/llm#llm-spec) for more details.
* `output_type` (type: `type`, required): The type of the output. e.g. a dataclass type name. See [Data Types](/docs/core/data_types) for all supported data types. The LLM will output values that match the schema of the type.
* `instruction` (type: `str`, optional): Additional instruction for the LLM.
* `llm_spec` (`cocoindex.LlmSpec`): The specification of the LLM to use. See [LLM Spec](/docs/ai/llm#llm-spec) for more details.
* `output_type` (`type`): The type of the output. e.g. a dataclass type name. See [Data Types](/docs/core/data_types) for all supported data types. The LLM will output values that match the schema of the type.
* `instruction` (`str`, optional): Additional instruction for the LLM.

:::tip Clear type definitions

Expand All @@ -109,25 +109,25 @@ To improve the quality of the extracted information, giving clear definitions fo

Input data:

* `text` (type: `str`, required): The text to extract information from.
* `text` (*Str*): The text to extract information from.

Return type: As specified by the `output_type` field in the spec. The extracted information from the input text.
Return: As specified by the `output_type` field in the spec. The extracted information from the input text.

## EmbedText

`EmbedText` embeds a text into a vector space using various LLM APIs that support text embedding.

The spec takes the following fields:

10000 * `api_type` (type: [`cocoindex.LlmApiType`](/docs/ai/llm#llm-api-types), required): The type of LLM API to use for embedding.
* `model` (type: `str`, required): The name of the embedding model to use.
* `address` (type: `str`, optional): The address of the LLM API. If not specified, uses the default address for the API type.
* `output_dimension` (type: `int`, optional): The expected dimension of the output embedding vector. If not specified, use the default dimension of the model.
* `api_type` ([`cocoindex.LlmApiType`](/docs/ai/llm#llm-api-types)): The type of LLM API to use for embedding.
* `model` (`str`): The name of the embedding model to use.
* `address` (`str`, optional): The address of the LLM API. If not specified, uses the default address for the API type.
* `output_dimension` (`int`, optional): The expected dimension of the output embedding vector. If not specified, use the default dimension of the model.

For most API types, the function internally keeps a registry for the default output dimension of known model.
You need to explicitly specify the `output_dimension` if you want to use a new model that is not in the registry yet.

* `task_type` (type: `str`, optional): The task type for embedding, used by some embedding models to optimize the embedding for specific use cases.
* `task_type` (`str`, optional): The task type for embedding, used by some embedding models to optimize the embedding for specific use cases.

:::note Supported APIs for Text Embedding

Expand All @@ -137,6 +137,6 @@ Not all LLM APIs support text embedding. See the [LLM API Types table](/docs/ai/

Input data:

* `text` (type: `str`, required): The text to embed.
* `text` (*Str*, required): The text to embed.

Return type: `vector[float32; N]`, where `N` is the dimension of the embedding vector determined by the model.
Return: *Vector[Float32, N]*, where *N* is the dimension of the embedding vector determined by the model.
50 changes: 25 additions & 25 deletions docs/docs/ops/sources.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,11 +13,11 @@ The `LocalFile` source imports files from a local file system.
### Spec

The spec takes the following fields:
* `path` (type: `str`, required): full path of the root directory to import files from
* `binary` (type: `bool`, optional): whether reading files as binary (instead of text)
* `included_patterns` (type: `list[str]`, optional): a list of glob patterns to include files, e.g. `["*.txt", "docs/**/*.md"]`.
* `path` (`str`): full path of the root directory to import files from
* `binary` (`bool`, optional): whether reading files as binary (instead of text)
* `included_patterns` (`list[str]`, optional): a list of glob patterns to include files, e.g. `["*.txt", "docs/**/*.md"]`.
If not specified, all files will be included.
* `excluded_patterns` (type: `list[str]`, optional): a list of glob patterns to exclude files, e.g. `["tmp", "**/node_modules"]`.
* `excluded_patterns` (`list[str]`, optional): a list of glob patterns to exclude files, e.g. `["tmp", "**/node_modules"]`.
Any file or directory matching these patterns will be excluded even if they match `included_patterns`.
If not specified, no files will be excluded.

Expand All @@ -29,9 +29,9 @@ The spec takes the following fields:

### Schema

The output is a [KTable](/docs/core/data_types#ktable) with the following sub fields:
* `filename` (key, type: `str`): the filename of the file, including the path, relative to the root directory, e.g. `"dir1/file1.md"`
* `content` (type: `str` if `binary` is `False`, otherwise `bytes`): the content of the file
The output is a [*KTable*](/docs/core/data_types#ktable) with the following sub fields:
* `filename` (*Str*, key): the filename of the file, including the path, relative to the root directory, e.g. `"dir1/file1.md"`
* `content` (*Str* if `binary` is `False`, *Bytes* otherwise): the content of the file

## AmazonS3

Expand Down Expand Up @@ -121,12 +121,12 @@ AWS's [Guide of Configuring a Bucket for Notifications](https://docs.aws.amazon.
### Spec

The spec takes the following fields:
* `bucket_name` (type: `str`, required): Amazon S3 bucket name.
* `prefix` (type: `str`, optional): if provided, only files with path starting with this prefix will be imported.
* `binary` (type: `bool`, optional): whether reading files as binary (instead of text).
* `included_patterns` (type: `list[str]`, optional): a list of glob patterns to include files, e.g. `["*.txt", "docs/**/*.md"]`.
* `bucket_name` (`str`): Amazon S3 bucket name.
* `prefix` (`str`, optional): if provided, only files with path starting with this prefix will be imported.
* `binary` (`bool`, optional): whether reading files as binary (instead of text).
* `included_patterns` (`list[str]`, optional): a list of glob patterns to include files, e.g. `["*.txt", "docs/**/*.md"]`.
If not specified, all files will be included.
* `excluded_patterns` (type: `list[str]`, optional): a list of glob patterns to exclude files, e.g. `["*.tmp", "**/*.log"]`.
* `excluded_patterns` (`list[str]`, optional): a list of glob patterns to exclude files, e.g. `["*.tmp", "**/*.log"]`.
Any file or directory matching these patterns will be excluded even if they match `included_patterns`.
If not specified, no files will be excluded.

Expand All @@ -136,7 +136,7 @@ The spec takes the following fields:

:::

* `sqs_queue_url` (type: `str`, optional): if provided, the source will receive change event notifications from Amazon S3 via this SQS queue.
* `sqs_queue_url` (`str`, optional): if provided, the source will receive change event notifications from Amazon S3 via this SQS queue.

:::info

Expand All @@ -147,9 +147,9 @@ The spec takes the following fields:

### Schema

The output is a [KTable](/docs/core/data_types#ktable) with the following sub fields:
* `filename` (key, type: `str`): the filename of the file, including the path, relative to the root directory, e.g. `"dir1/file1.md"`.
* `content` (type: `str` if `binary` is `False`, otherwise `bytes`): the content of the file.
The output is a [*KTable*](/docs/core/data_types#ktable) with the following sub fields:
* `filename` (*Str*, key): the filename of the file, including the path, relative to the root directory, e.g. `"dir1/file1.md"`.
* `content` (*Str* if `binary` is `False`, otherwise *Bytes*): the content of the file.


## GoogleDrive
Expand All @@ -176,10 +176,10 @@ To access files in Google Drive, the `GoogleDrive` source will need to authentic

The spec takes the following fields:

* `service_account_credential_path` (type: `str`, required): full path to the service account credential file in JSON format.
* `root_folder_ids` (type: `list[str]`, required): a list of Google Drive folder IDs to import files from.
* `binary` (type: `bool`, optional): whether reading files as binary (instead of text).
* `recent_changes_poll_interval` (type: `datetime.timedelta`, optional): when set, this source provides a change capture mechanism by polling Google Drive for recent modified files periodically.
* `service_account_credential_path` (`str`): full path to the service account credential file in JSON format.
* `root_folder_ids` (`list[str]`): a list of Google Drive folder IDs to import files from.
* `binary` (`bool`, optional): whether reading files as binary (instead of text).
* `recent_changes_poll_interval` (`datetime.timedelta`, optional): when set, this source provides a change capture mechanism by polling Google Drive for recent modified files periodically.

:::info

Expand All @@ -198,9 +198,9 @@ The spec takes the following fields:

### Schema

The output is a [KTable](/docs/core/data_types#ktable) with the following sub fields:
The output is a [*KTable*](/docs/core/data_types#ktable) with the following sub fields:

* `file_id` (key, type: `str`): the ID of the file in Google Drive.
* `filename` (type: `str`): the filename of the file, without the path, e.g. `"file1.md"`
* `mime_type` (type: `str`): the MIME type of the file.
* `content` (type: `str` if `binary` is `False`, otherwise `bytes`): the content of the file.
* `file_id` (*Str*, key): the ID of the file in Google Drive.
* `filename` (*Str*): the filename of the file, without the path, e.g. `"file1.md"`
* `mime_type` (*Str*): the MIME type of the file.
* `content` (*Str* if `binary` is `False`, otherwise *Bytes*): the content of the file.
Loading
0