Skip to content

Embeddings

GPT4All supports generating high quality embeddings of arbitrary length text using any embedding model supported by llama.cpp.

An embedding is a vector representation of a piece of text. Embeddings are useful for tasks such as retrieval for question answering (including retrieval augmented generation or RAG), semantic similarity search, classification, and topic clustering.

Supported Embedding Models

The following models have built-in support in Embed4All:

Name Embed4All model_name Context Length Embedding Length File Size
SBert all‑MiniLM‑L6‑v2.gguf2.f16.gguf 512 384 44 MiB
Nomic Embed v1 nomic‑embed‑text‑v1.f16.gguf 2048 768 262 MiB
Nomic Embed v1.5 nomic‑embed‑text‑v1.5.f16.gguf 2048 64-768 262 MiB

The context length is the maximum number of word pieces, or tokens, that a model can embed at once. Embedding texts longer than a model's context length requires some kind of strategy; see Embedding Longer Texts for more information.

The embedding length is the size of the vector returned by Embed4All.embed.

Quickstart

pip install gpt4all

Generating Embeddings

By default, embeddings will be generated on the CPU using all-MiniLM-L6-v2.

from gpt4all import Embed4All
text = 'The quick brown fox jumps over the lazy dog'
embedder = Embed4All()
output = embedder.embed(text)
print(output)
[0.034696947783231735, -0.07192722707986832, 0.06923297047615051, ...]

You can also use the GPU to accelerate the embedding model by specifying the device parameter. See the GPT4All constructor for more information.

from gpt4all import Embed4All
text = 'The quick brown fox jumps over the lazy dog'
embedder = Embed4All(device='gpu')
output = embedder.embed(text)
print(output)
[0.034696947783231735, -0.07192722707986832, 0.06923297047615051, ...]

Nomic Embed

Embed4All has built-in support for Nomic's open-source embedding model, Nomic Embed. When using this model, you must specify the task type using the prefix argument. This may be one of search_query, search_document, classification, or clustering. For retrieval applications, you should prepend search_document for all of your documents and search_query for your queries. See the Nomic Embedding Guide for more info.

from gpt4all import Embed4All
text = 'Who is Laurens van der Maaten?'
embedder = Embed4All('nomic-embed-text-v1.f16.gguf')
output = embedder.embed(text, prefix='search_query')
print(output)
[-0.013357644900679588, 0.027070969343185425, -0.0232995692640543, ...]

Embedding Longer Texts

Embed4All accepts a parameter called long_text_mode. This controls the behavior of Embed4All for texts longer than the context length of the embedding model.

In the default mode of "mean", Embed4All will break long inputs into chunks and average their embeddings to compute the final result.

To change this behavior, you can set the long_text_mode parameter to "truncate", which will truncate the input to the sequence length of the model before generating a single embedding.

from gpt4all import Embed4All
text = 'The ' * 512 + 'The quick brown fox jumps over the lazy dog'
embedder = Embed4All()
output = embedder.embed(text, long_text_mode="mean")
print(output)
print()
output = embedder.embed(text, long_text_mode="truncate")
print(output)
[0.0039850445464253426, 0.04558328539133072, 0.0035536508075892925, ...]

[-0.009771130047738552, 0.034792833030223846, -0.013273917138576508, ...]

Batching

You can send multiple texts to Embed4All in a single call. This can give faster results when individual texts are significantly smaller than n_ctx tokens. (n_ctx defaults to 2048.)

from gpt4all import Embed4All
texts = ['The quick brown fox jumps over the lazy dog', 'Foo bar baz']
embedder = Embed4All()
output = embedder.embed(texts)
print(output[0])
print()
print(output[1])
[0.03551332652568817, 0.06137588247656822, 0.05281158909201622, ...]

[-0.03879690542817116, 0.00013223080895841122, 0.023148687556385994, ...]

The number of texts that can be embedded in one pass of the model is proportional to the n_ctx parameter of Embed4All. Increasing it may increase batched embedding throughput if you have a fast GPU, at the cost of VRAM.

embedder = Embed4All(n_ctx=4096, device='gpu')

Resizable Dimensionality

The embedding dimension of Nomic Embed v1.5 can be resized using the dimensionality parameter. This parameter supports any value between 64 and 768.

Shorter embeddings use less storage, memory, and bandwidth with a small performance cost. See the blog post for more info.

from gpt4all import Embed4All
text = 'The quick brown fox jumps over the lazy dog'
embedder = Embed4All('nomic-embed-text-v1.5.f16.gguf')
output = embedder.embed(text, dimensionality=64)
print(len(output))
print(output)
64
[-0.03567073494195938, 0.1301717758178711, -0.4333043396472931, ...]

API documentation

Embed4All

Python class that handles embeddings for GPT4All.

Source code in gpt4all/gpt4all.py
class Embed4All:
    """
    Python class that handles embeddings for GPT4All.
    """

    MIN_DIMENSIONALITY = 64

    def __init__(self, model_name: str | None = None, *, n_threads: int | None = None, device: str | None = "cpu", **kwargs: Any):
        """
        Constructor

        Args:
            n_threads: number of CPU threads used by GPT4All. Default is None, then the number of threads are determined automatically.
            device: The processing unit on which the embedding model will run. See the `GPT4All` constructor for more info.
            kwargs: Remaining keyword arguments are passed to the `GPT4All` constructor.
        """
        if model_name is None:
            model_name = 'all-MiniLM-L6-v2.gguf2.f16.gguf'
        self.gpt4all = GPT4All(model_name, n_threads=n_threads, device=device, **kwargs)

    def __enter__(self) -> Self:
        return self

    def __exit__(
        self, typ: type[BaseException] | None, value: BaseException | None, tb: TracebackType | None,
    ) -> None:
        self.close()

    def close(self) -> None:
        """Delete the model instance and free associated system resources."""
        self.gpt4all.close()

    # return_dict=False
    @overload
    def embed(
        self, text: str, *, prefix: str | None = ..., dimensionality: int | None = ..., long_text_mode: str = ...,
        return_dict: Literal[False] = ..., atlas: bool = ...,
    ) -> list[float]: ...
    @overload
    def embed(
        self, text: list[str], *, prefix: str | None = ..., dimensionality: int | None = ..., long_text_mode: str = ...,
        return_dict: Literal[False] = ..., atlas: bool = ...,
    ) -> list[list[float]]: ...
    @overload
    def embed(
        self, text: str | list[str], *, prefix: str | None = ..., dimensionality: int | None = ...,
        long_text_mode: str = ..., return_dict: Literal[False] = ..., atlas: bool = ...,
    ) -> list[Any]: ...

    # return_dict=True
    @overload
    def embed(
        self, text: str, *, prefix: str | None = ..., dimensionality: int | None = ..., long_text_mode: str = ...,
        return_dict: Literal[True], atlas: bool = ...,
    ) -> EmbedResult[list[float]]: ...
    @overload
    def embed(
        self, text: list[str], *, prefix: str | None = ..., dimensionality: int | None = ..., long_text_mode: str = ...,
        return_dict: Literal[True], atlas: bool = ...,
    ) -> EmbedResult[list[list[float]]]: ...
    @overload
    def embed(
        self, text: str | list[str], *, prefix: str | None = ..., dimensionality: int | None = ...,
        long_text_mode: str = ..., return_dict: Literal[True], atlas: bool = ...,
    ) -> EmbedResult[list[Any]]: ...

    # return type unknown
    @overload
    def embed(
        self, text: str | list[str], *, prefix: str | None = ..., dimensionality: int | None = ...,
        long_text_mode: str = ..., return_dict: bool = ..., atlas: bool = ...,
    ) -> Any: ...

    def embed(
        self, text: str | list[str], *, prefix: str | None = None, dimensionality: int | None = None,
        long_text_mode: str = "mean", return_dict: bool = False, atlas: bool = False,
    ) -> Any:
        """
        Generate one or more embeddings.

        Args:
            text: A text or list of texts to generate embeddings for.
            prefix: The model-specific prefix representing the embedding task, without the trailing colon. For Nomic
                Embed, this can be `search_query`, `search_document`, `classification`, or `clustering`. Defaults to
                `search_document` or equivalent if known; otherwise, you must explicitly pass a prefix or an empty
                string if none applies.
            dimensionality: The embedding dimension, for use with Matryoshka-capable models. Defaults to full-size.
            long_text_mode: How to handle texts longer than the model can accept. One of `mean` or `truncate`.
            return_dict: Return the result as a dict that includes the number of prompt tokens processed.
            atlas: Try to be fully compatible with the Atlas API. Currently, this means texts longer than 8192 tokens
                with long_text_mode="mean" will raise an error. Disabled by default.

        Returns:
            With return_dict=False, an embedding or list of embeddings of your text(s).
            With return_dict=True, a dict with keys 'embeddings' and 'n_prompt_tokens'.
        """
        if dimensionality is None:
            dimensionality = -1
        else:
            if dimensionality <= 0:
                raise ValueError(f'Dimensionality must be None or a positive integer, got {dimensionality}')
            if dimensionality < self.MIN_DIMENSIONALITY:
                warnings.warn(
                    f'Dimensionality {dimensionality} is less than the suggested minimum of {self.MIN_DIMENSIONALITY}.'
                    ' Performance may be degraded.'
                )
        try:
            do_mean = {"mean": True, "truncate": False}[long_text_mode]
        except KeyError:
            raise ValueError(f"Long text mode must be one of 'mean' or 'truncate', got {long_text_mode!r}")
        result = self.gpt4all.model.generate_embeddings(text, prefix, dimensionality, do_mean, atlas)
        return result if return_dict else result['embeddings']
__init__(model_name=None, *, n_threads=None, device='cpu', **kwargs)

Constructor

Parameters:

  • n_threads (int | None, default: None ) –

    number of CPU threads used by GPT4All. Default is None, then the number of threads are determined automatically.

  • device (str | None, default: 'cpu' ) –

    The processing unit on which the embedding model will run. See the GPT4All constructor for more info.

  • kwargs (Any, default: {} ) –

    Remaining keyword arguments are passed to the GPT4All constructor.

Source code in gpt4all/gpt4all.py
def __init__(self, model_name: str | None = None, *, n_threads: int | None = None, device: str | None = "cpu", **kwargs: Any):
    """
    Constructor

    Args:
        n_threads: number of CPU threads used by GPT4All. Default is None, then the number of threads are determined automatically.
        device: The processing unit on which the embedding model will run. See the `GPT4All` constructor for more info.
        kwargs: Remaining keyword arguments are passed to the `GPT4All` constructor.
    """
    if model_name is None:
        model_name = 'all-MiniLM-L6-v2.gguf2.f16.gguf'
    self.gpt4all = GPT4All(model_name, n_threads=n_threads, device=device, **kwargs)
close()

Delete the model instance and free associated system resources.

Source code in gpt4all/gpt4all.py
def close(self) -> None:
    """Delete the model instance and free associated system resources."""
    self.gpt4all.close()
embed(text, *, prefix=None, dimensionality=None, long_text_mode='mean', return_dict=False, atlas=False)

Generate one or more embeddings.

Parameters:

  • text (str | list[str]) –

    A text or list of texts to generate embeddings for.

  • prefix (str | None, default: None ) –

    The model-specific prefix representing the embedding task, without the trailing colon. For Nomic Embed, this can be search_query, search_document, classification, or clustering. Defaults to search_document or equivalent if known; otherwise, you must explicitly pass a prefix or an empty string if none applies.

  • dimensionality (int | None, default: None ) –

    The embedding dimension, for use with Matryoshka-capable models. Defaults to full-size.

  • long_text_mode (str, default: 'mean' ) –

    How to handle texts longer than the model can accept. One of mean or truncate.

  • return_dict (bool, default: False ) –

    Return the result as a dict that includes the number of prompt tokens processed.

  • atlas (bool, default: False ) –

    Try to be fully compatible with the Atlas API. Currently, this means texts longer than 8192 tokens with long_text_mode="mean" will raise an error. Disabled by default.

Returns:

  • Any

    With return_dict=False, an embedding or list of embeddings of your text(s).

  • Any

    With return_dict=True, a dict with keys 'embeddings' and 'n_prompt_tokens'.

Source code in gpt4all/gpt4all.py
def embed(
    self, text: str | list[str], *, prefix: str | None = None, dimensionality: int | None = None,
    long_text_mode: str = "mean", return_dict: bool = False, atlas: bool = False,
) -> Any:
    """
    Generate one or more embeddings.

    Args:
        text: A text or list of texts to generate embeddings for.
        prefix: The model-specific prefix representing the embedding task, without the trailing colon. For Nomic
            Embed, this can be `search_query`, `search_document`, `classification`, or `clustering`. Defaults to
            `search_document` or equivalent if known; otherwise, you must explicitly pass a prefix or an empty
            string if none applies.
        dimensionality: The embedding dimension, for use with Matryoshka-capable models. Defaults to full-size.
        long_text_mode: How to handle texts longer than the model can accept. One of `mean` or `truncate`.
        return_dict: Return the result as a dict that includes the number of prompt tokens processed.
        atlas: Try to be fully compatible with the Atlas API. Currently, this means texts longer than 8192 tokens
            with long_text_mode="mean" will raise an error. Disabled by default.

    Returns:
        With return_dict=False, an embedding or list of embeddings of your text(s).
        With return_dict=True, a dict with keys 'embeddings' and 'n_prompt_tokens'.
    """
    if dimensionality is None:
        dimensionality = -1
    else:
        if dimensionality <= 0:
            raise ValueError(f'Dimensionality must be None or a positive integer, got {dimensionality}')
        if dimensionality < self.MIN_DIMENSIONALITY:
            warnings.warn(
                f'Dimensionality {dimensionality} is less than the suggested minimum of {self.MIN_DIMENSIONALITY}.'
                ' Performance may be degraded.'
            )
    try:
        do_mean = {"mean": True, "truncate": False}[long_text_mode]
    except KeyError:
        raise ValueError(f"Long text mode must be one of 'mean' or 'truncate', got {long_text_mode!r}")
    result = self.gpt4all.model.generate_embeddings(text, prefix, dimensionality, do_mean, atlas)
    return result if return_dict else result['embeddings']