📚 References
This module provides functions and classes for embedding queries, files, and directories using different embedding models.
The module includes the following functions:
embed_query
: Embeds the given query and returns an EmbedData object.embed_file
: Embeds the file at the given path and returns a list of EmbedData objects.embed_directory
: Embeds all the files in the given directory and returns a list of EmbedData objects.
The module also includes the EmbedData
class, which represents the data of an embedded file.
Usage:
import embed_anything
from embed_anything import EmbedData
#For text files
model = EmbeddingModel.from_pretrained_local(
WhichModel.Bert, model_id="Hugging_face_link"
)
data = embed_anything.embed_file("test_files/test.pdf", embeder=model)
#For images
model = embed_anything.EmbeddingModel.from_pretrained_local(
embed_anything.WhichModel.Clip,
model_id="openai/clip-vit-base-patch16",
# revision="refs/pr/15",
)
data: list[EmbedData] = embed_anything.embed_directory("test_files", embeder=model)
embeddings = np.array([data.embedding for data in data])
query = ["Photo of a monkey?"]
query_embedding = np.array(
embed_anything.embed_query(query, embeder=model)[0].embedding
)
# For audio files
from embed_anything import (
AudioDecoderModel,
EmbeddingModel,
embed_audio_file,
TextEmbedConfig,
)
# choose any whisper or distilwhisper model from https://huggingface.co/distil-whisper or https://huggingface.co/collections/openai/whisper-release-6501bba2cf999715fd953013
audio_decoder = AudioDecoderModel.from_pretrained_hf(
"openai/whisper-tiny.en", revision="main", model_type="tiny-en", quantized=False
)
embeder = EmbeddingModel.from_pretrained_hf(
embed_anything.WhichModel.Bert,
model_id="sentence-transformers/all-MiniLM-L6-v2",
revision="main",
)
config = TextEmbedConfig(chunk_size=200, batch_size=32)
data = embed_anything.embed_audio_file(
"test_files/audio/samples_hp0.wav",
audio_decoder=audio_decoder,
embeder=embeder,
text_embed_config=config,
)
You can also store the embeddings to a vector database and not keep them on memory. Here is an example of how to use the PineconeAdapter
class:
import embed_anything
import os
from embed_anything.vectordb import PineconeAdapter
# Initialize the PineconeEmbedder class
api_key = os.environ.get("PINECONE_API_KEY")
index_name = "anything"
pinecone_adapter = PineconeAdapter(api_key)
try:
pinecone_adapter.delete_index("anything")
except:
pass
# Initialize the PineconeEmbedder class
pinecone_adapter.create_index(dimension=512, metric="cosine")
# bert_model = EmbeddingModel.from_pretrained_hf(
# WhichModel.Bert, "sentence-transformers/all-MiniLM-L12-v2", revision="main"
# )
clip_model = EmbeddingModel.from_pretrained_hf(
WhichModel.Clip, "openai/clip-vit-base-patch16", revision="main"
)
embed_config = TextEmbedConfig(chunk_size=512, batch_size=32)
data = embed_anything.embed_image_directory(
"test_files",
embeder=clip_model,
adapter=pinecone_adapter,
# config=embed_config,
Supported Embedding Models:
-
Text Embedding Models:
- "OpenAI"
- "Bert"
- "Jina"
-
Image Embedding Models:
- "Clip"
- "SigLip" (Coming Soon)
-
Audio Embedding Models:
- "Whisper"
AudioDecoderModel
Represents an audio decoder model.
Attributes:
Name | Type | Description |
---|---|---|
model_id |
str
|
The ID of the audio decoder model. |
revision |
str
|
The revision of the audio decoder model. |
model_type |
str
|
The type of the audio decoder model. |
quantized |
bool
|
A flag indicating whether the audio decoder model is quantized or not. |
Example:
model = embed_anything.AudioDecoderModel.from_pretrained_hf(
model_id="openai/whisper-tiny.en",
revision="main",
model_type="tiny-en",
quantized=False
)
Source code in python/python/embed_anything/_embed_anything.pyi
ColpaliModel
Represents the Colpali model.
Source code in python/python/embed_anything/_embed_anything.pyi
__init__(model_id, revision=None)
Initializes the ColpaliModel object.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model_id
|
str
|
The ID of the model from Hugging Face. |
required |
revision
|
str | None
|
The revision of the model. |
None
|
embed_file(file_path, batch_size=1)
Embeds the given pdf file and returns a list of EmbedData objects for each page in the file This first convert the pdf file into images and then embed each image.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file_path
|
str
|
The path to the pdf file to embed. |
required |
batch_size
|
int | None
|
The batch size for processing the embeddings. Default is 1. |
1
|
Returns:
Type | Description |
---|---|
list[EmbedData]
|
A list of EmbedData objects for each page in the file. |
Source code in python/python/embed_anything/_embed_anything.pyi
embed_query(query)
Embeds the given query and returns a list of EmbedData objects.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
query
|
str
|
The query to embed. |
required |
Returns:
Type | Description |
---|---|
list[EmbedData]
|
A list of EmbedData objects. |
from_pretrained(model_id, revision=None)
Loads a pre-trained Colpali model from the Hugging Face model hub.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model_id
|
str
|
The ID of the model from Hugging Face. |
required |
revision
|
str | None
|
The revision of the model. |
None
|
Returns:
Type | Description |
---|---|
ColpaliModel
|
A ColpaliModel object. |
Source code in python/python/embed_anything/_embed_anything.pyi
from_pretrained_onnx(model_id, revision=None)
Loads a pre-trained Colpali model from the Hugging Face model hub.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model_id
|
str
|
The ID of the model from Hugging Face. |
required |
revision
|
str | None
|
The revision of the model. |
None
|
Returns:
Type | Description |
---|---|
ColpaliModel
|
A ColpaliModel object. |
Source code in python/python/embed_anything/_embed_anything.pyi
EmbedData
Represents the data of an embedded file.
Attributes:
Name | Type | Description |
---|---|---|
embedding |
list[float]
|
The embedding of the file. |
text |
str
|
The text for which the embedding is generated for. |
metadata |
dict[str, str]
|
Additional metadata associated with the embedding. |
Source code in python/python/embed_anything/_embed_anything.pyi
EmbeddingModel
Represents an embedding model.
Source code in python/python/embed_anything/_embed_anything.pyi
363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 |
|
ImageEmbedConfig
Represents the configuration for the Image Embedding model.
Attributes:
Name | Type | Description |
---|---|---|
buffer_size |
int | None
|
The buffer size for the Image Embedding model. Default is 100. |
Source code in python/python/embed_anything/_embed_anything.pyi
ONNXModel
Bases: Enum
Enum representing various ONNX models.
| Enum Variant | Description |
|----------------------------------|--------------------------------------------------|
| `AllMiniLML6V2` | sentence-transformers/all-MiniLM-L6-v2 |
| `AllMiniLML6V2Q` | Quantized sentence-transformers/all-MiniLM-L6-v2 |
| `AllMiniLML12V2` | sentence-transformers/all-MiniLM-L12-v2 |
| `AllMiniLML12V2Q` | Quantized sentence-transformers/all-MiniLM-L12-v2|
| `BGEBaseENV15` | BAAI/bge-base-en-v1.5 |
| `BGEBaseENV15Q` | Quantized BAAI/bge-base-en-v1.5 |
| `BGELargeENV15` | BAAI/bge-large-en-v1.5 |
| `BGELargeENV15Q` | Quantized BAAI/bge-large-en-v1.5 |
| `BGESmallENV15` | BAAI/bge-small-en-v1.5 - Default |
| `BGESmallENV15Q` | Quantized BAAI/bge-small-en-v1.5 |
| `NomicEmbedTextV1` | nomic-ai/nomic-embed-text-v1 |
| `NomicEmbedTextV15` | nomic-ai/nomic-embed-text-v1.5 |
| `NomicEmbedTextV15Q` | Quantized nomic-ai/nomic-embed-text-v1.5 |
| `ParaphraseMLMiniLML12V2` | sentence-transformers/paraphrase-MiniLM-L6-v2 |
| `ParaphraseMLMiniLML12V2Q` | Quantized sentence-transformers/paraphrase-MiniLM-L6-v2 |
| `ParaphraseMLMpnetBaseV2` | sentence-transformers/paraphrase-mpnet-base-v2 |
| `BGESmallZHV15` | BAAI/bge-small-zh-v1.5 |
| `MultilingualE5Small` | intfloat/multilingual-e5-small |
| `MultilingualE5Base` | intfloat/multilingual-e5-base |
| `MultilingualE5Large` | intfloat/multilingual-e5-large |
| `MxbaiEmbedLargeV1` | mixedbread-ai/mxbai-embed-large-v1 |
| `MxbaiEmbedLargeV1Q` | Quantized mixedbread-ai/mxbai-embed-large-v1 |
| `GTEBaseENV15` | Alibaba-NLP/gte-base-en-v1.5 |
| `GTEBaseENV15Q` | Quantized Alibaba-NLP/gte-base-en-v1.5 |
| `GTELargeENV15` | Alibaba-NLP/gte-large-en-v1.5 |
| `GTELargeENV15Q` | Quantized Alibaba-NLP/gte-large-en-v1.5 |
| `JINAV2SMALLEN` | jinaai/jina-embeddings-v2-small-en |
| `JINAV2BASEEN` | jinaai/jina-embeddings-v2-base-en |
| `JINAV2LARGEEN` | jinaai/jina-embeddings-v2-large-en |
Source code in python/python/embed_anything/_embed_anything.pyi
498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 |
|
TextEmbedConfig
Represents the configuration for the Text Embedding model.
Attributes:
Name | Type | Description |
---|---|---|
chunk_size |
int | None
|
The chunk size for the Text Embedding model. |
batch_size |
int | None
|
The batch size for processing the embeddings. Default is 32. Based on the memory, you can increase or decrease the batch size. |
splitting_strategy |
The strategy to use for splitting the text into chunks. Default is "sentence". |
|
semantic_encoder |
EmbeddingModel | None
|
The semantic encoder for the Text Embedding model. Default is None. |
use_ocr |
bool | None
|
A flag indicating whether to use OCR for the Text Embedding model. Default is False. |
Source code in python/python/embed_anything/_embed_anything.pyi
embed_audio_file(file_path, audio_decoder, embeder, text_embed_config=TextEmbedConfig(chunk_size=200, batch_size=32))
Embeds the given audio file and returns a list of EmbedData objects.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file_path
|
str
|
The path to the audio file to embed. |
required |
audio_decoder
|
AudioDecoderModel
|
The audio decoder model to use. |
required |
embeder
|
EmbeddingModel
|
The embedding model to use. |
required |
text_embed_config
|
TextEmbedConfig | None
|
The configuration for the embedding model. |
TextEmbedConfig(chunk_size=200, batch_size=32)
|
Returns:
Type | Description |
---|---|
list[EmbedData]
|
A list of EmbedData objects. |
Example:
import embed_anything
audio_decoder = embed_anything.AudioDecoderModel.from_pretrained_hf(
"openai/whisper-tiny.en", revision="main", model_type="tiny-en", quantized=False
)
embeder = embed_anything.EmbeddingModel.from_pretrained_hf(
embed_anything.WhichModel.Bert,
model_id="sentence-transformers/all-MiniLM-L6-v2",
revision="main",
)
config = embed_anything.TextEmbedConfig(chunk_size=200, batch_size=32)
data = embed_anything.embed_audio_file(
"test_files/audio/samples_hp0.wav",
audio_decoder=audio_decoder,
embeder=embeder,
text_embed_config=config,
)
Source code in python/python/embed_anything/_embed_anything.pyi
embed_directory(file_path, embeder, extensions, config=None, adapter=None)
Embeds the files in the given directory and returns a list of EmbedData objects.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file_path
|
str
|
The path to the directory containing the files to embed. |
required |
embeder
|
EmbeddingModel
|
The embedding model to use. |
required |
extensions
|
list[str]
|
The list of file extensions to consider for embedding. |
required |
config
|
TextEmbedConfig | None
|
The configuration for the embedding model. |
None
|
adapter
|
Adapter | None
|
The adapter to use for storing the embeddings in a vector database. |
None
|
Returns:
Type | Description |
---|---|
list[EmbedData]
|
A list of EmbedData objects. |
Example:
import embed_anything
model = embed_anything.EmbeddingModel.from_pretrained_hf(
embed_anything.WhichModel.Bert,
model_id="sentence-transformers/all-MiniLM-L6-v2",
revision="main",
)
data = embed_anything.embed_directory("test_files", embeder=model, extensions=[".pdf"])
Source code in python/python/embed_anything/_embed_anything.pyi
embed_file(file_path, embeder, config=None, adapter=None)
Embeds the given file and returns a list of EmbedData objects.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file_path
|
str
|
The path to the file to embed. |
required |
embeder
|
EmbeddingModel
|
The embedding model to use. |
required |
config
|
TextEmbedConfig | None
|
The configuration for the embedding model. |
None
|
adapter
|
Adapter | None
|
The adapter to use for storing the embeddings in a vector database. |
None
|
Returns:
Type | Description |
---|---|
list[EmbedData]
|
A list of EmbedData objects. |
Example:
import embed_anything
model = embed_anything.EmbeddingModel.from_pretrained_hf(
embed_anything.WhichModel.Bert,
model_id="sentence-transformers/all-MiniLM-L6-v2",
revision="main",
)
data = embed_anything.embed_file("test_files/test.pdf", embeder=model)
Source code in python/python/embed_anything/_embed_anything.pyi
embed_image_directory(file_path, embeder, config=None, adapter=None)
Embeds the images in the given directory and returns a list of EmbedData objects.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file_path
|
str
|
The path to the directory containing the images to embed. |
required |
embeder
|
EmbeddingModel
|
The embedding model to use. |
required |
config
|
ImageEmbedConfig | None
|
The configuration for the embedding model. |
None
|
adapter
|
Adapter | None
|
The adapter to use for storing the embeddings in a vector database. |
None
|
Returns:
Type | Description |
---|---|
list[EmbedData]
|
A list of EmbedData objects. |
Source code in python/python/embed_anything/_embed_anything.pyi
embed_query(query, embeder, config=None)
Embeds the given query and returns a list of EmbedData objects.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
query
|
list[str]
|
The query to embed. |
required |
embeder
|
EmbeddingModel
|
The embedding model to use. |
required |
config
|
TextEmbedConfig | None
|
The configuration for the embedding model. |
None
|
Returns:
Type | Description |
---|---|
list[EmbedData]
|
A list of EmbedData objects. |
Example:
import embed_anything
model = embed_anything.EmbeddingModel.from_pretrained_hf(
embed_anything.WhichModel.Bert,
model_id="sentence-transformers/all-MiniLM-L6-v2",
revision="main",
)
Source code in python/python/embed_anything/_embed_anything.pyi
embed_webpage(url, embeder, config, adapter)
Embeds the webpage at the given URL and returns a list of EmbedData objects.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
url
|
str
|
The URL of the webpage to embed. |
required |
embeder
|
EmbeddingModel
|
The name of the embedding model to use. Choose between "OpenAI", "Jina", "Bert" |
required |
config
|
TextEmbedConfig | None
|
The configuration for the embedding model. |
required |
adapter
|
Adapter | None
|
The adapter to use for storing the embeddings. |
required |
Returns:
Type | Description |
---|---|
list[EmbedData] | None
|
A list of EmbedData objects |
Example:
import embed_anything
config = embed_anything.EmbedConfig(
openai_config=embed_anything.OpenAIConfig(model="text-embedding-3-small")
)
data = embed_anything.embed_webpage(
"https://www.akshaymakes.com/", embeder="OpenAI", config=config
)