References
This module provides functions and classes for embedding queries, files, and directories using different embedding models.
The module includes the following functions:
embed_query
: Embeds the given query and returns an EmbedData object.embed_file
: Embeds the file at the given path and returns a list of EmbedData objects.embed_directory
: Embeds all the files in the given directory and returns a list of EmbedData objects.
The module also includes the EmbedData
class, which represents the data of an embedded file.
Usage:
import embed_anything
from embed_anything import EmbedData
#For text files
model = EmbeddingModel.from_pretrained_local(
WhichModel.Bert, model_id="Hugging_face_link"
)
data = embed_anything.embed_file("test_files/test.pdf", embeder=model)
#For images
model = embed_anything.EmbeddingModel.from_pretrained_local(
embed_anything.WhichModel.Clip,
model_id="openai/clip-vit-base-patch16",
# revision="refs/pr/15",
)
data: list[EmbedData] = embed_anything.embed_directory("test_files", embeder=model)
embeddings = np.array([data.embedding for data in data])
query = ["Photo of a monkey?"]
query_embedding = np.array(
embed_anything.embed_query(query, embeder=model)[0].embedding
)
# For audio files
from embed_anything import (
AudioDecoderModel,
EmbeddingModel,
embed_audio_file,
TextEmbedConfig,
)
# choose any whisper or distilwhisper model from https://huggingface.co/distil-whisper or https://huggingface.co/collections/openai/whisper-release-6501bba2cf999715fd953013
audio_decoder = AudioDecoderModel.from_pretrained_hf(
"openai/whisper-tiny.en", revision="main", model_type="tiny-en", quantized=False
)
embeder = EmbeddingModel.from_pretrained_hf(
embed_anything.WhichModel.Bert,
model_id="sentence-transformers/all-MiniLM-L6-v2",
revision="main",
)
config = TextEmbedConfig(chunk_size=200, batch_size=32)
data = embed_anything.embed_audio_file(
"test_files/audio/samples_hp0.wav",
audio_decoder=audio_decoder,
embeder=embeder,
text_embed_config=config,
)
You can also store the embeddings to a vector database and not keep them on memory. Here is an example of how to use the PineconeAdapter
class:
import embed_anything
import os
from embed_anything.vectordb import PineconeAdapter
# Initialize the PineconeEmbedder class
api_key = os.environ.get("PINECONE_API_KEY")
index_name = "anything"
pinecone_adapter = PineconeAdapter(api_key)
try:
pinecone_adapter.delete_index("anything")
except:
pass
# Initialize the PineconeEmbedder class
pinecone_adapter.create_index(dimension=512, metric="cosine")
# bert_model = EmbeddingModel.from_pretrained_hf(
# WhichModel.Bert, "sentence-transformers/all-MiniLM-L12-v2", revision="main"
# )
clip_model = EmbeddingModel.from_pretrained_hf(
WhichModel.Clip, "openai/clip-vit-base-patch16", revision="main"
)
embed_config = TextEmbedConfig(chunk_size=512, batch_size=32)
data = embed_anything.embed_image_directory(
"test_files",
embeder=clip_model,
adapter=pinecone_adapter,
# config=embed_config,
Supported Embedding Models:
-
Text Embedding Models:
- "OpenAI"
- "Bert"
- "Jina"
-
Image Embedding Models:
- "Clip"
- "SigLip" (Coming Soon)
-
Audio Embedding Models:
- "Whisper"
AudioDecoderModel
Represents an audio decoder model.
Attributes:
Name | Type | Description |
---|---|---|
model_id |
str
|
The ID of the audio decoder model. |
revision |
str
|
The revision of the audio decoder model. |
model_type |
str
|
The type of the audio decoder model. |
quantized |
bool
|
A flag indicating whether the audio decoder model is quantized or not. |
Example:
model = embed_anything.AudioDecoderModel.from_pretrained_hf(
model_id="openai/whisper-tiny.en",
revision="main",
model_type="tiny-en",
quantized=False
)
Source code in python/python/embed_anything/_embed_anything.pyi
EmbedData
Represents the data of an embedded file.
Attributes:
Name | Type | Description |
---|---|---|
embedding |
list[float]
|
The embedding of the file. |
text |
str
|
The text for which the embedding is generated for. |
metadata |
dict[str, str]
|
Additional metadata associated with the embedding. |
Source code in python/python/embed_anything/_embed_anything.pyi
EmbeddingModel
Represents an embedding model.
Source code in python/python/embed_anything/_embed_anything.pyi
ImageEmbedConfig
Represents the configuration for the Image Embedding model.
Attributes:
Name | Type | Description |
---|---|---|
buffer_size |
int | None
|
The buffer size for the Image Embedding model. Default is 100. |
Source code in python/python/embed_anything/_embed_anything.pyi
TextEmbedConfig
Represents the configuration for the Text Embedding model.
Attributes:
Name | Type | Description |
---|---|---|
chunk_size |
int | None
|
The chunk size for the Text Embedding model. |
batch_size |
int | None
|
The batch size for processing the embeddings. Default is 32. Based on the memory, you can increase or decrease the batch size. |
Source code in python/python/embed_anything/_embed_anything.pyi
embed_audio_file(file_path, audio_decoder, embeder, text_embed_config=TextEmbedConfig(chunk_size=200, batch_size=32))
Embeds the given audio file and returns a list of EmbedData objects.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file_path |
str
|
The path to the audio file to embed. |
required |
audio_decoder |
AudioDecoderModel
|
The audio decoder model to use. |
required |
embeder |
EmbeddingModel
|
The embedding model to use. |
required |
text_embed_config |
TextEmbedConfig | None
|
The configuration for the embedding model. |
TextEmbedConfig(chunk_size=200, batch_size=32)
|
Returns:
Type | Description |
---|---|
list[EmbedData]
|
A list of EmbedData objects. |
Example:
import embed_anything
audio_decoder = embed_anything.AudioDecoderModel.from_pretrained_hf(
"openai/whisper-tiny.en", revision="main", model_type="tiny-en", quantized=False
)
embeder = embed_anything.EmbeddingModel.from_pretrained_hf(
embed_anything.WhichModel.Bert,
model_id="sentence-transformers/all-MiniLM-L6-v2",
revision="main",
)
config = embed_anything.TextEmbedConfig(chunk_size=200, batch_size=32)
data = embed_anything.embed_audio_file(
"test_files/audio/samples_hp0.wav",
audio_decoder=audio_decoder,
embeder=embeder,
text_embed_config=config,
)
Source code in python/python/embed_anything/_embed_anything.pyi
embed_directory(file_path, embeder, extensions, config=None, adapter=None)
Embeds the files in the given directory and returns a list of EmbedData objects.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file_path |
str
|
The path to the directory containing the files to embed. |
required |
embeder |
EmbeddingModel
|
The embedding model to use. |
required |
extensions |
list[str]
|
The list of file extensions to consider for embedding. |
required |
config |
TextEmbedConfig | None
|
The configuration for the embedding model. |
None
|
adapter |
Adapter | None
|
The adapter to use for storing the embeddings in a vector database. |
None
|
Returns:
Type | Description |
---|---|
list[EmbedData]
|
A list of EmbedData objects. |
Example:
import embed_anything
model = embed_anything.EmbeddingModel.from_pretrained_hf(
embed_anything.WhichModel.Bert,
model_id="sentence-transformers/all-MiniLM-L6-v2",
revision="main",
)
data = embed_anything.embed_directory("test_files", embeder=model, extensions=[".pdf"])
Source code in python/python/embed_anything/_embed_anything.pyi
embed_file(file_path, embeder, config=None, adapter=None)
Embeds the given file and returns a list of EmbedData objects.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file_path |
str
|
The path to the file to embed. |
required |
embeder |
EmbeddingModel
|
The embedding model to use. |
required |
config |
TextEmbedConfig | None
|
The configuration for the embedding model. |
None
|
adapter |
Adapter | None
|
The adapter to use for storing the embeddings in a vector database. |
None
|
Returns:
Type | Description |
---|---|
list[EmbedData]
|
A list of EmbedData objects. |
Example:
import embed_anything
model = embed_anything.EmbeddingModel.from_pretrained_hf(
embed_anything.WhichModel.Bert,
model_id="sentence-transformers/all-MiniLM-L6-v2",
revision="main",
)
data = embed_anything.embed_file("test_files/test.pdf", embeder=model)
Source code in python/python/embed_anything/_embed_anything.pyi
embed_image_directory(file_path, embeder, config=None, adapter=None)
Embeds the images in the given directory and returns a list of EmbedData objects.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file_path |
str
|
The path to the directory containing the images to embed. |
required |
embeder |
EmbeddingModel
|
The embedding model to use. |
required |
config |
ImageEmbedConfig | None
|
The configuration for the embedding model. |
None
|
adapter |
Adapter | None
|
The adapter to use for storing the embeddings in a vector database. |
None
|
Returns:
Type | Description |
---|---|
list[EmbedData]
|
A list of EmbedData objects. |
Source code in python/python/embed_anything/_embed_anything.pyi
embed_query(query, embeder, config=None)
Embeds the given query and returns a list of EmbedData objects.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
query |
list[str]
|
The query to embed. |
required |
embeder |
EmbeddingModel
|
The embedding model to use. |
required |
config |
TextEmbedConfig | None
|
The configuration for the embedding model. |
None
|
Returns:
Type | Description |
---|---|
list[EmbedData]
|
A list of EmbedData objects. |
Example:
import embed_anything
model = embed_anything.EmbeddingModel.from_pretrained_hf(
embed_anything.WhichModel.Bert,
model_id="sentence-transformers/all-MiniLM-L6-v2",
revision="main",
)
Source code in python/python/embed_anything/_embed_anything.pyi
embed_webpage(url, embeder, config, adapter)
Embeds the webpage at the given URL and returns a list of EmbedData objects.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
url |
str
|
The URL of the webpage to embed. |
required |
embeder |
EmbeddingModel
|
The name of the embedding model to use. Choose between "OpenAI", "Jina", "Bert" |
required |
config |
TextEmbedConfig | None
|
The configuration for the embedding model. |
required |
adapter |
Adapter | None
|
The adapter to use for storing the embeddings. |
required |
Returns:
Type | Description |
---|---|
list[EmbedData] | None
|
A list of EmbedData objects |
Example:
import embed_anything
config = embed_anything.EmbedConfig(
openai_config=embed_anything.OpenAIConfig(model="text-embedding-3-small")
)
data = embed_anything.embed_webpage(
"https://www.akshaymakes.com/", embeder="OpenAI", config=config
)