Use PDFs that need OCR
Embed Anything can be used to embed scanned documents using OCR. This is useful for tasks such as document search and retrieval. You can set use_ocr=True
in the TextEmbedConfig
to enable OCR. But this requires tesseract
and poppler
to be installed.
You can install tesseract
and poppler
using the following commands:
Install Tesseract and Poppler
Windows
For Tesseract, download the installer from here and install it.
For Poppler, download the installer from here and install it.
MacOS
For Tesseract, you can install it using Homebrew.
For Poppler, you can install it using Homebrew.
Linux
For Tesseract, you can install it using the package manager for your Linux distribution. For example, on Ubuntu, you can install it using:
For Poppler, you can install it using the package manager for your Linux distribution. For example, on Ubuntu, you can install it using:
For more information, refer to the Tesseract installation guide.
Example Usage
# OCR Requires `tesseract` and `poppler` to be installed.
import time
import embed_anything
from embed_anything import EmbedData, EmbeddingModel, TextEmbedConfig, WhichModel
from time import time
model = EmbeddingModel.from_pretrained_hf(
WhichModel.Jina, model_id="jinaai/jina-embeddings-v2-small-en"
)
config = TextEmbedConfig(
chunk_size=256,
batch_size=32,
buffer_size=64,
splitting_strategy="sentence",
use_ocr=True,
)
start = time()
data: list[EmbedData] = embed_anything.embed_file(
"/home/akshay/projects/starlaw/src-server/test_files/court.pdf", # Replace with your file path
embedder=model,
config=config,
)
end = time()
for d in data:
print(d.text)
print("---" * 20)
print(f"Time taken: {end - start} seconds")