Skip to content

Use PDFs that need OCR

Embed Anything can be used to embed scanned documents using OCR. This is useful for tasks such as document search and retrieval. You can set use_ocr=True in the TextEmbedConfig to enable OCR. But this requires tesseract and poppler to be installed.

You can install tesseract and poppler using the following commands:

Install Tesseract and Poppler

Windows

For Tesseract, download the installer from here and install it.

For Poppler, download the installer from here and install it.

MacOS

For Tesseract, you can install it using Homebrew.

brew install tesseract

For Poppler, you can install it using Homebrew.

brew install poppler

Linux

For Tesseract, you can install it using the package manager for your Linux distribution. For example, on Ubuntu, you can install it using:

sudo apt install tesseract-ocr
sudo apt install libtesseract-dev

For Poppler, you can install it using the package manager for your Linux distribution. For example, on Ubuntu, you can install it using:

sudo apt install poppler-utils

For more information, refer to the Tesseract installation guide.

Example Usage

# OCR Requires `tesseract` and `poppler` to be installed.

import time
import embed_anything
from embed_anything import EmbedData, EmbeddingModel, TextEmbedConfig, WhichModel
from time import time


model = EmbeddingModel.from_pretrained_hf(
    WhichModel.Jina, model_id="jinaai/jina-embeddings-v2-small-en"
)

config = TextEmbedConfig(
    chunk_size=256,
    batch_size=32,
    buffer_size=64,
    splitting_strategy="sentence",
    use_ocr=True,
)

start = time()

data: list[EmbedData] = embed_anything.embed_file(
    "/home/akshay/projects/starlaw/src-server/test_files/court.pdf",  # Replace with your file path
    embedder=model,
    config=config,
)
end = time()

for d in data:
    print(d.text)
    print("---" * 20)

print(f"Time taken: {end - start} seconds")