Vespa π€ ColPali: Efficient Document Retrieval with Vision Language Modelsο
For a simpler example of using ColPali, where we use one Vespa document = One PDF page, see simplified-retrieval-with-colpali.
This notebook demonstrates how to represent ColPali in Vespa. ColPali is a powerful visual language model that can generate embeddings for images and text. In this notebook, we will use ColPali to generate embeddings for images of PDF pages and store them in Vespa. We will also store the base64 encoded image of the PDF page and some meta data like title and url. We will then demonstrate how to retrieve the pdf pages using the embeddings generated by ColPali.
ColPail is a combination of ColBERT and PailGemma:
ColPali is enabled by the latest advances in Vision Language Models, notably the PaliGemma model from the Google ZΓΌrich team, and leverages multi-vector retrieval through late interaction mechanisms as proposed in ColBERT by Omar Khattab.
Quote from ColPali: Efficient Document Retrieval with Vision Language Models π
The ColPali model achieves remarkable retrieval performance on the ViDoRe (Visual Document Retrieval) Benchmark. Beating complex pipelines with a single model.
The TLDR of this notebook:
Generate an image per PDF page using pdf2image and also extract the text using pypdf.
For each page image, use ColPali to obtain the visual multi-vector embeddings
Then we store colbert embeddings in Vespa and use the long-context variant where we represent the colbert embeddings per document with the tensor tensor(page{}, patch{}, v[128])
. This enables us to use the PDF as the document (retrievable unit), storing the page embeddings in the same document.
The upside of this is that we do not need to duplicate document level meta data like title, url, etc. But, the downside is that we cannot retrieve using the ColPali embeddings directly, but need to use the extracted text for retrieval. The ColPali embeddings are only used for reranking the results.
For a simpler example where we use one vespa document = One PDF page, see simplified-retrieval-with-colpali.
Consider following the ColQWen2 notebook instead as it use a better model with improved performance (Both accuracy and speed).
We also store the base64 encoded image, and page meta data like title and url so that we can display it in the result page, but also use it for RAG with powerful LLMs with vision capabilities.
At query time, we retrieve using BM25 over all the text from all pages, then use the ColPali embeddings to rerank the results using the max page score.
Let us get started.
Install dependencies:
Note that the python pdf2image package requires poppler-utils, see other installation options here.
Install dependencies:
[ ]:
!sudo apt-get install poppler-utils -y
Install python packages
[ ]:
!pip3 install colpali-engine==0.2.2 pdf2image pypdf pyvespa vespacli requests
[3]:
import torch
from torch.utils.data import DataLoader
from tqdm import tqdm
from transformers import AutoProcessor
from PIL import Image
from io import BytesIO
from colpali_engine.models.paligemma_colbert_architecture import ColPali
from colpali_engine.utils.colpali_processing_utils import (
process_images,
process_queries,
)
from colpali_engine.utils.image_utils import scale_image, get_base64_image
Load the modelο
This requires that the HF_TOKEN environment variable is set as the underlaying PaliGemma model is hosted on Hugging Face and has a restricive licence that requires authentication.
Choose the right device to run the model.
[4]:
if torch.cuda.is_available():
device = torch.device("cuda")
type = torch.bfloat16
elif torch.backends.mps.is_available():
device = torch.device("mps")
type = torch.float32
else:
device = torch.device("cpu")
type = torch.float32
[ ]:
model_name = "vidore/colpali-v1.2"
model = ColPali.from_pretrained("vidore/colpaligemma-3b-pt-448-base", torch_dtype=type).eval()
model.load_adapter(model_name)
model = model.eval()
model.to(device)
processor = AutoProcessor.from_pretrained(model_name)
Working with pdfsο
We need to convert a PDF to an array of images. One image per page. We will use pdf2image for this. Secondary, we also extract the text content of the pdf using pypdf.
NOTE: This step requires that you have poppler
installed on your system. Read more in pdf2image docs.
[6]:
import requests
from pdf2image import convert_from_path
from pypdf import PdfReader
def download_pdf(url):
response = requests.get(url)
if response.status_code == 200:
return BytesIO(response.content)
else:
raise Exception(f"Failed to download PDF: Status code {response.status_code}")
def get_pdf_images(pdf_url):
# Download the PDF
pdf_file = download_pdf(pdf_url)
# Save the PDF temporarily to disk (pdf2image requires a file path)
with open("temp.pdf", "wb") as f:
f.write(pdf_file.read())
reader = PdfReader("temp.pdf")
page_texts = []
for page_number in range(len(reader.pages)):
page = reader.pages[page_number]
text = page.extract_text()
page_texts.append(text)
images = convert_from_path("temp.pdf")
assert len(images) == len(page_texts)
return (images, page_texts)
We define a few sample PDFs to work with.
[7]:
sample_pdfs = [
{
"title": "ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction",
"url": "https://arxiv.org/pdf/2112.01488.pdf",
"authors": "Keshav Santhanam, Omar Khattab, Jon Saad-Falcon, Christopher Potts, Matei Zaharia",
},
{
"title": "ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT",
"url": "https://arxiv.org/pdf/2004.12832.pdf",
"authors": "Omar Khattab, Matei Zaharia",
},
]
Now we can convert the PDFs to images and also extract the text content.
[8]:
for pdf in sample_pdfs:
page_images, page_texts = get_pdf_images(pdf["url"])
pdf["images"] = page_images
pdf["texts"] = page_texts
Let us look at the extracted image of the first PDF page. This is the input to ColPali.
[9]:
from IPython.display import display
display(scale_image(sample_pdfs[0]["images"][0], 720))
Now we use the ColPali model to generate embeddings for the images.
[10]:
for pdf in sample_pdfs:
page_embeddings = []
dataloader = DataLoader(
pdf["images"],
batch_size=2,
shuffle=False,
collate_fn=lambda x: process_images(processor, x),
)
for batch_doc in tqdm(dataloader):
with torch.no_grad():
batch_doc = {k: v.to(model.device) for k, v in batch_doc.items()}
embeddings_doc = model(**batch_doc)
page_embeddings.extend(list(torch.unbind(embeddings_doc.to("cpu"))))
pdf["embeddings"] = page_embeddings
100%|ββββββββββ| 10/10 [01:34<00:00, 9.47s/it]
100%|ββββββββββ| 5/5 [00:48<00:00, 9.64s/it]
Now we are done with the document side embeddings, we now convert the custom dict to Vespa JSON feed format.
We use binarization of the vector embeddings to reduce their size. Read more about binarization of multi-vector representations in the colbert blog post. This maps 128 dimensional floats to 128 bits, or 16 bytes per vector. Reducing the size by 32x.
[37]:
import numpy as np
from typing import Dict, List
from binascii import hexlify
def binarize_token_vectors_hex(vectors: List[torch.Tensor]) -> Dict[str, str]:
vespa_tensor = list()
for page_id in range(0, len(vectors)):
page_vector = vectors[page_id]
binarized_token_vectors = np.packbits(
np.where(page_vector > 0, 1, 0), axis=1
).astype(np.int8)
for patch_index in range(0, len(page_vector)):
values = str(
hexlify(binarized_token_vectors[patch_index].tobytes()), "utf-8"
)
if (
values == "00000000000000000000000000000000"
): # skip empty vectors due to padding of batch
continue
vespa_tensor_cell = {
"address": {"page": page_id, "patch": patch_index},
"values": values,
}
vespa_tensor.append(vespa_tensor_cell)
return vespa_tensor
Iterate over the sample and create the Vespa JSON feed format, including the base64 encoded page images.
[38]:
vespa_feed = []
for idx, pdf in enumerate(sample_pdfs):
images_base_64 = []
for image in pdf["images"]:
images_base_64.append(get_base64_image(image, add_url_prefix=False))
pdf["images_base_64"] = images_base_64
doc = {
"fields": {
"url": pdf["url"],
"title": pdf["title"],
"images": pdf["images_base_64"],
"texts": pdf["texts"], # Array of text per page
"colbert": { # Colbert embeddings per page
"blocks": binarize_token_vectors_hex(pdf["embeddings"])
},
}
}
vespa_feed.append(doc)
[64]:
vespa_feed[0]["fields"]["colbert"]["blocks"][0:5]
[64]:
[{'address': {'page': 0, 'patch': 0},
'values': '93d23b85a3bb52c1b2ae05827ba19ad9'},
{'address': {'page': 0, 'patch': 1},
'values': '91c49b6deb226480f3dc05837bb08b09'},
{'address': {'page': 0, 'patch': 2},
'values': 'a3cd5b3d653ad2a87b5c0d2157b08b0b'},
{'address': {'page': 0, 'patch': 3},
'values': '91c51b3de3aa4480f39c05017bb08b09'},
{'address': {'page': 0, 'patch': 4},
'values': 'a0cd5b3de5b2f4a07b5a0d005b288b09'}]
Above is the feed format for mixed tensors with more than one mapped dimension, see details. We have the page
and patch
dimensions and for each combination with have a binary representation of the 128 dimensional embeddings, packed into 16 bytes.
For each page image, we have 1030 patches, each with a 128 dimensional embedding.
Configure Vespaο
PyVespa helps us build the Vespa application package. A Vespa application package consists of configuration files, schemas, models, and code (plugins).
First, we define a Vespa schema with the fields we want to store and their type.
[39]:
from vespa.package import Schema, Document, Field, FieldSet
colbert_schema = Schema(
name="doc",
document=Document(
fields=[
Field(name="url", type="string", indexing=["summary"]),
Field(
name="title",
type="string",
indexing=["summary", "index"],
index="enable-bm25",
),
Field(
name="texts",
type="array<string>",
indexing=["index"],
index="enable-bm25",
),
Field(
name="images",
type="array<string>",
indexing=["summary"],
),
Field(
name="colbert",
type="tensor<int8>(page{}, patch{}, v[16])",
indexing=["attribute"],
),
]
),
fieldsets=[FieldSet(name="default", fields=["title", "texts"])],
)
Notice the colbert
field is a tensor field with the type tensor(page{}, patch{}, v[128])
. This is the field that will store the embeddings generated by ColPali. This is an example of a mixed tensor where we combine two mapped (sparse) dimensions with one dense.
Read more in Tensor guide. We also enable BM25 for the title
and texts
Β fields.
Create the Vespa application package:
[40]:
from vespa.package import ApplicationPackage
vespa_app_name = "visionrag"
vespa_application_package = ApplicationPackage(
name=vespa_app_name, schema=[colbert_schema]
)
Now we define how we want to rank the pages. We use BM25 for the text and late interaction with Max Sim for the image embeddings. This means that we retrieve using the text representations to find relevant PDF documents, then we use the ColPALI embeddings to rerank the pages within the document using the max of the page scores.
We also return all the page level scores using match-features
, so that we can render multiple scoring pages in the search result.
As LLMs gets longer context windows, we can input more than a single page per PDF.
[41]:
from vespa.package import RankProfile, Function, FirstPhaseRanking, SecondPhaseRanking
colbert_profile = RankProfile(
name="default",
inputs=[("query(qt)", "tensor<float>(querytoken{}, v[128])")],
functions=[
Function(
name="max_sim_per_page",
expression="""
sum(
reduce(
sum(
query(qt) * unpack_bits(attribute(colbert)) , v
),
max, patch
),
querytoken
)
""",
),
Function(name="max_sim", expression="reduce(max_sim_per_page, max, page)"),
Function(name="bm25_score", expression="bm25(title) + bm25(texts)"),
],
first_phase=FirstPhaseRanking(expression="bm25_score"),
second_phase=SecondPhaseRanking(expression="max_sim", rerank_count=10),
match_features=["max_sim_per_page", "bm25_score"],
)
colbert_schema.add_rank_profile(colbert_profile)
Validate that certificates are ok and deploy the application to Vespa Cloud.
Deploy to Vespa Cloudο
With the configured application, we can deploy it to Vespa Cloud.
PyVespa
supports deploying apps to the development zone.
Note: Deployments to dev and perf expire after 7 days of inactivity, i.e., 7 days after running deploy. This applies to all plans, not only the Free Trial. Use the Vespa Console to extend the expiry period, or redeploy the application to add 7 more days.
[45]:
from vespa.deployment import VespaCloud
import os
os.environ['TOKENIZERS_PARALLELISM'] = "false"
# Replace with your tenant name from the Vespa Cloud Console
tenant_name = "vespa-team"
key = os.getenv("VESPA_TEAM_API_KEY", None)
if key is not None:
key = key.replace(r"\n", "\n") # To parse key correctly
vespa_cloud = VespaCloud(
tenant=tenant_name,
application=vespa_app_name,
key_content=key, # Key is only used for CI/CD testing of this notebook. Can be removed if logging in interactively
application_package=vespa_application_package,
)
Now deploy the app to Vespa Cloud dev zone.
The first deployment typically takes 2 minutes until the endpoint is up.
[ ]:
from vespa.application import Vespa
app: Vespa = vespa_cloud.deploy()
This example uses the synchronous feed method and feeds one document at a time. For larger datasets, consider using the asynchronous feed method.
[52]:
from vespa.io import VespaResponse
with app.syncio() as sync:
for operation in vespa_feed:
fields = operation["fields"]
response: VespaResponse = sync.feed_data_point(
data_id=fields["url"], fields=fields, schema="doc"
)
if not response.is_successful():
print(response.json())
Querying Vespaο
Ok, so now we have indexed the PDF pages in Vespa. Let us now obtain ColPali embeddings for a text query and use it to match against the indexed PDF pages.
The ColPali model text encoder needs a βdummyβ image.
[24]:
dummy_image = Image.new("RGB", (448, 448), (255, 255, 255))
Our demo query:
Composition of the Lotte Benchmark
[25]:
queries = ["Composition of the LoTTE benchmark"]
Obtain the query embeddings using the ColPali model
[26]:
dataloader = DataLoader(
queries,
batch_size=1,
shuffle=False,
collate_fn=lambda x: process_queries(processor, x, dummy_image),
)
qs = []
for batch_query in dataloader:
with torch.no_grad():
batch_query = {k: v.to(model.device) for k, v in batch_query.items()}
embeddings_query = model(**batch_query)
qs.extend(list(torch.unbind(embeddings_query.to("cpu"))))
A simple routine to format the ColPali multi-vector emebeddings to a format that can be used in Vespa. See querying with tensors for more details.
[27]:
def float_query_token_vectors(vectors: torch.Tensor) -> Dict[str, List[float]]:
vespa_token_dict = dict()
for index in range(0, len(vectors)):
vespa_token_dict[index] = vectors[index].tolist()
return vespa_token_dict
We create a simple routine to display the results.
Notice that each hit is a PDF document. Within a PDF document we have multiple pages and we have the MaxSim score for each page.
The PDF documents are ranked by the maximum page score. But, we have access to all the page level scores and below we display the top 2-pages for each PDF document. We convert the base64 encoded image to a PIL image for rendering. We could also render the extracted text, but we skip that for now.
[28]:
from IPython.display import display, HTML
import base64
def display_query_results(query, response):
"""
Displays the query result, including the two best matching pages per matched pdf.
"""
html_content = f"<h3>Query text: {query}</h3>"
for i, hit in enumerate(response.hits[:2]): # Adjust to show more hits if needed
title = hit["fields"]["title"]
url = hit["fields"]["url"]
match_scores = hit["fields"]["matchfeatures"]["max_sim_per_page"]
images = hit["fields"]["images"]
html_content += f"<h3>PDF Result {i + 1}</h3>"
html_content += f'<p><strong>Title:</strong> <a href="{url}">{title}</a></p>'
# Find the two best matching pages
sorted_pages = sorted(match_scores.items(), key=lambda x: x[1], reverse=True)
best_pages = sorted_pages[:2]
for page, score in best_pages:
page = int(page)
image_data = base64.b64decode(images[page])
image = Image.open(BytesIO(image_data))
scaled_image = scale_image(image, 648)
buffered = BytesIO()
scaled_image.save(buffered, format="PNG")
img_str = base64.b64encode(buffered.getvalue()).decode()
html_content += f"<p><strong>Best Matching Page {page+1} for PDF document:</strong> with MaxSim score {score:.2f}</p>"
html_content += (
f'<img src="data:image/png;base64,{img_str}" style="max-width:100%;">'
)
display(HTML(html_content))
Query Vespa with a text query and display the results.
[53]:
from vespa.io import VespaQueryResponse
for idx, query in enumerate(queries):
response: VespaQueryResponse = app.query(
yql="select title,url,images from doc where userInput(@userQuery)",
ranking="default",
userQuery=query,
timeout=2,
hits=3,
body={
"presentation.format.tensors": "short-value",
"input.query(qt)": float_query_token_vectors(qs[idx]),
},
)
assert response.is_successful()
display_query_results(query, response)
Query text: Composition of the LoTTE benchmark
PDF Result 1
Title: ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction
Best Matching Page 6 for PDF document: with MaxSim score 46.84
Best Matching Page 10 for PDF document: with MaxSim score 45.62
PDF Result 2
Title: ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT
Best Matching Page 1 for PDF document: with MaxSim score 40.29