Evaluating a Vespa Application
We are often asked by users and customers what is the best retrieval and ranking strategy for a given use case. And even though we might sometimes have an intuition, we always recommend to set up experiments and do a proper quantitative evaluation.
Models are temporary; Evals are forever.
-Eugene Yan
Without a proper evaluation setup, you run the risk of settling for lgtm@10
(looks good to me @ 10).
Then, if you deploy your application to users, you can be sure that you will get feedback of queries that does not produce relevant results. If you then try to optimize for that without knowing whether your tweaks are actually improving the overall quality of your search, you might end up with a system that is worse than the one you started with.
So, what can you do?
You can set up a proper evaluation pipeline, where you can test different ranking strategies, and see how they perform on a set of evaluation queries that act as a proxy for your real users’ queries. This way, you can make informed decisions about what works best for your use case. If you collect real user interactions, it could be even better, but it is important to also keep the evaluation pipeline light enough so that you can run it both during development and in CI pipelines (possibly at different scales).
This guide will show how you easily can evaluate a Vespa application using pyvespa’s VespaEvaluator
class.
We will define and compare 4 different ranking strategies in this guide:
bm25
- Keyword-based retrieval and ranking - The solid baseline.semantic
- Vector search using cosine similarity (using https://huggingface.co/intfloat/e5-small-v2 for embeddings)fusion
- Hybrid search (semantic+keyword). Combining BM25 and Semantic with reciprocal rank fusionatan_norm
- Hybrid search, combining BM25 and Semantic with atan normalization as described in Aapo Tanskanen’s Guidebook to the State-of-the-Art Embeddings and Information Retrieval (Originally proposed by Seo et al. (2022))
Refer to troubleshooting for any problem when running this guide.
Pre-requisite: Create a tenant at cloud.vespa.ai, save the tenant name.
Install
Install pyvespa >= 0.53.0 and the Vespa CLI. The Vespa CLI is used for data and control plane key management (Vespa Cloud Security Guide).
[1]:
#!pip3 install pyvespa vespacli datasets pandas
Configure application
[2]:
# Replace with your tenant name from the Vespa Cloud Console
tenant_name = "vespa-team"
# Replace with your application name (does not need to exist yet)
application = "evaluation"
schema_name = "doc"
Create an application package
The application package has all the Vespa configuration files - create one from scratch:
[3]:
from vespa.package import (
ApplicationPackage,
Field,
Schema,
Document,
HNSW,
RankProfile,
Component,
Parameter,
FieldSet,
GlobalPhaseRanking,
Function,
)
package = ApplicationPackage(
name=application,
schema=[
Schema(
name=schema_name,
document=Document(
fields=[
Field(name="id", type="string", indexing=["summary"]),
Field(
name="text",
type="string",
indexing=["index", "summary"],
index="enable-bm25",
bolding=True,
),
Field(
name="embedding",
type="tensor<float>(x[384])",
indexing=[
"input text",
"embed", # uses default model
"index",
"attribute",
],
ann=HNSW(distance_metric="angular"),
is_document_field=False,
),
]
),
fieldsets=[FieldSet(name="default", fields=["text"])],
rank_profiles=[
RankProfile(
name="bm25",
inputs=[("query(q)", "tensor<float>(x[384])")],
functions=[Function(name="bm25text", expression="bm25(text)")],
first_phase="bm25text",
match_features=["bm25text"],
),
RankProfile(
name="semantic",
inputs=[("query(q)", "tensor<float>(x[384])")],
functions=[
Function(
name="cos_sim", expression="closeness(field, embedding)"
)
],
first_phase="cos_sim",
match_features=["cos_sim"],
),
RankProfile(
name="fusion",
inherits="bm25",
functions=[
Function(
name="cos_sim", expression="closeness(field, embedding)"
)
],
inputs=[("query(q)", "tensor<float>(x[384])")],
first_phase="cos_sim",
global_phase=GlobalPhaseRanking(
expression="reciprocal_rank_fusion(bm25text, closeness(field, embedding))",
rerank_count=1000,
),
match_features=["cos_sim", "bm25text"],
),
RankProfile(
name="atan_norm",
inherits="bm25",
inputs=[("query(q)", "tensor<float>(x[384])")],
functions=[
Function(
name="scale",
args=["val"],
expression="2*atan(val)/(3.14159)",
),
Function(
name="normalized_bm25", expression="scale(bm25(text))"
),
Function(
name="cos_sim", expression="closeness(field, embedding)"
),
],
first_phase="normalized_bm25",
global_phase=GlobalPhaseRanking(
expression="normalize_linear(normalized_bm25) + normalize_linear(cos_sim)",
rerank_count=1000,
),
match_features=["cos_sim", "normalized_bm25"],
),
],
)
],
components=[
Component(
id="e5",
type="hugging-face-embedder",
parameters=[
Parameter(
"transformer-model",
{
"model-id": "e5-small-v2"
}, # in vespa cloud, we can use the model-id for selected models, see https://cloud.vespa.ai/en/model-hub
),
Parameter(
"tokenizer-model",
{"model-id": "e5-base-v2-vocab"},
),
],
)
],
)
Note that the name cannot have -
or _
.
Deploy to Vespa Cloud
The app is now defined and ready to deploy to Vespa Cloud.
Deploy package
to Vespa Cloud, by creating an instance of VespaCloud:
[4]:
from vespa.deployment import VespaCloud
import os
# Key is only used for CI/CD. Can be removed if logging in interactively
vespa_cloud = VespaCloud(
tenant=tenant_name,
application=application,
key_content=os.getenv(
"VESPA_TEAM_API_KEY", None
), # Key is only used for CI/CD. Can be removed if logging in interactively
application_package=package,
)
Setting application...
Running: vespa config set application vespa-team.evaluation
Setting target cloud...
Running: vespa config set target cloud
Api-key found for control plane access. Using api-key.
For more details on different authentication options and methods, see authenticating-to-vespa-cloud.
The following will upload the application package to Vespa Cloud Dev Zone (aws-us-east-1c
), read more about Vespa Zones. The Vespa Cloud Dev Zone is considered as a sandbox environment where resources are down-scaled and idle deployments are expired automatically. For information about production deployments, see the following
method.
Note: Deployments to dev and perf expire after 7 days of inactivity, i.e., 7 days after running deploy. This applies to all plans, not only the Free Trial. Use the Vespa Console to extend the expiry period, or redeploy the application to add 7 more days.
Now deploy the app to Vespa Cloud dev zone.
The first deployment typically takes 2 minutes until the endpoint is up. (Applications that for example refer to large onnx-models may take a bit longer.)
[5]:
from vespa.application import Vespa
app: Vespa = vespa_cloud.deploy()
Deployment started in run 4 of dev-aws-us-east-1c for vespa-team.evaluation. This may take a few minutes the first time.
INFO [14:00:58] Deploying platform version 8.478.26 and application dev build 4 for dev-aws-us-east-1c of default ...
INFO [14:00:58] Using CA signed certificate version 1
INFO [14:00:58] Using 1 nodes in container cluster 'evaluation_container'
INFO [14:01:01] Session 338115 for tenant 'vespa-team' prepared and activated.
INFO [14:01:01] ######## Details for all nodes ########
INFO [14:01:01] h113421f.dev.us-east-1c.aws.vespa-cloud.net: expected to be UP
INFO [14:01:01] --- platform vespa/cloud-tenant-rhel8:8.478.26
INFO [14:01:01] --- container-clustercontroller on port 19050 has config generation 338110, wanted is 338115
INFO [14:01:01] --- metricsproxy-container on port 19092 has config generation 338110, wanted is 338115
INFO [14:01:01] h113421e.dev.us-east-1c.aws.vespa-cloud.net: expected to be UP
INFO [14:01:01] --- platform vespa/cloud-tenant-rhel8:8.478.26
INFO [14:01:01] --- logserver-container on port 4080 has config generation 338110, wanted is 338115
INFO [14:01:01] --- metricsproxy-container on port 19092 has config generation 338110, wanted is 338115
INFO [14:01:01] h113501a.dev.us-east-1c.aws.vespa-cloud.net: expected to be UP
INFO [14:01:01] --- platform vespa/cloud-tenant-rhel8:8.478.26
INFO [14:01:01] --- container on port 4080 has config generation 338110, wanted is 338115
INFO [14:01:01] --- metricsproxy-container on port 19092 has config generation 338115, wanted is 338115
INFO [14:01:01] h112930a.dev.us-east-1c.aws.vespa-cloud.net: expected to be UP
INFO [14:01:01] --- platform vespa/cloud-tenant-rhel8:8.478.26
INFO [14:01:01] --- storagenode on port 19102 has config generation 338110, wanted is 338115
INFO [14:01:01] --- searchnode on port 19107 has config generation 338115, wanted is 338115
INFO [14:01:01] --- distributor on port 19111 has config generation 338110, wanted is 338115
INFO [14:01:01] --- metricsproxy-container on port 19092 has config generation 338110, wanted is 338115
INFO [14:01:05] Found endpoints:
INFO [14:01:05] - dev.aws-us-east-1c
INFO [14:01:05] |-- https://e8c40d27.ccc9bd09.z.vespa-app.cloud/ (cluster 'evaluation_container')
INFO [14:01:06] Deployment of new application complete!
Only region: aws-us-east-1c available in dev environment.
Found mtls endpoint for evaluation_container
URL: https://e8c40d27.ccc9bd09.z.vespa-app.cloud/
Application is up!
If the deployment failed, it is possible you forgot to add the key in the Vespa Cloud Console in the vespa auth api-key
step above.
If you can authenticate, you should see lines like the following
Deployment started in run 1 of dev-aws-us-east-1c for mytenant.hybridsearch.
The deployment takes a few minutes the first time while Vespa Cloud sets up the resources for your Vespa application
app
now holds a reference to a Vespa instance. We can access the mTLS protected endpoint name using the control-plane (vespa_cloud) instance. This endpoint we can query and feed to (data plane access) using the mTLS certificate generated in previous steps.
See Authenticating to Vespa Cloud for details on using token authentication instead of mTLS.
Getting your evaluation data
For evaluating information retrieval methods, in addition to the document corpus, we also need a set of queries and a mapping from queries to relevant documents.
For this guide, we will use the NanoMSMARCO dataset, made available on huggingface by Zeta Alpha.
This dataset is a subset of their 🍺NanoBEIR-collection, with 50 queries and up to 10K documents each.
This is a great dataset for testing and evaluating information retrieval methods quickly, as it is small and easy to work with.
Note that for almost any real-world use case, we would recommend you to create your own evaluation dataset. See Vespa blog post on how you can get help from an LLM for this.
Note that creating 20-50 queries and annotating relevant documents for each query can be a good start and well worth the effort.
[6]:
from datasets import load_dataset
dataset_id = "zeta-alpha-ai/NanoMSMARCO"
dataset = load_dataset(dataset_id, "corpus", split="train", streaming=True)
vespa_feed = dataset.map(
lambda x: {
"id": x["_id"],
"fields": {"text": x["text"], "id": x["_id"]},
}
)
Note that we are only evaluating rank strategies here, we consider it OK to use the train
split for evaluation. If we were to make changes to our ranking strategies, such as adding weighting terms, or training ml models for ranking, we would suggest to adopt a train
, validation
, test
split approach to avoid overfitting.
[7]:
query_ds = load_dataset(dataset_id, "queries", split="train")
qrels = load_dataset(dataset_id, "qrels", split="train")
[8]:
ids_to_query = dict(zip(query_ds["_id"], query_ds["text"]))
Let us print the first 5 queries:
[9]:
for idx, (qid, q) in enumerate(ids_to_query.items()):
print(f"qid: {qid}, query: {q}")
if idx == 5:
break
qid: 994479, query: which health care system provides all citizens or residents with equal access to health care services
qid: 1009388, query: what's right in health care
qid: 1088332, query: weather in oran
qid: 265729, query: how long keep financial records
qid: 1099433, query: how do hoa fees work
qid: 200600, query: heels or heal
[10]:
relevant_docs = dict(zip(qrels["query-id"], qrels["corpus-id"]))
Let us print the first 5 query ids and their relevant documents:
[11]:
for idx, (qid, doc_id) in enumerate(relevant_docs.items()):
print(f"qid: {qid}, doc_id: {doc_id}")
if idx == 5:
break
qid: 994479, doc_id: 7275120
qid: 1009388, doc_id: 7248824
qid: 1088332, doc_id: 7094398
qid: 265729, doc_id: 7369987
qid: 1099433, doc_id: 7255675
qid: 200600, doc_id: 7929603
We can see that this dataset only has one relevant document per query. The VespaEvaluator
class handles this just fine, but you could also provide a set of relevant documents per query if there are multiple relevant docs.
# multiple relevant docs per query
qrels = {
"q1": {"doc1", "doc2"},
"q2": {"doc3", "doc4"},
# etc.
}
Now we can feed to Vespa using feed_iterable
which accepts any Iterable
and an optional callback function where we can check the outcome of each operation. The application is configured to use embedding functionality, that produce a vector embedding using a concatenation of the title and the body input fields. This step may be resource intensive, depending on the model size.
Read more about embedding inference in Vespa in the Accelerating Transformer-based Embedding Retrieval with Vespa blog post.
Default node resources in Vespa Cloud have 2 v-cpu for the Dev Zone.
[ ]:
from vespa.io import VespaResponse
def callback(response: VespaResponse, id: str):
if not response.is_successful():
print(f"Error when feeding document {id}: {response.get_json()}")
app.feed_iterable(vespa_feed, schema="doc", namespace="tutorial", callback=callback)
VespaEvaluator
The VespaEvaluator
class is a high-level API that allows you to evaluate a Vespa application using a set of queries and a mapping from queries to relevant documents. It is inspired by SentenceTransformers `InformationRetrievalEvaluator
<https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#informationretrievalevaluator>`__ class.
The difference it that VespaEvaluator
works on a retrieval and ranking system (Vespa application) instead of a single model. Your application should be fed with the document corpus in advance, instead of taking in the document corpus.
Let us take a look at its API and documentation:
[14]:
from vespa.evaluation import VespaEvaluator
?VespaEvaluator
Init signature:
VespaEvaluator(
queries: 'Dict[str, str]',
relevant_docs: 'Union[Dict[str, Set[str]], Dict[str, str]]',
vespa_query_fn: 'Callable[[str, int], dict]',
app: 'Vespa',
name: 'str' = '',
accuracy_at_k: 'List[int]' = [1, 3, 5, 10],
precision_recall_at_k: 'List[int]' = [1, 3, 5, 10],
mrr_at_k: 'List[int]' = [10],
ndcg_at_k: 'List[int]' = [10],
map_at_k: 'List[int]' = [100],
write_csv: 'bool' = False,
csv_dir: 'Optional[str]' = None,
)
Docstring:
Evaluate retrieval performance on a Vespa application.
This class:
- Iterates over queries and issues them against your Vespa application.
- Retrieves top-k documents per query (with k = max of your IR metrics).
- Compares the retrieved documents with a set of relevant document ids.
- Computes IR metrics: Accuracy@k, Precision@k, Recall@k, MRR@k, NDCG@k, MAP@k.
- Logs vespa search times for each query.
- Logs/returns these metrics.
- Optionally writes out to CSV.
Example usage::
from vespa.application import Vespa
from vespa.evaluation import VespaEvaluator
queries = {
"q1": "What is the best GPU for gaming?",
"q2": "How to bake sourdough bread?",
# ...
}
relevant_docs = {
"q1": {"d12", "d99"},
"q2": {"d101"},
# ...
}
# relevant_docs can also be a dict of query_id => single relevant doc_id
# relevant_docs = {
# "q1": "d12",
# "q2": "d101",
# # ...
# }
def my_vespa_query_fn(query_text: str, top_k: int) -> dict:
return {
"yql": 'select * from sources * where userInput("' + query_text + '");',
"hits": top_k,
"ranking": "your_ranking_profile",
}
app = Vespa(url="http://localhost", port=8080)
evaluator = VespaEvaluator(
queries=queries,
relevant_docs=relevant_docs,
vespa_query_fn=my_vespa_query_fn,
app=app,
name="test-run",
accuracy_at_k=[1, 3, 5],
precision_recall_at_k=[1, 3, 5],
mrr_at_k=[10],
ndcg_at_k=[10],
map_at_k=[100],
write_csv=True
)
results = evaluator()
print("Primary metric:", evaluator.primary_metric)
print("All results:", results)
Init docstring:
:param queries: Dict of query_id => query text
:param relevant_docs: Dict of query_id => set of relevant doc_ids (the user-specified part of `id:<namespace>:<document-type>:<key/value-pair>:<user-specified>` in Vespa, see https://docs.vespa.ai/en/documents.html#document-ids)
:param vespa_query_fn: Callable, with signature: my_func(query:str, top_k: int)-> dict: Given a query string and top_k, returns a Vespa query body (dict).
:param app: A `vespa.application.Vespa` instance.
:param name: A name or tag for this evaluation run.
:param accuracy_at_k: list of k-values for Accuracy@k
:param precision_recall_at_k: list of k-values for Precision@k and Recall@k
:param mrr_at_k: list of k-values for MRR@k
:param ndcg_at_k: list of k-values for NDCG@k
:param map_at_k: list of k-values for MAP@k
:param write_csv: If True, writes results to CSV
:param csv_dir: Path in which to write the CSV file (default: current working dir).
File: ~/Repos/pyvespa/vespa/evaluation.py
Type: type
Subclasses:
We now have created the app, the queries, and the relevant documents. The only thing missing before we can initialize the VespaEvaluator
is the ranking strategies we want to evaluate. Each of them is passed as vespa_query_fn
.
We will use the vespa.querybuilder
module to create the queries. See reference doc and example notebook for more details on usage.
This module is a Python wrapper around the Vespa Query Language (YQL), which is an alternative to providing the YQL query as a string directly.
[15]:
import vespa.querybuilder as qb
def semantic_query_fn(query_text: str, top_k: int) -> dict:
return {
"yql": str(
qb.select("*")
.from_(schema_name)
.where(
qb.nearestNeighbor(
field="embedding",
query_vector="q",
annotations={"targetHits": 1000},
)
)
),
"query": query_text,
"ranking": "semantic",
"input.query(q)": f"embed({query_text})",
"hits": top_k,
}
def bm25_query_fn(query_text: str, top_k: int) -> dict:
return {
"yql": "select * from sources * where userQuery();", # provide the yql directly as a string
"query": query_text,
"ranking": "bm25",
"hits": top_k,
}
def fusion_query_fn(query_text: str, top_k: int) -> dict:
return {
"yql": str(
qb.select("*")
.from_(schema_name)
.where(
qb.nearestNeighbor(
field="embedding",
query_vector="q",
annotations={"targetHits": 1000},
)
| qb.userQuery(query_text)
)
),
"query": query_text,
"ranking": "fusion",
"input.query(q)": f"embed({query_text})",
"hits": top_k,
}
def atan_norm_query_fn(query_text: str, top_k: int) -> dict:
return {
"yql": str(
qb.select("*")
.from_(schema_name)
.where(
qb.nearestNeighbor(
field="embedding",
query_vector="q",
annotations={"targetHits": 1000},
)
| qb.userQuery(query_text)
)
),
"query": query_text,
"ranking": "atan_norm",
"input.query(q)": f"embed({query_text})",
"hits": top_k,
}
Run a test query
Great, now we have deployed the application and fed the data. Let us run a test query to see if everything is working as expected.
[16]:
from vespa.io import VespaQueryResponse
response: VespaQueryResponse = app.query(
body=atan_norm_query_fn("how to bake a cake", 3)
)
[17]:
response.get_json()
[17]:
{'root': {'id': 'toplevel',
'relevance': 1.0,
'fields': {'totalCount': 1911},
'coverage': {'coverage': 100,
'documents': 5043,
'full': True,
'nodes': 1,
'results': 1,
'resultsFull': 1},
'children': [{'id': 'id:tutorial:doc::1365411',
'relevance': 1.9456371356183069,
'source': 'evaluation_content',
'fields': {'matchfeatures': {'cos_sim': 0.6327433624922647,
'normalized_bm25': 0.9258703382800164},
'sddocname': 'doc',
'text': 'Cooked. 1 Beef: 2 <hi>to</hi> 3 months. 2 Breads and <hi>cakes</hi>: 3 months. 3 Casseroles: 3 months. Chicken pieces: 4 1 months. Hard sausage (pepperoni): 1 <hi>to</hi> 2 months. Vegetable or meat soups and stews: 2 <hi>to</hi> 3 months.',
'documentid': 'id:tutorial:doc::1365411',
'id': '1365411'}},
{'id': 'id:tutorial:doc::5144958',
'relevance': 1.9291597227244408,
'source': 'evaluation_content',
'fields': {'matchfeatures': {'cos_sim': 0.6306017579465462,
'normalized_bm25': 0.9296623243800929},
'sddocname': 'doc',
'text': 'Arrange 2 pounds of Italian sausages over the mixture, browned first if desired. <hi>To</hi> brown, use <hi>a</hi> heavy skillet over high heat and cook the sausages, turning often, for about eight minutes. After arranging the sausages over the vegetable mixture, put the <hi>baking</hi> dish in an oven at 350 F and allow <hi>to</hi> <hi>bake</hi> for 45 minutes.',
'documentid': 'id:tutorial:doc::5144958',
'id': '5144958'}},
{'id': 'id:tutorial:doc::8204644',
'relevance': 1.8018643812627477,
'source': 'evaluation_content',
'fields': {'matchfeatures': {'cos_sim': 0.615268931172449,
'normalized_bm25': 0.9522807125125186},
'sddocname': 'doc',
'text': 'pl. bis·cuits. 1 <hi>A</hi> small <hi>cake</hi> of shortened bread leavened with <hi>baking</hi> powder or soda. 2 Chiefly British <hi>a</hi>. <hi>A</hi> thin, crisp cracker. b. 3 <hi>A</hi> hard, dry cracker given <hi>to</hi> dogs as <hi>a</hi> treat or dietary supplement. 4 <hi>A</hi> thin, often oblong, waferlike piece of wood, glued into slots <hi>to</hi> connect larger pieces of wood in <hi>a</hi> joint. 5 <hi>A</hi> pale brown.',
'documentid': 'id:tutorial:doc::8204644',
'id': '8204644'}}]}}
[20]:
all_results = {}
for evaluator_name, query_fn in [
("semantic", semantic_query_fn),
("bm25", bm25_query_fn),
("fusion", fusion_query_fn),
("atan_norm", atan_norm_query_fn),
]:
print(f"Evaluating {evaluator_name}...")
evaluator = VespaEvaluator(
queries=ids_to_query,
relevant_docs=relevant_docs,
vespa_query_fn=query_fn,
app=app,
name=evaluator_name,
write_csv=True, # optionally write metrics to CSV
)
results = evaluator.run()
all_results[evaluator_name] = results
Evaluating semantic...
Evaluating bm25...
/Users/thomas/.local/share/uv/python/cpython-3.10.14-macos-aarch64-none/lib/python3.10/json/decoder.py:353: RuntimeWarning: coroutine 'Vespa.feed_async_iterable.<locals>.run' was never awaited
obj, end = self.scan_once(s, idx)
RuntimeWarning: Enable tracemalloc to get the object allocation traceback
Evaluating fusion...
Evaluating atan_norm...
Looking at the results
[22]:
import pandas as pd
results = pd.DataFrame(all_results)
[23]:
# take out all rows with "searchtime" to a separate dataframe
searchtime = results[results.index.str.contains("searchtime")]
results = results[~results.index.str.contains("searchtime")]
# Highlight the maximum value in each row
def highlight_max(s):
is_max = s == s.max()
return ["background-color: lightgreen; color: black;" if v else "" for v in is_max]
# Style the DataFrame: Highlight max values and format numbers to 4 decimals
styled_df = results.style.apply(highlight_max, axis=1).format("{:.4f}")
styled_df
[23]:
semantic | bm25 | fusion | atan_norm | |
---|---|---|---|---|
accuracy@1 | 0.3800 | 0.3000 | 0.4400 | 0.4400 |
accuracy@3 | 0.6400 | 0.6000 | 0.6800 | 0.7000 |
accuracy@5 | 0.7200 | 0.6600 | 0.7200 | 0.7400 |
accuracy@10 | 0.8200 | 0.7400 | 0.8000 | 0.8400 |
precision@1 | 0.3800 | 0.3000 | 0.4400 | 0.4400 |
recall@1 | 0.3800 | 0.3000 | 0.4400 | 0.4400 |
precision@3 | 0.2133 | 0.2000 | 0.2267 | 0.2333 |
recall@3 | 0.6400 | 0.6000 | 0.6800 | 0.7000 |
precision@5 | 0.1440 | 0.1320 | 0.1440 | 0.1480 |
recall@5 | 0.7200 | 0.6600 | 0.7200 | 0.7400 |
precision@10 | 0.0820 | 0.0740 | 0.0800 | 0.0840 |
recall@10 | 0.8200 | 0.7400 | 0.8000 | 0.8400 |
mrr@10 | 0.5309 | 0.4499 | 0.5532 | 0.5776 |
ndcg@10 | 0.6007 | 0.5204 | 0.6129 | 0.6409 |
map@100 | 0.5393 | 0.4592 | 0.5634 | 0.5853 |
We can see that for this particular dataset, the hybrid strategy atan_norm
is the best across all metrics.
[24]:
results.plot(kind="bar", figsize=(12, 6))
[24]:
<Axes: >

Looking at searchtimes
Ranking quality is not the only thing that matters. For many applications, search time is equally important.
[25]:
# plot search time, add (ms) to the y-axis
# convert to ms
searchtime = searchtime * 1000
searchtime.plot(kind="bar", figsize=(12, 6)).set(ylabel="time (ms)")
[25]:
[Text(0, 0.5, 'time (ms)')]

We can see that both hybrid strategies, fusion
and atan_norm
strategy is a bit slower on average than pure bm25
or semantic
, as expected.
Depending on the latency budget of your application, this is likely still an attractive trade-off.
Conclusion and next steps
We have shown how you can evaluate a Vespa application using pyvespa’s VespaEvaluator
class. We have defined and compared 4 different ranking strategies in terms of both ranking quality and searchtime latency.
We hope this can provide you with a good starting point for evaluating your own Vespa application.
If you are ready to advance, you can try to optimize the ranking strategies further, by for example weighing each of the terms in the atan_norm
strategy differently (a * normalize_linear(normalized_bm25) + (1-a) * normalize_linear(cos_sim)
) , or by adding a crossencoder for re-ranking the top-k results.
Cleanup
[ ]:
vespa_cloud.delete()
Deactivated vespa-team.evaluation in dev.aws-us-east-1c
Deleted instance vespa-team.evaluation.default