Advanced Configuration
Vespa support a wide range of configuration options to customize the behavior of the system through the services.xml
-file. Until pyvespa version 0.50.0, only a limited subset of these configurations were available in pyvespa.
Now, we have added support for passing a ServiceConfiguration
object to your ApplicationPackage
that allows you to define any configuration you want. This notebook demonstrates how to use this new feature if you have the need for more advanced configurations.
Note that it is not required to provide a ServiceConfiguration
feature, and if not passed, the default configuration will still be created for you.
There are some slight differences in which configuration options are available when running self-hosted (Docker) and when running on the cloud (Vespa Cloud). For details, see Vespa Cloud services.xml-reference This notebook demonstrates how to use the ServiceConfiguration
object to configure a Vespa application for some common use cases, with options that are available in both environments.
Refer to troubleshooting for any problem when running this guide.
Install pyvespa and start Docker Daemon, validate minimum 6G available:
[1]:
!pip3 install pyvespa
!docker info | grep "Total Memory"
Example 1 - Configure document-expiry
As an example of a common use case for advanced configuration, we will configure document-expiry. This feature allows you to set a time-to-live for documents in your Vespa application. This is useful when you have documents that are only relevant for a certain period of time, and you want to avoid serving stale data.
For reference, see the docs on document-expiry.
Define a schema
We define a simple schema, with a timestamp field that we will use in the document selection expression to set the document-expiry.
Note that the fields that are referenced in the selection expression should be attributes(in-memory).
Also, either the fields should be set with fast-access
or the number of searchable copies in the content cluster should be the same as the redundancy. Otherwise, the document selection maintenance will be slow and have a major performance impact on the system.
[2]:
from vespa.package import Document, Field, Schema, ApplicationPackage
application_name = "music"
music_schema = Schema(
name=application_name,
document=Document(
fields=[
Field(
name="artist",
type="string",
indexing=["attribute", "summary"],
),
Field(
name="title",
type="string",
indexing=["attribute", "summary"],
),
Field(
name="timestamp",
type="long",
indexing=["attribute", "summary"],
attribute=["fast-access"],
),
]
),
)
The ServiceConfiguration
object
The ServiceConfiguration
object allows you to define any configuration you want in the services.xml
file.
The syntax is as follows:
[3]:
from vespa.package import ServicesConfiguration
from vespa.configuration.services import (
services,
container,
search,
document_api,
document_processing,
content,
redundancy,
documents,
document,
node,
nodes,
)
# Create a ServicesConfiguration with document-expiry set to 1 day (timestamp > now() - 86400)
services_config = ServicesConfiguration(
application_name=application_name,
services_config=services(
container(
search(),
document_api(),
document_processing(),
id=f"{application_name}_container",
version="1.0",
),
content(
redundancy("1"),
documents(
document(
type=application_name,
mode="index",
# Note that the selection-expression does not need to be escaped, as it will be automatically escaped during xml-serialization
selection="music.timestamp > now() - 86400",
),
garbage_collection="true",
),
nodes(node(distribution_key="0", hostalias="node1")),
id=f"{application_name}_content",
version="1.0",
),
),
)
application_package = ApplicationPackage(
name=application_name,
schema=[music_schema],
services_config=services_config,
)
There are some useful gotchas to keep in mind when constructing the ServiceConfiguration
object.
First, let’s establish a common vocabulary through an example. Consider the following services.xml
file, which is what we are actually representing with the ServiceConfiguration
object from the previous cell:
<?xml version="1.0" encoding="UTF-8" ?>
<services>
<container id="music_container" version="1.0">
<search></search>
<document-api></document-api>
<document-processing></document-processing>
</container>
<content id="music_content" version="1.0">
<redundancy>1</redundancy>
<documents garbage-collection="true">
<document type="music" mode="index" selection="music.timestamp > now() - 86400"></document>
</documents>
<nodes>
<node distribution-key="0" hostalias="node1"></node>
</nodes>
</content>
</services>
In this example, services
, container
, search
, document-api
, document-processing
, content
, redundancy
, documents
, document
, and nodes
are tags. The id
, version
, type
, mode
, selection
, distribution-key
, hostalias
, and garbage-collection
are attributes, with a corresponding value.
Tag names
All tags as referenced in the Vespa documentation are available in vespa.configuration.services
module with the following modifications:
All
-
in the tag names are replaced by_
to avoid conflicts with Python syntax.Some tags that are Python reserved words (or commonly used objects) are constructed by adding a
_
at the end of the tag name. These are:type_
class_
for_
time_
io_
Only valid tags are exported by the vespa.configuration.services
module.
Attributes
any attribute can be passed to the tag constructor (no validation at construction time).
The attribute name should be the same as in the Vespa documentation, but with
-
replaced by_
. For example, thegarbage-collection
attribute in thequery
tag should be passed asgarbage_collection
.In case the attribute name is a Python reserved word, the same rule as for the tag names applies (add
_
at the end). An example of this is theglobal
attribute which should be passed asglobal_
.Some attributes, such as
id
, in thecontainer
tag, are mandatory and should be passed as positional arguments to the tag constructor.
Values
The value of an attribute can be a string, an integer, or a boolean. For types
bool
andint
, the value is converted to a string (lowercased forbool
). If you need to pass a float, you should convert it to a string before passing it to the tag constructor, e.g.container(version="1.0")
.Note that we are not escaping the values. In the xml file, the value of the
selection
attribute in thedocument
tag ismusic.timestamp > now() - 86400
. (>
is the escaped form of>
.) When passing this value to thedocument
tag constructor in python, we should not escape the>
character, i.e.document(selection="music.timestamp > now() - 86400")
.
Deploy the Vespa application
Deploy package
on the local machine using Docker, without leaving the notebook, by creating an instance of VespaDocker. VespaDocker
connects to the local Docker daemon socket and starts the Vespa docker image.
If this step fails, please check that the Docker daemon is running, and that the Docker daemon socket can be used by clients (Configurable under advanced settings in Docker Desktop).
[4]:
from vespa.deployment import VespaDocker
vespa_docker = VespaDocker()
app = vespa_docker.deploy(application_package=application_package)
Waiting for configuration server, 0/60 seconds...
Waiting for application to come up, 0/300 seconds.
Application is up!
Finished deployment.
app
now holds a reference to a Vespa instance. see this notebook for details on authenticating to Vespa Cloud.
Feeding documents to Vespa
Now, let us feed some documents to Vespa. We will feed one document with a timestamp of 24 hours (+1 sec (86401)) ago and another document with a timestamp of the current time. We will then query the documents to check verify that the document-expiry is working as expected.
[5]:
import time
docs_to_feed = [
{
"id": "1",
"fields": {
"artist": "Snoop Dogg",
"title": "Gin and Juice",
"timestamp": int(time.time()) - 86401,
},
},
{
"id": "2",
"fields": {
"artist": "Dr.Dre",
"title": "Still D.R.E",
"timestamp": int(time.time()),
},
},
]
[6]:
from vespa.io import VespaResponse
def callback(response: VespaResponse, id: str):
if not response.is_successful():
print(f"Error when feeding document {id}: {response.get_json()}")
app.feed_iterable(docs_to_feed, schema=application_name, callback=callback)
Verify document expiry through visiting
Visiting is a feature to efficiently get or process a set of documents, identified by a document selection expression. Here is how you can use visiting in pyvespa:
[7]:
visit_results = []
for slice_ in app.visit(
schema=application_name,
content_cluster_name=f"{application_name}_content",
timeout="5s",
):
for response in slice_:
visit_results.append(response.json)
visit_results
[7]:
[{'pathId': '/document/v1/music/music/docid/',
'documents': [{'id': 'id:music:music::2',
'fields': {'artist': 'Dr.Dre',
'title': 'Still D.R.E',
'timestamp': 1727428957}}],
'documentCount': 1}]
We can see that the document with the timestamp of 24 hours ago is not returned by the query, while the document with the current timestamp is returned.
Clean up
[8]:
vespa_docker.container.stop()
vespa_docker.container.remove()
Example 2 - Configuring requestthreads
per search
In Vespa, there are several configuration options that might be tuned to optimize the serving latency of your application. For an overview, see the Vespa documentation - Vespa Serving Scaling Guide. An example of a configuration that one might want to tune is the requestthreads
persearch
parameter. This parameter controls the number of search threads
that are used to handle each search on the content nodes. The default value is 1.
For some applications, where a significant portion of the work per query is linear with the number of documents, increasing the number of requestthreads
persearch
can improve the serving latency, as it allows more parallelism in the search phase.
Examples of potentially expensive work that scales linearly with the number of documents, and thus are likely to benefit from increasing requestthreads
persearch
are: - Xgboost inference with a large GDBT-model - ONNX inference, e.g with a crossencoder. - MaxSim-operations for late interaction scoring, as in ColBERT and ColPali. - Exact nearest neighbor search.
Example of query operators that are less likely to benefit from increasing requestthreads
persearch
are: - wand
/weakAnd
, see Using wand with Vespa. - Approximate nearest neighbor search with HNSW.
In this example, we will demonstrate an example of configuring requestthreads
persearch
to 4 for an application where a Crossencoder is used in first-phase ranking. The demo is based on the Cross-encoders for global reranking guide, but here we will use a cross-encoder in first-phase instead of global-phase. First-phase and second-phase ranking are executed on the content nodes, while
global-phase ranking is executed on the container node. See Phased ranking for more details.
Download the crossencoder-model
[9]:
from pathlib import Path
import requests
from vespa.deployment import VespaDocker
# Download the model if it doesn't exist
url = "https://huggingface.co/mixedbread-ai/mxbai-rerank-xsmall-v1/resolve/main/onnx/model.onnx"
local_model_path = "model/model.onnx"
if not Path(local_model_path).exists():
print("Downloading the mxbai-rerank model...")
r = requests.get(url)
Path(local_model_path).parent.mkdir(parents=True, exist_ok=True)
with open(local_model_path, "wb") as f:
f.write(r.content)
print(f"Downloaded model to {local_model_path}")
else:
print("Model already exists, skipping download.")
Model already exists, skipping download.
Define a schema
[10]:
from vespa.package import (
OnnxModel,
RankProfile,
Schema,
ApplicationPackage,
Field,
FieldSet,
Function,
FirstPhaseRanking,
Document,
)
application_name = "requestthreads"
# Define the reranking, as we will use it for two different rank profiles
reranking = FirstPhaseRanking(
keep_rank_count=8,
expression="sigmoid(onnx(crossencoder).logits{d0:0,d1:0})",
)
# Define the schema
schema = Schema(
name="doc",
document=Document(
fields=[
Field(name="id", type="string", indexing=["summary", "attribute"]),
Field(
name="text",
type="string",
indexing=["index", "summary"],
index="enable-bm25",
),
Field(
name="body_tokens",
type="tensor<float>(d0[512])",
indexing=[
"input text",
"embed tokenizer",
"attribute",
"summary",
],
is_document_field=False, # Indicates a synthetic field
),
],
),
fieldsets=[FieldSet(name="default", fields=["text"])],
models=[
OnnxModel(
model_name="crossencoder",
model_file_path=f"{local_model_path}",
inputs={
"input_ids": "input_ids",
"attention_mask": "attention_mask",
},
outputs={"logits": "logits"},
)
],
rank_profiles=[
RankProfile(name="bm25", first_phase="bm25(text)"),
RankProfile(
name="reranking",
inherits="default",
inputs=[("query(q)", "tensor<float>(d0[64])")],
functions=[
Function(
name="input_ids",
expression="customTokenInputIds(1, 2, 512, query(q), attribute(body_tokens))",
),
Function(
name="attention_mask",
expression="tokenAttentionMask(512, query(q), attribute(body_tokens))",
),
],
first_phase=reranking,
summary_features=[
"query(q)",
"input_ids",
"attention_mask",
"onnx(crossencoder).logits",
],
),
RankProfile(
name="one-thread-profile",
first_phase=reranking,
inherits="reranking",
num_threads_per_search=1,
),
],
)
Define the ServicesConfiguration
Note that the ServicesConfiguration may be used to define any configuration in the services.xml
file. In this example, we are only configuring the requestthreads
persearch
parameter, but you can use the same approach to configure any other parameter.
For a full reference of the available configuration options, see the Vespa documentation - services.xml.
[11]:
from vespa.configuration.services import *
from vespa.package import ServicesConfiguration
# Define services configuration with persearch threads set to 4
services_config = ServicesConfiguration(
application_name=f"{application_name}",
services_config=services(
container(id=f"{application_name}_default", version="1.0")(
component(
model(
url="https://huggingface.co/mixedbread-ai/mxbai-rerank-xsmall-v1/raw/main/tokenizer.json"
),
id="tokenizer",
type="hugging-face-tokenizer",
),
document_api(),
search(),
),
content(id=f"{application_name}", version="1.0")(
min_redundancy("1"),
documents(document(type="doc", mode="index")),
engine(
proton(
tuning(
searchnode(requestthreads(persearch("4"))),
),
),
),
),
version="1.0",
minimum_required_vespa_version="8.311.28",
),
)
Now, we are ready to deploy our application-package with the defined ServiceConfiguration
.
Deploy the application package
[12]:
app_package = ApplicationPackage(
name=f"{application_name}",
schema=[schema],
services_config=services_config,
)
[13]:
app_package.to_files("deleteme")
[14]:
vespa_docker = VespaDocker(port=8089)
app = vespa_docker.deploy(application_package=app_package)
Waiting for configuration server, 0/60 seconds...
Waiting for application to come up, 0/300 seconds.
Waiting for application to come up, 5/300 seconds.
Waiting for application to come up, 10/300 seconds.
Waiting for application to come up, 15/300 seconds.
Waiting for application to come up, 20/300 seconds.
Waiting for application to come up, 25/300 seconds.
Application is up!
Finished deployment.
Feed some sample documents
[15]:
sample_docs = [
{"id": i, "fields": {"text": text}}
for i, text in enumerate(
[
"'To Kill a Mockingbird' is a novel by Harper Lee published in 1960. It was immediately successful, winning the Pulitzer Prize, and has become a classic of modern American literature. The novel 'Moby-Dick' was written by Herman Melville and first published in 1851. Harper Lee, an American novelist widely known for her novel 'To Kill a Mockingbird'. It is considered a masterpiece of American literature and deals with complex themes of obsession, revenge, and the conflict between good and evil.",
"was born in 1926 in Monroeville, Alabama. She received the Pulitzer Prize for Fiction in 1961. Jane Austen was an English novelist known primarily for her six major novels, ",
"which interpret, critique and comment upon the British landed gentry at the end of the 18th century. The 'Harry Potter' series, which consists of seven fantasy novels written by British author J.K. Rowling, ",
"is among the most popular and critically acclaimed books of the modern era. 'The Great Gatsby', a novel written by American author F. Scott Fitzgerald, was published in 1925. The story is set in the Jazz Age and follows the life of millionaire Jay Gatsby and his pursuit of Daisy Buchanan.",
]
)
]
app.feed_iterable(sample_docs, schema="doc")
# Define the query body
query_body = {
"yql": "select * from sources * where userQuery();",
"query": "who wrote to kill a mockingbird?",
"timeout": "10s",
"input.query(q)": "embed(tokenizer, @query)",
"presentation.timing": "true",
}
# Warm-up query
app.query(body=query_body)
query_body_reranking = {
**query_body,
"ranking.profile": "reranking",
}
# Query with default persearch threads (set to 4)
with app.syncio() as sess:
response_default = app.query(body=query_body_reranking)
# Query with num-threads-per-search overridden to 1
query_body_one_thread = {
**query_body,
"ranking.profile": "one-thread-profile",
# "ranking.matching.numThreadsPerSearch": 1, Could potentiall also set numThreadsPerSearch in query parameters.
}
with app.syncio() as sess:
response_one_thread = sess.query(body=query_body_one_thread)
# Extract query times
timing_default = response_default.json["timing"]["querytime"]
timing_one_thread = response_one_thread.json["timing"]["querytime"]
# Beautifully formatted statement of - num threads and ratio of query times
print(f"Query time with 4 threads: {timing_default:.2f}s")
print(f"Query time with 1 thread: {timing_one_thread:.2f}s")
ratio = timing_one_thread / timing_default
print(f"4 threads is approximately {ratio:.2f}x faster than 1 thread")
Query time with 4 threads: 0.72s
Query time with 1 thread: 1.24s
4 threads is approximately 1.72x faster than 1 thread
Cleanup
[16]:
vespa_docker.container.stop()
vespa_docker.container.remove()
Next steps
This is just an intro into to the advanced configuration options available in Vespa. For more details, see the Vespa documentation.