#Vespa

Advanced Configuration

Vespa support a wide range of configuration options to customize the behavior of the system through the services.xml-file. Until pyvespa version 0.50.0, only a limited subset of these configurations were available in pyvespa.

Now, we have added support for passing a ServiceConfiguration object to your ApplicationPackage that allows you to define any configuration you want. This notebook demonstrates how to use this new feature if you have the need for more advanced configurations.

Note that it is not required to provide a ServiceConfiguration feature, and if not passed, the default configuration will still be created for you.

There are some slight differences in which configuration options are available when running self-hosted (Docker) and when running on the cloud (Vespa Cloud). For details, see Vespa Cloud services.xml-reference This notebook demonstrates how to use the ServiceConfiguration object to configure a Vespa application for some common use cases, with options that are available in both environments.

Refer to troubleshooting for any problem when running this guide.

Install pyvespa and start Docker Daemon, validate minimum 6G available:

[1]:
!pip3 install pyvespa
!docker info | grep "Total Memory"

Example 1 - Configure document-expiry

As an example of a common use case for advanced configuration, we will configure document-expiry. This feature allows you to set a time-to-live for documents in your Vespa application. This is useful when you have documents that are only relevant for a certain period of time, and you want to avoid serving stale data.

For reference, see the docs on document-expiry.

Define a schema

We define a simple schema, with a timestamp field that we will use in the document selection expression to set the document-expiry.

Note that the fields that are referenced in the selection expression should be attributes(in-memory).

Also, either the fields should be set with fast-access or the number of searchable copies in the content cluster should be the same as the redundancy. Otherwise, the document selection maintenance will be slow and have a major performance impact on the system.

[2]:
from vespa.package import Document, Field, Schema, ApplicationPackage

application_name = "music"
music_schema = Schema(
    name=application_name,
    document=Document(
        fields=[
            Field(
                name="artist",
                type="string",
                indexing=["attribute", "summary"],
            ),
            Field(
                name="title",
                type="string",
                indexing=["attribute", "summary"],
            ),
            Field(
                name="timestamp",
                type="long",
                indexing=["attribute", "summary"],
                attribute=["fast-access"],
            ),
        ]
    ),
)

The ServiceConfiguration object

The ServiceConfiguration object allows you to define any configuration you want in the services.xml file.

The syntax is as follows:

[3]:
from vespa.package import ServicesConfiguration
from vespa.configuration.services import (
    services,
    container,
    search,
    document_api,
    document_processing,
    content,
    redundancy,
    documents,
    document,
    node,
    nodes,
)

# Create a ServicesConfiguration with document-expiry set to 1 day (timestamp > now() - 86400)
services_config = ServicesConfiguration(
    application_name=application_name,
    services_config=services(
        container(
            search(),
            document_api(),
            document_processing(),
            id=f"{application_name}_container",
            version="1.0",
        ),
        content(
            redundancy("1"),
            documents(
                document(
                    type=application_name,
                    mode="index",
                    # Note that the selection-expression does not need to be escaped, as it will be automatically escaped during xml-serialization
                    selection="music.timestamp > now() - 86400",
                ),
                garbage_collection="true",
            ),
            nodes(node(distribution_key="0", hostalias="node1")),
            id=f"{application_name}_content",
            version="1.0",
        ),
    ),
)
application_package = ApplicationPackage(
    name=application_name,
    schema=[music_schema],
    services_config=services_config,
)

There are some useful gotchas to keep in mind when constructing the ServiceConfiguration object.

First, let’s establish a common vocabulary through an example. Consider the following services.xml file, which is what we are actually representing with the ServiceConfiguration object from the previous cell:

<?xml version="1.0" encoding="UTF-8" ?>
<services>
  <container id="music_container" version="1.0">
    <search></search>
    <document-api></document-api>
    <document-processing></document-processing>
  </container>
  <content id="music_content" version="1.0">
    <redundancy>1</redundancy>
    <documents garbage-collection="true">
      <document type="music" mode="index" selection="music.timestamp &gt; now() - 86400"></document>
    </documents>
    <nodes>
      <node distribution-key="0" hostalias="node1"></node>
    </nodes>
  </content>
</services>

In this example, services, container, search, document-api, document-processing, content, redundancy, documents, document, and nodes are tags. The id, version, type, mode, selection, distribution-key, hostalias, and garbage-collection are attributes, with a corresponding value.

Tag names

All tags as referenced in the Vespa documentation are available in vespa.configuration.services module with the following modifications:

  • All - in the tag names are replaced by _ to avoid conflicts with Python syntax.

  • Some tags that are Python reserved words (or commonly used objects) are constructed by adding a _ at the end of the tag name. These are:

    • type_

    • class_

    • for_

    • time_

    • io_

Only valid tags are exported by the vespa.configuration.services module.

Attributes

  • any attribute can be passed to the tag constructor (no validation at construction time).

  • The attribute name should be the same as in the Vespa documentation, but with - replaced by _. For example, the garbage-collection attribute in the query tag should be passed as garbage_collection.

  • In case the attribute name is a Python reserved word, the same rule as for the tag names applies (add _ at the end). An example of this is the global attribute which should be passed as global_.

  • Some attributes, such as id, in the container tag, are mandatory and should be passed as positional arguments to the tag constructor.

Values

  • The value of an attribute can be a string, an integer, or a boolean. For types bool and int, the value is converted to a string (lowercased for bool). If you need to pass a float, you should convert it to a string before passing it to the tag constructor, e.g. container(version="1.0").

  • Note that we are not escaping the values. In the xml file, the value of the selection attribute in the document tag is music.timestamp &gt; now() - 86400. (&gt; is the escaped form of >.) When passing this value to the document tag constructor in python, we should not escape the > character, i.e. document(selection="music.timestamp > now() - 86400").

Deploy the Vespa application

Deploy package on the local machine using Docker, without leaving the notebook, by creating an instance of VespaDocker. VespaDocker connects to the local Docker daemon socket and starts the Vespa docker image.

If this step fails, please check that the Docker daemon is running, and that the Docker daemon socket can be used by clients (Configurable under advanced settings in Docker Desktop).

[4]:
from vespa.deployment import VespaDocker

vespa_docker = VespaDocker()
app = vespa_docker.deploy(application_package=application_package)
Waiting for configuration server, 0/60 seconds...
Waiting for application to come up, 0/300 seconds.
Application is up!
Finished deployment.

app now holds a reference to a Vespa instance. see this notebook for details on authenticating to Vespa Cloud.

Feeding documents to Vespa

Now, let us feed some documents to Vespa. We will feed one document with a timestamp of 24 hours (+1 sec (86401)) ago and another document with a timestamp of the current time. We will then query the documents to check verify that the document-expiry is working as expected.

[5]:
import time

docs_to_feed = [
    {
        "id": "1",
        "fields": {
            "artist": "Snoop Dogg",
            "title": "Gin and Juice",
            "timestamp": int(time.time()) - 86401,
        },
    },
    {
        "id": "2",
        "fields": {
            "artist": "Dr.Dre",
            "title": "Still D.R.E",
            "timestamp": int(time.time()),
        },
    },
]
[6]:
from vespa.io import VespaResponse


def callback(response: VespaResponse, id: str):
    if not response.is_successful():
        print(f"Error when feeding document {id}: {response.get_json()}")


app.feed_iterable(docs_to_feed, schema=application_name, callback=callback)

Verify document expiry through visiting

Visiting is a feature to efficiently get or process a set of documents, identified by a document selection expression. Here is how you can use visiting in pyvespa:

[7]:
visit_results = []
for slice_ in app.visit(
    schema=application_name,
    content_cluster_name=f"{application_name}_content",
    timeout="5s",
):
    for response in slice_:
        visit_results.append(response.json)
visit_results
[7]:
[{'pathId': '/document/v1/music/music/docid/',
  'documents': [{'id': 'id:music:music::2',
    'fields': {'artist': 'Dr.Dre',
     'title': 'Still D.R.E',
     'timestamp': 1727428957}}],
  'documentCount': 1}]

We can see that the document with the timestamp of 24 hours ago is not returned by the query, while the document with the current timestamp is returned.

Clean up

[8]:
vespa_docker.container.stop()
vespa_docker.container.remove()

Next steps

This is just an intro into to the advanced configuration options available in Vespa. For more details, see the Vespa documentation.