Pyvespa examples

This is a notebook with short examples one can build applications from.

Refer to troubleshooting for any problem when running this guide.

Refer to troubleshooting, which also has utilies for debugging.

[ ]:

!pip3 install pyvespa

Neighbors

Explore distance between points in 3D vector space.

These are simple examples, feeding documents with a tensor representing a point in space, and a rank profile calculating the distance between a point in the query and the point in the documents.

The examples start with using simple ranking expressions like euclidean-distance, then rank features like closeness() and setting different distance-metrics.

Distant neighbor

First, find the point that is most distant from a point in query - deploy the Application Package:

[14]:

from vespa.package import ApplicationPackage, Field, RankProfile
from vespa.deployment import VespaDocker
from vespa.io import VespaResponse

app_package = ApplicationPackage(name="neighbors")

app_package.schema.add_fields(
    Field(name = "point", type = "tensor<float>(d[3])", indexing = ["attribute", "summary"])
)

app_package.schema.add_rank_profile(
    RankProfile(
        name = "max_distance",
        inputs = [("query(qpoint)", "tensor<float>(d[3])")],
        first_phase = "euclidean_distance(attribute(point), query(qpoint), d)"
    )
)

vespa_docker = VespaDocker()
app = vespa_docker.deploy(application_package=app_package)

Waiting for configuration server, 0/300 seconds...
Waiting for configuration server, 5/300 seconds...
Using plain http against endpoint http://localhost:8080/ApplicationStatus
Waiting for application status, 0/300 seconds...
Using plain http against endpoint http://localhost:8080/ApplicationStatus
Waiting for application status, 5/300 seconds...
Using plain http against endpoint http://localhost:8080/ApplicationStatus
Waiting for application status, 10/300 seconds...
Using plain http against endpoint http://localhost:8080/ApplicationStatus
Waiting for application status, 15/300 seconds...
Using plain http against endpoint http://localhost:8080/ApplicationStatus
Waiting for application status, 20/300 seconds...
Using plain http against endpoint http://localhost:8080/ApplicationStatus
Waiting for application status, 25/300 seconds...
Using plain http against endpoint http://localhost:8080/ApplicationStatus
Application is up!
Finished deployment.

Feed points in 3d space using a 3-dimensional indexed tensor. Pyvespa feeds using the /document/v1/ API, refer to document format:

[15]:

def get_feed(field_name):
    return [
        {
            'id': 0,
            'fields': {
                field_name: [0.0, 1.0, 2.0]
            }
        },
        {
            'id': 1,
            'fields': {
                field_name: [1.0, 2.0, 3.0]
            }
        },
        {
            'id': 2,
            'fields': {
                field_name: [2.0, 3.0, 4.0]
            }
        }
    ]
with app.syncio(connections=1) as session:
    for u in get_feed("point"):
        response:VespaResponse = session.update_data(data_id=u["id"], schema="neighbors", fields=u["fields"], create=True)
        if not response.is_successful():
            print("Update failed for document {}".format(u["id"]) + " with status code {}".format(response.status_code) +
            " with response {}".format(response.get_json()))

Note: The feed above uses create-if-nonexistent, i.e. update a document, create it if it does not exists. Later in this notebook we will add a field and update it, so using an update to feed data makes it easier.

Query from origo using YQL. The rank profile will rank the most distant points highest, here sqrt(2*2 + 3*3 + 4*4) = 5.385:

[16]:

import json
from vespa.io import VespaQueryResponse

result:VespaQueryResponse = app.query(body={
  'yql': 'select point from neighbors where true',
  'input.query(qpoint)': '[0.0, 0.0, 0.0]',
  'ranking.profile': 'max_distance',
  'presentation.format.tensors': 'short-value'
})

if not response.is_successful():
    print("Query failed with status code {}".format(response.status_code) +
    " with response {}".format(response.get_json()))
    raise Exception("Query failed")
if len(result.hits) != 3:
    raise Exception("Expected 3 hits, got {}".format(len(result.hits)))
print(json.dumps(result.hits, indent=4))

[
    {
        "id": "index:neighbors_content/0/c81e728dfde15fa4e8dfb3d3",
        "relevance": 5.385164807134504,
        "source": "neighbors_content",
        "fields": {
            "point": [
                2.0,
                3.0,
                4.0
            ]
        }
    },
    {
        "id": "index:neighbors_content/0/c4ca4238db266f395150e961",
        "relevance": 3.7416573867739413,
        "source": "neighbors_content",
        "fields": {
            "point": [
                1.0,
                2.0,
                3.0
            ]
        }
    },
    {
        "id": "index:neighbors_content/0/cfcd20845b10b1420c6cdeca",
        "relevance": 2.23606797749979,
        "source": "neighbors_content",
        "fields": {
            "point": [
                0.0,
                1.0,
                2.0
            ]
        }
    }
]

Query from [1.0, 2.0, 2.9] - find that [2.0, 3.0, 4.0] is most distant:

[17]:

result = app.query(body={
  'yql': 'select point from neighbors where true',
  'input.query(qpoint)': '[1.0, 2.0, 2.9]',
  'ranking.profile': 'max_distance',
  'presentation.format.tensors': 'short-value',
})
print(json.dumps(result.hits, indent=4))

[
    {
        "id": "index:neighbors_content/0/c81e728dfde15fa4e8dfb3d3",
        "relevance": 1.7916472308265357,
        "source": "neighbors_content",
        "fields": {
            "point": [
                2.0,
                3.0,
                4.0
            ]
        }
    },
    {
        "id": "index:neighbors_content/0/cfcd20845b10b1420c6cdeca",
        "relevance": 1.6763055154708881,
        "source": "neighbors_content",
        "fields": {
            "point": [
                0.0,
                1.0,
                2.0
            ]
        }
    },
    {
        "id": "index:neighbors_content/0/c4ca4238db266f395150e961",
        "relevance": 0.09999990575011103,
        "source": "neighbors_content",
        "fields": {
            "point": [
                1.0,
                2.0,
                3.0
            ]
        }
    }
]

Nearest neighbor

The nearestNeighbor query operator calculates distances between points in vector space. Here, we are using the default distance metric (euclidean), as it is not specified. The closeness() rank feature can be used to rank results - add a new rank profile:

[18]:

app_package.schema.add_rank_profile(
    RankProfile(
        name = "nearest_neighbor",
        inputs = [("query(qpoint)", "tensor<float>(d[3])")],
        first_phase = "closeness(field, point)"
    )
)

app = vespa_docker.deploy(application_package=app_package)

Waiting for configuration server, 0/300 seconds...
Waiting for configuration server, 5/300 seconds...
Using plain http against endpoint http://localhost:8080/ApplicationStatus
Waiting for application status, 0/300 seconds...
Using plain http against endpoint http://localhost:8080/ApplicationStatus
Waiting for application status, 5/300 seconds...
Using plain http against endpoint http://localhost:8080/ApplicationStatus
Application is up!
Finished deployment.

Nearest neighbor - angular

So far, we have used the default distance-metric which is euclidean - now try with another. Add new few field with “angular” distance metric:

[20]:

app_package.schema.add_fields(
    Field(name = "point_angular",
          type = "tensor<float>(d[3])",
          indexing = ["attribute", "summary"],
          attribute=["distance-metric: angular"])
)
app_package.schema.add_rank_profile(
    RankProfile(
        name = "nearest_neighbor_angular",
        inputs = [("query(qpoint)", "tensor<float>(d[3])")],
        first_phase = "closeness(field, point_angular)"
    )
)

app = vespa_docker.deploy(application_package=app_package)

Waiting for configuration server, 0/300 seconds...
Waiting for configuration server, 5/300 seconds...
Using plain http against endpoint http://localhost:8080/ApplicationStatus
Waiting for application status, 0/300 seconds...
Using plain http against endpoint http://localhost:8080/ApplicationStatus
Waiting for application status, 5/300 seconds...
Using plain http against endpoint http://localhost:8080/ApplicationStatus
Waiting for application status, 10/300 seconds...
Using plain http against endpoint http://localhost:8080/ApplicationStatus
Application is up!
Finished deployment.

Feed the same data to the point_angular field:

[21]:

for u in get_feed("point_angular"):
        response:VespaResponse = session.update_data(data_id=u["id"], schema="neighbors", fields=u["fields"])
        if not response.is_successful():
            print("Update failed for document {}".format(u["id"]) + " with status code {}".format(response.status_code) +
            " with response {}".format(response.get_json()))

Observe the documents now have two vectors

Notice that we pass native Vespa document v1 api parameters to reduce the tensor verbosity.

[24]:

from vespa.io import VespaResponse
response:VespaResponse = app.get_data(schema="neighbors", data_id=0, **{"format.tensors": "short-value"})
print(json.dumps(response.get_json(), indent=4))

{
    "pathId": "/document/v1/neighbors/neighbors/docid/0",
    "id": "id:neighbors:neighbors::0",
    "fields": {
        "point": [
            0.0,
            1.0,
            2.0
        ],
        "point_angular": [
            0.0,
            1.0,
            2.0
        ]
    }
}

[25]:

result = app.query(body={
  'yql': 'select point_angular from neighbors where {targetHits: 3}nearestNeighbor(point_angular, qpoint)',
  'input.query(qpoint)': '[1.0, 2.0, 2.9]',
  'ranking.profile': 'nearest_neighbor_angular',
  'presentation.format.tensors': 'short-value'
})
print(json.dumps(result.hits, indent=4))

[
    {
        "id": "index:neighbors_content/0/c4ca4238db266f395150e961",
        "relevance": 0.983943389010042,
        "source": "neighbors_content",
        "fields": {
            "point_angular": [
                1.0,
                2.0,
                3.0
            ]
        }
    },
    {
        "id": "index:neighbors_content/0/c81e728dfde15fa4e8dfb3d3",
        "relevance": 0.9004871017951954,
        "source": "neighbors_content",
        "fields": {
            "point_angular": [
                2.0,
                3.0,
                4.0
            ]
        }
    },
    {
        "id": "index:neighbors_content/0/cfcd20845b10b1420c6cdeca",
        "relevance": 0.7638041096953281,
        "source": "neighbors_content",
        "fields": {
            "point_angular": [
                0.0,
                1.0,
                2.0
            ]
        }
    }
]

In the output above, observe the different in “relevance”, compared to the query using 'ranking.profile': 'nearest_neighbor' above - this is the difference in closeness() using different distance metrics.

Next steps

Try the multi-vector-indexing notebook to explore using an HNSW-index for approximate nearest neighbor search.
Explore using the distance() rank feature - this should give the same results as the ranking expressions using euclidean-distance above.
label is useful when having more vector fields - read more about the nearestNeighbor query operator.

Cleanup

[ ]:

vespa_docker.container.stop()
vespa_docker.container.remove()