Build end-to-end Vespa apps and deploy to Vespa Cloud

Python API to create, modify, deploy and interact with Vespa applications

Open In Colab

This self-contained tutorial will create a simplified text search application from scratch based on the MS MARCO dataset, similar to our text search tutorials. We will then deploy the app to Vespa Cloud and interact with it by feeding data, querying and evaluating different query models.

Install

The library is available at PyPI and therefore can be installed with pip.

pip install pyvespa

Application package API

We first create a Document instance containing the Fields that we want to store in the app. In this case we will keep the application simple and only feed a unique id, title and body of the MS MARCO documents.

[1]:
from vespa.package import Document, Field

document = Document(
    fields=[
        Field(name = "id", type = "string", indexing = ["attribute", "summary"]),
        Field(name = "title", type = "string", indexing = ["index", "summary"], index = "enable-bm25"),
        Field(name = "body", type = "string", indexing = ["index", "summary"], index = "enable-bm25")
    ]
)

The complete Schema of our application will be named msmarco and contains the Document instance that we defined above, the default FieldSet indicates that queries will look for matches by searching both in the titles and bodies of the documents. The default RankProfile indicates that all the matched documents will be ranked by the nativeRank expression involving the title and the body of the matched documents.

[2]:
from vespa.package import Schema, FieldSet, RankProfile

msmarco_schema = Schema(
    name = "msmarco",
    document = document,
    fieldsets = [FieldSet(name = "default", fields = ["title", "body"])],
    rank_profiles = [RankProfile(name = "default", first_phase = "nativeRank(title, body)")]
)

Once the Schema is defined, all we have to do is to create our msmarco ApplicationPackage:

[3]:
from vespa.package import ApplicationPackage

app_package = ApplicationPackage(name = "msmarco", schema=[msmarco_schema])

At this point, app_package contains all the relevant information required to create our MS MARCO text search app. We now need to deploy it.

Deploy to Vespa Cloud

To be able to deploy to Vespa Cloud, you need to sign-up, register an application name on the Vespa Cloud console and generate your user API key.

We first create a VespaCloud instance that will handle the secure communication with Vespa Cloud servers. In order to do that, all we need is your Vespa Cloud tenant name, the application name that you registered, the user key you generated on the Vespa Cloud console and the application package that we created above.

[6]:
import os
from vespa.package import VespaCloud

vespa_cloud = VespaCloud(
    tenant="vespa-team",
    application="pyvespa-integration",
    key_content=os.getenv("VESPA_CLOUD_USER_KEY").replace(r"\n", "\n"),
    application_package=app_package,
)

We then deploy the application to a particular instance (named from-notebook in this case) and specify a folder location necessary to store required files such as certificates to allow for secure data exchange between the client and the VespaCloud servers.

Note: It takes around 15 min to call cloud.deploy for the first time, as Vespa Cloud will have the setup the environment. Subsequent calls will be much faster, usually taking less than 10 seconds.

[ ]:
app = vespa_cloud.deploy(
    instance='test',
    disk_folder=os.path.join(os.getenv("WORK_DIR"), "sample_application")
)

The app variable above will hold a Vespa instance that will be used to connect and interact with our text search application throughtout this tutorial.

Feed data to the app

We now have our text search app up and running. We can start to feed data to it. We have pre-processed and sampled some MS MARCO data to use in this tutorial. We can load 100 documents that we want to feed and check the first two documents in this sample.

[8]:
from pandas import read_csv

docs = read_csv("https://thigm85.github.io/data/msmarco/docs_100.tsv", sep = "\t")
docs.shape
[8]:
(100, 3)
[9]:
docs.head(2)
[9]:
id title body
0 D2185715 What Is an Appropriate Gift for a Bris Hub Pages Religion and Philosophy Judaism...
1 D2819479 lunge 1lungenoun ˈlənj Popularity Bottom 40 of...

To feed the data we need to specify the schema that we are sending data to. We named our schema msmarco in a previous section. Each data point needs to have a unique data_id associated with it, independent of having an id field or not. The fields should be a dict containing all the fields in the schema, which are id, title and body in our case.

[10]:
for idx, row in docs.iterrows():
    response = app.feed_data_point(
        schema = "msmarco",
        data_id = str(row["id"]),
        fields = {
            "id": str(row["id"]),
            "title": str(row["title"]),
            "body": str(row["body"])
        }
    )

Make a simple query

Once our application is fed we can start sending queries to it. The MS MARCO app expects to receive questions as queries and the goal of the application is to return documents that are relevant to the questions made.

In the example below, we will send a question via the query parameter. In addition, we need to specify how we want the documents to be matched and ranked. We do this by specifying a QueryModel. The query model below will have the OR operator in the match phase, indicating that the application will match all the documents which have at least one query term within the title or the body (due to the default FieldSet we defined earlier) of the document. And we will rank all the matched documents by the default RankProfile that we defined earlier.

[11]:
from vespa.query import QueryModel, OR, RankProfile as Ranking

results = app.query(
    query="Where is my text?",
    query_model = QueryModel(
        match_phase=OR(),
        rank_profile=Ranking(name="default")
    ),
    hits = 2
)

In addition to the query and query_model parameters, we can specify a multitude of relevant Vespa parameters such as the number of hits that we want Vespa to return. We chose hits=2 for simplicity in this tutorial.

[12]:
len(results.hits)
[12]:
2

Change the application package and redeploy

We can also make specific changes to our application by changing the application package and redeploying. Lets add a new rank profile based on BM25 to our Schema.

[13]:
app_package.schema.add_rank_profile(
    RankProfile(name = "bm25", inherits = "default", first_phase = "bm25(title) + bm25(body)")
)

After that we can redeploy our application, similar to what we did earlier:

[ ]:
app = vespa_cloud.deploy(
    instance='test',
    disk_folder=os.path.join(os.getenv("WORK_DIR"), "sample_application")
)

We can then use the newly created bm25 rank profile to make queries:

[15]:
results = app.query(
    query="Where is my text?",
    query_model = QueryModel(
        match_phase=OR(),
        rank_profile=Ranking(name="bm25")
    ),
    hits = 2
)
len(results.hits)
[15]:
2

Compare query models

When we are building a search application, we often want to experiment and compare different query models. In this section we want to show how easy it is to compare different query models in Vespa.

Lets load some labeled data where each data point contains a query_id, a query and a list of relevant_docs associated with the query. In this case, we have only one relevant document for each query.

[16]:
import requests, json

labeled_data = json.loads(
    requests.get("https://thigm85.github.io/data/msmarco/query-labels.json").text
)

Following we can see two examples of the labeled data:

[17]:
labeled_data[0:2]
[17]:
[{'query_id': '1',
  'query': 'what county is aspen co',
  'relevant_docs': [{'id': 'D1098819'}]},
 {'query_id': '2',
  'query': 'where is aeropostale located',
  'relevant_docs': [{'id': 'D2268823'}]}]

Lets define two QueryModels to be compared. We are going to use the same OR operator in the match phase and compare the default and bm25 rank profiles.

[22]:
default_ranking = QueryModel(
    name="default",
    match_phase=OR(),
    rank_profile=Ranking(name="default")
)
[23]:
bm25_ranking = QueryModel(
    name="bm25",
    match_phase=OR(),
    rank_profile=Ranking(name="bm25")
)

Now we will chose which evaluation metrics we want to look at. In this case we will chose the MatchRatio to check how many documents have been matched by the query, the Recall at 10 and the ReciprocalRank at 10.

[24]:
from vespa.evaluation import MatchRatio, Recall, ReciprocalRank

eval_metrics = [MatchRatio(), Recall(at = 10), ReciprocalRank(at = 10)]

We now can run the evaluation method for each QueryModel. This will make queries to the application and process the results to compute the pre-defined eval_metrics defined above.

[26]:
evaluation = app.evaluate(
    labeled_data=labeled_data,
    eval_metrics=eval_metrics,
    query_model=[default_ranking, bm25_ranking],
    id_field="id",
    timeout=5,
    hits=10
)
evaluation
[26]:
model bm25 default
match_ratio mean 0.867000 0.867000
median 0.940000 0.940000
std 0.187161 0.187161
recall_10 mean 0.110000 0.100000
median 0.000000 0.000000
std 0.314466 0.301511
reciprocal_rank_10 mean 0.110000 0.093333
median 0.000000 0.000000
std 0.314466 0.288500