Build end-to-end Vespa apps and deploy to Vespa Cloud¶
Python API to create, modify, deploy and interact with Vespa applications
This self-contained tutorial will create a simplified text search application from scratch based on the MS MARCO dataset, similar to our text search tutorials. We will then deploy the app to Vespa Cloud and interact with it by feeding data, querying and evaluating different query models.
Install¶
The library is available at PyPI and therefore can be installed with pip
.
pip install pyvespa
Application package API¶
We first create a Document
instance containing the Field
s that we want to store in the app. In this case we will keep the application simple and only feed a unique id
, title
and body
of the MS MARCO documents.
[1]:
from vespa.package import Document, Field
document = Document(
fields=[
Field(name = "id", type = "string", indexing = ["attribute", "summary"]),
Field(name = "title", type = "string", indexing = ["index", "summary"], index = "enable-bm25"),
Field(name = "body", type = "string", indexing = ["index", "summary"], index = "enable-bm25")
]
)
The complete Schema
of our application will be named msmarco
and contains the Document
instance that we defined above, the default FieldSet
indicates that queries will look for matches by searching both in the titles and bodies of the documents. The default RankProfile
indicates that all the matched documents will be ranked by the nativeRank
expression involving the title and the body of the matched documents.
[2]:
from vespa.package import Schema, FieldSet, RankProfile
msmarco_schema = Schema(
name = "msmarco",
document = document,
fieldsets = [FieldSet(name = "default", fields = ["title", "body"])],
rank_profiles = [RankProfile(name = "default", first_phase = "nativeRank(title, body)")]
)
Once the Schema
is defined, all we have to do is to create our msmarco ApplicationPackage
:
[3]:
from vespa.package import ApplicationPackage
app_package = ApplicationPackage(name = "msmarco", schema=[msmarco_schema])
At this point, app_package
contains all the relevant information required to create our MS MARCO text search app. We now need to deploy it.
Deploy to Vespa Cloud¶
To be able to deploy to Vespa Cloud, you need to sign-up, register an application name on the Vespa Cloud console and generate your user API key.
We first create a VespaCloud
instance that will handle the secure communication with Vespa Cloud servers. In order to do that, all we need is your Vespa Cloud tenant name, the application name that you registered, the user key you generated on the Vespa Cloud console and the application package that we created above.
[5]:
import os
from vespa.package import VespaCloud
vespa_cloud = VespaCloud(
tenant="vespa-team",
application="pyvespa-integration",
key_content=os.getenv("VESPA_CLOUD_USER_KEY").replace(r"\n", "\n"),
application_package=app_package,
)
We then deploy the application to a particular instance (named from-notebook
in this case) and specify a folder location necessary to store required files such as certificates to allow for secure data exchange between the client and the VespaCloud servers.
Note: It takes around 15 min to call cloud.deploy
for the first time, as Vespa Cloud will have the setup the environment. Subsequent calls will be much faster, usually taking less than 10 seconds.
[ ]:
app = vespa_cloud.deploy(
instance='test',
disk_folder=os.path.join(os.getenv("WORK_DIR"), "sample_application")
)
The app
variable above will hold a Vespa
instance that will be used to connect and interact with our text search application throughtout this tutorial.
Feed data to the app¶
We now have our text search app up and running. We can start to feed data to it. We have pre-processed and sampled some MS MARCO data to use in this tutorial. We can load 100 documents that we want to feed and check the first two documents in this sample.
[13]:
from pandas import read_csv
docs = read_csv("https://thigm85.github.io/data/msmarco/docs_100.tsv", sep = "\t")
docs.shape
[13]:
(100, 3)
[14]:
docs.head(2)
[14]:
id | title | body | |
---|---|---|---|
0 | D2185715 | What Is an Appropriate Gift for a Bris | Hub Pages Religion and Philosophy Judaism... |
1 | D2819479 | lunge | 1lungenoun ˈlənj Popularity Bottom 40 of... |
To feed the data we need to specify the schema
that we are sending data to. We named our schema msmarco
in a previous section. Each data point needs to have a unique data_id
associated with it, independent of having an id field or not. The fields
should be a dict containing all the fields in the schema, which are id
, title
and body
in our case.
[ ]:
for idx, row in docs.iterrows():
response = app.feed_data_point(
schema = "msmarco",
data_id = str(row["id"]),
fields = {
"id": str(row["id"]),
"title": str(row["title"]),
"body": str(row["body"])
}
)
Make a simple query¶
Once our application is fed we can start sending queries to it. The MS MARCO app expects to receive questions as queries and the goal of the application is to return documents that are relevant to the questions made.
In the example below, we will send a question via the query
parameter. In addition, we need to specify how we want the documents to be matched and ranked. We do this by specifying a QueryModel
. The query model below will have the OR
operator in the match phase, indicating that the application will match all the documents which have at least one query term within the title or the body (due to the default FieldSet
we defined earlier) of the document. And we will rank all the
matched documents by the default RankProfile
that we defined earlier.
[9]:
from vespa.query import QueryModel, OR, RankProfile as Ranking
results = app.query(
query="Where is my text?",
query_model = QueryModel(
match_phase=OR(),
rank_profile=Ranking(name="default")
),
hits = 2
)
In addition to the query
and query_model
parameters, we can specify a multitude of relevant Vespa parameters such as the number of hits
that we want Vespa to return. We chose hits=2
for simplicity in this tutorial.
[10]:
len(results.hits)
[10]:
2
Change the application package and redeploy¶
We can also make specific changes to our application by changing the application package and redeploying. Lets add a new rank profile based on BM25 to our Schema
.
[11]:
app_package.schema.add_rank_profile(
RankProfile(name = "bm25", inherits = "default", first_phase = "bm25(title) + bm25(body)")
)
After that we can redeploy our application, similar to what we did earlier:
[ ]:
app = vespa_cloud.deploy(
instance='test',
disk_folder=os.path.join(os.getenv("WORK_DIR"), "sample_application")
)
We can then use the newly created bm25
rank profile to make queries:
[15]:
results = app.query(
query="Where is my text?",
query_model = QueryModel(
match_phase=OR(),
rank_profile=Ranking(name="bm25")
),
hits = 2
)
len(results.hits)
[15]:
2
Compare query models¶
When we are building a search application, we often want to experiment and compare different query models. In this section we want to show how easy it is to compare different query models in Vespa.
Lets load some labeled data where each data point contains a query_id
, a query
and a list of relevant_docs
associated with the query. In this case, we have only one relevant document for each query.
[16]:
import requests, json
labeled_data = json.loads(
requests.get("https://thigm85.github.io/data/msmarco/query-labels.json").text
)
Following we can see two examples of the labeled data:
[17]:
labeled_data[0:2]
[17]:
[{'query_id': '1',
'query': 'what county is aspen co',
'relevant_docs': [{'id': 'D1098819'}]},
{'query_id': '2',
'query': 'where is aeropostale located',
'relevant_docs': [{'id': 'D2268823'}]}]
Lets define two QueryModel
s to be compared. We are going to use the same OR
operator in the match phase and compare the default
and bm25
rank profiles.
[18]:
default_ranking = QueryModel(
match_phase=OR(),
rank_profile=Ranking(name="default")
)
[19]:
bm25_ranking = QueryModel(
match_phase=OR(),
rank_profile=Ranking(name="bm25")
)
Now we will chose which evaluation metrics we want to look at. In this case we will chose the MatchRatio
to check how many documents have been matched by the query, the Recall
at 10 and the ReciprocalRank
at 10.
[20]:
from vespa.evaluation import MatchRatio, Recall, ReciprocalRank
eval_metrics = [MatchRatio(), Recall(at = 10), ReciprocalRank(at = 10)]
We now can run the evaluation
method for each QueryModel
. This will make queries to the application and process the results to compute the pre-defined eval_metrics
defined above.
[21]:
default_evaluation = app.evaluate(
labeled_data=labeled_data,
eval_metrics=eval_metrics,
query_model=default_ranking,
id_field="id",
timeout=5,
hits=10
)
[22]:
bm25_evaluation = app.evaluate(
labeled_data=labeled_data,
eval_metrics=eval_metrics,
query_model=bm25_ranking,
id_field="id",
timeout=5,
hits=10
)
We can then merge the DataFrames returned by the evaluation
method and start to analyse the results.
[23]:
from pandas import merge
eval_comparison = merge(
left=default_evaluation,
right=bm25_evaluation,
on="query_id",
suffixes=('_default', '_bm25')
)
eval_comparison.head()
[23]:
query_id | match_ratio_retrieved_docs_default | match_ratio_docs_available_default | match_ratio_value_default | recall_10_value_default | reciprocal_rank_10_value_default | match_ratio_retrieved_docs_bm25 | match_ratio_docs_available_bm25 | match_ratio_value_bm25 | recall_10_value_bm25 | reciprocal_rank_10_value_bm25 | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 914 | 997 | 0.916750 | 1.0 | 1.000 | 914 | 997 | 0.916750 | 1.0 | 1.000000 |
1 | 2 | 896 | 997 | 0.898696 | 1.0 | 0.125 | 896 | 997 | 0.898696 | 1.0 | 1.000000 |
2 | 3 | 970 | 997 | 0.972919 | 1.0 | 1.000 | 970 | 997 | 0.972919 | 1.0 | 1.000000 |
3 | 4 | 981 | 997 | 0.983952 | 1.0 | 1.000 | 981 | 997 | 0.983952 | 1.0 | 1.000000 |
4 | 5 | 748 | 997 | 0.750251 | 1.0 | 0.500 | 748 | 997 | 0.750251 | 1.0 | 0.333333 |
Notice that we expect to observe the same match ratio for both query models since they use the same OR
operator.
[24]:
eval_comparison[["match_ratio_value_default", "match_ratio_value_bm25"]].describe().loc[["mean", "std"]]
[24]:
match_ratio_value_default | match_ratio_value_bm25 | |
---|---|---|
mean | 0.866650 | 0.866650 |
std | 0.181307 | 0.181307 |
The bm25
rank profile obtained a significantly higher recall than the default
.
[25]:
eval_comparison[["recall_10_value_default", "recall_10_value_bm25"]].describe().loc[["mean", "std"]]
[25]:
recall_10_value_default | recall_10_value_bm25 | |
---|---|---|
mean | 0.840000 | 0.960000 |
std | 0.368453 | 0.196946 |
Similarly, bm25
also get a significantly higher reciprocal rank value when compared to the default
rank profile.
[26]:
eval_comparison[["reciprocal_rank_10_value_default", "reciprocal_rank_10_value_bm25"]].describe().loc[["mean", "std"]]
[26]:
reciprocal_rank_10_value_default | reciprocal_rank_10_value_bm25 | |
---|---|---|
mean | 0.724750 | 0.943333 |
std | 0.399118 | 0.216103 |