LightGBM: Training the model with Vespa featuresο
The main goal of this tutorial is to deploy and use a LightGBM model in a Vespa application. The following tasks will be accomplished throughout the tutorial:
Train a LightGBM classification model with variable names supported by Vespa.
Create Vespa application package files and export then to an application folder.
Export the trained LightGBM model to the Vespa application folder.
Deploy the Vespa application using the application folder.
Feed data to the Vespa application.
Assert that the LightGBM predictions from the deployed model are correct.
Refer to troubleshooting for any problem when running this guide.
Setupο
Install and load required packages.
[ ]:
!pip3 install numpy pandas pyvespa lightgbm
[3]:
import json
import lightgbm as lgb
import numpy as np
import pandas as pd
Create dataο
Generate a toy dataset to follow along. Note that we set the column names in a format that Vespa understands. query(value)
means that the user will send a parameter named value
along with the query. attribute(field)
means that field
is a document attribute defined in a schema. In the example below we have a query parameter named value
and two documentβs attributes, numeric
and categorical
. If we want lightgbm
to handle categorical variables we should use
dtype="category"
when creating the dataframe, as shown below.
[4]:
# Create random training set
features = pd.DataFrame({
"query(value)": np.random.random(100),
"attribute(numeric)": np.random.random(100),
"attribute(categorical)": pd.Series(np.random.choice(["a", "b", "c"], size=100), dtype="category")
})
features.head()
[4]:
query(value) | attribute(numeric) | attribute(categorical) | |
---|---|---|---|
0 | 0.437748 | 0.442222 | c |
1 | 0.957135 | 0.323047 | b |
2 | 0.514168 | 0.426117 | a |
3 | 0.713511 | 0.886630 | b |
4 | 0.626918 | 0.663179 | c |
We generate the target variable as a function of the three features defined above:
[5]:
numeric_features = pd.get_dummies(features)
targets = (
(numeric_features["query(value)"] +
numeric_features["attribute(numeric)"] -
0.5 * numeric_features["attribute(categorical)_a"] +
0.5 * numeric_features["attribute(categorical)_c"]) > 1.0
) * 1.0
targets
[5]:
0 1.0
1 1.0
2 0.0
3 1.0
4 1.0
...
95 0.0
96 1.0
97 0.0
98 0.0
99 1.0
Length: 100, dtype: float64
Fit lightgbm modelο
Train an LightGBM model with a binary loss function:
[6]:
training_set = lgb.Dataset(features, targets)
# Train the model
params = {
'objective': 'binary',
'metric': 'binary_logloss',
'num_leaves': 3,
}
model = lgb.train(params, training_set, num_boost_round=5)
[LightGBM] [Info] Number of positive: 48, number of negative: 52
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000484 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 74
[LightGBM] [Info] Number of data points in the train set: 100, number of used features: 3
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.480000 -> initscore=-0.080043
[LightGBM] [Info] Start training from score -0.080043
Vespa application packageο
Create a Vespa application package. The model expects two document attributes, numeric
and categorical
. We can use the model in the first-phase ranking by using the lightgbm
rank feature.
[7]:
from vespa.package import ApplicationPackage, Field, RankProfile, Function
app_package = ApplicationPackage(name="lightgbm")
app_package.schema.add_fields(
Field(name="id", type="string", indexing=["summary", "attribute"]),
Field(name="numeric", type="double", indexing=["summary", "attribute"]),
Field(name="categorical", type="string", indexing=["summary", "attribute"])
)
app_package.schema.add_rank_profile(
RankProfile(
name="classify",
first_phase="lightgbm('lightgbm_model.json')"
)
)
We can check how the Vespa search defition file will look like:
[8]:
print(app_package.schema.schema_to_text)
schema lightgbm {
document lightgbm {
field id type string {
indexing: summary | attribute
}
field numeric type double {
indexing: summary | attribute
}
field categorical type string {
indexing: summary | attribute
}
}
rank-profile classify {
first-phase {
expression {
lightgbm('lightgbm_model.json')
}
}
}
}
We can export the application package files to disk:
[9]:
from pathlib import Path
Path("lightgbm").mkdir(parents=True, exist_ok=True)
app_package.to_files("lightgbm")
Note that we donβt have any models under the models
folder. We need to export the lightGBM model that we trained earlier to models/lightgbm.json
.
[10]:
!tree lightgbm
lightgbm
βββ files
βββ models
βββ schemas
βΒ Β βββ lightgbm.sd
βββ search
βΒ Β βββ query-profiles
βΒ Β βββ default.xml
βΒ Β βββ types
βΒ Β βββ root.xml
βββ services.xml
7 directories, 4 files
Export the modelο
[11]:
with open("lightgbm/models/lightgbm_model.json", "w") as f:
json.dump(model.dump_model(), f, indent=2)
Now we can see that the model is where Vespa expects it to be:
[12]:
!tree lightgbm
lightgbm
βββ files
βββ models
βΒ Β βββ lightgbm_model.json
βββ schemas
βΒ Β βββ lightgbm.sd
βββ search
βΒ Β βββ query-profiles
βΒ Β βββ default.xml
βΒ Β βββ types
βΒ Β βββ root.xml
βββ services.xml
7 directories, 5 files
Deploy the applicationο
Deploy the application package from disk with Docker:
[13]:
from vespa.deployment import VespaDocker
vespa_docker = VespaDocker()
app = vespa_docker.deploy_from_disk(application_name="lightgbm", application_root="lightgbm")
Waiting for configuration server, 0/300 seconds...
Using plain http against endpoint http://localhost:8080/ApplicationStatus
Waiting for application status, 0/300 seconds...
Using plain http against endpoint http://localhost:8080/ApplicationStatus
Waiting for application status, 5/300 seconds...
Using plain http against endpoint http://localhost:8080/ApplicationStatus
Waiting for application status, 10/300 seconds...
Using plain http against endpoint http://localhost:8080/ApplicationStatus
Application is up!
Finished deployment.
Feed the dataο
Feed the simulated data. To feed data in batch we need to create a list of dictionaries containing id
and fields
keys:
[14]:
feed_batch = [
{
"id": idx,
"fields": {
"id": idx,
"numeric": row["attribute(numeric)"],
"categorical": row["attribute(categorical)"]
}
} for idx, row in features.iterrows()
]
Feed the batch of data:
[15]:
from vespa.io import VespaResponse
def callback(response:VespaResponse, id:str):
if not response.is_successful():
print(f"Document {id} was not fed to Vespa due to error: {response.get_json()}")
app.feed_iterable(feed_batch, callback=callback)
Model predictionsο
Predict with the trained LightGBM model so that we can later compare with the predictions returned by Vespa.
[16]:
features["model_prediction"] = model.predict(features)
[17]:
features
[17]:
query(value) | attribute(numeric) | attribute(categorical) | model_prediction | |
---|---|---|---|---|
0 | 0.437748 | 0.442222 | c | 0.645663 |
1 | 0.957135 | 0.323047 | b | 0.645663 |
2 | 0.514168 | 0.426117 | a | 0.354024 |
3 | 0.713511 | 0.886630 | b | 0.645663 |
4 | 0.626918 | 0.663179 | c | 0.645663 |
... | ... | ... | ... | ... |
95 | 0.208583 | 0.103319 | c | 0.352136 |
96 | 0.882902 | 0.224213 | c | 0.645663 |
97 | 0.604831 | 0.675583 | a | 0.354024 |
98 | 0.278674 | 0.008019 | b | 0.352136 |
99 | 0.417318 | 0.616241 | b | 0.645663 |
100 rows Γ 4 columns
Queryο
Create a compute_vespa_relevance
function that takes a document id
and a query value
and return the LightGBM model deployed.
[18]:
def compute_vespa_relevance(id_value:int):
hits = app.query(
body={
"yql": "select * from sources * where id = {}".format(str(id_value)),
"ranking": "classify",
"ranking.features.query(value)": features.loc[id_value, "query(value)"],
"hits": 1
}
).hits
return hits[0]["relevance"]
compute_vespa_relevance(id_value=0)
[18]:
0.645662636917761
Loop through the features
to compute a vespa prediction for all the data points, so that we can compare it to the predictions made by the model outside Vespa.
[19]:
vespa_relevance = []
for idx, row in features.iterrows():
vespa_relevance.append(compute_vespa_relevance(id_value=idx))
features["vespa_relevance"] = vespa_relevance
[20]:
features
[20]:
query(value) | attribute(numeric) | attribute(categorical) | model_prediction | vespa_relevance | |
---|---|---|---|---|---|
0 | 0.437748 | 0.442222 | c | 0.645663 | 0.645663 |
1 | 0.957135 | 0.323047 | b | 0.645663 | 0.645663 |
2 | 0.514168 | 0.426117 | a | 0.354024 | 0.354024 |
3 | 0.713511 | 0.886630 | b | 0.645663 | 0.645663 |
4 | 0.626918 | 0.663179 | c | 0.645663 | 0.645663 |
... | ... | ... | ... | ... | ... |
95 | 0.208583 | 0.103319 | c | 0.352136 | 0.352136 |
96 | 0.882902 | 0.224213 | c | 0.645663 | 0.645663 |
97 | 0.604831 | 0.675583 | a | 0.354024 | 0.354024 |
98 | 0.278674 | 0.008019 | b | 0.352136 | 0.352136 |
99 | 0.417318 | 0.616241 | b | 0.645663 | 0.645663 |
100 rows Γ 5 columns
Compare model and Vespa predictionsο
Predictions from the model should be equal to predictions from Vespa, showing the model was correctly deployed to Vespa.
[21]:
assert features["model_prediction"].tolist() == features["vespa_relevance"].tolist()
Clean environmentο
[22]:
!rm -fr lightgbm
vespa_docker.container.stop()
vespa_docker.container.remove()