LightGBM: Training the model with Vespa features

The main goal of this tutorial is to deploy and use a LightGBM model in a Vespa application. The following tasks will be accomplished throughout the tutorial:

Train a LightGBM classification model with variable names supported by Vespa.
Create Vespa application package files and export then to an application folder.
Export the trained LightGBM model to the Vespa application folder.
Deploy the Vespa application using the application folder.
Feed data to the Vespa application.
Assert that the LightGBM predictions from the deployed model are correct.

Refer to troubleshooting for any problem when running this guide.

Setup

Install and load required packages.

[ ]:

!pip3 install numpy pandas pyvespa lightgbm

[3]:

import json
import lightgbm as lgb
import numpy as np
import pandas as pd

Create data

Generate a toy dataset to follow along. Note that we set the column names in a format that Vespa understands. query(value) means that the user will send a parameter named value along with the query. attribute(field) means that field is a document attribute defined in a schema. In the example below we have a query parameter named value and two document’s attributes, numeric and categorical. If we want lightgbm to handle categorical variables we should use dtype="category" when creating the dataframe, as shown below.

[4]:

# Create random training set
features = pd.DataFrame({
            "query(value)": np.random.random(100),
            "attribute(numeric)": np.random.random(100),
            "attribute(categorical)": pd.Series(np.random.choice(["a", "b", "c"], size=100), dtype="category")
        })
features.head()

[4]:

	query(value)	attribute(numeric)	attribute(categorical)
0	0.437748	0.442222	c
1	0.957135	0.323047	b
2	0.514168	0.426117	a
3	0.713511	0.886630	b
4	0.626918	0.663179	c

We generate the target variable as a function of the three features defined above:

[5]:

numeric_features = pd.get_dummies(features)
targets = (
    (numeric_features["query(value)"] +
     numeric_features["attribute(numeric)"]  -
     0.5 * numeric_features["attribute(categorical)_a"] +
     0.5 * numeric_features["attribute(categorical)_c"]) > 1.0
) * 1.0
targets

[5]:

0     1.0
1     1.0
2     0.0
3     1.0
4     1.0
     ...
95    0.0
96    1.0
97    0.0
98    0.0
99    1.0
Length: 100, dtype: float64

Fit lightgbm model

Train an LightGBM model with a binary loss function:

[6]:

training_set = lgb.Dataset(features, targets)

# Train the model
params = {
    'objective': 'binary',
    'metric': 'binary_logloss',
    'num_leaves': 3,
}
model = lgb.train(params, training_set, num_boost_round=5)

[LightGBM] [Info] Number of positive: 48, number of negative: 52
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000484 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 74
[LightGBM] [Info] Number of data points in the train set: 100, number of used features: 3
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.480000 -> initscore=-0.080043
[LightGBM] [Info] Start training from score -0.080043

Vespa application package

Create a Vespa application package. The model expects two document attributes, numeric and categorical. We can use the model in the first-phase ranking by using the lightgbm rank feature.

[7]:

from vespa.package import ApplicationPackage, Field, RankProfile, Function

app_package = ApplicationPackage(name="lightgbm")
app_package.schema.add_fields(
    Field(name="id", type="string", indexing=["summary", "attribute"]),
    Field(name="numeric", type="double", indexing=["summary", "attribute"]),
    Field(name="categorical", type="string", indexing=["summary", "attribute"])
)
app_package.schema.add_rank_profile(
    RankProfile(
        name="classify",
        first_phase="lightgbm('lightgbm_model.json')"
    )
)

We can check how the Vespa search defition file will look like:

[8]:

print(app_package.schema.schema_to_text)

schema lightgbm {
    document lightgbm {
        field id type string {
            indexing: summary | attribute
        }
        field numeric type double {
            indexing: summary | attribute
        }
        field categorical type string {
            indexing: summary | attribute
        }
    }
    rank-profile classify {
        first-phase {
            expression {
                lightgbm('lightgbm_model.json')
            }
        }
    }
}

We can export the application package files to disk:

[9]:

from pathlib import Path
Path("lightgbm").mkdir(parents=True, exist_ok=True)
app_package.to_files("lightgbm")

Note that we don’t have any models under the models folder. We need to export the lightGBM model that we trained earlier to models/lightgbm.json.

[10]:

!tree lightgbm

lightgbm
├── files
├── models
├── schemas
│   └── lightgbm.sd
├── search
│   └── query-profiles
│       ├── default.xml
│       └── types
│           └── root.xml
└── services.xml

7 directories, 4 files

Export the model

[11]:

with open("lightgbm/models/lightgbm_model.json", "w") as f:
    json.dump(model.dump_model(), f, indent=2)

Now we can see that the model is where Vespa expects it to be:

[12]:

!tree lightgbm

lightgbm
├── files
├── models
│   └── lightgbm_model.json
├── schemas
│   └── lightgbm.sd
├── search
│   └── query-profiles
│       ├── default.xml
│       └── types
│           └── root.xml
└── services.xml

7 directories, 5 files

Deploy the application

Deploy the application package from disk with Docker:

[13]:

from vespa.deployment import VespaDocker

vespa_docker = VespaDocker()
app = vespa_docker.deploy_from_disk(application_name="lightgbm", application_root="lightgbm")

Waiting for configuration server, 0/300 seconds...
Using plain http against endpoint http://localhost:8080/ApplicationStatus
Waiting for application status, 0/300 seconds...
Using plain http against endpoint http://localhost:8080/ApplicationStatus
Waiting for application status, 5/300 seconds...
Using plain http against endpoint http://localhost:8080/ApplicationStatus
Waiting for application status, 10/300 seconds...
Using plain http against endpoint http://localhost:8080/ApplicationStatus
Application is up!
Finished deployment.

Feed the data

Feed the simulated data. To feed data in batch we need to create a list of dictionaries containing id and fields keys:

[14]:

feed_batch = [
    {
        "id": idx,
        "fields": {
            "id": idx,
            "numeric": row["attribute(numeric)"],
            "categorical": row["attribute(categorical)"]
        }
    } for idx, row in features.iterrows()
]

Feed the batch of data:

[15]:

from vespa.io import VespaResponse

def callback(response:VespaResponse, id:str):
    if not response.is_successful():
        print(f"Document {id} was not fed to Vespa due to error: {response.get_json()}")

app.feed_iterable(feed_batch, callback=callback)

Model predictions

Predict with the trained LightGBM model so that we can later compare with the predictions returned by Vespa.

[16]:

features["model_prediction"] = model.predict(features)

[17]:

features

[17]:

	query(value)	attribute(numeric)	attribute(categorical)	model_prediction
0	0.437748	0.442222	c	0.645663
1	0.957135	0.323047	b	0.645663
2	0.514168	0.426117	a	0.354024
3	0.713511	0.886630	b	0.645663
4	0.626918	0.663179	c	0.645663
...	...	...	...	...
95	0.208583	0.103319	c	0.352136
96	0.882902	0.224213	c	0.645663
97	0.604831	0.675583	a	0.354024
98	0.278674	0.008019	b	0.352136
99	0.417318	0.616241	b	0.645663

100 rows × 4 columns

Query

Create a compute_vespa_relevance function that takes a document id and a query value and return the LightGBM model deployed.

[18]:

def compute_vespa_relevance(id_value:int):
    hits = app.query(
        body={
            "yql": "select * from sources * where id = {}".format(str(id_value)),
            "ranking": "classify",
            "ranking.features.query(value)": features.loc[id_value, "query(value)"],
            "hits": 1
        }
    ).hits
    return hits[0]["relevance"]

compute_vespa_relevance(id_value=0)

[18]:

0.645662636917761

Loop through the features to compute a vespa prediction for all the data points, so that we can compare it to the predictions made by the model outside Vespa.

[19]:

vespa_relevance = []
for idx, row in features.iterrows():
    vespa_relevance.append(compute_vespa_relevance(id_value=idx))
features["vespa_relevance"] = vespa_relevance

[20]:

features

[20]:

	query(value)	attribute(numeric)	attribute(categorical)	model_prediction	vespa_relevance
0	0.437748	0.442222	c	0.645663	0.645663
1	0.957135	0.323047	b	0.645663	0.645663
2	0.514168	0.426117	a	0.354024	0.354024
3	0.713511	0.886630	b	0.645663	0.645663
4	0.626918	0.663179	c	0.645663	0.645663
...	...	...	...	...	...
95	0.208583	0.103319	c	0.352136	0.352136
96	0.882902	0.224213	c	0.645663	0.645663
97	0.604831	0.675583	a	0.354024	0.354024
98	0.278674	0.008019	b	0.352136	0.352136
99	0.417318	0.616241	b	0.645663	0.645663

100 rows × 5 columns

Compare model and Vespa predictions

Predictions from the model should be equal to predictions from Vespa, showing the model was correctly deployed to Vespa.

[21]:

assert features["model_prediction"].tolist() == features["vespa_relevance"].tolist()

Clean environment

[22]:

!rm -fr lightgbm
vespa_docker.container.stop()
vespa_docker.container.remove()