BGE-M3 - The Mother of all embedding modelsο
BAAI released BGE-M3 on January 30th, a new member of the BGE model series.
M3 stands for Multi-linguality (100+ languages), Multi-granularities (input length up to 8192), Multi-Functionality (unification of dense, lexical, multi-vec (colbert) retrieval).
This notebook demonstrates how to use the BGE-M3 embeddings and represent all three embedding representations in Vespa! Vespa is the only scalable serving engine that can handle all M3 representations.
This code is inspired by the README from the model hub BAAI/bge-m3.
Letβs get started! First, install dependencies:
[ ]:
!pip3 install -U pyvespa FlagEmbedding
Explore the multiple representations of M3ο
When encoding text, we can ask for the representations we want
Sparse vectors with weights for the token IDs (from the multilingual tokenization process)
Dense (DPR) regular text embeddings
Multi-Dense (ColBERT) - contextualized multi-token vectors
Let us dive into it - To use this model on the CPU we set use_fp16
to False, for GPU inference, it is recommended to use use_fp16=True
for accelerated inference.
[ ]:
from FlagEmbedding import BGEM3FlagModel
model = BGEM3FlagModel("BAAI/bge-m3", use_fp16=False)
A demo passageο
Let us encode a simple passage
[3]:
passage = [
"BGE M3 is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction."
]
[ ]:
passage_embeddings = model.encode(
passage, return_dense=True, return_sparse=True, return_colbert_vecs=True
)
[5]:
passage_embeddings.keys()
[5]:
dict_keys(['dense_vecs', 'lexical_weights', 'colbert_vecs'])
Defining the Vespa applicationο
PyVespa helps us build the Vespa application package. A Vespa application package consists of configuration files, schemas, models, and code (plugins).
First, we define a Vespa schema with the fields we want to store and their type. We use Vespa tensors to represent the three different M3 representations.
We use a mapped tensor denoted by
t{}
to represent the sparse lexical representationWe use an indexed tensor denoted by
x[1024]
to represent the dense single vector representation of 1024 dimensionsFor the colbert_rep (multi-vector), we use a mixed tensor that combines a mapped and an indexed dimension. This mixed tensor allows us to represent variable lengths.
We use bfloat16
tensor cell type, saving 50% storage compared to float
.
[6]:
from vespa.package import Schema, Document, Field, FieldSet
m_schema = Schema(
name="m",
document=Document(
fields=[
Field(name="id", type="string", indexing=["summary"]),
Field(
name="text",
type="string",
indexing=["summary", "index"],
index="enable-bm25",
),
Field(
name="lexical_rep",
type="tensor<bfloat16>(t{})",
indexing=["summary", "attribute"],
),
Field(
name="dense_rep",
type="tensor<bfloat16>(x[1024])",
indexing=["summary", "attribute"],
attribute=["distance-metric: angular"],
),
Field(
name="colbert_rep",
type="tensor<bfloat16>(t{}, x[1024])",
indexing=["summary", "attribute"],
),
],
),
fieldsets=[FieldSet(name="default", fields=["text"])],
)
The above defines our m
schema with the original text and the three different representations
[7]:
from vespa.package import ApplicationPackage
vespa_app_name = "m"
vespa_application_package = ApplicationPackage(name=vespa_app_name, schema=[m_schema])
In the last step, we configure ranking by adding rank-profile
βs to the schema.
We define three functions that implement the three different scoring functions for the different representations
dense (dense cosine similarity)
sparse (sparse dot product)
max_sim (The colbert max sim operation)
Then, we combine these three scoring functions using a linear combination with weights, as suggested by the authors here.
[8]:
from vespa.package import RankProfile, Function, FirstPhaseRanking
semantic = RankProfile(
name="m3hybrid",
inputs=[
("query(q_dense)", "tensor<bfloat16>(x[1024])"),
("query(q_lexical)", "tensor<bfloat16>(t{})"),
("query(q_colbert)", "tensor<bfloat16>(qt{}, x[1024])"),
("query(q_len_colbert)", "float"),
],
functions=[
Function(
name="dense",
expression="cosine_similarity(query(q_dense), attribute(dense_rep),x)",
),
Function(
name="lexical", expression="sum(query(q_lexical) * attribute(lexical_rep))"
),
Function(
name="max_sim",
expression="sum(reduce(sum(query(q_colbert) * attribute(colbert_rep) , x),max, t),qt)/query(q_len_colbert)",
),
],
first_phase=FirstPhaseRanking(
expression="0.4*dense + 0.2*lexical + 0.4*max_sim", rank_score_drop_limit=0.0
),
match_features=["dense", "lexical", "max_sim", "bm25(text)"],
)
m_schema.add_rank_profile(semantic)
The m3hybrid
rank-profile above defines the query input embedding type and a similarities function that uses a Vespa tensor compute function that calculates the M3 similarities for dense, lexical, and the max_sim for the colbert representations.
The profile only defines a single ranking phase, using a linear combination of multiple features using the suggested weighting.
Using match-features, Vespa returns selected features along with the hit in the SERP (result page). We also include BM25. We can view BM25 as the fourth dimension. Especially for long-context retrieval, it can be helpful compared to the neural representations.
Deploy the application to Vespa Cloudο
With the configured application, we can deploy it to Vespa Cloud. It is also possible to deploy the app using docker; see the Hybrid Search - Quickstart guide for an example of deploying it to a local docker container.
Install the Vespa CLI using homebrew - or download a binary from GitHub as demonstrated below.
[ ]:
!brew install vespa-cli
Alternatively, if running in Colab, download the Vespa CLI:
[ ]:
import os
import requests
res = requests.get(
url="https://api.github.com/repos/vespa-engine/vespa/releases/latest"
).json()
os.environ["VERSION"] = res["tag_name"].replace("v", "")
!curl -fsSL https://github.com/vespa-engine/vespa/releases/download/v${VERSION}/vespa-cli_${VERSION}_linux_amd64.tar.gz | tar -zxf -
!ln -sf /content/vespa-cli_${VERSION}_linux_amd64/bin/vespa /bin/vespa
To deploy the application to Vespa Cloud we need to create a tenant in the Vespa Cloud:
Create a tenant at console.vespa-cloud.com (unless you already have one). This step requires a Google or GitHub account, and will start your free trial. Make note of the tenant name, it is used in the next steps.
Configure Vespa Cloud date-plane securityο
Create Vespa Cloud data-plane mTLS cert/key-pair. The mutual certificate pair is used to talk to your Vespa cloud endpoints. See Vespa Cloud Security Guide for details.
We save the paths to the credentials for later data-plane access without using pyvespa APIs.
[ ]:
import os
os.environ["TENANT_NAME"] = "vespa-team" # Replace with your tenant name
vespa_cli_command = (
f'vespa config set application {os.environ["TENANT_NAME"]}.{vespa_app_name}'
)
!vespa config set target cloud
!{vespa_cli_command}
!vespa auth cert -N
Validate that we have the expected data-plane credential files:
[10]:
from os.path import exists
from pathlib import Path
cert_path = (
Path.home()
/ ".vespa"
/ f"{os.environ['TENANT_NAME']}.{vespa_app_name}.default/data-plane-public-cert.pem"
)
key_path = (
Path.home()
/ ".vespa"
/ f"{os.environ['TENANT_NAME']}.{vespa_app_name}.default/data-plane-private-key.pem"
)
if not exists(cert_path) or not exists(key_path):
print(
"ERROR: set the correct paths to security credentials. Correct paths above and rerun until you do not see this error"
)
Note that the subsequent Vespa Cloud deploy call below will add data-plane-public-cert.pem
to the application before deploying it to Vespa Cloud, so that you have access to both the private key and the public certificate. At the same time, Vespa Cloud only knows the public certificate.
Configure Vespa Cloud control-plane securityο
Authenticate to generate a tenant level control plane API key for deploying the applications to Vespa Cloud, and save the path to it.
The generated tenant api key must be added in the Vespa Console before attemting to deploy the application.
To use this key in Vespa Cloud click 'Add custom key' at
https://console.vespa-cloud.com/tenant/TENANT_NAME/account/keys
and paste the entire public key including the BEGIN and END lines.
[ ]:
!vespa auth api-key
from pathlib import Path
api_key_path = Path.home() / ".vespa" / f"{os.environ['TENANT_NAME']}.api-key.pem"
Deploy to Vespa Cloudο
Now that we have data-plane and control-plane credentials ready, we can deploy our application to Vespa Cloud!
PyVespa
supports deploying apps to the development zone.
Note: Deployments to dev and perf expire after 7 days of inactivity, i.e., 7 days after running deploy. This applies to all plans, not only the Free Trial. Use the Vespa Console to extend the expiry period, or redeploy the application to add 7 more days.
[13]:
from vespa.deployment import VespaCloud
def read_secret():
"""Read the API key from the environment variable. This is
only used for CI/CD purposes."""
t = os.getenv("VESPA_TEAM_API_KEY")
if t:
return t.replace(r"\n", "\n")
else:
return t
vespa_cloud = VespaCloud(
tenant=os.environ["TENANT_NAME"],
application=vespa_app_name,
key_content=read_secret() if read_secret() else None,
key_location=api_key_path,
application_package=vespa_application_package,
)
Now deploy the app to Vespa Cloud dev zone.
The first deployment typically takes 2 minutes until the endpoint is up.
[14]:
from vespa.application import Vespa
app: Vespa = vespa_cloud.deploy()
Deployment started in run 1 of dev-aws-us-east-1c for samples.m. This may take a few minutes the first time.
INFO [22:13:09] Deploying platform version 8.299.14 and application dev build 1 for dev-aws-us-east-1c of default ...
INFO [22:13:10] Using CA signed certificate version 0
INFO [22:13:10] Using 1 nodes in container cluster 'm_container'
INFO [22:13:14] Session 939 for tenant 'samples' prepared and activated.
INFO [22:13:17] ######## Details for all nodes ########
INFO [22:13:31] h88976d.dev.aws-us-east-1c.vespa-external.aws.oath.cloud: expected to be UP
INFO [22:13:31] --- platform vespa/cloud-tenant-rhel8:8.299.14 <-- :
INFO [22:13:31] --- container-clustercontroller on port 19050 has not started
INFO [22:13:31] --- metricsproxy-container on port 19092 has not started
INFO [22:13:31] h89388b.dev.aws-us-east-1c.vespa-external.aws.oath.cloud: expected to be UP
INFO [22:13:31] --- platform vespa/cloud-tenant-rhel8:8.299.14 <-- :
INFO [22:13:31] --- storagenode on port 19102 has not started
INFO [22:13:31] --- searchnode on port 19107 has not started
INFO [22:13:31] --- distributor on port 19111 has not started
INFO [22:13:31] --- metricsproxy-container on port 19092 has not started
INFO [22:13:31] h90001a.dev.aws-us-east-1c.vespa-external.aws.oath.cloud: expected to be UP
INFO [22:13:31] --- platform vespa/cloud-tenant-rhel8:8.299.14 <-- :
INFO [22:13:31] --- logserver-container on port 4080 has not started
INFO [22:13:31] --- metricsproxy-container on port 19092 has not started
INFO [22:13:31] h90550a.dev.aws-us-east-1c.vespa-external.aws.oath.cloud: expected to be UP
INFO [22:13:31] --- platform vespa/cloud-tenant-rhel8:8.299.14 <-- :
INFO [22:13:31] --- container on port 4080 has not started
INFO [22:13:31] --- metricsproxy-container on port 19092 has not started
INFO [22:14:31] Found endpoints:
INFO [22:14:31] - dev.aws-us-east-1c
INFO [22:14:31] |-- https://d29bf3e7.f064e220.z.vespa-app.cloud/ (cluster 'm_container')
INFO [22:14:32] Installation succeeded!
Using mTLS (key,cert) Authentication against endpoint https://d29bf3e7.f064e220.z.vespa-app.cloud//ApplicationStatus
Application is up!
Finished deployment.
Feed the M3 representationsο
We convert the three different representations to Vespa feed format
[15]:
vespa_fields = {
"text": passage[0],
"lexical_rep": {
key: float(value)
for key, value in passage_embeddings["lexical_weights"][0].items()
},
"dense_rep": passage_embeddings["dense_vecs"][0].tolist(),
"colbert_rep": {
index: passage_embeddings["colbert_vecs"][0][index].tolist()
for index in range(passage_embeddings["colbert_vecs"][0].shape[0])
},
}
[17]:
from vespa.io import VespaResponse
response: VespaResponse = app.feed_data_point(
schema="m", data_id=0, fields=vespa_fields
)
assert response.is_successful()
Querying dataο
Now, we can also query our data.
Read more about querying Vespa in:
[ ]:
query = ["What is BGE M3?"]
query_embeddings = model.encode(
query, return_dense=True, return_sparse=True, return_colbert_vecs=True
)
The M3 colbert scoring function needs the query length to normalize the score to the range 0 to 1. This helps when combining the score with the other scoring functions.
[19]:
query_length = query_embeddings["colbert_vecs"][0].shape[0]
[20]:
query_fields = {
"input.query(q_lexical)": {
key: float(value)
for key, value in query_embeddings["lexical_weights"][0].items()
},
"input.query(q_dense)": query_embeddings["dense_vecs"][0].tolist(),
"input.query(q_colbert)": str(
{
index: query_embeddings["colbert_vecs"][0][index].tolist()
for index in range(query_embeddings["colbert_vecs"][0].shape[0])
}
),
"input.query(q_len_colbert)": query_length,
}
[21]:
from vespa.io import VespaQueryResponse
import json
response: VespaQueryResponse = app.query(
yql="select id, text from m where userQuery() or ({targetHits:10}nearestNeighbor(dense_rep,q_dense))",
ranking="m3hybrid",
query=query[0],
body={**query_fields},
)
assert response.is_successful()
print(json.dumps(response.hits[0], indent=2))
{
"id": "index:m_content/0/cfcd2084234135f700f08abf",
"relevance": 0.5993361056332731,
"source": "m_content",
"fields": {
"matchfeatures": {
"bm25(text)": 0.8630462173553426,
"dense": 0.6258970723760484,
"lexical": 0.1941967010498047,
"max_sim": 0.7753448411822319
},
"text": "BGE M3 is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction."
}
}
Notice the matchfeatures
that returns the configured match-features from the rank-profile. We can use these to compare the torch model scoring with the computations specified in Vespa.
Now, we can compare the Vespa computed scores with the model torch code and they line up perfectly
[22]:
model.compute_lexical_matching_score(
passage_embeddings["lexical_weights"][0], query_embeddings["lexical_weights"][0]
)
[22]:
0.19554455392062664
[23]:
query_embeddings["dense_vecs"][0] @ passage_embeddings["dense_vecs"][0].T
[23]:
0.6259037
[24]:
model.colbert_score(
query_embeddings["colbert_vecs"][0], passage_embeddings["colbert_vecs"][0]
)
[24]:
tensor(0.7797)
That is it!ο
That is how easy it is to represent the brand new M3 FlagEmbedding representations in Vespa! Read more in the M3 technical report.
We can go ahead and delete the Vespa cloud instance we deployed by:
[ ]:
vespa_cloud.delete()