
BGE-M3 - The Mother of all embedding models

BAAI released BGE-M3 on January 30th, a new member of the BGE model series.

M3 stands for Multi-linguality (100+ languages), Multi-granularities (input length up to 8192), Multi-Functionality (unification of dense, lexical, multi-vec (colbert) retrieval).

This notebook demonstrates how to use the BGE-M3 embeddings and represent all three embedding representations in Vespa! Vespa is the only scalable serving engine that can handle all M3 representations.

This code is inspired by the README from the model hub BAAI/bge-m3.

Let’s get started! First, install dependencies:

!pip3 install -U pyvespa FlagEmbedding

Explore the multiple representations of M3

When encoding text, we can ask for the representations we want

  • Sparse vectors with weights for the token IDs (from the multilingual tokenization process)

  • Dense (DPR) regular text embeddings

  • Multi-Dense (ColBERT) - contextualized multi-token vectors

Let us dive into it - To use this model on the CPU we set use_fp16 to False, for GPU inference, it is recommended to use use_fp16=True for accelerated inference.

from FlagEmbedding import BGEM3FlagModel

model = BGEM3FlagModel('BAAI/bge-m3', use_fp16=False)

A demo passage

Let us encode a simple passage


passage = ["BGE M3 is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction."]
passage_embeddings = model.encode(passage, return_dense=True, return_sparse=True, return_colbert_vecs=True)
dict_keys(['dense_vecs', 'lexical_weights', 'colbert_vecs'])

Defining the Vespa application

PyVespa helps us build the Vespa application package. A Vespa application package consists of configuration files, schemas, models, and code (plugins).

First, we define a Vespa schema with the fields we want to store and their type. We use Vespa tensors to represent the three different M3 representations.

  • We use a mapped tensor denoted by t{} to represent the sparse lexical representation

  • We use an indexed tensor denoted by x[1024] to represent the dense single vector representation of 1024 dimensions

  • For the colbert_rep (multi-vector), we use a mixed tensor that combines a mapped and an indexed dimension. This mixed tensor allows us to represent variable lengths.

We use bfloat16 tensor cell type, saving 50% storage compared to float.

from vespa.package import Schema, Document, Field, FieldSet
m_schema = Schema(
                    Field(name="id", type="string", indexing=["summary"]),
                    Field(name="text", type="string", indexing=["summary", "index"], index="enable-bm25"),
                    Field(name="lexical_rep", type="tensor<bfloat16>(t{})", indexing=["summary", "attribute"]),
                    Field(name="dense_rep", type="tensor<bfloat16>(x[1024])", indexing=["summary", "attribute"], attribute=["distance-metric: angular"]),
                    Field(name="colbert_rep", type="tensor<bfloat16>(t{}, x[1024])", indexing=["summary", "attribute"])
                FieldSet(name = "default", fields = ["text"])

The above defines our m schema with the original text and the three different representations

from vespa.package import ApplicationPackage

vespa_app_name = "m"
vespa_application_package = ApplicationPackage(

In the last step, we configure ranking by adding rank-profile’s to the schema.

We define three functions that implement the three different scoring functions for the different representations

  • dense (dense cosine similarity)

  • sparse (sparse dot product)

  • max_sim (The colbert max sim operation)

Then, we combine these three scoring functions using a linear combination with weights, as suggested by the authors here.

from vespa.package import RankProfile, Function,  FirstPhaseRanking

semantic = RankProfile(
        ("query(q_dense)", "tensor<bfloat16>(x[1024])"),
        ("query(q_lexical)", "tensor<bfloat16>(t{})"),
        ("query(q_colbert)", "tensor<bfloat16>(qt{}, x[1024])"),
        ("query(q_len_colbert)", "float"),
            expression="cosine_similarity(query(q_dense), attribute(dense_rep),x)"
            expression="sum(query(q_lexical) * attribute(lexical_rep))"
            expression="sum(reduce(sum(query(q_colbert) * attribute(colbert_rep) , x),max, t),qt)/query(q_len_colbert)"
        expression="0.4*dense + 0.2*lexical +  0.4*max_sim",
    match_features=["dense", "lexical", "max_sim", "bm25(text)"]

The m3hybrid rank-profile above defines the query input embedding type and a similarities function that uses a Vespa tensor compute function that calculates the M3 similarities for dense, lexical, and the max_sim for the colbert representations.

The profile only defines a single ranking phase, using a linear combination of multiple features using the suggested weighting.

Using match-features, Vespa returns selected features along with the hit in the SERP (result page). We also include BM25. We can view BM25 as the fourth dimension. Especially for long-context retrieval, it can be helpful compared to the neural representations.

Deploy the application to Vespa Cloud

With the configured application, we can deploy it to Vespa Cloud. It is also possible to deploy the app using docker; see the Hybrid Search - Quickstart guide for an example of deploying it to a local docker container.

Install the Vespa CLI using homebrew - or download a binary from GitHub as demonstrated below.

!brew install vespa-cli

Alternatively, if running in Colab, download the Vespa CLI:

import os
import requests
res = requests.get(url="https://api.github.com/repos/vespa-engine/vespa/releases/latest").json()
os.environ["VERSION"] = res["tag_name"].replace("v", "")
!curl -fsSL https://github.com/vespa-engine/vespa/releases/download/v${VERSION}/vespa-cli_${VERSION}_linux_amd64.tar.gz | tar -zxf -
!ln -sf /content/vespa-cli_${VERSION}_linux_amd64/bin/vespa /bin/vespa

To deploy the application to Vespa Cloud we need to create a tenant in the Vespa Cloud:

Create a tenant at console.vespa-cloud.com (unless you already have one). This step requires a Google or GitHub account, and will start your free trial. Make note of the tenant name, it is used in the next steps.

Configure Vespa Cloud date-plane security

Create Vespa Cloud data-plane mTLS cert/key-pair. The mutual certificate pair is used to talk to your Vespa cloud endpoints. See Vespa Cloud Security Guide for details.

We save the paths to the credentials for later data-plane access without using pyvespa APIs.

import os

os.environ["TENANT_NAME"] = "vespa-team" # Replace with your tenant name

vespa_cli_command = f'vespa config set application {os.environ["TENANT_NAME"]}.{vespa_app_name}'

!vespa config set target cloud
!vespa auth cert -N

Validate that we have the expected data-plane credential files:

from os.path import exists
from pathlib import Path

cert_path = Path.home() / ".vespa" / f"{os.environ['TENANT_NAME']}.{vespa_app_name}.default/data-plane-public-cert.pem"
key_path = Path.home() / ".vespa" / f"{os.environ['TENANT_NAME']}.{vespa_app_name}.default/data-plane-private-key.pem"

if not exists(cert_path) or not exists(key_path):
    print("ERROR: set the correct paths to security credentials. Correct paths above and rerun until you do not see this error")

Note that the subsequent Vespa Cloud deploy call below will add data-plane-public-cert.pem to the application before deploying it to Vespa Cloud, so that you have access to both the private key and the public certificate. At the same time, Vespa Cloud only knows the public certificate.

Configure Vespa Cloud control-plane security

Authenticate to generate a tenant level control plane API key for deploying the applications to Vespa Cloud, and save the path to it.

The generated tenant api key must be added in the Vespa Console before attemting to deploy the application.

To use this key in Vespa Cloud click 'Add custom key' at
and paste the entire public key including the BEGIN and END lines.
!vespa auth api-key

from pathlib import Path
api_key_path = Path.home() / ".vespa" / f"{os.environ['TENANT_NAME']}.api-key.pem"

Deploy to Vespa Cloud

Now that we have data-plane and control-plane credentials ready, we can deploy our application to Vespa Cloud!

PyVespa supports deploying apps to the development zone.

Note: Deployments to dev and perf expire after 7 days of inactivity, i.e., 7 days after running deploy. This applies to all plans, not only the Free Trial. Use the Vespa Console to extend the expiry period, or redeploy the application to add 7 more days.

from vespa.deployment import VespaCloud

def read_secret():
    """Read the API key from the environment variable. This is
    only used for CI/CD purposes."""
    t = os.getenv("VESPA_TEAM_API_KEY")
    if t:
        return t.replace(r"\n", "\n")
        return t

vespa_cloud = VespaCloud(
    key_content=read_secret() if read_secret() else None,

Now deploy the app to Vespa Cloud dev zone.

The first deployment typically takes 2 minutes until the endpoint is up.

from vespa.application import Vespa
app:Vespa = vespa_cloud.deploy()
Feed the M3 representations

We convert the three different representations to Vespa feed format

vespa_fields = {
    "text": passage[0],
    "lexical_rep": {key: float(value) for key, value in passage_embeddings['lexical_weights'][0].items()},
    "colbert_rep":  {index: passage_embeddings['colbert_vecs'][0][index].tolist() for index in range(passage_embeddings['colbert_vecs'][0].shape[0])}
from vespa.io import VespaResponse
response: VespaResponse = app.feed_data_point(schema='m', data_id=0, fields=vespa_fields)

Querying data

Now, we can also query our data.

Read more about querying Vespa in:

query  = ["What is BGE M3?"]
query_embeddings = model.encode(query, return_dense=True, return_sparse=True, return_colbert_vecs=True)

The M3 colbert scoring function needs the query length to normalize the score to the range 0 to 1. This helps when combining the score with the other scoring functions.

query_length = query_embeddings['colbert_vecs'][0].shape[0]
query_fields = {
    "input.query(q_lexical)": {key: float(value) for key, value in query_embeddings['lexical_weights'][0].items()},
    "input.query(q_dense)": query_embeddings['dense_vecs'][0].tolist(),
    "input.query(q_colbert)":  str({index: query_embeddings['colbert_vecs'][0][index].tolist() for index in range(query_embeddings['colbert_vecs'][0].shape[0])}),
    "input.query(q_len_colbert)": query_length
from vespa.io import VespaQueryResponse
import json

response:VespaQueryResponse = app.query(
    yql="select id, text from m where userQuery() or ({targetHits:10}nearestNeighbor(dense_rep,q_dense))",
print(json.dumps(response.hits[0], indent=2))
  "id": "index:m_content/0/cfcd2084234135f700f08abf",
  "relevance": 0.5993361056332731,
  "source": "m_content",
  "fields": {
    "matchfeatures": {
      "bm25(text)": 0.8630462173553426,
      "dense": 0.6258970723760484,
      "lexical": 0.1941967010498047,
      "max_sim": 0.7753448411822319
    "text": "BGE M3 is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction."

Notice the matchfeatures that returns the configured match-features from the rank-profile. We can use these to compare the torch model scoring with the computations specified in Vespa.

Now, we can compare the Vespa computed scores with the model torch code and they line up perfectly

model.compute_lexical_matching_score(passage_embeddings['lexical_weights'][0], query_embeddings['lexical_weights'][0])
query_embeddings['dense_vecs'][0] @ passage_embeddings['dense_vecs'][0].T

That is it!

That is how easy it is to represent the brand new M3 FlagEmbedding representations in Vespa! Read more in the M3 technical report.

We can go ahead and delete the Vespa cloud instance we deployed by:

