Multi-vector indexing with HNSW

This is the pyvespa steps of the multi-vector-indexing sample application. Go to the source for a full description and prerequisites, and read the blog post. Highlighted features:

Approximate Nearest Neighbor Search - using HNSW or exact
Use a Component to configure the Huggingface embedder.
Using synthetic fields with auto-generated embeddings in data and query flow.
Application package file export, model files in the application package, deployment from files.
Multiphased ranking.
How to control text search result highlighting.

For simpler examples, see text search and pyvespa examples.

Pyvespa is an add-on to Vespa, and this guide will export the application package containing services.xml and wiki.sd. The latter is the schema file for this application - knowing services.xml and schema files is useful when reading Vespa documentation.

Refer to troubleshooting for any problem when running this guide.

This notebook requires pyvespa >= 0.37.1, ZSTD, and the Vespa CLI.

[ ]:

!pip3 install pyvespa

Create the application

Configure the Vespa instance with a component loading the E5-small model. Components are used to plug in code and models to a Vespa application - read more:

[1]:

from vespa.package import (
    ApplicationPackage,
    Component,
    Parameter,
    Field,
    HNSW,
    RankProfile,
    Function,
    FirstPhaseRanking,
    SecondPhaseRanking,
    FieldSet,
    DocumentSummary,
    Summary,
)
from pathlib import Path
import json

app_package = ApplicationPackage(
    name="wiki",
    components=[
        Component(
            id="e5-small-q",
            type="hugging-face-embedder",
            parameters=[
                Parameter("transformer-model", {"path": "model/e5-small-v2-int8.onnx"}),
                Parameter("tokenizer-model", {"path": "model/tokenizer.json"}),
            ],
        )
    ],
)

Configure fields

Vespa has a variety of basic and complex field types. This application uses a combination of integer, text and tensor fields, making it easy to implement hybrid ranking use cases:

[2]:

app_package.schema.add_fields(
    Field(name="id", type="int", indexing=["attribute", "summary"]),
    Field(
        name="title", type="string", indexing=["index", "summary"], index="enable-bm25"
    ),
    Field(
        name="url", type="string", indexing=["index", "summary"], index="enable-bm25"
    ),
    Field(
        name="paragraphs",
        type="array<string>",
        indexing=["index", "summary"],
        index="enable-bm25",
        bolding=True,
    ),
    Field(
        name="paragraph_embeddings",
        type="tensor<float>(p{},x[384])",
        indexing=["input paragraphs", "embed", "index", "attribute"],
        ann=HNSW(distance_metric="angular"),
        is_document_field=False,
    ),
    #
    # Alteratively, for exact distance calculation not using HNSW:
    #
    # Field(name="paragraph_embeddings", type="tensor<float>(p{},x[384])",
    #       indexing=["input paragraphs", "embed", "attribute"],
    #       attribute=["distance-metric: angular"],
    #       is_document_field=False)
)

One field of particular interest is paragraph_embeddings. Note that we are not feeding embeddings to this instance. Instead, the embeddings are generated by using the embed feature, using the model configured at start. Read more in Text embedding made simple.

Looking closely at the code, paragraph_embeddings uses is_document_field=False, meaning it will read another field as input (here paragraph), and run embed on it.

As only one model is configured, embed will use that one - it is possible to configure mode models and use embed model-id as well.

As the code comment illustrates, there can be different distrance metrics used, as well as using an exact or approximate nearest neighbor search.

Configure rank profiles

A rank profile defines the computation for the ranking, with a wide range of possible features as input. Below you will find first_phase ranking using text ranking (bm), semantic ranking using vector distance (consider a tensor a vector here), and combinations of the two:

[3]:

app_package.schema.add_rank_profile(
    RankProfile(
        name="semantic",
        inputs=[("query(q)", "tensor<float>(x[384])")],
        inherits="default",
        first_phase="cos(distance(field,paragraph_embeddings))",
        match_features=["closest(paragraph_embeddings)"],
    )
)

app_package.schema.add_rank_profile(
    RankProfile(name="bm25", first_phase="2*bm25(title) + bm25(paragraphs)")
)

app_package.schema.add_rank_profile(
    RankProfile(
        name="hybrid",
        inherits="semantic",
        functions=[
            Function(
                name="avg_paragraph_similarity",
                expression="""reduce(
                              sum(l2_normalize(query(q),x) * l2_normalize(attribute(paragraph_embeddings),x),x),
                              avg,
                              p
                          )""",
            ),
            Function(
                name="max_paragraph_similarity",
                expression="""reduce(
                              sum(l2_normalize(query(q),x) * l2_normalize(attribute(paragraph_embeddings),x),x),
                              max,
                              p
                          )""",
            ),
            Function(
                name="all_paragraph_similarities",
                expression="sum(l2_normalize(query(q),x) * l2_normalize(attribute(paragraph_embeddings),x),x)",
            ),
        ],
        first_phase=FirstPhaseRanking(
            expression="cos(distance(field,paragraph_embeddings))"
        ),
        second_phase=SecondPhaseRanking(
            expression="firstPhase + avg_paragraph_similarity() + log( bm25(title) + bm25(paragraphs) + bm25(url))"
        ),
        match_features=[
            "closest(paragraph_embeddings)",
            "firstPhase",
            "bm25(title)",
            "bm25(paragraphs)",
            "avg_paragraph_similarity",
            "max_paragraph_similarity",
            "all_paragraph_similarities",
        ],
    )
)

Configure fieldset

A fieldset is a way to configure search in multiple fields:

[4]:

app_package.schema.add_field_set(
    FieldSet(name="default", fields=["title", "url", "paragraphs"])
)

Configure document summary

A document summary is the collection of fields to return in query results - the default summary is used unless other specified in the query. Here we configure a minimal fieldset without the larger paragraph text/embedding fields:

[5]:

app_package.schema.add_document_summary(
    DocumentSummary(
        name="minimal",
        summary_fields=[Summary("id", "int"), Summary("title", "string")],
    )
)

Export the configuration

At this point, the application is well defined. Remember that the Component configuration at start configures model files to be found in a model directory. We must therefore export the configuration and add the models, before we can deploy to the Vespa instance. Export the application package:

[6]:

Path("pkg").mkdir(parents=True, exist_ok=True)
app_package.to_files("pkg")

It is a good idea to inspect the files exported into pkg - these are files referred to in the Vespa Documentation.

Download model files

At this point, we can save the model files into the application package:

[7]:

! mkdir -p pkg/model
! curl -L -o pkg/model/tokenizer.json \
  https://raw.githubusercontent.com/vespa-engine/sample-apps/master/simple-semantic-search/model/tokenizer.json

! curl -L -o pkg/model/e5-small-v2-int8.onnx \
  https://github.com/vespa-engine/sample-apps/raw/master/simple-semantic-search/model/e5-small-v2-int8.onnx

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  694k  100  694k    0     0  2473k      0 --:--:-- --:--:-- --:--:-- 2508k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 32.3M  100 32.3M    0     0  27.1M      0  0:00:01  0:00:01 --:--:-- 53.0M

Deploy the application

As all the files in the app package are ready, we can start a Vespa instance - here using Docker. Deploy the app package:

[8]:

from vespa.deployment import VespaDocker

vespa_docker = VespaDocker()
app = vespa_docker.deploy_from_disk(application_name="wiki", application_root="pkg")

Waiting for configuration server, 0/300 seconds...
Waiting for configuration server, 5/300 seconds...
Using plain http against endpoint http://localhost:8080/ApplicationStatus
Waiting for application status, 0/300 seconds...
Using plain http against endpoint http://localhost:8080/ApplicationStatus
Waiting for application status, 5/300 seconds...
Using plain http against endpoint http://localhost:8080/ApplicationStatus
Application is up!
Finished deployment.

Feed documents

Download the Wikipedia articles:

[9]:

! curl -s -H "Accept:application/vnd.github.v3.raw" \
  https://api.github.com/repos/vespa-engine/sample-apps/contents/multi-vector-indexing/ext/articles.jsonl.zst | \
  zstdcat - > articles.jsonl

I you do not have ZSTD install, get articles.jsonl.zip and unzip it instead.

Feed and index the Wikipedia articles using the Vespa CLI. As part of feeding, embed is called on each article, and the output of this is stored in the paragraph_embeddings field:

[10]:

! vespa config set target local
! vespa feed articles.jsonl

{
  "feeder.seconds": 1.448,
  "feeder.ok.count": 8,
  "feeder.ok.rate": 5.524,
  "feeder.error.count": 0,
  "feeder.inflight.count": 0,
  "http.request.count": 8,
  "http.request.bytes": 12958,
  "http.request.MBps": 0.009,
  "http.exception.count": 0,
  "http.response.count": 8,
  "http.response.bytes": 674,
  "http.response.MBps": 0.000,
  "http.response.error.count": 0,
  "http.response.latency.millis.min": 728,
  "http.response.latency.millis.avg": 834,
  "http.response.latency.millis.max": 1446,
  "http.response.code.counts": {
    "200": 8
  }
}

Note that creating embeddings is computationally expensive, but this is a small dataset with only 8 articles, so will be done in a few seconds.

The Vespa instance is now populated with the Wikipedia articles, with generated embeddings, and ready for queries. The next sections have examples of various kinds of queries to run on the dataset.

Simple retrieve all articles with undefined ranking

Run a query selecting all documents, returning two of them. The rank profile is the built-in unranked which means no ranking calculations are done, the results are returned in random order:

[11]:

from vespa.io import VespaQueryResponse

result: VespaQueryResponse = app.query(
    body={
        "yql": "select * from wiki where true",
        "ranking.profile": "unranked",
        "hits": 2,
    }
)
if not result.is_successful():
    raise ValueError(result.get_json())
if len(result.hits) != 2:
    raise ValueError("Expected 2 hits, got {}".format(len(result.hits)))
print(json.dumps(result.hits, indent=4))

[
    {
        "id": "id:wikipedia:wiki::797944",
        "relevance": 0.0,
        "source": "wiki_content",
        "fields": {
            "sddocname": "wiki",
            "paragraphs": [
                "Abella Danger made her pornography debut in July 2014 for Bang Bros. She has appeared in about 1010 credited scenes. She has appeared in mainstream news media other than adult news media, including the websites \"Elite Daily\" and \"International Business Times\".",
                "In 2018, \"Fortune\" said she was one of the most popular and in-demand performers in the pornographic business.",
                "Amongst many awards, she won Best Pornographic actor in 2021 (Pornovizija '21), nominated by the competent jury committee.",
                "Abella belongs to a Jewish-Ukrainian family. She started as a ballet dancer when she was only three years old."
            ],
            "documentid": "id:wikipedia:wiki::797944",
            "title": "Abella Danger",
            "url": "https://simple.wikipedia.org/wiki?curid=797944"
        }
    },
    {
        "id": "id:wikipedia:wiki::8496",
        "relevance": 0.0,
        "source": "wiki_content",
        "fields": {
            "sddocname": "wiki",
            "paragraphs": [
                "The way the word is used varies. Most societies have rites of passage to mark the change from childhood to adulthood. These ceremonies may be quite elaborate. During puberty, rapid mental and physical development occurs. Adolescence is the name for this transition period from childhood to adulthood.",
                "\"Teenager\" is mainly an English word, as many foreign languages do not include a suffix in their translations of the numbers 13 to 19. In non-English speaking countries, people between these ages may be called adolescents, youths, young adults, or just children, depending on the culture.",
                "The life of a teenager seems to change daily. Constantly exposed to new ideas, social situations and people, teenagers work to develop their personalities and interests during this time of great change. Before their teenage years, these adolescents focused on school, play, and gaining approval from their parents."
            ],
            "documentid": "id:wikipedia:wiki::8496",
            "title": "Teenager",
            "url": "https://simple.wikipedia.org/wiki?curid=8496"
        }
    }
]

Traditional keyword search with BM25 ranking on the article level

Run a text-search query and use the bm25 ranking profile configured at the start of this guide: 2*bm25(title) + bm25(paragraphs). Here, we use BM25 on the title and paragraph text fields, giving more weight to matches in title:

[12]:

result = app.query(
    body={
        "yql": "select * from wiki where userQuery()",
        "query": 24,
        "ranking.profile": "bm25",
        "hits": 2,
    }
)
if len(result.hits) != 2:
    raise ValueError("Expected 2 hits, got {}".format(len(result.hits)))
print(json.dumps(result.hits, indent=4))

[
{
"id": "id:wikipedia:wiki::9985",
"relevance": 4.8876824345024605,
"source": "wiki_content",
"fields": {
"sddocname": "wiki",
"paragraphs": [
"The <hi>24</hi>-hour clock is a way of telling the time in which the day runs from midnight to midnight and is divided into <hi>24</hi> hours, numbered from 0 to 23. It does not use a.m. or p.m. This system is also referred to (only in the US and the English speaking parts of Canada) as military time or (only in the United Kingdom and now very rarely) as continental time. In some parts of the world, it is called railway time. Also, the international standard notation of time (ISO 8601) is based on this format.",
"A time in the <hi>24</hi>-hour clock is written in the form hours:minutes (for example, 01:23), or hours:minutes:seconds (01:23:45). Numbers under 10 have a zero in front (called a leading zero); e.g. 09:07. Under the <hi>24</hi>-hour clock system, the day begins at midnight, 00:00, and the last minute of the day begins at 23:59 and ends at <hi>24</hi>:00, which is identical to 00:00 of the following day. 12:00 can only be mid-day. Midnight is called <hi>24</hi>:00 and is used to mean the end of the day and 00:00 is used to mean the beginning of the day. For example, you would say \"Tuesday at <hi>24</hi>:00\" and \"Wednesday at 00:00\" to mean exactly the same time.",
"However, the US military prefers not to say <hi>24</hi>:00 - they do not like to have two names for the same thing, so they always say \"23:59\", which is one minute before midnight.",
"<hi>24</hi>-hour clock time is used in computers, military, public safety, and transport. In many Asian, European and Latin American countries people use it to write the time. Many European people use it in speaking.",
"In railway timetables <hi>24</hi>:00 means the \"end\" of the day. For example, a train due to arrive at a station during the last minute of a day arrives at <hi>24</hi>:00; but trains which depart during the first minute of the day go at 00:00."
],
"documentid": "id:wikipedia:wiki::9985",
"title": "24-hour clock",
"url": "https://simple.wikipedia.org/wiki?curid=9985"
}
},
{
"id": "id:wikipedia:wiki::65655",
"relevance": 0.8262847635070321,
"source": "wiki_content",
"fields": {
"sddocname": "wiki",
"paragraphs": [
"Ronaldo began his professional career with Sporting CP at age 17 in 2002, and signed for Manchester United a year later. He won the FA Cup in his first season, and then won three back-to-back Premier League titles: in 2006-07, 2007-08, and 2008-09. In 2007-08, Ronaldo, helped United win the UEFA Champions League. In 2008-09, he won his first FIFA Club World Cup in December 2008, and he also won his first Ballon d'Or. At one point Ronaldo was the most expensive professional footballer of all time, after moving from Manchester United to Real Madrid for approximately \u00a380 m in July 2009.",
"He won his first trophy with Real Madrid in 2011, the 2010-11 Copa del Rey. In the next season, he won his first La Liga title with the club, the 2011-12 La Liga. In the 2012-13 season he won the Supercopa de Espa\u00f1a. In the next season, the 2013-14 season, he won his second Ballon d'Or. Then he won the Copa del Rey, and he also won his second Champions League with a record 17 goals. The following year, Ronaldo won the Ballon d'Or again, along with his second FIFA Club World Cup in December 2014. In 2016, Ronaldo won his third Champions League, and scored the winning penalty in the final against Atl\u00e9tico Madrid. He won his fourth Ballon d'or the next season, his second La Liga title for the first time in five years, another Champions League, and his second Club World Cup. Ronaldo's last season with Real Madrid was the 2017-18 season, where he won his fifth Ballon d'Or in 2017, and also won his fifth Champions League and scored two goals in the final against Juventus. With his third consecutive Champions League, he became the first player to win the UEFA Champions League five times. He would later go on to transfer to Juventus in July 2018. Ronaldo left the club by holding the record for being the top goal scorer in Real Madrid's history, and remaining as the only player in La Liga's history to score 30 or more goals in six consecutive seasons.",
"Ronaldo began his career with Portugal at age 18. He scored his first goal at UEFA Euro 2004 and helped Portugal reach the final, although they lost to Greece 1-0. The first World Cup he played at was the 2006 FIFA World Cup. He scored a goal and helped Portugal earn fourth place. Two years later, he became Portugal's full captain. Since then, he has appeared at three Euro's: 2008, 2012, and 2016, and two World Cups: 2014 and 2018. He helped Portugal win their first major international trophy at Euro 2016, and their second three years later with the UEFA Nations League.",
"He currently Plays for Manchester United and his shirt number is 7. He is now considered the greatest of all time with the most goals of 800+ in his all time of carrer.",
"Cristiano Ronaldo dos Santos Aveiro was born in Funchal, Madeira Islands to Maria Dolores dos Santos and Jos\u00e9 Dinis Aveiro. He has one brother named Hugo, and two sisters named Katia and Elma. Ronaldo was diagnosed with a racing heart (Tachycardia) at age 15 In 1997, 12-year-old Ronaldo went on a trial with Sporting CP. He impressed the club enough to be signed for \u00a31,500. He then moved from Funchal to Lisbon, to join the Sporting youth academy.",
"Ronaldo played his first professional match against Inter Milan on 14 August 2002. He came in as a sub during the second half. On 7 October 2002, Ronaldo played a role in his first game in the Portuguese Primeira Liga, against Moreirense. He scored two goals and Sporting won 3\u20130. Ronaldo came to the attention of Manchester United manager Alex Ferguson in August 2003, when Sporting defeated United 3\u20131 in the first game ever played at the Est\u00e1dio Jos\u00e9 Alvalade in Lisbon. His performance impressed the United players, who told Ferguson to sign him.",
"On 12 August 2003, Ronaldo joined Manchester United from Sporting CP for a fee of \u00a312.<hi>24</hi> million. He was Manchester United's first Portuguese player. He wanted the number 28, the number he wore at Sporting, but was eventually given the number 7. This number had been worn by George Best, Eric Cantona and David Beckham before him. He played his first game for the club on 16 August 2003 in a 4-0 win against Bolton Wanderers. Many people were impressed with his debut, including legendary United player George Best. Ronaldo's first goal with Manchester United was a free kick. He scored it in a 3-0 win against Portsmouth on 1 November 2003.Ronaldo won his first trophy in England, the 2003-04 FA Cup, in May 2004, when Manchester United beat Millwall 3-0. Ronaldo scored the first goal of the match.",
"Ronaldo scored his first UEFA Champions League goals in a 7-1 victory against A.S. Roma in April 2007. He scored two goals that day in what was his 30th Champions League match. He played the FA Cup Final that year, but United lost 1-0 to Chelsea. Even though he lost the FA Cup final, he didn't end the season without a title, because he won his first Premier League title.",
"In the 2007-08 season, Ronaldo scored his first and only hat-trick for Manchester United in a 6\u20130 win against Newcastle United on 12 January 2008. In the 2008 Champions League final against Chelsea, he scored a header as the match ended 1-1 after extra time. Although he missed his penalty, Manchester United won the shootout 6-5 and Ronaldo won his first UEFA Champions League. On 15 November 2008, Ronaldo scored his 100th goal for United in a 5-0 win against Stoke City. He also scored two free-kicks: the first one was his 100th goal. He scored a total of 42 goals and won the European Golden Boot, an award given to the top scorer of every European national league.",
"He won the first FIFA Puskas Award in 2009. The Puskas Award is given to whoever scores the best goal of that year. The goal was a 40-yard strike into the top-left corner against FC Porto on 15 April 2009 in the Champions League quarter finals. That goal was the only goal of the game. It was also an important goal because it sent United to the semi-finals. In the semi-final against Arsenal, Ronaldo scored two goals. One of them was a free kick from 40 yards out. His goals helped Man United qualify to the final, where they lost to Barcelona 2-0.",
"Ronaldo joined Real Madrid on 1 July 2009 for a fee of \u20ac94 million. That was a world record transfer fee at the time. He also signed a six-year contract with the club. At his presentation as a Real Madrid player, 80,000 people greeted him at the Santiago Bernab\u00e9u Stadium. This is the world record, breaking the 25-year record of 75,000 people at Diego Maradona's presentation for Napoli. He wore the number 9 in his first season because number 7 was taken by Ra\u00fal Gonz\u00e1lez. Ronaldo had to wait until Raul left the club in the summer of 2010 to wear number 7. He made his debut (first appearance) on 29 August 2009, in a La Liga game against Deportivo de La Coru\u00f1a. He scored a goal and Real Madrid won 3-2. On 23 October 2010, Ronaldo scored 4 goals in a match for the first time in his career during a 6-1 win against Racing de Santander. Ronaldo scored a header in extra time in the 2011 Copa del Rey final against rivals Barcelona. His goal was the match winner, so that was Ronaldo's first trophy in Spain. At the end of the 2010-11 season, he became the first player to score 40 goals in La Liga.",
"On 2 November 2011, Ronaldo scored both goals in a 2-0 Champions League group stage win against Olympique Lyon. The second goal was his 100th goal for Real Madrid. He achieved this in just 105 matches. He scored his 100th La Liga goal for his club in just 92 appearances in a 5-1 win against Real Sociedad on <hi>24</hi> March 2012.",
"Ronaldo began the 2012-13 season by winning the 2012 Supercopa de Espa\u00f1a. He scored in both legs, as Real Madrid won against Barca. In October 2012, he scored his first hat-trick in the Champions League in a 4-1 win against Ajax. On 6 January 2013, Ronaldo captained his club for the first time. In May 2013, he scored his 200th goal for Real Madrid in a 6-2 win against M\u00e1laga. He finished as the Champions League top scorer that season. Ronaldo also won his second Ballon d'Or in 2013.",
"In the 2013-14 season, Ronaldo broke the record for most goals in one Champions League season by scoring his 17th goal with a penalty in extra time in the final against Atl\u00e9tico Madrid that Real Madrid won 4\u20131. The previous record was 14 goals, set by Messi in the 2011-12 season.",
"In the 2014-15 season, Ronaldo set a new personal record by scoring 61 goals in all competitions. This achievement helped him win his second Ballon d'Or. He scored five goals in one match for the first time in his career in a 9-1 win vs. Granada on 5 April 2015. He became Real Madrid's all-time top scorer when he scored 5 goals against RCD Espanyol in a 6-0 away win on 12 September 2015. This brought his total goal tally to 230 goals in 203 games. The previous record holder was Raul.",
"On 18 April 2017, he became the first player to reach 100 goals in the UEFA Champions League, after he scored a hat-trick in a 4-2 extra-time win against Bayern Munich. On 18 March 2018, Ronaldo reached his 50th career hat-trick in a 6-3 win against Girona. Ronaldo scored an amazing bicycle-kick in a UEFA Champions League match against Juventus on 3 April 2018. He got a standing ovation, or round of applause, from the Juventus fans after scoring that goal. Real Madrid went on to play the final against Liverpool F.C.. Real Madrid became champions, so that was Ronaldo's 5th Champions League.",
"On 10 July 2018, He joined Juventus of Italy and signed a 4 year contract worth 112 Million Euros. The transfer was the highest paid for a player over 30 years old. People called it \"the deal of the century\".",
"He scored his first goals for the club on 16 September against US Sassuolo. Juventus won 2-1 at home. Three days later, on 19 September, he was controversially sent off against Valencia C.F. for \"violent behavior\". He was crying as he received the red card and said he \"did nothing\". Ronaldo won his first trophy with the club, the 2018 Supercoppa Italiana, in January 2019. In the match, he scored the only goal from a header against AC Milan.",
"Ronaldo began his international career with the Portugal under-15's in 2001. In 2002, he played the U-17 Euro with Portugal's U-17. In June 2003, Ronaldo won the Toulon Tournament with the Portugal U-20. He played with the U21's in 2003 as well during the 2004 U-21 Euro qualification stages. He also played with the Portugal under-23 team at the 2004 Summer Olympics, where he scored one goal against Morocco.",
"He kept progressing through the youth national teams until he played his first senior game for Portugal when he was 18 on 20 August 2003 against Kazakhstan. He scored his first goal for Portugal in a game against Greece at the UEFA Euro 2004.",
"Ronaldo was selected to play at the 2006 FIFA World Cup, which was also his first World Cup. During the tournament, he scored a goal against Iran. That goal was also his first World Cup goal. In the match against England, Ronaldo's teammate at Manchester United, Wayne Rooney, was sent off. Ronaldo went up to the referee and appeared to ask the ref to give a red card to Rooney. Rooney pushed Ronaldo as he was talking to the ref. After Rooney left the pitch, Ronaldo was caught on camera winking at the Portugal bench. Even though the referee said Rooney would've gotten the red card if Ronaldo complained or not, these actions caused a lot of fans to hate Ronaldo because they believed he influenced the referee's decision.",
"Ronaldo became captain of Portugal for the first time in a friendly game against Brazil on 6 February 2007. At Euro 2008, he scored one goal against Czech Republic. Portugal were eliminated in the quarter-finals by barely losing to Germany 3-2. In the 2010 World Cup, Ronaldo only scored one goal in a 7-0 win against North Korea, but was the man of the match in all of Portugal's 3 group stage games. Portugal were eliminated in the round of 16 by Spain 1-0, and Spain went on to win the World Cup.",
"His first international hat-trick came in a 4-2 win against Northern Ireland on 6 September 2013. In that match, he scored 3 goals in 15 minutes. He became Portugal's all time top scorer when he scored twice against Cameroon in March 2014, with Portugal winning 5-1. At the 2014 World Cup, Ronaldo scored a goal against Ghana, and assisted a last minute equalizer in a 2-2 draw against the United States. In a 3-3 draw against Hungary in Euro 2016, Ronaldo scored a back heel goal. This goal made him the first player to score in four Euros. Although he had to leave the game early because he got injured in the UEFA Euro 2016 Final, Portugal still won 1-0 in extra time because of a goal from Eder.",
"In the 2018 World Cup held in Russia, Ronaldo scored 4 goals. In Portugal's first group game, he scored a hat trick against Spain in a 3-3 draw. After that he scored the winning goal against Morocco. In the last group game, he missed a penalty in a 1-1 draw against Iran. Portugal qualified to the knockout stages by finishing second in the group. They were eliminated by Uruguay 2-1. Ronaldo became one of only four players to score at 4 World Cups. He has played at 4 of them: 2006, 2010, 2014, and 2018.",
"In June 2019, Ronaldo won the UEFA Nations League with Portugal to give him his second international title. In the final, Portugal beat The Netherlands 1-0.",
"Ronaldo is able to play on both wings and also as a striker since he is very strong with both feet, even though he is naturally right footed. He is also one of the world's fastest players. He has good heading ability because he is over 6 feet tall and jumps high. He is also known for his powerful \"knuckleball\" free kicks. The \"knuckleball\" technique is when the ball spins very little and creates an unpredictable motion. He combines this with his powerful shot, making it hard for goalkeepers to stop his shots. Ronaldo has also been known for his dribbling, as he likes to do many tricks and with the ball to pass defenders, such as the . When he was at Manchester United, he would play as a winger and try to send into the middle. At Real Madrid, he changed his playing style by moving more towards the middle and becoming more of a striker. He also focused more on scoring goals. When he arrived to Juventus, he stayed with this playing style of being a goal scorer and target man, although he dribbled with the ball more because he sometimes liked to play his traditional winger position and go one-on-one with defenders. He also sent crosses more frequently than he did in his last few seasons at Real Madrid.",
"Ronaldo has been criticized for \"diving\" by many people, including his Manchester United manager Alex Ferguson. He has also been criticized for being arrogant, such as complaining for not receiving set-pieces (free-kicks and penalties) when he gets fouled, having too much self-confidence, not celebrating with teammates after scoring goals, and getting excessively angry with others after losing. Examples of this are when he threw a reporter's microphone into a lake before a UEFA Euro 2016 match, and negative comments made at the Iceland national team after playing against them.",
"Ronaldo's father, Jos\u00e9 Aveiro, died of liver disease at age 52 in September 2005. Ronaldo was 20 years old at the time. Ronaldo said that he does not like to drink alcohol, mostly because of his dad's death, but has on some very few occasions.",
"In 2006, Ronaldo opened his first fashion boutique under the name \"CR7\" (his initials and shirt number) on the\u00a0island he was born in, Madeira. He opened a second boutique in Lisbon in 2008, and a third in 2009, located in Madrid. In December 2013, Ronaldo opened his own museum called \"Museu CR7\", which has all of his trophies and awards from his career.",
"Ronaldo became a father on 17 June 2010. He had a son, named Cristiano Jr.\u00a0He was born in the United States through an American surrogate he met in a restaurant, and Ronaldo announced that he had full custody.\u00a0Ronaldo has never publicly revealed information about his son's mother, but he says he will reveal it to Cristiano Jr. when he gets older. On 8 June 2017, Ronaldo confirmed on social media that he had become the father to twins, Mateo and Eva. They were born in the United States to a mother. In November 2017, his girlfriend Georgina Rodriguez gave birth to their first daughter, Alana.",
"Ronaldo has had many relationships. He was in a relationship with Russian model Irina Shayk from 2010\u20132015. In 2016, he began to date Spanish model Georgina Rodriguez. They were publicly seen for the first time at Disneyland Paris in November 2016.",
"Ronaldo is a\u00a0Roman Catholic. He does not have\u00a0tattoos\u00a0because it would prevent him from donating blood. On 9 November 2015, a movie about his lifestyle and his career was released. The title of the movie is \"Ronaldo\".",
"On 29 March 2017, Madeira Airport was renamed to Cristiano Ronaldo International Airport. A bust of Ronaldo was also revealed as part of the official renaming ceremony.",
"He is currently the most followed Instagram user, with over 200 million followers as of February 2020. He passed Selena Gomez as the most followed person in October 2018, with 144 million followers.",
"In June 2018, Ronaldo was given a suspended jail sentence of 2 years and a fine of \u20ac18.8 million for tax evasion.",
"In 2017, a woman claimed she was raped by Ronaldo at a hotel in Las Vegas in June 2009. Many articles state that Ronaldo and the woman both met each other at the nightclub of the hotel. Ronaldo later invited her to his suite, and that was where the rape occurred. Ronaldo paid that woman $375,000 to stay quiet, and an agreement was made between Ronaldo's lawyers and the woman's lawyers that if she publicly shared information of what happened, she had to pay those $375,000 back. Ronaldo himself denies raping the woman and calls it \"fake news\". In July 2019, prosecutors said they would not charge Ronaldo because there was not enough evidence."
],
"documentid": "id:wikipedia:wiki::65655",
"title": "Cristiano Ronaldo",
"url": "https://simple.wikipedia.org/wiki?curid=65655"
}
}
]

Semantic vector search on the paragraph level

This query creates an embedding of the query “what does 24 mean in the context of railways” and specifies the semantic ranking profile: cos(distance(field,paragraph_embeddings)). This will hence compute the distance between the vector in the query and the vectors computed when indexing: "input paragraphs", "embed", "index", "attribute":

[14]:

result = app.query(
    body={
        "yql": "select * from wiki where {targetHits:2}nearestNeighbor(paragraph_embeddings,q)",
        "input.query(q)": "embed(what does 24 mean in the context of railways)",
        "ranking.profile": "semantic",
        "presentation.format.tensors": "short-value",
        "hits": 2,
    }
)
result.hits
if len(result.hits) != 2:
    raise ValueError("Expected 2 hits, got {}".format(len(result.hits)))
print(json.dumps(result.hits, indent=4))

[
{
"id": "id:wikipedia:wiki::9985",
"relevance": 0.8807156260391702,
"source": "wiki_content",
"fields": {
"matchfeatures": {
"closest(paragraph_embeddings)": {
"4": 1.0
}
},
"sddocname": "wiki",
"paragraphs": [
"The 24-hour clock is a way of telling the time in which the day runs from midnight to midnight and is divided into 24 hours, numbered from 0 to 23. It does not use a.m. or p.m. This system is also referred to (only in the US and the English speaking parts of Canada) as military time or (only in the United Kingdom and now very rarely) as continental time. In some parts of the world, it is called railway time. Also, the international standard notation of time (ISO 8601) is based on this format.",
"A time in the 24-hour clock is written in the form hours:minutes (for example, 01:23), or hours:minutes:seconds (01:23:45). Numbers under 10 have a zero in front (called a leading zero); e.g. 09:07. Under the 24-hour clock system, the day begins at midnight, 00:00, and the last minute of the day begins at 23:59 and ends at 24:00, which is identical to 00:00 of the following day. 12:00 can only be mid-day. Midnight is called 24:00 and is used to mean the end of the day and 00:00 is used to mean the beginning of the day. For example, you would say \"Tuesday at 24:00\" and \"Wednesday at 00:00\" to mean exactly the same time.",
"However, the US military prefers not to say 24:00 - they do not like to have two names for the same thing, so they always say \"23:59\", which is one minute before midnight.",
"24-hour clock time is used in computers, military, public safety, and transport. In many Asian, European and Latin American countries people use it to write the time. Many European people use it in speaking.",
"In railway timetables 24:00 means the \"end\" of the day. For example, a train due to arrive at a station during the last minute of a day arrives at 24:00; but trains which depart during the first minute of the day go at 00:00."
],
"documentid": "id:wikipedia:wiki::9985",
"title": "24-hour clock",
"url": "https://simple.wikipedia.org/wiki?curid=9985"
}
},
{
"id": "id:wikipedia:wiki::59079",
"relevance": 0.7972394509946005,
"source": "wiki_content",
"fields": {
"matchfeatures": {
"closest(paragraph_embeddings)": {
"4": 1.0
}
},
"sddocname": "wiki",
"paragraphs": [
"Logic gates are digital components. They normally work at only two levels of voltage, a positive level and zero level. Commonly they work based on two states: \"On\" and \"Off\". In the On state, voltage is positive. In the Off state, the voltage is at zero. The On state usually uses a voltage in the range of 3.5 to 5 volts. This range can be lower for some uses.",
"Logic gates compare the state at their inputs to decide what the state at their output should be. A logic gate is \"on\" or active when its rules are correctly met. At this time, electricity is flowing through the gate and the voltage at its output is at the level of its On state.",
"Logic gates are electronic versions of Boolean logic. Truth tables will tell you what the output will be, depending on the inputs.",
"AND gates have two inputs. The output of an AND gate is on only if both inputs are on. If at least one of the inputs is off, the output will be off.",
"Using the image at the right, if \"A\" and \"B\" are both in an On state, the output (out) will be an On state. If either \"A\" or \"B\" is in an Off state, the output will also be in an Off state. \"A\" and \"B\" must be On for the output to be On.",
"OR gates have two inputs. The output of an OR gate will be on if at least one of the inputs are on. If both inputs are off, the output will be off.",
"Using the image at the right, if either \"A\" or \"B\" is On, the output (\"out\") will also be On. If both \"A\" and \"B\" are Off, the output will be Off.",
"The NOT logic gate has only one input. If the input is On then the output will be Off. In other words, the NOT logic gate changes the signal from On to Off or from Off to On. It is sometimes called an inverter.",
"XOR (\"exclusive or\") gates have two inputs. The output of a XOR gate will be true only if the two inputs are different from each other. If both inputs are the same, the output will be off.",
"NAND means not both. It is called NAND because it means \"not and.\" This means that it will always output true unless both inputs are on.",
"XNOR means \"not exclusive or.\" This means that it will only output true if both inputs are the same. It is the opposite of a XOR logic gate."
],
"documentid": "id:wikipedia:wiki::59079",
"title": "Logic gate",
"url": "https://simple.wikipedia.org/wiki?curid=59079"
}
}
]

An interesting question then is, of the paragraphs in the document, which one was the closest? When analysing ranking, using match-features lets you export the scores used in the ranking calculations, see closest - from the result above:

 "matchfeatures": {
                "closest(paragraph_embeddings)": {
                    "4": 1.0
                }
}

This means, the tensor of index 4 has the closest match. With this, it is straight forward to feed articles with an array of paragraphs and highlight the best matching paragraph in the document!

[17]:

def find_best_paragraph(hit: dict) -> str:
    paragraphs = hit["fields"]["paragraphs"]
    match_features = hit["fields"]["matchfeatures"]
    index = int(list(match_features["closest(paragraph_embeddings)"].keys())[0])
    return paragraphs[index]

[18]:

find_best_paragraph(result.hits[0])

[18]:

'In railway timetables 24:00 means the "end" of the day. For example, a train due to arrive at a station during the last minute of a day arrives at 24:00; but trains which depart during the first minute of the day go at 00:00.'

Hybrid search and ranking

Hybrid combining keyword search on the article level with vector search in the paragraph index:

[20]:

result = app.query(
    body={
        "yql": "select * from wiki where userQuery() or ({targetHits:1}nearestNeighbor(paragraph_embeddings,q))",
        "input.query(q)": "embed(what does 24 mean in the context of railways)",
        "query": "what does 24 mean in the context of railways",
        "ranking.profile": "hybrid",
        "presentation.format.tensors": "short-value",
        "hits": 1,
    }
)
if len(result.hits) != 1:
    raise ValueError("Expected 1 hits, got {}".format(len(result.hits)))
print(json.dumps(result.hits, indent=4))

[
    {
        "id": "id:wikipedia:wiki::9985",
        "relevance": 4.163399168193791,
        "source": "wiki_content",
        "fields": {
            "matchfeatures": {
                "bm25(paragraphs)": 10.468827250036052,
                "bm25(title)": 1.1272217840066168,
                "closest(paragraph_embeddings)": {
                    "4": 1.0
                },
                "firstPhase": 0.8807156260391702,
                "all_paragraph_similarities": {
                    "1": 0.8030083179473877,
                    "2": 0.7992785573005676,
                    "3": 0.8273358345031738,
                    "4": 0.8807156085968018,
                    "0": 0.849757194519043
                },
                "avg_paragraph_similarity": 0.8320191025733947,
                "max_paragraph_similarity": 0.8807156085968018
            },
            "sddocname": "wiki",
            "paragraphs": [
                "<hi>The</hi> <hi>24</hi>-hour clock is a way <hi>of</hi> telling <hi>the</hi> time <hi>in</hi> which <hi>the</hi> day runs from midnight to midnight and is divided into <hi>24</hi> hours, numbered from 0 to 23. It <hi>does</hi> not use a.m. or p.m. This system is also referred to (only <hi>in</hi> <hi>the</hi> US and <hi>the</hi> English speaking parts <hi>of</hi> Canada) as military time or (only <hi>in</hi> <hi>the</hi> United Kingdom and now very rarely) as continental time. <hi>In</hi> some parts <hi>of</hi> <hi>the</hi> world, it is called <hi>railway</hi> time. Also, <hi>the</hi> international standard notation <hi>of</hi> time (ISO 8601) is based on this format.",
                "A time <hi>in</hi> <hi>the</hi> <hi>24</hi>-hour clock is written <hi>in</hi> <hi>the</hi> form hours:minutes (for example, 01:23), or hours:minutes:seconds (01:23:45). Numbers under 10 have a zero <hi>in</hi> front (called a leading zero); e.g. 09:07. Under <hi>the</hi> <hi>24</hi>-hour clock system, <hi>the</hi> day begins at midnight, 00:00, and <hi>the</hi> last minute <hi>of</hi> <hi>the</hi> day begins at 23:59 and ends at <hi>24</hi>:00, which is identical to 00:00 <hi>of</hi> <hi>the</hi> following day. 12:00 can only be mid-day. Midnight is called <hi>24</hi>:00 and is used to <hi>mean</hi> <hi>the</hi> end <hi>of</hi> <hi>the</hi> day and 00:00 is used to <hi>mean</hi> <hi>the</hi> beginning <hi>of</hi> <hi>the</hi> day. For example, you would say \"Tuesday at <hi>24</hi>:00\" and \"Wednesday at 00:00\" to <hi>mean</hi> exactly <hi>the</hi> same time.",
                "However, <hi>the</hi> US military prefers not to say <hi>24</hi>:00 - they <hi>do</hi> not like to have two names for <hi>the</hi> same thing, so they always say \"23:59\", which is one minute before midnight.",
                "<hi>24</hi>-hour clock time is used <hi>in</hi> computers, military, public safety, and transport. <hi>In</hi> many Asian, European and Latin American countries people use it to write <hi>the</hi> time. Many European people use it <hi>in</hi> speaking.",
                "<hi>In</hi> <hi>railway</hi> timetables <hi>24</hi>:00 means <hi>the</hi> \"end\" <hi>of</hi> <hi>the</hi> day. For example, a train due to arrive at a station during <hi>the</hi> last minute <hi>of</hi> a day arrives at <hi>24</hi>:00; but trains which depart during <hi>the</hi> first minute <hi>of</hi> <hi>the</hi> day go at 00:00."
            ],
            "documentid": "id:wikipedia:wiki::9985",
            "title": "24-hour clock",
            "url": "https://simple.wikipedia.org/wiki?curid=9985"
        }
    }
]

This case combines exact search with nearestNeighbor search. The hybrid rank-profile above also calculates several additional features using tensor expressions:

firstPhase is the score of the first ranking phase, configured in the hybrid profile as cos(distance(field, paragraph_embeddings)).
all_paragraph_similarities returns all the similarity scores for all paragraphs.
avg_paragraph_similarity is the average similarity score across all the paragraphs.
max_paragraph_similarity is the same as firstPhase, but computed using a tensor expression.

These additional features are calculated during second-phase ranking to limit the number of vector computations.

The Tensor Playground is useful to play with tensor expressions.

The Hybrid Search blog post series is a good read to learn more about hybrid ranking!

[23]:

def find_paragraph_scores(hit: dict) -> str:
    paragraphs = hit["fields"]["paragraphs"]
    match_features = hit["fields"]["matchfeatures"]
    indexes = [int(v) for v in match_features["all_paragraph_similarities"]]
    scores = list(match_features["all_paragraph_similarities"].values())
    return list(zip([paragraphs[i] for i in indexes], scores))

[24]:

find_paragraph_scores(result.hits[0])

[24]:

[('A time <hi>in</hi> <hi>the</hi> <hi>24</hi>-hour clock is written <hi>in</hi> <hi>the</hi> form hours:minutes (for example, 01:23), or hours:minutes:seconds (01:23:45). Numbers under 10 have a zero <hi>in</hi> front (called a leading zero); e.g. 09:07. Under <hi>the</hi> <hi>24</hi>-hour clock system, <hi>the</hi> day begins at midnight, 00:00, and <hi>the</hi> last minute <hi>of</hi> <hi>the</hi> day begins at 23:59 and ends at <hi>24</hi>:00, which is identical to 00:00 <hi>of</hi> <hi>the</hi> following day. 12:00 can only be mid-day. Midnight is called <hi>24</hi>:00 and is used to <hi>mean</hi> <hi>the</hi> end <hi>of</hi> <hi>the</hi> day and 00:00 is used to <hi>mean</hi> <hi>the</hi> beginning <hi>of</hi> <hi>the</hi> day. For example, you would say "Tuesday at <hi>24</hi>:00" and "Wednesday at 00:00" to <hi>mean</hi> exactly <hi>the</hi> same time.',
  0.8030083179473877),
 ('However, <hi>the</hi> US military prefers not to say <hi>24</hi>:00 - they <hi>do</hi> not like to have two names for <hi>the</hi> same thing, so they always say "23:59", which is one minute before midnight.',
  0.7992785573005676),
 ('<hi>24</hi>-hour clock time is used <hi>in</hi> computers, military, public safety, and transport. <hi>In</hi> many Asian, European and Latin American countries people use it to write <hi>the</hi> time. Many European people use it <hi>in</hi> speaking.',
  0.8273358345031738),
 ('<hi>In</hi> <hi>railway</hi> timetables <hi>24</hi>:00 means <hi>the</hi> "end" <hi>of</hi> <hi>the</hi> day. For example, a train due to arrive at a station during <hi>the</hi> last minute <hi>of</hi> a day arrives at <hi>24</hi>:00; but trains which depart during <hi>the</hi> first minute <hi>of</hi> <hi>the</hi> day go at 00:00.',
  0.8807156085968018),
 ('<hi>The</hi> <hi>24</hi>-hour clock is a way <hi>of</hi> telling <hi>the</hi> time <hi>in</hi> which <hi>the</hi> day runs from midnight to midnight and is divided into <hi>24</hi> hours, numbered from 0 to 23. It <hi>does</hi> not use a.m. or p.m. This system is also referred to (only <hi>in</hi> <hi>the</hi> US and <hi>the</hi> English speaking parts <hi>of</hi> Canada) as military time or (only <hi>in</hi> <hi>the</hi> United Kingdom and now very rarely) as continental time. <hi>In</hi> some parts <hi>of</hi> <hi>the</hi> world, it is called <hi>railway</hi> time. Also, <hi>the</hi> international standard notation <hi>of</hi> time (ISO 8601) is based on this format.',
  0.849757194519043)]

Hybrid search and filter

YQL is a structured query langauge. In the query examples, the user input is fed as-is using the userQuery() operator.

Filters are normally separate from the user input, below is an example of adding a filter url contains "9985" to the YQL string.

Finally, the use the Query API for other options, like highlighting - here disable bolding:

[25]:

result = app.query(
    body={
        "yql": 'select * from wiki where url contains "9985" and userQuery() or ({targetHits:1}nearestNeighbor(paragraph_embeddings,q))',
        "input.query(q)": "embed(what does 24 mean in the context of railways)",
        "query": "what does 24 mean in the context of railways",
        "ranking.profile": "hybrid",
        "bolding": False,
        "presentation.format.tensors": "short-value",
    }
)
if len(result.hits) != 1:
    raise ValueError("Expected one hit, got {}".format(len(result.hits)))
print(json.dumps(result.hits, indent=4))

[
    {
        "id": "id:wikipedia:wiki::9985",
        "relevance": 4.307079208249452,
        "source": "wiki_content",
        "fields": {
            "matchfeatures": {
                "bm25(paragraphs)": 10.468827250036052,
                "bm25(title)": 1.1272217840066168,
                "closest(paragraph_embeddings)": {
                    "type": "tensor<float>(p{})",
                    "cells": {
                        "4": 1.0
                    }
                },
                "firstPhase": 0.8807156260391702,
                "all_paragraph_similarities": {
                    "type": "tensor<float>(p{})",
                    "cells": {
                        "1": 0.8030083179473877,
                        "2": 0.7992785573005676,
                        "3": 0.8273358345031738,
                        "4": 0.8807156085968018,
                        "0": 0.849757194519043
                    }
                },
                "avg_paragraph_similarity": 0.8320191025733947,
                "max_paragraph_similarity": 0.8807156085968018
            },
            "sddocname": "wiki",
            "paragraphs": [
                "The 24-hour clock is a way of telling the time in which the day runs from midnight to midnight and is divided into 24 hours, numbered from 0 to 23. It does not use a.m. or p.m. This system is also referred to (only in the US and the English speaking parts of Canada) as military time or (only in the United Kingdom and now very rarely) as continental time. In some parts of the world, it is called railway time. Also, the international standard notation of time (ISO 8601) is based on this format.",
                "A time in the 24-hour clock is written in the form hours:minutes (for example, 01:23), or hours:minutes:seconds (01:23:45). Numbers under 10 have a zero in front (called a leading zero); e.g. 09:07. Under the 24-hour clock system, the day begins at midnight, 00:00, and the last minute of the day begins at 23:59 and ends at 24:00, which is identical to 00:00 of the following day. 12:00 can only be mid-day. Midnight is called 24:00 and is used to mean the end of the day and 00:00 is used to mean the beginning of the day. For example, you would say \"Tuesday at 24:00\" and \"Wednesday at 00:00\" to mean exactly the same time.",
                "However, the US military prefers not to say 24:00 - they do not like to have two names for the same thing, so they always say \"23:59\", which is one minute before midnight.",
                "24-hour clock time is used in computers, military, public safety, and transport. In many Asian, European and Latin American countries people use it to write the time. Many European people use it in speaking.",
                "In railway timetables 24:00 means the \"end\" of the day. For example, a train due to arrive at a station during the last minute of a day arrives at 24:00; but trains which depart during the first minute of the day go at 00:00."
            ],
            "documentid": "id:wikipedia:wiki::9985",
            "title": "24-hour clock",
            "url": "https://simple.wikipedia.org/wiki?curid=9985"
        }
    }
]

In short, the above query demonstrates how easy it is to combine various ranking strategies, and also combine with filters.

To learn more about pre-filtering vs post-filtering, read Filtering strategies and serving performance. Semantic search with multi-vector indexing is a great read overall for this domain.

Cleanup

[26]:

vespa_docker.container.stop()
vespa_docker.container.remove()