VectorSearchWidget

This Not That (TNT) provides a vector search widget that can search via a simple vector database. This allows for things like semantic search, or reverse image search on embeddings of text or images. We will outline the core functionality of the VectorSearchWidget and why and when it may be useful. Note that the VectorSearchWidget requires an “embedder” that can embed queries into the same vector space as vector representation of the data. If you want a more featureful or general purpose search you should use the SearchWidget.

The first step is to load thisnotthat and panel.

[1]:

import thisnotthat as tnt
import panel as pn

2023-01-26 16:52:17.573902: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.

To make Panel based objects interactive within a notebook we need to load the panel extension.

[2]:

pn.extension()

Now we need some data to use as an example. In this case we’ll use the 20-Newsgroups dataset, which we can get easy access to via scikit-learn. The data itself consists of posts from twenty different newsgroups from the 1990s. Most posts are relatively short (a paragraph or two), but other can be consixderably longer. We will clean up the data by removing overly short and excessivly long posts. We are also going to have make a data map of the newsgroup posts. For that we’ll use the Universal Sentence Encoder to create vector representations of the posts, and UMAP to build the map from the vectors. We can then re-use the Universal Sentence Encoder as our “embedder” to convert query text into vectors.

[3]:

import sklearn.datasets
import tensorflow_hub as hub
import umap
import numpy as np

[4]:

dataset = sklearn.datasets.fetch_20newsgroups(
    subset="all", remove=("headers", "footers", "quotes")
)

good_length = [(len(t) > 128) and (len(t) < 16384) for t in dataset["data"]]
targets = np.array(dataset.target)[good_length]
news_data = [t for t in dataset["data"] if (len(t) > 128) and (len(t) < 16384)]
news_labels = [dataset.target_names[x] for x in targets]

Having extracted and cleaned the data we can load USE and embed the posts using it. The resulting vectors can then be passed to UMAP to create a map representation. The vectors themselves will be kept to be passed into the VectorSearchWidget, as well as the USE instance as the embedder.

[5]:

use_embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")
use_vectors = use_embed(news_data).numpy()
use_map = umap.UMAP(metric="cosine", random_state=42).fit_transform(use_vectors)

2023-01-26 16:52:20.179369: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-01-26 16:52:20.180180: I tensorflow/core/common_runtime/process_util.cc:146] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.

We can use a BokehPlotPane to visualize the result. We will use the newsgroups as a label, and the first 256 characters of each post as hover text to make it easier to navigate the data.

[6]:

plot = tnt.BokehPlotPane(
    use_map,
    labels=news_labels,
    hover_text=["▶ " + x[:256] + "..." for x in news_data],
    marker_size=0.05,
    width=900,
    height=700,
    title="Newsgroup data map",
)

A quick visual check shows that our PlotPane data map looks like the sort of thing we want.

[7]:

plot.pane

[7]:

To create a VectorSearchWidget you need to pass it the vector representation of the data to be searched over, and a function or callable that can convert the search value into a vector in the same space. In our case that is the USE vector representations the posts, and the use_embed object which acts as a callable on text strings. Internally the VectorSearchWidget will construct a search index over the vectors allowing for fast searching – not that the index construction may take a little time; particularly for large vector datasets.

Once the VectorSearchWidget has been constructed we can link it to the plot to enable the search. The Vector search widget comes with a slides to determine the number of search results to return. You cna set the value of the slider, and the maximal value allowed via keyword arguments. Choosing particularly large values will make the search quite slow.

[8]:

search = tnt.VectorSearchWidget(use_vectors, use_embed, n_query_results=50, max_query_results=200)
search.link_to_plot(plot)
search

[8]:

With this done we can create a simple Column layout of the BokehPlotPane and our VectorSearchWidget. The search will be a semantic search, finding posts most similar in meaning to the search term or phrase entered. Note that this search will only work live in a running notebook.

[9]:

pn.Column(search, plot)

[9]:

If you wish to go beyond text based search (although CLIP embeddings to text for searching CLIP embeddings of images can be very effective), the VectorSearchWidget supports a file-based input mode:

[10]:

file_input_search = tnt.VectorSearchWidget(use_vectors, use_embed, input_type="file")
file_input_search

[10]:

The “Choose file” button will open a file-chooser. The selected filed will then be read in as a bytestring and passed along to the “embedder”. In the special cases of text files the bytestring will be converted into text; in the case of image files the image will be parsed and converted into a numpy array representation; in all other cases the raw bytestring will be passed to the embedder. In this case it is likely necessary to provide a function wrapper around any neural network embeddings to handle preprocessing of the text, numpy array, or bytestring.