KeywordSearchWidget

This Not That (TNT) provides a keyword-based search bar widget that can search through long form text efficiently looking for keywords. We will outline the core functionality of the KeywordSearchWidget and why and when it may be useful. Note that the KeywordSearchWidget is designed specifically for longer text that can be usefully broken up into words for faster keyword searching. If you want a more featureful or general purpose search you should use the SearchWidget.

The first step is to load thisnotthat and panel.

[1]:
import thisnotthat as tnt
import panel as pn
2023-01-24 18:30:45.564281: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.

To make Panel based objects interactive within a notebook we need to load the panel extension.

[2]:
pn.extension()

Now we need some data to use as an example. In this case we’ll use the 20-Newsgroups dataset, which we can get easy access to via scikit-learn. The data itself consists of posts from twenty different newsgroups from the 1990s. Most posts are relatively short (a paragraph or two), but other can be consixderably longer. We will clean up the data by removing overly short and excessivly long posts. We are also going to have make a data map of the newsgroup posts. For that we’ll use the Universal Sentence Encoder to create vector representations of the posts, and UMAP to build the map from the vectors.

[3]:
import sklearn.datasets
import tensorflow_hub as hub
import umap
import numpy as np
[4]:
dataset = sklearn.datasets.fetch_20newsgroups(
    subset="all", remove=("headers", "footers", "quotes")
)

good_length = [(len(t) > 128) and (len(t) < 16384) for t in dataset["data"]]
targets = np.array(dataset.target)[good_length]
news_data = [t for t in dataset["data"] if (len(t) > 128) and (len(t) < 16384)]
news_labels = [dataset.target_names[x] for x in targets]

Having extracted and cleaned the data we can load USE and embed the posts using it. The resulting vectors can then be passed to UMAP to create a map representation.

[5]:
use_embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")
use_vectors = use_embed(news_data).numpy()
use_map = umap.UMAP(metric="cosine", random_state=42).fit_transform(use_vectors)
2023-01-24 18:30:48.116947: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-01-24 18:30:48.117795: I tensorflow/core/common_runtime/process_util.cc:146] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.

We can use a BokehPlotPane to visualize the result. We will use the newsgroups as a label, and the first 256 characters of each post as hover text to make it easier to navigate the data.

[6]:
plot = tnt.BokehPlotPane(
    use_map,
    labels=news_labels,
    hover_text=["▶ " + x[:256] + "..." for x in news_data],
    marker_size=0.05,
    width=900,
    height=700,
    title="Newsgroup data map",
)

A quick visual check shows that our PlotPane data map looks like the sort of thing we want.

[7]:
plot.pane
[7]:

To create a KeywordSearchWidget you need to pass it the text data to be searched over. In our case that is the raw text of the posts (as a list). Internally the KeywordSearchWidget will parse this into words and only search over words, quickly finding all the texts that contain words containing the search query. We will also link the search to the plot.

[8]:
search = tnt.KeywordSearchWidget(news_data)
search.link_to_plot(plot)
search
[8]:

With this done we can create a simple Column layout of the BokehPlotPane and our KeywordSearchWidget. You can provide space separated keywords in the searhc box, and hiutting the search button will result in all the posts having words that contain any of the search-keywords selected. Note that this search will only work live in a running notebook.

[9]:
pn.Column(search, plot)
[9]: