Annotating a Data Map Using Joint Vector Spaces

A data map all by itself contains useful information, but it can be hard to quickly orient yourself and understand what different clusters and relationships mean without looking in detail at the original source data for points in different regions of the map – an often tedious process. Layered textual annotation labels on clusters are regions can go a long way to making a map easier to understand at a glance, and faster to navigate to specific regions of interest guided by the cluster labels. In this tutorial we will look at an approach that embeds data useful for cluster labelling into the same vector space as the source of the data map, and uses that for labelling clusters.

First we’ll need some libraries. To get the data and produce a map of it we’ll use a combination of sklearn’s dataset fetcher, Universal Sentence Encoder from tensorflow_hub, and UMAP for data mapping.

[1]:
import sklearn.datasets
import tensorflow_hub as hub
import umap
import numpy as np

As a simple example we’ll be working with the 20-Newsgroups data. These are posts from twenty different newsgroups from the 1990s – relatively short texts on a variety fo topics. This can be fetched via sklearn. Since the actual content of newsgroup posts is pretty diverse (from short single word replies, to enormous encoded binaries) we’ll tidy it up a little, pruning out the really short entries and the really long ones.

[2]:
dataset = sklearn.datasets.fetch_20newsgroups(
    subset="all", remove=("headers", "footers", "quotes")
)

good_length = [(len(t) > 128) and (len(t) < 16384) for t in dataset["data"]]
targets = np.array(dataset.target)[good_length]
news_data = [t for t in dataset["data"] if (len(t) > 128) and (len(t) < 16384)]

Text documents are, of course, not vectors suitable for mapping. We’ll solve that by using Uinversal Sentence Encoder (USE) to convert the text into vector representations. We picked USE because it is readily accessible and easy to use. To embed text using it we first load it from tensorflow_hub, and then simply apply it to our list of text items, converting the resulting vectors to numpy format when we are done.

[3]:
use_embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")
2022-08-22 14:44:50.131795: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2022-08-22 14:44:50.132012: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-08-22 14:44:50.133959: I tensorflow/core/common_runtime/process_util.cc:146] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.
2022-08-22 14:44:52.453417: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
2022-08-22 14:44:52.750309: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2095190000 Hz
[4]:
%%time
use_vectors = use_embed(news_data).numpy()
CPU times: user 47.1 s, sys: 4.35 s, total: 51.5 s
Wall time: 38.2 s

Next we need a data map to explore. We can get that by passing the USE vectors to UMAP to reduce to 2D. Note that USE vectors measure distance via cosine distance, so we need pass that information to UMAP.

[5]:
use_map = umap.UMAP(metric="cosine", random_state=42).fit_transform(use_vectors)

Now we have a data map of the 20 newsgroup data. This provides a starting point for using TNT.

A Basic Interactive Plot

First we will get an interactive TNT plot of our data map working. For this we will need to import the TNT library, as well as the panel library which provides the infrastructure for building and composing TNT elements.

[6]:
import thisnotthat as tnt
import panel as pn

The next important step, to make use of TNT inside a notebook, is to enable panel’s extensions – we do this by calling the extension function. This will enable panel to render straight to a notebook, even if that involves communication back to the server for interactions.

[7]:
pn.extension()

To get started we’ll make a simple plot using the BokehPlotPane. There re other plot pane options, but the Bokeh pane is the richest in features, and supports the addition of the text annotation layers we’ll be using later. In the most basic usage you simply pass a data map in. In this case, since we have no class label information for the legend to show, we’ll turn the legend off.

[8]:
basic_plot = tnt.BokehPlotPane(
    use_map,
    show_legend=False,
)

To display the plot we make a panel Row and put the plot in it. If we had other panel or TNT elements we wanted to add we could simply add them as extra arguments to the row (or use a more advaned layout if we liked).

[9]:
pn.Row(basic_plot)
[9]:

Immediately we have an interactive plot that we can zoom and pan around in. We also have hover tooltips. This plot, however, doesn’t really tell us much about our data. To fix that let’s extract some more information we can use to enrich the plot.

We do have class labels for all the data points – which newsgroup they came from. And we can use the length of the post to define a size for the points in the plot. Finally we have the text of the newsgroup post itself. We can add a truncated version of that as hover_text to appear in the tooltip when we hover over points.

[10]:
labels = [dataset.target_names[x] for x in targets]
sizes = [np.sqrt(len(x)) / 1024 for x in news_data]
hover_text = [x[:384] + " ... trimmed" if len(x) > 384 else x for x in news_data]

One last thing: the twenty different newsgroups actually group together into five or six overall themes. To make things a little easier to see let’s create a custom colour palette so that related newsgroups can have similar colours.

[11]:
import seaborn as sns

religion = ("alt.atheism", "talk.religion.misc", "soc.religion.christian")
politics = ("talk.politics.misc", "talk.politics.mideast", "talk.politics.guns")
sport = ("rec.sport.baseball", "rec.sport.hockey")
comp = (
    "comp.graphics",
    "comp.os.ms-windows.misc",
    "comp.sys.ibm.pc.hardware",
    "comp.sys.mac.hardware",
    "comp.windows.x",
)
sci = (
    "sci.crypt",
    "sci.electronics",
    "sci.med",
    "sci.space",
)
misc = (
    "misc.forsale",
    "rec.autos",
    "rec.motorcycles",
)

COLOR_KEY = {}
COLOR_KEY.update(zip(religion, sns.color_palette("Blues", 4).as_hex()[1:]))
COLOR_KEY.update(zip(politics, sns.color_palette("Purples", 4).as_hex()[1:]))
COLOR_KEY.update(zip(comp, sns.color_palette("YlOrRd", 5).as_hex()))
COLOR_KEY.update(zip(sci, sns.color_palette("light:teal", 5).as_hex()[1:]))
COLOR_KEY.update(zip(sport, sns.color_palette("light:#660033", 4).as_hex()[1:3]))
COLOR_KEY.update(zip(misc, sns.color_palette("YlGn", 4).as_hex()[1:]))

Now let’s pass all of that to our PlotPane. We will also set a min_point_size and max_point_size so that if we zoom out a lot, we’ll still see some points, and if we zoom in the points will get smaller in apparent size, allowing us to avoid overlap in dense areas if we zoom in enough.

[12]:
enriched_plot = tnt.BokehPlotPane(
    use_map,
    labels=labels,
    hover_text=hover_text,
    marker_size=sizes,
    label_color_mapping=COLOR_KEY,
    show_legend=False,
    min_point_size=0.001,
    max_point_size=0.05,
    title="20-Newsgroups Data Map",
)

Now we display the plot as before, using a panel Row.

[13]:
pn.Row(enriched_plot)
[13]:

The result is much richer – we have a sense of how the different newsgroups are distributed in the map, we can see the long and short documents clearly, and the mouseover tooltips allow us to quickly get an idea of the content of posts.

It would be nice, however, to go another step and have textal annotations labelling the different clusters to give some idea of their content. Ideally we could also zoom in and see finer grained cluster labels as well. Let’s get started building that.

Generating Annotation Vectors

To label clusters using the joint vector space approach we need some data for which we have short textual representations that we can embed into the same vector space as the 20-newsgroups documents. The obvious choice here is words, so the trick is get a good word list, and then somehow embed the words into the same space as the 20-newsgroup documents. This is actually relatively easy - -we can get a list of useful words for labelling from the documents themselves, and USE will work just fine as a word embedding tool, putting the words into the same USE vector space as the documents.

First we’ll load some libraries to do the parsing of the documents to extract a word list. There are much fancier NLP approaches we could use, but to keep everything in this tutorial simple and accessible we’ll simply use the very basic tokenizer from sklearn’s count vectorizer, and a Counter to extract frequently used words.

[14]:
import sklearn.feature_extraction
import itertools
from collections import Counter

A more advanced approach would use NLTK, SpaCy, tokenizers, or sentencepiece, and do all of this more carefully, but we’ll stick with libraries people will already have installed to make the tutorial relatively self contained. Sklearn’s CountVectorizer class does some basic tokenization via regex, and some basic case folding, so we’ll simply make use of that, co-opting the relevant methods from an instantiation of CountVectorizer.

To prune down the vocabulary we can use the Counter class to simply pick out the twenty-thousand most commonly used words, and then use sklearn’s ENGLISH_STOP_WORDS to remove the stop words as well (we could probably leave them in, but it doesn’t hurt to take them out).

[15]:
%%time
cv = sklearn.feature_extraction.text.CountVectorizer(lowercase=True)
sk_word_tokenize = cv.build_tokenizer()
sk_preprocesser = cv.build_preprocessor()
tokenize = lambda doc: sk_word_tokenize(sk_preprocesser(doc))
tokenized_news = [tokenize(doc) for doc in news_data]
token_counts = Counter(itertools.chain.from_iterable(tokenized_news))
vocabulary = [word for word, count in token_counts.most_common(n=20000)
              if word not in sklearn.feature_extraction.text.ENGLISH_STOP_WORDS]
CPU times: user 1.79 s, sys: 91.8 ms, total: 1.88 s
Wall time: 1.88 s

That leaves us with a vocabulary of just under 20000 words:

[16]:
len(vocabulary)
[16]:
19694

To convert that to vectors in the same space as our 20-newsgroup vectors we simply apply use_embed on the list of words and convert to numpy. That gives us a set of vectors for labelling!

[17]:
%%time
use_word_vectors = use_embed(vocabulary).numpy()
CPU times: user 1.15 s, sys: 449 ms, total: 1.6 s
Wall time: 374 ms

Adding Annotation Layers

Our goal is to tag regions of the map with word based annotations giving some indication of the content of documents in that region. We can do this by clustering the documents, and finding words related to each cluster. To find words associated to a cluster we can use the fact that we have the words and the documents in the same joint space, and look for the words closest to the centroid of the cluster. This is, in fact, exactly the approach that Top2Vec uses to label its topics.

So, in summary, much like Top2Vec, we want to use UMAP and HDBSCAN to cluster documents, and then use the joint space word-document representation to label clusters with set of words – and then do this in a hierarchical fashion so we can have broad high level labels for large clusters, and fine-grained labels for all the smalle low-level clusters. Ideally we would even do some outlier detection to prune away small outlying clusters from our high level clusterings.

Fortunately TNT wraps all of that work up in an easy to use JointVectorLabelLayers class. The class has four required arguments: the source data (the high dimensional vectors for our documents in this case), the map representation (as produced by UMAP earlier), the labelling vectors that live in the same vector space as the source data (our word vectors), and a dictionary mapping indices of the labelling vectors to textual labels.

The JointVectorLabelLayers class also supports a wide variety of optional keyword arguments to help control the clustering and outlier detection. In this case we will not worry too much, but to save time we’ll set cluster_map_representation to True so it will simply use the UMAP map representation we already have (instead of UMAPing to five dimensions for clustering purposes), and the number of clusters in the highest level labelling layer to be at least five. You can see the JointVectorLabelLayers API docs for more details on the various keyword options.

[18]:
label_layers =  tnt.JointVectorLabelLayers(
    use_vectors,
    use_map,
    use_word_vectors,
    {index:word for index, word in enumerate(vocabulary)},
    cluster_map_representation=True,
    min_clusters_in_layer=5,
    random_state=0,
)

Let’s now create a new plot, and add the textual annotations to it. The plot creation works essentially as before, and adding the textual annotations is as simple as using the add_cluster_labels method. This method includes a number of optional keyword arguments to tweak the aesthetics of how the textual labels are rendered.

[19]:
annotated_plot = tnt.BokehPlotPane(
    use_map,
    labels=labels,
    hover_text=hover_text,
    marker_size=sizes,
    label_color_mapping=COLOR_KEY,
    width=700,
    height=600,
    min_point_size=0.001,
    max_point_size=0.05,
    title="20-Newsgroups Data Map",
)
annotated_plot.add_cluster_labels(label_layers, max_text_size=24)

Now we can display the plot, much as before.

[20]:
pn.Row(annotated_plot)
[20]:

Now we have textual labels to help guide our exploration of the map. The top level cluster labels struggle to explain the over-arching concepts using specific words, but they do give a good indication of what sort of content one might expect. Zooming in, however, provides further levels of labelling, which are for more specific areas of the map and are much better targetted to the content they are summarising. Zooming in on the region roughly in the center of the plot in purple quickly highlights the different content between talk.politics.misc and talk.politics.guns for example. It is worth taking a little time to zoom and pan around the map, and see for yourself how well the labelling actually works – given the very simple approach used to generate it. More careful NLP based work to generate joint word and document vectors will only improve the labelling quality.