Annotating a Data Map Using Sparse Metadata

A data map all by itself contains useful information, but it can be hard to quickly orient yourself and understand what different clusters and relationships mean without looking in detail at the original source data for points in different regions of the map – an often tedious process. Layered textual annotation labels on clusters are regions can go a long way to making a map easier to understand at a glance, and faster to navigate to specific regions of interest guided by the cluster labels. In this tutorial we will look at an approach that uses metadata associated to the vector data that has a very large number of features (and is thus less suitable for the MetadataLabelLayer approach), but samples have a small number of non-zero values across those features.

First we’ll need some libraries. To get the data and produce a map of it we’ll use a combination of sklearn’s dataset fetcher, Universal Sentence Encoder from tensorflow_hub, and UMAP for data mapping.

[1]:
import sklearn.datasets
import tensorflow_hub as hub
import umap
import numpy as np
2022-12-12 15:44:55.281863: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.

As a simple example we’ll be working with the 20-Newsgroups data. These are posts from twenty different newsgroups from the 1990s – relatively short texts on a variety fo topics. This can be fetched via sklearn. Since the actual content of newsgroup posts is pretty diverse (from short single word replies, to enormous encoded binaries) we’ll tidy it up a little, pruning out the really short entries and the really long ones.

[2]:
dataset = sklearn.datasets.fetch_20newsgroups(
    subset="all", remove=("headers", "footers", "quotes")
)

good_length = [(len(t) > 128) and (len(t) < 16384) for t in dataset["data"]]
targets = np.array(dataset.target)[good_length]
news_data = [t for t in dataset["data"] if (len(t) > 128) and (len(t) < 16384)]

Text documents are, of course, not vectors suitable for mapping. We’ll solve that by using Uinversal Sentence Encoder (USE) to convert the text into vector representations. We picked USE because it is readily accessible and easy to use. To embed text using it we first load it from tensorflow_hub, and then simply apply it to our list of text items, converting the resulting vectors to numpy format when we are done.

[3]:
use_embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")
2022-12-12 15:44:58.784956: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-12-12 15:44:58.786619: I tensorflow/core/common_runtime/process_util.cc:146] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.
[4]:
%%time
use_vectors = use_embed(news_data).numpy()
CPU times: user 27.8 s, sys: 3.08 s, total: 30.8 s
Wall time: 23.4 s

Next we need a data map to explore. We can get that by passing the USE vectors to UMAP to reduce to 2D. Note that USE vectors measure distance via cosine distance, so we need pass that information to UMAP.

[5]:
use_map = umap.UMAP(metric="cosine", random_state=42).fit_transform(use_vectors)

Now we have a data map of the 20 newsgroup data. This provides a starting point for using TNT.

A Basic Interactive Plot

First we will get an interactive TNT plot of our data map working. For this we will need to import the TNT library, as well as the panel library which provides the infrastructure for building and composing TNT elements.

[6]:
import thisnotthat as tnt
import panel as pn

The next important step, to make use of TNT inside a notebook, is to enable panel’s extensions – we do this by calling the extension function. This will enable panel to render straight to a notebook, even if that involves communication back to the server for interactions.

[7]:
pn.extension()

To get started we’ll make a simple plot using the BokehPlotPane. There re other plot pane options, but the Bokeh pane is the richest in features, and supports the addition of the text annotation layers we’ll be using later. In the most basic usage you simply pass a data map in. In this case, since we have no class label information for the legend to show, we’ll turn the legend off.

[8]:
basic_plot = tnt.BokehPlotPane(
    use_map,
    show_legend=False,
)

To display the plot we make a panel Row and put the plot in it. If we had other panel or TNT elements we wanted to add we could simply add them as extra arguments to the row (or use a more advaned layout if we liked).

[9]:
pn.Row(basic_plot)
[9]:

Immediately we have an interactive plot that we can zoom and pan around in. We also have hover tooltips. This plot, however, doesn’t really tell us much about our data. To fix that let’s extract some more information we can use to enrich the plot.

We do have class labels for all the data points – which newsgroup they came from. And we can use the length of the post to define a size for the points in the plot. Finally we have the text of the newsgroup post itself. We can add a truncated version of that as hover_text to appear in the tooltip when we hover over points.

[10]:
labels = [dataset.target_names[x] for x in targets]
sizes = [np.sqrt(len(x)) / 1024 for x in news_data]
hover_text = [x[:384] + " ... trimmed" if len(x) > 384 else x for x in news_data]

One last thing: the twenty different newsgroups actually group together into five or six overall themes. To make things a little easier to see let’s create a custom colour palette so that related newsgroups can have similar colours.

[11]:
import seaborn as sns

religion = ("alt.atheism", "talk.religion.misc", "soc.religion.christian")
politics = ("talk.politics.misc", "talk.politics.mideast", "talk.politics.guns")
sport = ("rec.sport.baseball", "rec.sport.hockey")
comp = (
    "comp.graphics",
    "comp.os.ms-windows.misc",
    "comp.sys.ibm.pc.hardware",
    "comp.sys.mac.hardware",
    "comp.windows.x",
)
sci = (
    "sci.crypt",
    "sci.electronics",
    "sci.med",
    "sci.space",
)
misc = (
    "misc.forsale",
    "rec.autos",
    "rec.motorcycles",
)

COLOR_KEY = {}
COLOR_KEY.update(zip(religion, sns.color_palette("Blues", 4).as_hex()[1:]))
COLOR_KEY.update(zip(politics, sns.color_palette("Purples", 4).as_hex()[1:]))
COLOR_KEY.update(zip(comp, sns.color_palette("YlOrRd", 5).as_hex()))
COLOR_KEY.update(zip(sci, sns.color_palette("light:teal", 5).as_hex()[1:]))
COLOR_KEY.update(zip(sport, sns.color_palette("light:#660033", 4).as_hex()[1:3]))
COLOR_KEY.update(zip(misc, sns.color_palette("YlGn", 4).as_hex()[1:]))

Now let’s pass all of that to our PlotPane. We will also set a min_point_size and max_point_size so that if we zoom out a lot, we’ll still see some points, and if we zoom in the points will get smaller in apparent size, allowing us to avoid overlap in dense areas if we zoom in enough.

[12]:
enriched_plot = tnt.BokehPlotPane(
    use_map,
    labels=labels,
    hover_text=hover_text,
    marker_size=sizes,
    label_color_mapping=COLOR_KEY,
    show_legend=False,
    min_point_size=0.001,
    max_point_size=0.05,
    title="20-Newsgroups Data Map",
)

Now we display the plot as before, using a panel Row.

[13]:
pn.Row(enriched_plot)
[13]:

The result is much richer – we have a sense of how the different newsgroups are distributed in the map, we can see the long and short documents clearly, and the mouseover tooltips allow us to quickly get an idea of the content of posts.

It would be nice, however, to go another step and have textal annotations labelling the different clusters to give some idea of their content. Ideally we could also zoom in and see finer grained cluster labels as well. Let’s get started building that.

Generating Annotation Vectors

To label clusters using the sparse metadata approach we need metadata associated to out USE embedding vectors. We chose documents for this example precisely because there is a natural sparse metadata representation for documents – the term-frequency matrix for the corpus of documents. This is simply the bag-of-words representation, and can be generated simply by sklearn’s CountVectorizer.

[14]:
import sklearn.feature_extraction

We can simply run CountVectorizer on the full corpus of documents and have it return a sparse matrix. We will keep the model around as well since we will also need to be able to map column indices of the sparse data to feature names (and for feature names we’ll be using the word/term associated to the given column.

[15]:
%%time
cv = sklearn.feature_extraction.text.CountVectorizer(lowercase=True, min_df=10)
sparse_metadata = cv.fit_transform(news_data)
CPU times: user 1.46 s, sys: 3.83 ms, total: 1.46 s
Wall time: 1.46 s

This provides us with associated metadata that has a very high number of features (over twelve thousand!) but which is quite sparse:

[16]:
sparse_metadata
[16]:
<16384x12746 sparse matrix of type '<class 'numpy.int64'>'
        with 1421224 stored elements in Compressed Sparse Row format>

There are, of course, many other ways one can end up with sparse data that has a very high feature count; consider, for example, gene-expression data in biology, role-based permission vectors in cybersecurity, etc. However you end up with the data, as long as you have a textual label or name that you can associate to each feature column then this approach should work. In our case, to get the mapping from columns to feature names, we simply need to reverse the vocabulary_ dictionary that the CountVectorizer generates (which is a mapping from words to column indices):

[17]:
feature_name_dict = {idx:word for word, idx in cv.vocabulary_.items()}

Adding Annotation Layers

Our goal is to tag regions of the map with word based annotations giving some indication of the content of documents in that region. We can do this by clustering the documents, and finding features in the metadata that are most distinctive of each cluster. To do this we can the InformationWeightTransformer from vectorizers which can generate feature-weights in a semi-supervised manner – highlighting features that distinguish classes or clusters. Given feature weights we can find the features in each cluster that score the highest when combined with the prevalence of the feature in that cluster. For those who use BERTopic you can view this as a similar approach to the c-TF-IDF technique used there, but with more advanced machinery doing the heavy lifting.

So we will use UMAP and HDBSCAN to cluster documents, and then use the distnctive features of each cluster to label clusters with set of feature names (in our case words) – and then do this in a hierarchical fashion so we can have broad high level labels for large clusters, and fine-grained labels for all the smaller low-level clusters. Ideally we would even do some outlier detection to prune away small outlying clusters from our high level clusterings.

Fortunately TNT wraps all of that work up in an easy to use SparseMetadataLabelLayers class. The class has four required arguments: the source vectors (out USE vectors); the map representation (as produced by UMAP earlier); the sparse metadata associated to the vector data (our sparse matrix of counts of word occurrences), and a dictionary mapping column indices of the sparse metadata to textual labels.

The SparseMetadataLabelLayers class also supports a wide variety of optional keyword arguments to help control the clustering and outlier detection. In this case we will not worry too much, but to save time we’ll set cluster_map_representation to True so it will simply use the UMAP map representation we already have (instead of UMAPing to five dimensions for clustering purposes), and the number of clusters in the highest level labelling layer to be at least five. You can see the SparseMetadataLabelLayers API docs for more details on the various keyword options.

[18]:
label_layers =  tnt.SparseMetadataLabelLayers(
    use_vectors,
    use_map,
    sparse_metadata,
    feature_name_dict,
    cluster_map_representation=True,
    min_clusters_in_layer=5,
    random_state=0,
)
/home/azureuser/PycharmProjects/thisnotthat/thisnotthat/map_cluster_labelling.py:1545: UserWarning: NetworkX is required for label adjustments; try pip install networkx
  warn("NetworkX is required for label adjustments; try pip install networkx")

Let’s now create a new plot, and add the textual annotations to it. The plot creation works essentially as before, and adding the textual annotations is as simple as using the add_cluster_labels method. This method includes a number of optional keyword arguments to tweak the aesthetics of how the textual labels are rendered.

[19]:
annotated_plot = tnt.BokehPlotPane(
    use_map,
    labels=labels,
    hover_text=hover_text,
    marker_size=sizes,
    label_color_mapping=COLOR_KEY,
    width=700,
    height=600,
    min_point_size=0.001,
    max_point_size=0.05,
    title="20-Newsgroups Data Map",
)
annotated_plot.add_cluster_labels(label_layers, max_text_size=24)

Now we can display the plot, much as before.

[20]:
pn.Row(annotated_plot)
[20]:

Now we have textual labels to help guide our exploration of the map. The top level cluster labels struggle to explain the over-arching concepts using specific words, but they do give a good indication of what sort of content one might expect. Zooming in, however, provides further levels of labelling, which are for more specific areas of the map and are much better targetted to the content they are summarising. Zooming in on the region roughly in the center of the plot in purple quickly highlights the different content between talk.politics.misc and talk.politics.guns for example. It is worth taking a little time to zoom and pan around the map, and see for yourself how well the labelling actually works – given the very simple approach used to generate it.