Annotating a Data Map Using Per-Sample Labels

A data map all by itself contains useful information, but it can be hard to quickly orient yourself and understand what different clusters and relationships mean without looking in detail at the original source data for points in different regions of the map – an often tedious process. Layered textual annotation labels on clusters are regions can go a long way to making a map easier to understand at a glance, and faster to navigate to specific regions of interest guided by the cluster labels. In this tutorial we will look at what can be done when each sample has a short textual label, and how we can use sampling to create hierarchical layers of labelling of clusters.

First we’ll need an example dataset where each sample has some short textual label. A perfect example of this is word-vectors. We can get word-vectors via gensim and its convenient downloader utility. We will also need UMAP to make a data map of the word-vectors.

[1]:
from gensim.models.word2vec import Word2Vec
import gensim.downloader as api
import umap

Word-vectors are dense vector representations of words learned from processing a large corpus of text such that vectors for words that are relatively interchangeable in language are similar to each other. In other words, we want to have vectors such that words with similar meanings have vectors that are close in the learned space. We will use the pretrained GloVe word-vectors, available for download – this is a large file, so it may take a little while to download.

[2]:
word_vector_model = api.load('glove-wiki-gigaword-100')
word_vector_model.vectors.shape
[2]:
(400000, 100)

Four hundred thousand word-vectors is somewhat overkill. The GloVe pre-trained model is trained over a very large corpus of wikipedia and other text and includes a lot of obscure words, mis-spellings and typos, etc. Fortunately the pretrained word vectors are stored in order of frequency of use in the training corpus – so we can easily pull off the thirty thousand most frequently used words and work with that subset instead. We’ll also keep track of the associated word representations of the first thirty thousand words as word_text to help label out plots.

[3]:
word_vectors = word_vector_model.vectors[:30000]
word_text = word_vector_model.index_to_key[:30000]

Next we need a data map to explore. We can get that by passing the word-vectors on to UMAP to reduce to 2D. Note that standard practice for word-vectors is to use cosine distance as the measure of dissimilarity, so we need to pass that information to UMAP as well.

[4]:
word_map = umap.UMAP(metric="cosine", random_state=42).fit_transform(word_vectors)

Now we have a 2D data map of the thirty thousand most frequently used word in the GloVe training corpus. This provides the starting point for TNT.

A Basic Interactive Plot

First we will get an interactive TNT plot oif our data map working. Fir this we will need to import the TNT library, as well as the panel library which provides the infrastructure for building and composing TNT elements.

[5]:
import thisnotthat as tnt
import panel as pn

The next important step, to make use of TNT inside a notebook, is to enable panel’s extensions – we do this by calling the extension function. This will enable panel to render straight to a notebook, even if that involves communication back to the server for interactions.

[6]:
pn.extension()

To get started we’ll make a simple plot using the BokehPlotPane. There re other plot pane options, but the Bokeh pane is the richest in features, and supports the addition of the text annotation layers we’ll be using later. In the most basic usage you simply pass a data map in. In this case, since we have no class label information for the legend to show, we’ll turn the legend off. We will also set the hover_text to be the text of the word associated to each vector, allowing is to hover over points in the map to see what word the point represents. This is as simple as passing the vector of word_text, which is in the same order as the word-vectors, in as hover_text.

[7]:
basic_plot = tnt.BokehPlotPane(
    word_map,
    hover_text=word_text,
    show_legend=False,
)

To display the plot we make a panel Row and put the plot in it. If we had other panel or TNT elements we wanted to add we could simply add them as extra arguments to the row (or use a more advaned layout if we liked).

[8]:
pn.Row(basic_plot)
[8]:

Immediately we have an interactive plot that we can zoom and pan around in. We can mouse over the points to see what the associated words are – and we can get a sense of how similar or related words end appearing in clustered regions of the map. It is a little hard to guess what the different regions of the map relate to, however, just by looking at the map, and mousing over everything quickly becomes somewhat tedious.

Let’s get some extra information that we can use to enrich the plot a little. One useful thing would be to try and tag the words according to their part-of-speech usage: nouns, verbs, adjectives, adverbs and so on. We don’t have that information in the GloVe vectors, but we can generate some reasonable guesses with just a little work. We’ll need NLTK (the Natural Language ToolKit) for this.

[9]:
import nltk
import pandas as pd
import numpy as np
import string
import re

We need some part-of-speech tagged text, and given that people’s names are prevalent in the word vectors (since much of the training corpus is Wikipedia), it will also be useful to tag those. We could invest a lot of effort in this, but for a quick demonstration we’ll just grab a basic corpus and a list of names and work from there.

[10]:
nltk.download('brown')
nltk.download('names')
nltk.download('universal_tagset')
tagged_corpus = nltk.corpus.brown.tagged_words(tagset='universal')
names = nltk.corpus.names.words()
[nltk_data] Downloading package brown to /home/lmmcinn/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package names to /home/lmmcinn/nltk_data...
[nltk_data]   Package names is already up-to-date!
[nltk_data] Downloading package universal_tagset to
[nltk_data]     /home/lmmcinn/nltk_data...
[nltk_data]   Package universal_tagset is already up-to-date!

Now we just associate a word with its part-of-speech tag. It is also worth noting that many words’ part-of-speech depends upon their usage; our word-vectors don’t have that information. To keep everything simple we’ll just use the most common tag for each word.

[11]:
pos_dict = (
    pd.DataFrame(tagged_corpus, columns=("word", "POS"))
    .assign(word=lambda x: x.word.str.lower())
    .assign(POS=lambda x: x.POS.replace({"X": "OTHER", ".": "PUNCT"}).str.lower())
    .groupby("word")
    .agg(lambda x: pd.Series.mode(x)[0])
    .to_dict()['POS']
)

This gives us a dictionary mapping from words in our pre-tagged corpus, to their most commonly assigned tag. We also want to tag people’s names, and whike we are at it we can enrich our dictionary to work with other relatively obvious number and punctuation patterns that show up in the word-vectors, but not in our pre-tagged corpus.

[12]:
pos_dict.update({name.lower():"name" for name in names})
for word in word_text:
    if (word.isnumeric() or word.isdecimal() or re.match(r'\d+\.\d+', word)) and word not in pos_dict:
        pos_dict[word] = "num"
    elif all(char in string.punctuation for char in word) and word not in pos_dict:
        pos_dict[word] = "punct"

Now we can create a part-of-speech label for each word in our word-vectors, tagging everything that doesn’t havea match in our hastily cobbled dictionary with “other”. We’ll aslo make a convenient colour mapping to discern the different parts-of-speech in the plot.

[13]:
pos_labels = [pos_dict[word] if word in pos_dict else "other" for word in word_text]
pos_color_mapping = {
    "noun":'#fed977',
    "name":'#37a055',
    "verb":'#fd8e3c',
    "adj":'#225ea8',
    "adv":'#42b6c4',
    "adp":'#c8e9b4',
    "det":'#88419d',
    "prt":'#8c97c6',
    "conj":'#c0d4e6',
    "num":'#e31a1c',
    "punct":'#74b9b9',
    "other":'#aaaaaa',
}

Now we can enrich out plot – we have effective class labels with the part-of-speech tags applied to each word. We can also make use of the relative frequncy of words. If we assume a Zipf distribution of word freqeuncy (a reasonable assumption) we can size the markers in (log-scale) with the frequency of use of the word. We can also use tooltip templating (see the Bokeh docs for more details on syntax) to enrich the tooltips with the part-of-speech.

[14]:
word_plot = tnt.BokehPlotPane(
    word_map,
    labels=pos_labels,
    label_color_mapping=pos_color_mapping,
    hover_text=word_text,
    tooltip_template="""@hover_text [@label]""",
    marker_size=0.01 * (np.log(1.0 / np.arange(1, len(pos_labels) + 1)) + 11),
    min_point_size=0.001,
    max_point_size=0.12,
    width=768,
    height=512,
    title="GloVe word vectors"
)

We then display the plot as before. It is well word zooming in and panning around to explore the way different parts of speech have clustered, and the fact that particularly frequent words (often determininers and conjuctions like “the”, “and”, etc.) all cluster in the rough center of the map.

[15]:
pn.Row(word_plot)
[15]:

The part-of-speech of commonness of usage does help to make the plot more quickly navigable to regions of interest, and the hover text makes it relatively easy to quickly skim over the words in a region. Still, it would be nice to be able to do even better still. It would be great to have textal annotations labelling the different clusters to give some idea of their content. Ideally we could also zoom in and see finer grained cluster labels as well. Let’s get started building that.

Generating Annotation Vectors

Our goal is to tag regions of the map with word based annotations giving some indication of the kinds of words in that region. Other approaches, such as the JointVectorLabelLayers and MetadataLabelLayers, make use or significant extra information to provide rich labelling of clusters. Here we have very limited extra information, but at least have a unique text label (the word) for each sample (the word-vector). In such a situation the most natural approach is to create a label for each cluster by selecting a decent representative sample of word-vectors in the cluster, and using those as labels.

In selecting samples that provide a good representation of a cluster (or dataset) a good approach is to use a technique like submodular selection to ensure reasonable diversity and coverage of the sampling from the cluster. Fortunately the apricot-select library provides this for us. The SampleLabelLayers class uses apricot-select (and falls back to pure random sampling if apricot-select is not installed) to sample from hierarchical layers of clustering, and includes facilities for pruning away outlying clusters where necessary from high level clusterings.

The SampleLabelLayers class has three required arguments: the source data (our original high dimensiona word-vectors), the map representation (as produced by UMAP earlier), and a vector of text representations for each sample (in our case the word_text vector). The SampleLabelLayers class also supports a wide variety of optional keyword arguments to help control the clustering and outlier detection. Perhaps one of the more important such arguments is the sample_selection_method which let’s you chose the submodular selection approach used by apricot-select. Here we have chosen "saturated_coverage" as it is very fast. Other options include "sum_redundancy" and "graph_cut" which are a little slower and similarity-graph based, and "facility_location" which does the best job of choosing examples that represent the cluster well but can be more computationally expensive.

We also set the vector_metric to "cosine" since that is how apricot-select should be measuring distances when optimizing. The cluster_map_representation option saves time by running hdbscan on the map representation rather than re-running UMAP to umap_n_components many dimensions (which defaults to 5). By default a spring based layout system is used to try to avoid labelk overlaps in and among layers. Since we want to keep the labels well centered we will turn that off here. Finally we’ll tweak the clustering and outlier detection options to help get a good looking result.

[16]:
%%time
label_layers = tnt.map_cluster_labelling.SampleLabelLayers(
    word_vectors,
    word_map,
    word_text,
    sample_selection_method="saturated_coverage",
    vector_metric="cosine",
    cluster_map_representation=True,
    adjust_label_locations=False,
    hdbscan_min_cluster_size=10,
    min_clusters_in_layer=16,
    contamination_multiplier=1.0,
    random_state=42,
)
CPU times: user 22.7 s, sys: 1.01 s, total: 23.7 s
Wall time: 7.37 s

Now we can simplt add the resulting multi-layer text labelling structure to our existing plot using the add_cluster_labels method which takes the output of any of the cluster-label_layers approaches. This method also supports a number of optional keyword arguments for tweaking the aesthetics of how the text labels are rendered. Here we’ll just compress the line-height a little, and not let text labels get too large in apparent size before transitioning to lower layers.

[17]:
word_plot.add_cluster_labels(
    label_layers,
    text_line_height=0.75,
    max_text_size=24.0,

)

We can display the plot, much as before.

[18]:
pn.Row(word_plot)
[18]:

Now we have textual labels to help further guide out exploration of the map. The top level cluster labels give, at best, a general idea of the content in a given region – selecting three items out of hundreds is always going to be limited. However they provide a starting point, and zooming in will reveal lower lecel labellings of finer regions of the map, giving more specific details on the content of that region (as can be quickly verified by hovering over the points. The finest grained labels are often quite specific, and taken together provide quite a bit of insight into how the map has compressed the complexity of language into only two dimensions (in various cases trade-offs have to be made). Overall this cluster labelling makes the map much easier to explore and navigate to regions of interest.