Annotating a Data Map from Metadata

A data map all by itself contains useful information, but it can be hard to quickly orient yourself and understand what different clusters and relationships mean without looking in detail at the original source data for points in different regions of the map – an often tedious process. Layered textual annotation labels on clusters are regions can go a long way to making a map easier to understand at a glance, and faster to navigate to specific regions of interest guided by the cluster labels. Here we will look at how to generate and apply wuch a textual annotation using metadata associated to points in a map.

First we’ll need some libraries. To get the data and to produce a map from that data we’ll use some sklearn preprocessors, seaborn for it’s load_dataset feature, and, of course, UMAP.

[1]:

from sklearn.preprocessing import RobustScaler
import seaborn as sns
import umap

As a simple example dataset we’ll be working with the Palmer’s penguin data. This can be loaded via seaborn. We will also do a little tidying up: we’ll drop rows with missing data, and we’ll rename the columns to have slightly print-friendlier names.

[2]:

penguins = (
    sns.load_dataset('penguins')
    .dropna()
    .rename(
        columns={
            "bill_length_mm": "bill-length",
            "bill_depth_mm": "bill-depth",
            "flipper_length_mm": "flipper-length",
            "body_mass_g": "body-mass"
        }
    )
)

The actual dataset consist of measurements associated to three different species of penguins: the length of their bills, the depth of their bills, the length of the flippers, and how much they weigh. The data was collected from three different islands, and the sex of the penguins was also recorded.

[3]:

penguins.head()

[3]:

	species	island	bill-length	bill-depth	flipper-length	body-mass	sex
0	Adelie	Torgersen	39.1	18.7	181.0	3750.0	Male
1	Adelie	Torgersen	39.5	17.4	186.0	3800.0	Female
2	Adelie	Torgersen	40.3	18.0	195.0	3250.0	Female
4	Adelie	Torgersen	36.7	19.3	193.0	3450.0	Female
5	Adelie	Torgersen	39.3	20.6	190.0	3650.0	Male

If we want to create a data map from this data we need to extract some vector data with a resonable distance metric on it. Obviously the purely numeric data (excluding species, island and sex) will be useful, but the different measurements (in different units) are on very different scales. While there are much better and more exomplex approaches to dealing with this, to keep this tutorial simple we’ll just use sklearn’s RobustScaler to rescale the features to be on the same general scale.

[4]:

data_for_umap = RobustScaler().fit_transform(penguins.select_dtypes(include="number"))

The last step is to UMAP the data to get out data map. This will provide the starting point for using TNT for data map exploration.

[5]:

penguin_datamap = umap.UMAP(random_state=42).fit_transform(data_for_umap)

A Basic Interactive Plot

First we will get an interactive TNT plot of our data map working. For this we will need to import the TNT library, as well as the panel library which provides the infrastructure for building and composing TNT elements.

[6]:

import thisnotthat as tnt
import panel as pn

The next important step is to enable panel to render into the notebook. We do that by calling the extension function from panel. In this case we will call it with the 'tabulator' argument as we will later be using the tabulator based interactive data table, and need to enable this extension.

[7]:

pn.extension('tabulator')

For this plot we’ll make use of the BokehPlotPane. Other plot pane types exist, but the Bokeh pane is the richest in features and will work well for this data. The BokehPlotPane has a single required argument – the data map to be plotted – and a large range of optional keyword arguments to styling and enriching the plot. In this case we’ll add class labels to the plot using the species of penguin (TNT will use class labels for colouring points), and some text to show when hovering over points generated by gluing together the species, island and sex for each penguin.

[8]:

basic_plot = tnt.BokehPlotPane(
    penguin_datamap,
    labels=penguins.species,
    hover_text=penguins.select_dtypes(include="object").apply(" ".join, axis=1),
    width=700,
)

To show the plot we assemble together the components we want to display using panel. In this case it is just the plot. We can use the Row object to hold our plot, and since we enabled the panel extension is will display inline in the notebook.

[9]:

pn.Row(basic_plot)

[9]:

Note that the plot is interactive – you can zoom and pan, and hovering the cursor over points will provide a tooltip with the information we supplied as hover_text. We can also see the penguins split fairly well by species in the map thanks to the points being coloured by the species label. It is also possible to use the lasso select tool to select data – right now that doesn’t do anything, but we’ll see more about it later.

Adding Annotation Layers

Colouring by species was useful in helping to understand what the different clusters in the map we – we see right away that the Gentoo penguins are quite distinct from the other two species. But we can only colour by a single variable at time, and it can be hard to get a sense of how different factors interact to result in clusters and shapes within the map. Wouldn’t it be great if we could tag regions of the map with descriptions of what makes that region distinct from others across all the different features at once? Are some regions more associated with remale penguins? Or lower body mass? Or some combination thereof? How can we relate the full metadata table, including categorical variables like species, island and sex, to the map generated solely from re-scaled numeric data?

We can achieve something like this by clustering the data and then generating textual labels for the clusters. Each cluster is a region of the map, and we can then label the regions with the associated text. Better still we can do this hierarchically, generating multiple layers of textual labels for increasingly larger higher level regions – much as we might label a map with country, state, and city labels. The question is how do we go about labelling a single cluster given a dataframe of metadata about all the points?

One approach is to use the dataframe as training data for a binary classifier that tries to learn to classify the data as in the cluster or not. We can then use feature importances of the classifier to label a cluster with the most discerning features. Of course we also have to handle differences between categorical and numeric features, etc. and apply all of this repeatedly and hierarchicaly. Fortunately TNT wraps all of this complication up in the MetadataLabelLayers class.

The MetadataLabelLayers class has three required arguments: the source data (which we will cluster usign a combination of UMAP and HDBSCAN), the map representation of the data (as produced by UMAP earlier), and the dataframe of associated metadata we will use to train the classifier (in our case that original dataframe of penguin data).

The MetadataLabelLayers class also has a range of optional keyword arguments to help control the clustering, pruning of outlying clusters from higher level labelling, the metric on the source data, and so on. Essentially these provide you the ability to fine tune the kind of labelling results you get. See the MetadataLabelLayers API docs for more details.

[10]:

label_layers = tnt.MetadataLabelLayers(
    data_for_umap,
    penguin_datamap,
    penguins,
    hdbscan_min_cluster_size=5,
    hdbscan_min_samples=5,
    contamination=1e-6,
    min_clusters_in_layer=3,
    vector_metric="euclidean",
    cluster_distance_threshold=0.0,
    random_state=0,
)

Let’s now create a new plot, and add the textual annotations to it. The plot creation works essentially as before, and adding the textual annotations is as simple as using the add_cluster_labels method. This method includes a number of optional keyword arguments to tweak the aesthetics of how the textual labels are rendered.

[11]:

annotated_plot = tnt.BokehPlotPane(
    penguin_datamap,
    labels=penguins.species,
    hover_text=penguins.select_dtypes(include="object").apply(" ".join, axis=1),
    legend_location="top_right",
    width=600,
    height=600,
)
annotated_plot.add_cluster_labels(label_layers, text_size_scale=64, text_layer_scale_factor=3.0)

Now we can display the plot, much as before.

[12]:

pn.Row(annotated_plot)

[12]:

Now we have textual labels highlighting the features that are important to different regions of the map. Clearly species matters (we knew that), but we can also quickly gather that low bill-depth and high body mass are distinguishign features of Gentoo penguinds, for example, and that the Adelie cluster is split between smaller and larger penguins. You can zoom in and get finer grained labels for smaller regions revealed, helping to further guide any exploration of the data.

Adding Plot Interactions

To help give some confidence that the textual cluster labels are indeed telling us what we want let’s add a data table that we can link to selections in the plot. To do this we create a DataPane and then link it to the annotated plot by the selected attribute.

[13]:

data_view = tnt.DataPane(penguins, width=700, page_size=150)
data_view.link(
    annotated_plot,
    selected="selected",
    bidirectional=True,
);

We can now display both the plot and the table. Since the table is large it is beneficial to use the Tabs layout from panel, so we can tab back and forth between plot and table. The table will display only those rows associated to points selected in the plot. Note that you will need to be running this is a notebook to enable this level of interactivity.

[14]:

pn.Tabs(annotated_plot, data_view)

[14]:

Another approach to tying the plot back to the source data is via plot controls that allow us to select which data columns we are colouring by (using continuous color maps for numeric data if required), control the marker size based on numeric data columns, or choose which source data columsn to use for the hover tooltip. We can do this easily via the PlotControlWidget. We can then link this to the plot, matching the relevant attributes of each. We can now recolour the plot in various was, and alter marker size and hover text, to indeed verify that the textual labels are providing quite effective summaries of the data map. Note that you will need to be running this is a notebook to enable this level of interactivity.

[15]:

plot_control = tnt.PlotControlWidget(penguins, width=120)
plot_control.link(
    annotated_plot,
    hover_text="hover_text",
    marker_size="marker_size",
    color_by_vector="color_by_vector",
    color_by_palette="color_by_palette",
);

[16]:

pn.Row(annotated_plot,plot_control)

[16]: