[1]:

%load_ext autoreload
%autoreload 2

JointVectorSummarizer

Feature importance is quite difficult to perform on very sparse and very high dimensional data such as text documents. The JointVectorSummarizer aims to bridge this gap. It leverages a joint embedding of your points of interest and some label space to find a set of interpretable labels for a selection of your data.

It does this in a manner similar to the cluster labels in Dimo Angelov’s Top2Vec. Please see his paper for the details but the basic idea is that as long as we are working in a high dimensional space we can computer the centroid of a large number of points and still have that centroid represent our points surprisingly well. We can then represent that centroid by computing it’s nearest neighbours from a more interpretable label space.

Let’s run through an example using a joint document and word embedding. Here the documents are our points of interest and the words are our interpretable labels.

The first step is to load thisnotthat and panel.

[2]:

import thisnotthat as tnt
import panel as pn

2023-03-03 16:53:29.354478: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.

To make Panel based objects interactive within a notebook we need to load the panel extension.

[3]:

pn.extension()

Now we need some data to use as an example. In this case we’ll use the 20-Newsgroups dataset, which we can get easy access to via scikit-learn. The data itself consists of posts from twenty different newsgroups from the 1990s. Most posts are relatively short (a paragraph or two), but other can be consixderably longer. We will clean up the data by removing overly short and excessivly long posts.

[4]:

import sklearn.datasets
import umap
import numpy as np

[5]:

dataset = sklearn.datasets.fetch_20newsgroups(
    subset="all", remove=("headers", "footers", "quotes")
)

good_length = [(len(t) > 128) and (len(t) < 16384) for t in dataset["data"]]
targets = np.array(dataset.target)[good_length]
news_data = [t for t in dataset["data"] if (len(t) > 128) and (len(t) < 16384)]
news_labels = [dataset.target_names[x] for x in targets]

Having extracted and cleaned the data we are going to create a joint embedding of these posts and their words. For this we will make use of Dimo Angelov’s Top2Vec but any form of joint vector space should do the trick. Top2Vec can be built over a number of different text embedding models but for this example we will make use of the Universal Sentence Encoder. Doc2Vec and Bert Sentence Transformer are also easily wrapped by this library but feel free to use your own methods or branch out into joint image and word embeddings via something like CLIP. All that is required for this summarization is a joint embedding of your objects and some interpretable labels.

For this notebook you will need to pip install Top2Vec with an extra parameter to ensure you have the appropriate model:

pip install top2vec[sentence_encoders]

Then import top2vec and initialize it to encode your joint document and word embeddings.

[6]:

from top2vec import Top2Vec
model = Top2Vec(news_data, embedding_model='universal-sentence-encoder')

2023-03-03 16:53:39,966 - top2vec - INFO - Pre-processing documents for training
/Users/jchealy/opt/anaconda3/envs/tnt/lib/python3.10/site-packages/sklearn/feature_extraction/text.py:528: UserWarning: The parameter 'token_pattern' will not be used since 'tokenizer' is not None'
  warnings.warn(
2023-03-03 16:53:47,141 - top2vec - INFO - Downloading universal-sentence-encoder model
2023-03-03 16:53:48.337467: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-03 16:53:59,775 - top2vec - INFO - Creating joint document/word embedding
INFO:top2vec:Creating joint document/word embedding
2023-03-03 16:54:22,357 - top2vec - INFO - Creating lower dimension embedding of documents
INFO:top2vec:Creating lower dimension embedding of documents
2023-03-03 16:54:48,039 - top2vec - INFO - Finding dense areas of documents
INFO:top2vec:Finding dense areas of documents
2023-03-03 16:54:48,584 - top2vec - INFO - Finding topics
INFO:top2vec:Finding topics

This will have created a joint embedding for our documents and words as well as done some extra work to find useful topics for this data set.

The 512 dimensional document vectors are stored in the document_vector property.

[7]:

model.document_vectors.shape

[7]:

(16384, 512)

We also have a long list of 4580 words to act as our interpretable labels (model.vocab) and their corresponding 512 dimensional vector (model.word_vectors) that are jointly embedding with our document_vectors to make distance comparisons betwee the two meaningful.

[8]:

print(model.vocab[:5])
print(model.word_vectors.shape)

['aa', 'aaron', 'ab', 'abc', 'abiding']
(4580, 512)

Let’s use UMAP to find a good two dimensional representation of our documents by reducing the dimension of their 512 dimensional USE vectors. We could reduce to any dimension we’d like but 2d is particularly conducive to visualization.

[9]:

use_map = umap.UMAP(metric="cosine", random_state=42).fit_transform(model.document_vectors)

We can use a BokehPlotPane to visualize the result. We will use the newsgroups as a label, and the first 256 characters of each post as hover text to make it easier to navigate the data.

[10]:

plot = tnt.BokehPlotPane(
    use_map,
    labels=news_labels,
    hover_text=["▶ " + x[:256] + "..." for x in news_data],
    marker_size=0.05,
    width=500,
    height=500,
    title="Newsgroup data map",
)

We would like a DataFrame summary of our selected data so we will load a summarizer from the summary.dataframe namespace. In this case JointLabelSummarizer. JointLabelSummarizer takes an np.array of vectors associated with our points (or documents in this case), as well as a list of labels (words) and finally an np.array of vectors associated with these labels (word vectors). The key here is that these vectors need to be in the same space.

The JointLabelSummarizer is going to compute the centroid of your selected points in your high dimensional space. Then it will compute the nearest n_neighbours labels (or words) to this centroid.

This is the basic process that Top2Vec uses to summarize a cluster. The difference here is that we are performing this summarization not on a detected cluster but instead on a selection of documents.

As with all summarizers we construct the summarizer and pass it into a SummaryPane. In this case because our JoingLabelSummarizer returns a DataFrame we will use the DataSummaryPane for building our dashboard. As always we link the plots together with link_to_plot and begin to explore our data.

[11]:

from thisnotthat.summary.plot import JointWordCloudSummarizer
word_summarizer = JointWordCloudSummarizer(model.document_vectors, model.vocab, model.word_vectors, background_color='black')
word_summary_pane = tnt.PlotSummaryPane(word_summarizer)
word_summary_pane.link_to_plot(plot)

from thisnotthat.summary.dataframe import CountSelectedSummarizer
count_summary = tnt.DataSummaryPane(CountSelectedSummarizer(),sizing_mode = "stretch_width")
count_summary.link_to_plot(plot)
pn.Row(plot, pn.Column(count_summary,word_summary_pane))

[11]:

Now if you’re running this in a notebook you can select points in the left pane and see what words might best represent their selected points.

[12]:

from wordcloud import WordCloud