FeatureImportanceSummarizer

This Not That (TNT) provides a plot viewer to help a user gain a better understanding of their summarized data. We will outline the basic functionality of this PlotSummaryPane by demonstrating it with a one of TNT’s build in summary functions: FeatureImportanceSummarizer.

The first step is to load thisnotthat and panel.

[1]:

import thisnotthat as tnt
import panel as pn

2023-03-05 15:56:21.951054: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.

To make Panel based objects interactive within a notebook we need to load the panel extension.

[2]:

pn.extension()

Now we need some data to use as an example. In this case we’ll use the Palmer’s Penguins dataset, which we can get easy access to via seaborn.

[3]:

import seaborn as sns
penguins = sns.load_dataset("penguins").dropna(how="any", axis=0)
penguins.head()

[3]:

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex
0	Adelie	Torgersen	39.1	18.7	181.0	3750.0	Male
1	Adelie	Torgersen	39.5	17.4	186.0	3800.0	Female
2	Adelie	Torgersen	40.3	18.0	195.0	3250.0	Female
4	Adelie	Torgersen	36.7	19.3	193.0	3450.0	Female
5	Adelie	Torgersen	39.3	20.6	190.0	3650.0	Male

Now we will need a summarizer object. This is simply an object of a class which has a summarize function. That summarize function needs to take a selected sequence (the indices of the points you’ve selected in another plot) and returns a figure to be displayed in our PlotSummaryPane.

def summarize(self, selected: Sequence[int]):
    do some things
    return figure

There are a number of useful pre-defined summarizer functions already included in TNT. Summarizer functions which return a plot and are thus appropriate for initializing a PlotSummaryPane are included within the summary.plot namespace. For this example we will demonstrate our FeatureImportanceSummarizer.

FeatureImportanceSummarizer by default constructs a class balanced, L1 penalized, logistic regression between the selected points and the unselected data points. By default the categorical variables are one hot encoded and the numeric variables are scaled and centered with a RobustScalar. This will at least attempt to put the variables on the same scale so that the coefficients are somewhat comparable.

There are a number of problems with using the coefficients of any linear model for feature importance. One of the biggest (after proper normalization) being that the current summarizer doesn’t account for correlation amongst our features. As such any feature importances should be taken with a healthy dose of skeptisism and should only be used to get a rough idea of what might distinguish a particular cluster. That said, this function is provided as a very fast and scalable first look summarization tool.

More expensive summarizers making use of cross validation, bootstrapping or feature permutation should be used in a follow on analysis before any definitive conclusions are reached. Some of these could easily be included in their own summzarizer functions.

The basic usage is that we construct a summarizer object with the data it needs to compute it’s summary and any desired parameters. In this case that is the original penguins DataFrame, though we could use the one_hot_categorical_features=False option to only calculate feature importance on the numeric penguin features.

This summarizer is then passed into the constructor for a PlotSummaryPane this pane will handle all the display parameters necessary.

[4]:

summarizer = tnt.summary.plot.FeatureImportanceSummarizer(penguins, one_hot_categorical_features=False)
summary_plot= tnt.PlotSummaryPane(summarizer)
summary_plot

[4]:

We see that initially the plot shows “Nothing to summarize”. That is because we haven’t selected any data points yet.

The selected points are handled via a .selected property which is a base zero index linking the rows of the data frame we passed into our selector with the base zero index of any other Panes that we construct. If we are running this in a notebook we can run the following cell to update this property with the indices of all the penguins of species Gentoo. That should update the previous plot summary pane with horizontal bars depicting the top importance coefficients differentiating these points from the remaining data.

[5]:

import numpy as np
summary_plot.selected = list(np.where(penguins.species=="Gentoo")[0]);

_images/plotsummarypane_feature_importance_9_0.png

Tying the plots together

To see how this works we’ll need a data map. For that we’ll need some preprocessing for the numeric columns of the penguins data, and UMAP.

[6]:

from sklearn.preprocessing import RobustScaler
import umap

We can now build a data map out of the rescaled numeric penguins data, and create a PlotPane for it.

[7]:

data_for_umap = RobustScaler().fit_transform(penguins.select_dtypes(include="number"))
penguin_datamap = umap.UMAP(random_state=37).fit_transform(data_for_umap)
plot = tnt.BokehPlotPane(
    penguin_datamap,
    labels=penguins.species,
    hover_text=penguins.island,
    legend_location="top_right",
    title="Penguins data map",
)

Finally we can link our previously constructed summary_plot PlotSummaryPane with our newly constructed BokehPlotPane. This is done via our link_to_plot function which ties together the .selected properties of both panes.

[8]:

summary_plot.link_to_plot(plot)
pn.Row(plot, summary_plot)

[8]:

If you are running this in a notebook you can now select the lasso tool the leftmost plot and select a set of points. You should see the coefficient values associated with the various features appear on the right.

Explore the model

If you’d like to get a few more details on how your particular model is performing the model and the transformed data can be found in the summarizer object.

summarizer._classes contains a binary vector of your selected indices. summarizer._classifier contains the model used for prediction. summarizer.data contains the transformed data used for prediction. summarizer._preprocessor contains the pipeline used for transforming other data into the format expected by the model.

[9]:

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

predictions = summarizer._classifier.predict(summarizer.data)
truth = summarizer._classes

labels = ['not selected', 'selected']
cm = confusion_matrix(y_true=summarizer._classes, y_pred=predictions)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=labels)
disp.plot();

_images/plotsummarypane_feature_importance_17_0.png