ValueCountsSummarizer

This Not That (TNT) provides a DataFrame viewer to help a user gain a better understanding of their summarized data. We will outline the basic functionality of this DataSummaryPane by demonstrating it with a one of TNT’s build in summary functions.

The first step is to load thisnotthat and panel.

[1]:

import thisnotthat as tnt
import panel as pn

2023-03-01 14:00:35.144748: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.

To make Panel based objects interactive within a notebook we need to load the panel extension.

[2]:

pn.extension()

Now we need some data to use as an example. In this case we’ll use the Palmer’s Penguins dataset, which we can get easy access to via seaborn.

[3]:

import seaborn as sns
penguins = sns.load_dataset("penguins").dropna(how="any", axis=0)
penguins.head()

[3]:

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex
0	Adelie	Torgersen	39.1	18.7	181.0	3750.0	Male
1	Adelie	Torgersen	39.5	17.4	186.0	3800.0	Female
2	Adelie	Torgersen	40.3	18.0	195.0	3250.0	Female
4	Adelie	Torgersen	36.7	19.3	193.0	3450.0	Female
5	Adelie	Torgersen	39.3	20.6	190.0	3650.0	Male

Now we will need a summarizer object. This is simply an object of a class which has a summarize function. For the DataSummaryPane, that summarize function needs to take a selected sequence (the indices of the points you’ve selected in another plot) and returns a DataFrame to be displayed in our DataSummaryPane.

def summarize(self, selected: Sequence[int]):
    do some things
    return pd.DataFrame

There are a number of useful pre-defined summarizer functions already included in TNT. Summarizer functions which return a DataFrame and are thus appropriate for initializing a DataSummaryPane are included within the summary.dataframe namespace. For this example we will demonstrate our ValueCountsSummarizer.

ValueCountsSummarizer takes a pandas Series in it’s constructor. It then calls a simple value_counts on this Series in order to get an idea of what categorical values from the Series in question have been selected in a linked plot.

The basic usage is that we construct a summarizer object with the data it needs to compute it’s summary and any desired parameters. In this case that is a categorical series from our penguins DataFrame indicating which island each penguin can be found on.

This summarizer is then passed into the constructor for a DataSummaryPane this pane will handle all the display parameters necessary.

[4]:

summarizer = tnt.summary.dataframe.ValueCountsSummarizer(penguins.island)
summary_df= tnt.DataSummaryPane(summarizer)
summary_df

[4]:

We see that initially the plot shows “Nothing to summarize”. That is because we haven’t selected any data points yet.

The selected points are handled via a .selected property which is a base zero index linking the rows of the data frame we passed into our selector with the base zero index of any other Panes that we construct. If we are running this in a notebook we can run the following cell to update this property with the indices of all the penguins of species Gentoo. That should update the above DataSummaryPane with a pandas DataFrame showing that this species of penguin is fairly well distributed across the three islands of this data set.

[5]:

import numpy as np
summary_df.selected = list(np.where(penguins[penguins.species=="Gentoo"])[0]);

Tying the plots together

To see how this works we’ll need a data map. For that we’ll need some preprocessing for the numeric columns of the penguins data, and UMAP.

[6]:

from sklearn.preprocessing import RobustScaler
import umap

We can now build a data map out of the rescaled numeric penguins data, and create a BokehPlotPane for it.

[7]:

data_for_umap = RobustScaler().fit_transform(penguins.select_dtypes(include="number"))
penguin_datamap = umap.UMAP(random_state=37).fit_transform(data_for_umap)
plot = tnt.BokehPlotPane(
    penguin_datamap,
    labels=penguins.species,
    hover_text=penguins.island,
    width=500,
    height=500,
    legend_location="top_right",
    title="Penguins data map",
)

Finally we can link our previously constructed summary_plot DataSummaryPane with our newly constructed BokehPlotPane. This is done via our link_to_plot function which ties together the .selected properties of both panes.

[8]:

summary_df.link_to_plot(plot)
pn.Row(plot, summary_df)

[8]:

If you are running this in a notebook you can now select the lasso tool the leftmost plot and select a set of points. You should see the distribution of the islands that the selected penguins can be found on.

Multiple summaries

Remember that we can have multiple summaries and panes associated with any selected data. Below we’ll construct a pair of ValueCountsSummerizer DataSummaryPane’s to allow us to explore the species and island of our selected data at the same time.

To save some space we’ll import the ValueCountsSummarizer directly and nest our constructors.

[9]:

from thisnotthat.summary.dataframe import ValueCountsSummarizer

summary_island= tnt.DataSummaryPane(ValueCountsSummarizer(penguins.island))
summary_island.link_to_plot(plot)

summary_species= tnt.DataSummaryPane(ValueCountsSummarizer(penguins.species))
summary_species.link_to_plot(plot)

pn.Row(plot, pn.Column(summary_island, summary_species))

[9]:

Once again, in a notebook select the lasso tool from the top bar of the leftmost pane and select various piles of data to see their species and island displayed on the right.