ValueCountsSummarizer
This Not That (TNT) provides a DataFrame viewer to help a user gain a better understanding of their summarized data. We will outline the basic functionality of this DataSummaryPane
by demonstrating it with a one of TNT’s build in summary functions.
The first step is to load thisnotthat
and panel
.
[1]:
import thisnotthat as tnt
import panel as pn
2023-03-01 14:00:35.144748: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
To make Panel based objects interactive within a notebook we need to load the panel extension
.
[2]:
pn.extension()
Now we need some data to use as an example. In this case we’ll use the Palmer’s Penguins dataset, which we can get easy access to via seaborn.
[3]:
import seaborn as sns
penguins = sns.load_dataset("penguins").dropna(how="any", axis=0)
penguins.head()
[3]:
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | Male |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | Female |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female |
4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | Female |
5 | Adelie | Torgersen | 39.3 | 20.6 | 190.0 | 3650.0 | Male |
Now we will need a summarizer object. This is simply an object of a class which has a summarize function. For the DataSummaryPane
, that summarize function needs to take a selected
sequence (the indices of the points you’ve selected in another plot) and returns a DataFrame to be displayed in our DataSummaryPane
.
def summarize(self, selected: Sequence[int]):
do some things
return pd.DataFrame
There are a number of useful pre-defined summarizer functions already included in TNT. Summarizer functions which return a DataFrame and are thus appropriate for initializing a DataSummaryPane
are included within the summary.dataframe
namespace. For this example we will demonstrate our ValueCountsSummarizer
.
ValueCountsSummarizer
takes a pandas Series in it’s constructor. It then calls a simple value_counts
on this Series in order to get an idea of what categorical values from the Series in question have been selected in a linked plot.
The basic usage is that we construct a summarizer object with the data it needs to compute it’s summary and any desired parameters. In this case that is a categorical series from our penguins DataFrame indicating which island each penguin can be found on.
This summarizer is then passed into the constructor for a DataSummaryPane
this pane will handle all the display parameters necessary.
[4]:
summarizer = tnt.summary.dataframe.ValueCountsSummarizer(penguins.island)
summary_df= tnt.DataSummaryPane(summarizer)
summary_df
[4]:
We see that initially the plot shows “Nothing to summarize”. That is because we haven’t selected any data points yet.
The selected points are handled via a .selected
property which is a base zero index linking the rows of the data frame we passed into our selector with the base zero index of any other Panes that we construct. If we are running this in a notebook we can run the following cell to update this property with the indices of all the penguins of species Gentoo. That should update the above DataSummaryPane
with a pandas DataFrame showing that this species of penguin is fairly well distributed
across the three islands of this data set.
[5]:
import numpy as np
summary_df.selected = list(np.where(penguins[penguins.species=="Gentoo"])[0]);
Tying the plots together
To see how this works we’ll need a data map. For that we’ll need some preprocessing for the numeric columns of the penguins data, and UMAP.
[6]:
from sklearn.preprocessing import RobustScaler
import umap
We can now build a data map out of the rescaled numeric penguins data, and create a BokehPlotPane
for it.
[7]:
data_for_umap = RobustScaler().fit_transform(penguins.select_dtypes(include="number"))
penguin_datamap = umap.UMAP(random_state=37).fit_transform(data_for_umap)
plot = tnt.BokehPlotPane(
penguin_datamap,
labels=penguins.species,
hover_text=penguins.island,
width=500,
height=500,
legend_location="top_right",
title="Penguins data map",
)
Finally we can link our previously constructed summary_plot DataSummaryPane
with our newly constructed BokehPlotPane
. This is done via our link_to_plot
function which ties together the .selected
properties of both panes.
[8]:
summary_df.link_to_plot(plot)
pn.Row(plot, summary_df)
[8]:
If you are running this in a notebook you can now select the lasso tool the leftmost plot and select a set of points. You should see the distribution of the islands that the selected penguins can be found on.
Multiple summaries
Remember that we can have multiple summaries and panes associated with any selected data. Below we’ll construct a pair of ValueCountsSummerizer
DataSummaryPane
’s to allow us to explore the species and island of our selected data at the same time.
To save some space we’ll import the ValueCountsSummarizer directly and nest our constructors.
[9]:
from thisnotthat.summary.dataframe import ValueCountsSummarizer
summary_island= tnt.DataSummaryPane(ValueCountsSummarizer(penguins.island))
summary_island.link_to_plot(plot)
summary_species= tnt.DataSummaryPane(ValueCountsSummarizer(penguins.species))
summary_species.link_to_plot(plot)
pn.Row(plot, pn.Column(summary_island, summary_species))
[9]:
Once again, in a notebook select the lasso tool from the top bar of the leftmost pane and select various piles of data to see their species and island displayed on the right.