TNT API Guide

TNT Provides a number of different Pane and Widget classes that can be combined and linked in various ways. The core Panes are the Plot Panes, with various associated other panes and widgets.

Plot Panes

Data Panes

Summary Panes

Search and Edit Widgets

Finally TNT provides tools for annotating plots with cluster labels. There are various methods for achieving this.

Cluster Labelling Methods

Generate a fine grained clustering either from a UMAP projection of the data, or the map representation itself. Return the resulting cluster centroids in the high space, their equivalent centroids in the map representation, and the condensed tree of the clusterig itself (useful for cases where centroids are not suitable cluster representations).

Parameters:

source_vectors: ArrayLike of shape (n_samples, n_features): The original high dimensional vector representation of the data
map_representation: ArrayLike of shape (n_samples, n_map_features): The map representation of the data
cluster_map_representation: bool (optional, default = False): Whether to directly cluster the map representation, or use UMAP to generate a representation for clustering using umap_n_components many dimensions.
umap_n_components: int (optional, default = 5): The number of dimensions to use UMAP to reduce to if cluster_map_representation is False.
umap_metric: str (optional, default = “cosine”): The metric to pass to UMAP for dimension reduction if cluster_map_representation is False.
umap_n_neighbors: int (optional, default = 15): The number of neighbors to use for UMAP if cluster_map_representation is False.
hdbscan_min_samples: int (optional, default = 10): The min_samples value to use with HDBSCAN for clustering.
hdbscan_min_cluster_size: int (optional, default = 20): The min_cluster_size value to use with HDBSCAN for clustering.
random_state: int or None (optional, default = None): A random state seed that can be fixed to ensure reproducibility.

Returns:

cluster_vectors: ArrayLike of shape (n_clusters, n_features): Centroid representations of each of the fine grained clusters found
map_cluster_locations: ArrayLike of shape (n_clusters, n_map_features): Centroid bsed map locations of each of the fine grained clusters found
condensed_tree: CondensedTree object: The condensed tree representation of the clustering.

thisnotthat.map_cluster_labelling.hdbscan_tree_based_cluster_merger(tree: CondensedTree, clusters_to_merge: List[int]) → List[int]

Given a lost of leaf nodes, find the clusters in the tree that cover all the leaf nodes in the list, and no leaf nodes outside of the list, using higher nodes in the tree to merge clusters whenever possible. This provides point based representations of sets of fine grained clusters.

Parameters:

tree: CondensedTree: The condensed tree containing the relevant cluster information for merging
clusters_to_merge: List of leaf node ids: The leaf nodes to attempt to cover via tree based merging

Returns:

result: List of cluster node ids: The cluster nodes that cover the input leaf nodes optimally.

thisnotthat.map_cluster_labelling.point_set_from_cluster(tree: CondensedTree, cluster_indices: List[int], topic_mask: ndarray[Any, dtype[_ScalarType_co]], leaf_mapping: Dict[int, int]) → List[int]

Given a list of cluster node ids return the source points falling under those cluster ids. We also need to keep track of any masked out clusters, and a mapping to leaf nodes ids.

Parameters:

tree: CondensedTree: The condensed tree containing the relevant cluster information for merging
cluster_indices: List of cluster node ids: The cluster node ids of which to find the underlying points
topic_mask: ArrayLike of bool of shape (n_clusters,): A mask vector determining which leaf nodes from the fine grained clustering to ignore at this time
leaf_mapping: Dict mapping int to int: A mapping from cluster label ids of the fine grained clustering to leaf node ids in the condensed tree

Returns:

points: List of point indices: The indices of the points in the input clusters

Given a fine grained clustering generate hierarchical layers of clusters such that each layer is a clustering of fine-grained clusters. For this we want compact clusters in the map representation, so we use complete linkage on the map representation as the clustering approach. We also want to be wary of duplicating clusters, or creating higher level clusters that include otherwise distinct outlying points. We resolve the first issue by checking the distances to existing clusters and not using higher level clusters that are too close to existing lower level clusters. We resolve the second issue by using outlier detection on the fine grained clusters, progressivly removing more outliers for higher level clusterings.

Depending on whether the desired result is cluster centroids or sets of points a condensed tree may be required.

Parameters:

cluster_vectors: ArrayLike of shape (n_clusters, n_features): The centroid vector representations in terms of the source vector data
cluster_locations: ArrayLike of shape (n_clusters, n_map_features): The centroid map locations of the clusters
min_clusters: int (optional, default = 4): The number of clusters to have at the highest layer; layers with fewer than this number of clusters will be discarded
contamination: float (optional, default = 0.05): The base contamination score used for outlier detection of fine grained clusters. Larger values will prune out more outliers
contamination_multiplier: float (optional, default = 1.5): The value to multiply the contamination score by as we increase the layers – thus applying higher contamination and removing more outliers from higher layers. Larger values will prune more aggressively
max_contamination: float (optional, default = 0.25): The maximum contamination value to use in outlier pruning – once the multiplier increases contamination beyond this value the contamination used will simply be capped at this value.
vector_metric: str (optional, default = “cosine”): The metric to use on the source vector space. This is used to determine if cluster centroid representatives are too close and should be ignored.
cluster_distance_threshold: float (optional, default = 0.025): Cluster centroid representatives from a higher layer that are within this distance of an already selected cluster centroid in a lower layer will be ignored (so we don’t repeat clusters)
return_pointsets: bool (optional, default = False): Whether to return point set data for clusters. This may be required for various approaches to cluster labelling.
hdbscan_tree: CondensedTree or None: If return_pointsets is True then a condensed tree must be provided to generate the relevant pointsets. If return_pointsets is False then this can be None as it will not be used.

Returns:

vector_layers: List of list Arrays: A list of layers; each layer is a list of arrays of the cluster centroids for that layer
location_layers: List of list of Arrays: A list of layers; each layer is a list of arrays of map locations for clusters in that layer
pointset_layers: List of list of lists (optional; only if return_pointsets was True): A list of layers, each layer is a list of point sets (a list of indices) for the clusters in that layer

thisnotthat.map_cluster_labelling.adjust_layer_locations(fixed_layer: ndarray[Any, dtype[_ScalarType_co]], layer_to_adjust: ndarray[Any, dtype[_ScalarType_co]], *, spring_constant: float = 0.1, edge_weight: float = 1.0) → ndarray[Any, dtype[_ScalarType_co]]

Use a spring layout style approach to adjust the locations of a layer to conflict/overlap less with a fixed layer (generally a lower layer, or combination of lower layers). Essentially each cluster in the layer to be adjusted is attached to its location by a spring with weight edge_weight and spring constant spring_constant and then repelled by all the points in the fixed layer.

Parameters:

fixed_layer: Array of shape (n_clusters, n_map_features): cluster positions to remain fixed and which will provide a repulsive force pushing away clusters to be adjusted
layer_to_adjust: Array of shape (n_clusters, n_map_features): cluster positions to be adjusted
spring_constant: float (optional, default = 0.1): The “optimal” distance from the source position; larger values will allow the adjusted cluster to move farther
edge_weight: float (optional, default = 1.0): How strong the springs pull

Returns:

adjusted_positions: Array of shape (n_clusters, n_map_features)

thisnotthat.map_cluster_labelling.text_locations(location_layers: List[ndarray[Any, dtype[_ScalarType_co]]], *, spring_constant: float = 0.1, spring_constant_multiplier: float = 1.5, edge_weight: float = 1.0) → List[ndarray[Any, dtype[_ScalarType_co]]]

Adjust locations of clusters in layers to attempt to avoid too much overlap of cluster – move higher level layer clusters to avoid overlapping with lower level layer clusters, under the assumptions that higher level layers represent more area and thus have some freedom to be moved.

Parameters:

location_layers: List of Arrays: The list of layers, where each layer is a array of positions on the map representation.
spring_constant: float (optional, default = 0.1): The “optimal” distance from the source position; larger values will allow the adjusted cluster to move farther
spring_constant_multiplier: float (optional, default = 1.5): We can increase the spring constant for higher level layers; to do this we multiply by the spring_constant_multiplier as we go up a layer. Smaller values (closer to 1.0) will ensure locations do no stray too far; this is particularly desireable in the case where there are many layers.
edge_weight: float (optional, default = 1.0): How strong the springs pull

Returns:

text_locations: List of list of Arrays: The resulting list of layers, where each layer is a list of positions on the map representation.

Generate labels (usually text) for each cluster in each layer using a joint vector space representation model. To do this we assume we have a text_representation providing a vector to each “word” such that the vectors exist in the same vector space as the vector_layers vector representations. A cluster is then labelled by the “words” closest to the cluster representation in the vector space. By default we avoid keyword re-use in layers by keeping track of which words have already been used in a layer (starting from the top layers and working downward to the finest grained layers), and exclude “words” that have already been used. This behaviour can be turned off if desired.

Parameters:

vector_layers: List of list of Arrays: A list of layers; each layer is a list of cluster centroids existing in the source vector space
text_representations: Array of shape (n_possible_labels, n_features): An array giving a vector (in the source vector space) for each potential label item
text_label_dictionary: Dict mapping indices to labels: A dictionary mapping from indices in the text_representation array to labels (usually words)
items_per_label: int (optional, default = 3): The number of items to use for each cluster label
vector_metric: str (optional, default = “cosine”): The metric to use to measure closeness in the source vector space
pynnd_n_neighbors: int (optional, default = 40): The n_neighbors parameter to use for PyNNDescent for nearest neighbour lookups
query_size: int (optional, default = 10): The number of nearest neighbors to return via PyNNDescent queries; this should be at least items_per_label and often larger if exclude_keyword_reuse is True.
exclude_keyword_reuse: bool (optional, default = True): Whether to ensure keyword/labels don’t get reused for lower level clusters.
random_state: int or None (optional, default = None): A random state parameter, passed to PyNNDescent which can be used to ensure fixed results for reproducibility.

Returns:

labels: List of list of lists of label items: The resulting layers; each layer is a list of cluster labels; each cluster label is a list of label items

thisnotthat.map_cluster_labelling.text_labels_from_source_metadata(pointset_layers: List[List[ndarray[Any, dtype[_ScalarType_co]]]], source_metadataframe: DataFrame, *, items_per_label: int = 3) → List[List[List[Any]]]

Generate text labels for layers of clusters using a dataframe of metadata associated to points. To label a cluster in a layer we train a one versus the rest classifier to discern the cluster and use feature importance to label a cluster with the most discerning features.

Parameters:

pointset_layers: List of list of Arrays: A list of layers; each layer is a list of clusters; each cluster is an array of point indices.
source_metadataframe: DataFrame: A dataframe of metadata associated to the points of data / map representation. Each row of the dataframe should correspond to a point in the dataset (assumed to be in the same order as the points). We will attempt to handle relatively diverse datatypes within the dataframe as well as possible.
items_per_label: int (optional, default = 3): The number of items (features) to label a given cluster with

Returns:

labels: List of list of lists of label items: The resulting layers; each layer is a list of cluster labels; each cluster label is a list of label items

Generate text labels for layers of clusters where each source vector has an associated label representation (usually text, usually a word). The labels are generated by sampling labels from the points in the cluster. Various sampling strategies are available. The cheapest approach is "random". More advanced approaches are available via the apricot-select library which provides submodular-selection. Here we support "saturated_coverage" which is fast; "sum_redundancy" and "graph_cut" which are more expensive, but do a better coverage job; and "facility_location" which does the best job of ensuring diversity and coverage in the selection, but can be quite expensive computationally. "facility_selection" is definitely the best option if items_per_label is very large however.

Parameters:

pointset_layers: List of list of Arrays

A list of layers; each layer is a list of clusters; each cluster is an array of point indices.

source_vectors: Array of shape (n_samples, n_features)

The source vector data from which the map representation was generated.

labels_per_sample: Array of shape (n_samples,)

An array of label items for each source vector.

sample_selection_method: str (optional, default = “facility_selection”)

The selection method to use for sampling from a cluster. Should be one of

"facility_selection"
"graph_cut"
"sum_redundancy"
"saturated_coverage"
"random"

items_per_label: int (optional, default = 3)

The number of items to use for each cluster label

vector_metric: str (optional, default = “cosine”)

The distance metric used in the source_vectors vector space.

sample_weights: Array of shape (n_samples,)

An array of weights to apply to each sample. Higher weight samples may be more likely to be selected. This is only supported for some selection methods (random selection does support it). Check the apricot-select documentation for more details.

random_state: int or None (optional, default = None)

A random state seed to use in random selection.

Returns:

labels: List of list of lists of label items: The resulting layers; each layer is a list of cluster labels; each cluster label is a list of label items

class thisnotthat.map_cluster_labelling.JointVectorLabelLayers(source_vectors: ~numpy._typing._array_like._SupportsArray[~numpy.dtype[~typing.Any]] | ~numpy._typing._nested_sequence._NestedSequence[~numpy._typing._array_like._SupportsArray[~numpy.dtype[~typing.Any]]] | bool | int | float | complex | str | bytes | ~numpy._typing._nested_sequence._NestedSequence[bool | int | float | complex | str | bytes], map_representation: ~numpy._typing._array_like._SupportsArray[~numpy.dtype[~typing.Any]] | ~numpy._typing._nested_sequence._NestedSequence[~numpy._typing._array_like._SupportsArray[~numpy.dtype[~typing.Any]]] | bool | int | float | complex | str | bytes | ~numpy._typing._nested_sequence._NestedSequence[bool | int | float | complex | str | bytes], labelling_vectors: ~numpy._typing._array_like._SupportsArray[~numpy.dtype[~typing.Any]] | ~numpy._typing._nested_sequence._NestedSequence[~numpy._typing._array_like._SupportsArray[~numpy.dtype[~typing.Any]]] | bool | int | float | complex | str | bytes | ~numpy._typing._nested_sequence._NestedSequence[bool | int | float | complex | str | bytes], labels: ~typing.Dict[int, ~typing.Any], *, vector_metric: str = 'cosine', cluster_map_representation: bool = False, umap_n_components: int = 5, umap_n_neighbors: int = 15, hdbscan_min_samples: int = 10, hdbscan_min_cluster_size: int = 20, min_clusters_in_layer: int = 4, contamination: float = 0.05, contamination_multiplier: float = 1.5, max_contamination: float = 0.25, cluster_distance_threshold: float = 0.025, adjust_label_locations: bool = True, label_adjust_spring_constant: float = 0.1, label_adjust_spring_constant_multiplier: float = 1.5, label_adjust_edge_weight: float = 1.0, items_per_label: int = 3, pynnd_n_neighbors: int = 40, pynnd_query_size: int = 10, exclude_keyword_reuse: bool = True, label_formatter: ~typing.Callable[[~typing.List[~typing.Any]], ~typing.Any] = <function string_label_formatter>, random_state: int | None = None)

Generate multiple layers of labelling for a map based on the existence of a joint vector space representation

of the source vector data for the map, and a separate set of label vectors that exist in the same vector space. To do this we assume we have a text_representation providing a vector to each “word” such that the vectors exist in the same vector space as the vector_layers vector representations.

Multiple layers of clusters are generated, with higher level layers having larger more general clusters. Each cluster is then labelled by the “words” closest to the cluster representation in the vector space. By default we avoid keyword re-use in layers by keeping track of which words have already been used in a layer (starting from the top layers and working downward to the finest grained layers), and exclude “words” that have already been used. This behaviour can be turned off if desired.

Parameters:

source_vectors: Array of shape (n_samples, n_features)

The original high dimensional vector representation of the data

map_representation: Array of shape (n_samples, n_map_features): The map representation of the data
labelling_vectors: Array of shape (n_possible_labels, n_features): An array giving a vector (in the source vector space) for each potential label item
labels: Dictionary mapping indices to label items: A dictionary mapping from indices in the labelling_vectors array to labels (usually words)
vector_metric: str (optional, default = “cosine”): The metric to use on the source vector space.
cluster_map_representation: bool (optional, default = False): Whether to directly cluster the map representation, or use UMAP to generate a representation for clustering using umap_n_components many dimensions.
umap_n_components:: The number of dimensions to use UMAP to reduce to if cluster_map_representation is False.
umap_n_neighbors: int (optional, default = 15): The number of neighbors to use for UMAP if cluster_map_representation is False.
hdbscan_min_samples: int (optional, default = 10): The min_samples value to use with HDBSCAN for clustering.
hdbscan_min_cluster_size: int (optional, default = 20): The min_cluster_size value to use with HDBSCAN for clustering.
min_clusters: int (optional, default = 4): The number of clusters to have at the highest layer; layers with fewer than this number of clusters will be discarded
contamination: float (optional, default = 0.05): The base contamination score used for outlier detection of fine grained clusters. Larger values will prune out more outliers
contamination_multiplier: float (optional, default = 1.5): The value to multiply the contamination score by as we increase the layers – thus applying higher contamination and removing more outliers from higher layers. Larger values will prune more aggressively
max_contamination: float (optional, default = 0.25): The maximum contamination value to use in outlier pruning – once the multiplier increases contamination beyond this value the contamination used will simply be capped at this value.
cluster_distance_threshold: float (optional, default = 0.025): Cluster centroid representatives from a higher layer that are within this distance of an already selected cluster centroid in a lower layer will be ignored (so we don’t repeat clusters)
adjust_label_locations: bool (optional, default = True): Whether to attempt to adjust label locations to avoid overlaps with lower layers.
label_adjust_spring_constant: float (optional, default = 0.1): The “optimal” distance from the source position; larger values will allow the adjusted cluster to move farther

label_adjust_spring_constant_multiplier: float (optional, default = 1.5)

We can increase the spring constant for higher level layers; to do this we multiply by the spring_constant_multiplier as we go up a layer. Smaller values (closer to 1.0) will ensure locations do no stray too far; this is particularly desireable in the case where there are many layers.

label_adjust_edge_weight: float (optional, default = 1.0)

How strong the springs pull

items_per_label: int (optional, default = 3)

The number of items to use for each cluster label

pynnd_n_neighbors: int (optional, default = 40)

The n_neighbors parameter to use for PyNNDescent for nearest neighbour lookups

query_size: int (optional, default = 10)

The number of nearest neighbors to return via PyNNDescent queries; this should be at least items_per_label and often larger if exclude_keyword_reuse is True.

exclude_keyword_reuse: bool (optional, default = True)

Whether to ensure keyword/labels don’t get reused for lower level clusters.

label_formatter: Function (optional, default = string_label_formatter)

A function used for format a list of label items into a usable label (usually a single string).

random_state: int or None (optional, default = None)

A random state parameter which can be used to ensure fixed results for reproducibility.

Attributes:

labels: List of list of lists of label items: A list of layers; each layer is a list of labels; each label is a list of label items_per_label many items
location_layers: List of Arrays of shape (n_cluster_in_layer, n_map_features): A list of layers; each layer is an array of locations in the map representation to place the labels of that layer
labels_for_display: List of list of labels: A list of layers; each layer is a list of labels; each label is formatted for display use by label_formatter

class thisnotthat.map_cluster_labelling.MetadataLabelLayers(source_vectors: ~numpy._typing._array_like._SupportsArray[~numpy.dtype[~typing.Any]] | ~numpy._typing._nested_sequence._NestedSequence[~numpy._typing._array_like._SupportsArray[~numpy.dtype[~typing.Any]]] | bool | int | float | complex | str | bytes | ~numpy._typing._nested_sequence._NestedSequence[bool | int | float | complex | str | bytes], map_representation: ~numpy._typing._array_like._SupportsArray[~numpy.dtype[~typing.Any]] | ~numpy._typing._nested_sequence._NestedSequence[~numpy._typing._array_like._SupportsArray[~numpy.dtype[~typing.Any]]] | bool | int | float | complex | str | bytes | ~numpy._typing._nested_sequence._NestedSequence[bool | int | float | complex | str | bytes], metadata_dataframe: ~pandas.core.frame.DataFrame, *, vector_metric: str = 'cosine', cluster_map_representation: bool = False, umap_n_components: int = 5, umap_n_neighbors: int = 15, hdbscan_min_samples: int = 10, hdbscan_min_cluster_size: int = 20, min_clusters_in_layer: int = 4, contamination: float = 0.05, contamination_multiplier: float = 1.5, max_contamination: float = 0.25, cluster_distance_threshold: float = 0.025, adjust_label_locations: bool = True, label_adjust_spring_constant: float = 0.1, label_adjust_spring_constant_multiplier: float = 1.5, label_adjust_edge_weight: float = 1.0, items_per_label: int = 3, label_formatter: ~typing.Callable[[~typing.List[~typing.Any]], ~typing.Any] = <function string_label_formatter>, random_state: int | None = None)

Generate multiple layers of labelling for a map based on a dataframe of metadata associated to points. Multiple layers of clusters are generated, with higher level layers having larger more general clusters. Each cluster is then labelled by training a one versus the rest classifier to discern the cluster in terms of the associated metadata. The feature importances can then be used to label a cluster with the most discerning features.

Parameters:

source_vectors: Array of shape (n_samples, n_features)

The original high dimensional vector representation of the data

map_representation: Array of shape (n_samples, n_map_features): The map representation of the data

metadata_dataframe: DataFrame

A dataframe of metadata associated to the points of data / map representation. Each row of the dataframe should correspond to a point in the dataset (assumed to be in the same order as the points). We will attempt to handle relatively diverse datatypes within the dataframe as well as possible.

vector_metric: str (optional, default = “cosine”)

The metric to use on the source vector space.

cluster_map_representation: bool (optional, default = False)

Whether to directly cluster the map representation, or use UMAP to generate a representation for clustering using umap_n_components many dimensions.

umap_n_components:

The number of dimensions to use UMAP to reduce to if cluster_map_representation is False.

umap_n_neighbors: int (optional, default = 15)

The number of neighbors to use for UMAP if cluster_map_representation is False.

hdbscan_min_samples: int (optional, default = 10)

The min_samples value to use with HDBSCAN for clustering.

hdbscan_min_cluster_size: int (optional, default = 20)

The min_cluster_size value to use with HDBSCAN for clustering.

min_clusters: int (optional, default = 4)

The number of clusters to have at the highest layer; layers with fewer than this number of clusters will be discarded

contamination: float (optional, default = 0.05)

The base contamination score used for outlier detection of fine grained clusters. Larger values will prune out more outliers

contamination_multiplier: float (optional, default = 1.5)

The value to multiply the contamination score by as we increase the layers – thus applying higher contamination and removing more outliers from higher layers. Larger values will prune more aggressively

max_contamination: float (optional, default = 0.25)

The maximum contamination value to use in outlier pruning – once the multiplier increases contamination beyond this value the contamination used will simply be capped at this value.

cluster_distance_threshold: float (optional, default = 0.025)

Cluster centroid representatives from a higher layer that are within this distance of an already selected cluster centroid in a lower layer will be ignored (so we don’t repeat clusters)

adjust_label_locations: bool (optional, default = True)

Whether to attempt to adjust label locations to avoid overlaps with lower layers.

label_adjust_spring_constant: float (optional, default = 0.1)

The “optimal” distance from the source position; larger values will allow the adjusted cluster to move farther

label_adjust_spring_constant_multiplier: float (optional, default = 1.5)

We can increase the spring constant for higher level layers; to do this we multiply by the spring_constant_multiplier as we go up a layer. Smaller values (closer to 1.0) will ensure locations do no stray too far; this is particularly desireable in the case where there are many layers.

label_adjust_edge_weight: float (optional, default = 1.0)

How strong the springs pull

items_per_label: int (optional, default = 3)

The number of items to use for each cluster label

label_formatter: Function (optional, default = string_label_formatter)

A function used for format a list of label items into a usable label (usually a single string).

random_state: int or None (optional, default = None)

A random state parameter which can be used to ensure fixed results for reproducibility.

Attributes:

labels: List of list of lists of label items: A list of layers; each layer is a list of labels; each label is a list of label items_per_label many items
location_layers: List of Arrays of shape (n_cluster_in_layer, n_map_features): A list of layers; each layer is an array of locations in the map representation to place the labels of that layer
labels_for_display: List of list of labels: A list of layers; each layer is a list of labels; each label is formatted for display use by label_formatter

Generate text labels for layers of clusters from data where each source vector has an associated label representation (usually text, usually a word). Multiple layers of clusters are generated, with higher level layers having larger more general clusters. Each cluster is then labelled by sampling labels from the points in the cluster. Various sampling strategies are available. The cheapest approach is "random". More advanced approaches are available via the apricot-select library which provides submodular-selection. Here we support "saturated_coverage" which is fast; "sum_redundancy" and "graph_cut" which are more expensive, but do a better coverage job; and "facility_location" which does the best job of ensuring diversity and coverage in the selection, but can be quite expensive computationally. "facility_selection" is definitely the best option if items_per_label is very large however.

Parameters:

source_vectors: Array of shape (n_samples, n_features)

The original high dimensional vector representation of the data

map_representation: Array of shape (n_samples, n_map_features): The map representation of the data

per_sample_labels: Array of shape (n_samples,)

An array of label items for each source vector.

vector_metric: str (optional, default = “cosine”)

The metric to use on the source vector space.

cluster_map_representation: bool (optional, default = False)

Whether to directly cluster the map representation, or use UMAP to generate a representation for clustering using umap_n_components many dimensions.

sample_selection_method: str (optional, default = “facility_selection”)

The selection method to use for sampling from a cluster. Should be one of

"facility_selection"
"graph_cut"
"sum_redundancy"
"saturated_coverage"
"random"

sample_weights: Array of shape (n_samples,)

An array of weights to apply to each sample. Higher weight samples may be more likely to be selected. This is only supported for some selection methods (random selection does support it). Check the apricot-select documentation for more details.

umap_n_components:

The number of dimensions to use UMAP to reduce to if cluster_map_representation is False.

umap_n_neighbors: int (optional, default = 15)

The number of neighbors to use for UMAP if cluster_map_representation is False.

hdbscan_min_samples: int (optional, default = 10)

The min_samples value to use with HDBSCAN for clustering.

hdbscan_min_cluster_size: int (optional, default = 20)

The min_cluster_size value to use with HDBSCAN for clustering.

min_clusters: int (optional, default = 4)

The number of clusters to have at the highest layer; layers with fewer than this number of clusters will be discarded

contamination: float (optional, default = 0.05)

The base contamination score used for outlier detection of fine grained clusters. Larger values will prune out more outliers

contamination_multiplier: float (optional, default = 1.5)

The value to multiply the contamination score by as we increase the layers – thus applying higher contamination and removing more outliers from higher layers. Larger values will prune more aggressively

max_contamination: float (optional, default = 0.25)

The maximum contamination value to use in outlier pruning – once the multiplier increases contamination beyond this value the contamination used will simply be capped at this value.

cluster_distance_threshold: float (optional, default = 0.025)

Cluster centroid representatives from a higher layer that are within this distance of an already selected cluster centroid in a lower layer will be ignored (so we don’t repeat clusters)

adjust_label_locations: bool (optional, default = True)

Whether to attempt to adjust label locations to avoid overlaps with lower layers.

label_adjust_spring_constant: float (optional, default = 0.1)

The “optimal” distance from the source position; larger values will allow the adjusted cluster to move farther

label_adjust_spring_constant_multiplier: float (optional, default = 1.5)

We can increase the spring constant for higher level layers; to do this we multiply by the spring_constant_multiplier as we go up a layer. Smaller values (closer to 1.0) will ensure locations do no stray too far; this is particularly desireable in the case where there are many layers.

label_adjust_edge_weight: float (optional, default = 1.0)

How strong the springs pull

items_per_label: int (optional, default = 3)

The number of items to use for each cluster label

label_formatter: Function (optional, default = string_label_formatter)

A function used for format a list of label items into a usable label (usually a single string).

random_state: int or None (optional, default = None)

A random state parameter which can be used to ensure fixed results for reproducibility.

Attributes:

labels: List of list of lists of label items: A list of layers; each layer is a list of labels; each label is a list of label items_per_label many items
location_layers: List of Arrays of shape (n_cluster_in_layer, n_map_features): A list of layers; each layer is an array of locations in the map representation to place the labels of that layer
labels_for_display: List of list of labels: A list of layers; each layer is a list of labels; each label is formatted for display use by label_formatter

class thisnotthat.map_cluster_labelling.SparseMetadataLabelLayers(source_vectors: ~numpy._typing._array_like._SupportsArray[~numpy.dtype[~typing.Any]] | ~numpy._typing._nested_sequence._NestedSequence[~numpy._typing._array_like._SupportsArray[~numpy.dtype[~typing.Any]]] | bool | int | float | complex | str | bytes | ~numpy._typing._nested_sequence._NestedSequence[bool | int | float | complex | str | bytes], map_representation: ~numpy._typing._array_like._SupportsArray[~numpy.dtype[~typing.Any]] | ~numpy._typing._nested_sequence._NestedSequence[~numpy._typing._array_like._SupportsArray[~numpy.dtype[~typing.Any]]] | bool | int | float | complex | str | bytes | ~numpy._typing._nested_sequence._NestedSequence[bool | int | float | complex | str | bytes], sparse_metadata: ~scipy.sparse._matrix.spmatrix, feature_name_dictionary: ~typing.Dict[int, str], *, vector_metric: str = 'cosine', cluster_map_representation: bool = False, umap_n_components: int = 5, umap_n_neighbors: int = 15, hdbscan_min_samples: int = 10, hdbscan_min_cluster_size: int = 20, min_clusters_in_layer: int = 4, contamination: float = 0.05, contamination_multiplier: float = 1.5, max_contamination: float = 0.25, cluster_distance_threshold: float = 0.025, adjust_label_locations: bool = True, label_adjust_spring_constant: float = 0.1, label_adjust_spring_constant_multiplier: float = 1.5, label_adjust_edge_weight: float = 1.0, items_per_label: int = 3, label_formatter: ~typing.Callable[[~typing.List[~typing.Any]], ~typing.Any] = <function string_label_formatter>, random_state: int | None = None)

Generate multiple layers of labelling for a map based on a dataframe of metadata associated to points. Multiple layers of clusters are generated, with higher level layers having larger more general clusters. Each cluster is then labelled by training a one versus the rest classifier to discern the cluster in terms of the associated metadata. The feature importances can then be used to label a cluster with the most discerning features.

Parameters:

source_vectors: Array of shape (n_samples, n_features)

The original high dimensional vector representation of the data

map_representation: Array of shape (n_samples, n_map_features): The map representation of the data

sparse_metadata: spmatrix

A sparse matrix of metadata associated to the points of data / map representation. Usually this is associated with metadata that has a high number of features, and any given sample only has non-zero values for a small number of features. A prime example is a bag-of-words representation of a corpus of documents.

feature_name_dictionary: dict

A dictionary mapping column indices of the sparse matrix to feature names. For example, if the sparse matrix were the output of sklearn’s CountVectorizer the dict would be {idx: word for word, idx in model.vocabulary_.items()}.

vector_metric: str (optional, default = “cosine”)

The metric to use on the source vector space.

cluster_map_representation: bool (optional, default = False)

Whether to directly cluster the map representation, or use UMAP to generate a representation for clustering using umap_n_components many dimensions.

umap_n_components:

The number of dimensions to use UMAP to reduce to if cluster_map_representation is False.

umap_n_neighbors: int (optional, default = 15)

The number of neighbors to use for UMAP if cluster_map_representation is False.

hdbscan_min_samples: int (optional, default = 10)

The min_samples value to use with HDBSCAN for clustering.

hdbscan_min_cluster_size: int (optional, default = 20)

The min_cluster_size value to use with HDBSCAN for clustering.

min_clusters: int (optional, default = 4)

The number of clusters to have at the highest layer; layers with fewer than this number of clusters will be discarded

contamination: float (optional, default = 0.05)

The base contamination score used for outlier detection of fine grained clusters. Larger values will prune out more outliers

contamination_multiplier: float (optional, default = 1.5)

The value to multiply the contamination score by as we increase the layers – thus applying higher contamination and removing more outliers from higher layers. Larger values will prune more aggressively

max_contamination: float (optional, default = 0.25)

The maximum contamination value to use in outlier pruning – once the multiplier increases contamination beyond this value the contamination used will simply be capped at this value.

cluster_distance_threshold: float (optional, default = 0.025)

Cluster centroid representatives from a higher layer that are within this distance of an already selected cluster centroid in a lower layer will be ignored (so we don’t repeat clusters)

adjust_label_locations: bool (optional, default = True)

Whether to attempt to adjust label locations to avoid overlaps with lower layers.

label_adjust_spring_constant: float (optional, default = 0.1)

The “optimal” distance from the source position; larger values will allow the adjusted cluster to move farther

label_adjust_spring_constant_multiplier: float (optional, default = 1.5)

We can increase the spring constant for higher level layers; to do this we multiply by the spring_constant_multiplier as we go up a layer. Smaller values (closer to 1.0) will ensure locations do no stray too far; this is particularly desireable in the case where there are many layers.

label_adjust_edge_weight: float (optional, default = 1.0)

How strong the springs pull

items_per_label: int (optional, default = 3)

The number of items to use for each cluster label

label_formatter: Function (optional, default = string_label_formatter)

A function used for format a list of label items into a usable label (usually a single string).

random_state: int or None (optional, default = None)

A random state parameter which can be used to ensure fixed results for reproducibility.

Attributes:

labels: List of list of lists of label items: A list of layers; each layer is a list of labels; each label is a list of label items_per_label many items
location_layers: List of Arrays of shape (n_cluster_in_layer, n_map_features): A list of layers; each layer is an array of locations in the map representation to place the labels of that layer
labels_for_display: List of list of labels: A list of layers; each layer is a list of labels; each label is formatted for display use by label_formatter