algorithms.clustering.hierarchical_clustering¶
Module: algorithms.clustering.hierarchical_clustering
¶
Inheritance diagram for nipy.algorithms.clustering.hierarchical_clustering
:

These routines perform some hierrachical agglomerative clustering of some input data. The following alternatives are proposed: - Distance based average-link - Similarity-based average-link - Distance based maximum-link - Ward’s algorithm under graph constraints - Ward’s algorithm without graph constraints
In this latest version, the results are returned in a ‘WeightedForest’ structure, which gives access to the clustering hierarchy, facilitates the plot of the result etc.
For back-compatibility, *_segment versions of the algorithms have been appended, with the old API (except the qmax parameter, which now represents the number of wanted clusters)
Author : Bertrand Thirion,Pamela Guevara, 2006-2009
Class¶
WeightedForest
¶
-
class
nipy.algorithms.clustering.hierarchical_clustering.
WeightedForest
(V, parents=None, height=None)¶ Bases:
nipy.algorithms.graph.forest.Forest
This is a weighted Forest structure, i.e. a tree - each node has one parent and children (hierarchical structure) - some of the nodes can be viewed as leaves, other as roots - the edges within a tree are associated with a weight: +1 from child to parent -1 from parent to child - additionally, the nodes have a value, which is called ‘height’, especially useful from dendrograms
Methods
adjacency
()returns the adjacency matrix of the graph as a sparse coo matrix all_distances
([seed])returns all the distances of the graph as a tree anti_symmeterize
()anti-symmeterize self, i.e. produces the graph cc
()Compte the different connected components of the graph. check
()Check that self is indeed a forest, i.e. check_compatible_height
()Check that height[parents[i]]>=height[i] for all nodes cliques
()Extraction of the graphe cliques compact_neighb
()returns a compact representation of self compute_children
()Define the children of each node (stored in self.children) copy
()returns a copy of self cut_redundancies
()Returns a graph with redundant edges removed: ecah edge (ab) is present ony once in the edge matrix: the correspondng weights are added. define_graph_attributes
()define the edge and weights array degrees
()Returns the degree of the graph vertices. depth_from_leaves
()compute an index for each node: 0 for the leaves, 1 for dijkstra
([seed])Returns all the [graph] geodesic distances starting from seed floyd
([seed])Compute all the geodesic distances starting from seeds from_3d_grid
(xyz[, k])Sets the graph to be the topological neighbours graph get_E
()To get the number of edges in the graph get_V
()To get the number of vertices in the graph get_children
([v])Get the children of a node/each node get_descendants
(v[, exclude_self])returns the nodes that are children of v as a list get_edges
()To get the graph’s edges get_height
()Get the height array get_vertices
()To get the graph’s vertices (as id) get_weights
()is_connected
()States whether self is connected or not isleaf
()Identification of the leaves of the forest isroot
()Returns an indicator of nodes being roots kruskal
()Creates the Minimum Spanning Tree of self using Kruskal’s algo. leaves_of_a_subtree
(ids[, custom])tests whether the given nodes are the leaves of a certain subtree left_incidence
()Return left incidence matrix list_of_neighbors
()returns the set of neighbors of self as a list of arrays list_of_subtrees
()returns the list of all non-trivial subtrees in the graph main_cc
()Returns the indexes of the vertices within the main cc merge_simple_branches
()Return a subforest, where chained branches are collapsed normalize
([c])Normalize the graph according to the index c partition
(threshold)Partition the tree according to a cut criterion plot
([ax])Plot the dendrogram associated with self plot_height
()Plot the height of the non-leaves nodes propagate_upward
(label)Propagation of a certain labelling from leves to roots propagate_upward_and
(prop)propagates from leaves to roots some binary property of the nodes remove_edges
(valid)Removes all the edges for which valid==0 remove_trivial_edges
()Removes trivial edges, i.e. reorder_from_leaves_to_roots
()reorder the tree so that the leaves come first then their right_incidence
()Return right incidence matrix set_edges
(edges)Sets the graph’s edges set_euclidian
(X)Compute the weights of the graph as the distances between the set_gaussian
(X[, sigma])Compute the weights of the graph as a gaussian function set_height
([height])Set the height array set_weights
(weights)Set edge weights show
([X, ax])Plots the current graph in 2D split
(k)idem as partition, but a number of components are supplied instead subforest
(valid)Creates a subforest with the vertices for which valid > 0 subgraph
(valid)Creates a subgraph with the vertices for which valid>0 symmeterize
()Symmeterize self, modify edges and weights so that self.adjacency becomes the symmetric part of the current self.adjacency. to_coo_matrix
()Return adjacency matrix as coo sparse tree_depth
()Returns the number of hierarchical levels in the tree voronoi_diagram
(seeds, samples)Defines the graph as the Voronoi diagram (VD) that links the seeds. voronoi_labelling
(seed)Performs a voronoi labelling of the graph -
__init__
(V, parents=None, height=None)¶ Parameters: V: the number of edges of the graph :
parents=None: array of shape (V) :
the parents of the graph by default, the parents are set to range(V), i.e. each node is its own parent, and each node is a tree
height=None: array of shape(V) :
the height of the nodes
-
adjacency
()¶ returns the adjacency matrix of the graph as a sparse coo matrix
Returns: adj: scipy.sparse matrix instance, :
that encodes the adjacency matrix of self
-
all_distances
(seed=None)¶ returns all the distances of the graph as a tree
Parameters: seed=None array of shape(nbseed) with valuesin [0..self.V-1] :
set of vertices from which tehe distances are computed
Returns: dg: array of shape(nseed, self.V), the resulting distances :
Notes
By convention infinite distances are given the distance np.inf
-
anti_symmeterize
()¶ anti-symmeterize self, i.e. produces the graph whose adjacency matrix would be the antisymmetric part of its current adjacency matrix
-
cc
()¶ Compte the different connected components of the graph.
Returns: label: array of shape(self.V), labelling of the vertices :
-
check
()¶ Check that self is indeed a forest, i.e. contains no loop
Returns: a boolean b=0 iff there are loops, 1 otherwise : Notes
Slow implementation, might be rewritten in C or cython
-
check_compatible_height
()¶ Check that height[parents[i]]>=height[i] for all nodes
-
cliques
()¶ Extraction of the graphe cliques these are defined using replicator dynamics equations
Returns: cliques: array of shape (self.V), type (np.int) :
labelling of the vertices according to the clique they belong to
-
compact_neighb
()¶ returns a compact representation of self
Returns: idx: array of of shape(self.V + 1): :
the positions where to find the neighors of each node within neighb and weights
neighb: array of shape(self.E), concatenated list of neighbors :
weights: array of shape(self.E), concatenated list of weights :
-
compute_children
()¶ Define the children of each node (stored in self.children)
-
copy
()¶ returns a copy of self
-
cut_redundancies
()¶ Returns a graph with redundant edges removed: ecah edge (ab) is present ony once in the edge matrix: the correspondng weights are added.
Returns: the resulting WeightedGraph :
-
define_graph_attributes
()¶ define the edge and weights array
-
degrees
()¶ Returns the degree of the graph vertices.
Returns: rdegree: (array, type=int, shape=(self.V,)), the right degrees :
ldegree: (array, type=int, shape=(self.V,)), the left degrees :
-
depth_from_leaves
()¶ compute an index for each node: 0 for the leaves, 1 for their parents etc. and maximal for the roots.
Returns: depth: array of shape (self.V): the depth values of the vertices :
-
dijkstra
(seed=0)¶ Returns all the [graph] geodesic distances starting from seed x
- seed (int, >-1, <self.V) or array of shape(p)
- edge(s) from which the distances are computed
Returns: dg: array of shape (self.V), :
the graph distance dg from ant vertex to the nearest seed
Notes
It is mandatory that the graph weights are non-negative
-
floyd
(seed=None)¶ Compute all the geodesic distances starting from seeds
Parameters: seed= None: array of shape (nbseed), type np.int :
vertex indexes from which the distances are computed if seed==None, then every edge is a seed point
Returns: dg array of shape (nbseed, self.V) :
the graph distance dg from each seed to any vertex
Notes
It is mandatory that the graph weights are non-negative. The algorithm proceeds by repeating Dijkstra’s algo for each seed. Floyd’s algo is not used (O(self.V)^3 complexity…)
-
from_3d_grid
(xyz, k=18)¶ Sets the graph to be the topological neighbours graph of the three-dimensional coordinates set xyz, in the k-connectivity scheme
Parameters: xyz: array of shape (self.V, 3) and type np.int, :
k = 18: the number of neighbours considered. (6, 18 or 26) :
Returns: E(int): the number of edges of self :
-
get_E
()¶ To get the number of edges in the graph
-
get_V
()¶ To get the number of vertices in the graph
-
get_children
(v=-1)¶ Get the children of a node/each node
Parameters: v: int, optional :
a node index
Returns: children: list of int the list of children of node v (if v is provided) :
a list of lists of int, the children of all nodes otherwise
-
get_descendants
(v, exclude_self=False)¶ returns the nodes that are children of v as a list
Parameters: v: int, a node index : Returns: desc: list of int, the list of all descendant of the input node :
-
get_edges
()¶ To get the graph’s edges
-
get_height
()¶ Get the height array
-
get_vertices
()¶ To get the graph’s vertices (as id)
-
get_weights
()¶
-
is_connected
()¶ States whether self is connected or not
-
isleaf
()¶ Identification of the leaves of the forest
Returns: leaves: bool array of shape(self.V), indicator of the forest’s leaves :
-
isroot
()¶ Returns an indicator of nodes being roots
Returns: roots, array of shape(self.V, bool), indicator of the forest’s roots :
-
kruskal
()¶ Creates the Minimum Spanning Tree of self using Kruskal’s algo. efficient is self is sparse
Returns: K, WeightedGraph instance: the resulting MST : Notes
If self contains several connected components, will have the same number k of connected components
-
leaves_of_a_subtree
(ids, custom=False)¶ tests whether the given nodes are the leaves of a certain subtree
Parameters: ids: array of shape (n) that takes values in [0..self.V-1] :
custom == False, boolean :
if custom==true the behavior of the function is more specific - the different connected components are considered as being in a same greater tree - when a node has more than two subbranches, any subset of these children is considered as a subtree
-
left_incidence
()¶ Return left incidence matrix
Returns: left_incid: list :
the left incidence matrix of self as a list of lists: i.e. the list[[e.0.0, .., e.0.i(0)], .., [e.V.0, E.V.i(V)]] where e.i.j is the set of edge indexes so that e.i.j[0] = i
-
list_of_neighbors
()¶ returns the set of neighbors of self as a list of arrays
-
list_of_subtrees
()¶ returns the list of all non-trivial subtrees in the graph Caveat: theis function assumes that the vertices are sorted in a way such that parent[i]>i forall i Only the leaves are listeed, not the subtrees themselves
-
main_cc
()¶ Returns the indexes of the vertices within the main cc
Returns: idx: array of shape (sizeof main cc) :
-
merge_simple_branches
()¶ Return a subforest, where chained branches are collapsed
Returns: sf, Forest instance, same as self, without any chain :
-
normalize
(c=0)¶ Normalize the graph according to the index c Normalization means that the sum of the edges values that go into or out each vertex must sum to 1
Parameters: c=0 in {0, 1, 2}, optional: index that designates the way :
according to which D is normalized c == 0 => for each vertex a, sum{edge[e, 0]=a} D[e]=1 c == 1 => for each vertex b, sum{edge[e, 1]=b} D[e]=1 c == 2 => symmetric (‘l2’) normalization
Notes
Note that when sum_{edge[e, .] == a } D[e] = 0, nothing is performed
-
partition
(threshold)¶ Partition the tree according to a cut criterion
-
plot
(ax=None)¶ Plot the dendrogram associated with self the rank of the data in the dendogram is returned
Parameters: ax: axis handle, optional : Returns: ax, the axis handle :
-
plot_height
()¶ Plot the height of the non-leaves nodes
-
propagate_upward
(label)¶ Propagation of a certain labelling from leves to roots Assuming that label is a certain positive integer field this propagates these labels to the parents whenever the children nodes have coherent properties otherwise the parent value is unchanged
Parameters: label: array of shape(self.V) : Returns: label: array of shape(self.V) :
-
propagate_upward_and
(prop)¶ propagates from leaves to roots some binary property of the nodes so that prop[parents] = logical_and(prop[children])
Parameters: prop, array of shape(self.V), the input property : Returns: prop, array of shape(self.V), the output property field :
-
remove_edges
(valid)¶ Removes all the edges for which valid==0
Parameters: valid : (self.E,) array
-
remove_trivial_edges
()¶ Removes trivial edges, i.e. edges that are (vv)-like self.weights and self.E are corrected accordingly
Returns: self.E (int): The number of edges :
-
reorder_from_leaves_to_roots
()¶ reorder the tree so that the leaves come first then their parents and so on, and the roots are last.
Returns: order: array of shape(self.V) :
the order of the old vertices in the reordered graph
-
right_incidence
()¶ Return right incidence matrix
Returns: right_incid: list :
the right incidence matrix of self as a list of lists: i.e. the list[[e.0.0, .., e.0.i(0)], .., [e.V.0, E.V.i(V)]] where e.i.j is the set of edge indexes so that e.i.j[1] = i
-
set_edges
(edges)¶ Sets the graph’s edges
-
set_euclidian
(X)¶ Compute the weights of the graph as the distances between the corresponding rows of X, which represents an embdedding of self
Parameters: X array of shape (self.V, edim), :
the coordinate matrix of the embedding
-
set_gaussian
(X, sigma=0)¶ Compute the weights of the graph as a gaussian function of the distance between the corresponding rows of X, which represents an embdedding of self
Parameters: X array of shape (self.V, dim) :
the coordinate matrix of the embedding
sigma=0, float: the parameter of the gaussian function :
Notes
When sigma == 0, the following value is used:
sigma = sqrt(mean(||X[self.edges[:, 0], :]-X[self.edges[:, 1], :]||^2))
-
set_height
(height=None)¶ Set the height array
-
set_weights
(weights)¶ Set edge weights
Parameters: weights: array :
array shape(self.V): edges weights
-
show
(X=None, ax=None)¶ Plots the current graph in 2D
Parameters: X : None or array of shape (self.V, 2)
a set of coordinates that can be used to embed the vertices in 2D. If X.shape[1]>2, a svd reduces X for display. By default, the graph is presented on a circle
ax: None or int, optional :
ax handle
Returns: ax: axis handle :
Notes
This should be used only for small graphs.
-
split
(k)¶ idem as partition, but a number of components are supplied instead
-
subforest
(valid)¶ Creates a subforest with the vertices for which valid > 0
Parameters: valid: array of shape (self.V): idicator of the selected nodes : Returns: subforest: a new forest instance, with a reduced set of nodes : Notes
The children of deleted vertices become their own parent
-
subgraph
(valid)¶ Creates a subgraph with the vertices for which valid>0 and with the correponding set of edges
Parameters: valid, array of shape (self.V): nonzero for vertices to be retained : Returns: G, WeightedGraph instance, the desired subgraph of self : Notes
The vertices are renumbered as [1..p] where p = sum(valid>0) when sum(valid==0) then None is returned
-
symmeterize
()¶ Symmeterize self, modify edges and weights so that self.adjacency becomes the symmetric part of the current self.adjacency.
-
to_coo_matrix
()¶ Return adjacency matrix as coo sparse
Returns: sp: scipy.sparse matrix instance :
that encodes the adjacency matrix of self
-
tree_depth
()¶ Returns the number of hierarchical levels in the tree
-
voronoi_diagram
(seeds, samples)¶ Defines the graph as the Voronoi diagram (VD) that links the seeds. The VD is defined using the sample points.
Parameters: seeds: array of shape (self.V, dim) :
samples: array of shape (nsamples, dim) :
Notes
By default, the weights are a Gaussian function of the distance The implementation is not optimal
-
voronoi_labelling
(seed)¶ Performs a voronoi labelling of the graph
Parameters: seed: array of shape (nseeds), type (np.int), :
vertices from which the cells are built
Returns: labels: array of shape (self.V) the labelling of the vertices :
-
Functions¶
-
nipy.algorithms.clustering.hierarchical_clustering.
average_link_graph
(G)¶ Agglomerative function based on a (hopefully sparse) similarity graph
Parameters: G the input graph : Returns: t a weightForest structure that represents the dendrogram of the data :
-
nipy.algorithms.clustering.hierarchical_clustering.
average_link_graph_segment
(G, stop=0, qmax=1, verbose=False)¶ Agglomerative function based on a (hopefully sparse) similarity graph
Parameters: G the input graph :
stop: float :
the stopping criterion
qmax: int, optional :
the number of desired clusters (in the limit of the stopping criterion)
verbose : bool, optional
If True, print diagnostic information
Returns: u: array of shape (G.V) :
a labelling of the graph vertices according to the criterion
cost: array of shape (G.V (?)) :
the cost of each merge step during the clustering procedure
-
nipy.algorithms.clustering.hierarchical_clustering.
fusion
(K, pop, i, j, k)¶ Modifies the graph K to merge nodes i and j into nodes k
The similarity values are weighted averaged, where pop[i] and pop[j] yield the relative weights. this is used in average_link_slow (deprecated)
-
nipy.algorithms.clustering.hierarchical_clustering.
ward
(G, feature, verbose=False)¶ Agglomerative function based on a topology-defining graph and a feature matrix.
Parameters: G : graph
the input graph (a topological graph essentially)
feature : array of shape (G.V,dim_feature)
vectorial information related to the graph vertices
verbose : bool, optional
If True, print diagnostic information
Returns: t :
WeightedForest
instancestructure that represents the dendrogram
Notes
When G has more than 1 connected component, t is no longer a tree. This case is handled cleanly now
-
nipy.algorithms.clustering.hierarchical_clustering.
ward_field_segment
(F, stop=-1, qmax=-1, verbose=False)¶ Agglomerative function based on a field structure
Parameters: F the input field (graph+feature) :
stop: float, optional :
the stopping crterion. if stop==-1, then no stopping criterion is used
qmax: int, optional :
the maximum number of desired clusters (in the limit of the stopping criterion)
verbose : bool, optional
If True, print diagnostic information
Returns: u: array of shape (F.V) :
labelling of the graph vertices according to the criterion
cost array of shape (F.V - 1) :
the cost of each merge step during the clustering procedure
Notes
See ward_quick_segment for more information
Caveat : only approximate
-
nipy.algorithms.clustering.hierarchical_clustering.
ward_quick
(G, feature, verbose=False)¶ Agglomerative function based on a topology-defining graph and a feature matrix.
Parameters: G : graph instance
topology-defining graph
feature: array of shape (G.V,dim_feature) :
some vectorial information related to the graph vertices
verbose : bool, optional
If True, print diagnostic information
Returns: t: weightForest instance, :
that represents the dendrogram of the data
Notes :
—- :
Hopefully a quicker version :
A euclidean distance is used in the feature space :
Caveat : only approximate
-
nipy.algorithms.clustering.hierarchical_clustering.
ward_quick_segment
(G, feature, stop=-1, qmax=1, verbose=False)¶ Agglomerative function based on a topology-defining graph and a feature matrix.
Parameters: G: labs.graph.WeightedGraph instance :
the input graph (a topological graph essentially)
feature array of shape (G.V,dim_feature) :
vectorial information related to the graph vertices
stop1 : int or float, optional
the stopping crterion if stop==-1, then no stopping criterion is used
qmax : int, optional
the maximum number of desired clusters (in the limit of the stopping criterion)
verbose : bool, optional
If True, print diagnostic information
Returns: u: array of shape (G.V) :
labelling of the graph vertices according to the criterion
cost: array of shape (G.V - 1) :
the cost of each merge step during the clustering procedure
Notes
Hopefully a quicker version
A euclidean distance is used in the feature space
Caveat : only approximate
-
nipy.algorithms.clustering.hierarchical_clustering.
ward_segment
(G, feature, stop=-1, qmax=1, verbose=False)¶ Agglomerative function based on a topology-defining graph and a feature matrix.
Parameters: G : graph object
the input graph (a topological graph essentially)
feature : array of shape (G.V,dim_feature)
some vectorial information related to the graph vertices
stop : int or float, optional
the stopping crterion. if stop==-1, then no stopping criterion is used
qmax : int, optional
the maximum number of desired clusters (in the limit of the stopping criterion)
verbose : bool, optional
If True, print diagnostic information
Returns: u: array of shape (G.V): :
a labelling of the graph vertices according to the criterion
cost: array of shape (G.V - 1) :
the cost of each merge step during the clustering procedure
Notes
A euclidean distance is used in the feature space
Caveat : when the number of cc in G (nbcc) is greter than qmax, u contains nbcc values, not qmax !