sklearn.cluster
.ward_tree#
- sklearn.cluster.ward_tree(X, *, connectivity=None, n_clusters=None, return_distance=False)[source]#
Ward clustering based on a Feature matrix.
Recursively merges the pair of clusters that minimally increases within-cluster variance.
The inertia matrix uses a Heapq-based representation.
This is the structured version, that takes into account some topological structure between samples.
Read more in the User Guide.
- Parameters:
- Xarray-like of shape (n_samples, n_features)
Feature matrix representing
n_samples
samples to be clustered.- connectivity{array-like, sparse matrix}, default=None
Connectivity matrix. Defines for each sample the neighboring samples following a given structure of the data. The matrix is assumed to be symmetric and only the upper triangular half is used. Default is None, i.e, the Ward algorithm is unstructured.
- n_clustersint, default=None
n_clusters
should be less thann_samples
. Stop early the construction of the tree atn_clusters.
This is useful to decrease computation time if the number of clusters is not small compared to the number of samples. In this case, the complete tree is not computed, thus the ‘children’ output is of limited use, and the ‘parents’ output should rather be used. This option is valid only when specifying a connectivity matrix.- return_distancebool, default=False
If
True
, return the distance between the clusters.
- Returns:
- childrenndarray of shape (n_nodes-1, 2)
The children of each non-leaf node. Values less than
n_samples
correspond to leaves of the tree which are the original samples. A nodei
greater than or equal ton_samples
is a non-leaf node and has childrenchildren_[i - n_samples]
. Alternatively at the i-th iteration, children[i][0] and children[i][1] are merged to form noden_samples + i
.- n_connected_componentsint
The number of connected components in the graph.
- n_leavesint
The number of leaves in the tree.
- parentsndarray of shape (n_nodes,) or None
The parent of each node. Only returned when a connectivity matrix is specified, elsewhere ‘None’ is returned.
- distancesndarray of shape (n_nodes-1,)
Only returned if
return_distance
is set toTrue
(for compatibility). The distances between the centers of the nodes.distances[i]
corresponds to a weighted Euclidean distance between the nodeschildren[i, 1]
andchildren[i, 2]
. If the nodes refer to leaves of the tree, thendistances[i]
is their unweighted Euclidean distance. Distances are updated in the following way (from scipy.hierarchy.linkage):The new entry \(d(u,v)\) is computed as follows,
\[d(u,v) = \sqrt{\frac{|v|+|s|} {T}d(v,s)^2 + \frac{|v|+|t|} {T}d(v,t)^2 - \frac{|v|} {T}d(s,t)^2}\]where \(u\) is the newly joined cluster consisting of clusters \(s\) and \(t\), \(v\) is an unused cluster in the forest, \(T=|v|+|s|+|t|\), and \(|*|\) is the cardinality of its argument. This is also known as the incremental algorithm.
Examples
>>> import numpy as np >>> from sklearn.cluster import ward_tree >>> X = np.array([[1, 2], [1, 4], [1, 0], ... [4, 2], [4, 4], [4, 0]]) >>> children, n_connected_components, n_leaves, parents = ward_tree(X) >>> children array([[0, 1], [3, 5], [2, 6], [4, 7], [8, 9]]) >>> n_connected_components 1 >>> n_leaves 6