hierarchical_clustering
Hierarchical clusterer. Supports continuous attributes only. It builds the full bottom-up agglomerative_clusterer hierarchy and derives the requested partition by cutting the learned dendrogram at the largest remaining merge distances.
The library implements the clusterer_protocol defined in the
clustering_protocols library. It provides predicates for learning a
full agglomerative_clusterer hierarchy from a dataset, deriving a
k-cluster cut for prediction, and exporting the learned clusterer as
a list of predicate clauses or to a file.
Datasets are represented as objects implementing the
clustering_dataset_protocol protocol from the
clustering_protocols library.
API documentation
Open the ../../apis/library_index.html#hierarchical_clustering link in a web browser.
Loading
To load this library, load the loader.lgt file:
| ?- logtalk_load(hierarchical_clustering(loader)).
Testing
To test this library predicates, load the tester.lgt file:
| ?- logtalk_load(hierarchical_clustering(tester)).
To run the performance benchmark suite, load the
tester_performance.lgt file:
| ?- logtalk_load(hierarchical_clustering(tester_performance)).
Features
Full Hierarchy Learning: Builds the complete bottom-up merge hierarchy before deriving the requested cluster partition.
Deterministic Dendrogram Cutting: Produces the final
kclusters by repeatedly splitting the highest remaining merge.Continuous Datasets: Accepts datasets containing only continuous attributes.
Linkage Strategies: Supports
single,complete, andaveragelinkage.Distance Metrics: Supports Euclidean and Manhattan distances.
Optional Feature Scaling: Continuous attributes can be standardized using z-score scaling.
Linkage-Aware Prediction: New instances are assigned to the nearest learned cluster using the selected linkage strategy and distance metric rather than a prototype shortcut.
Incremental Distance Updates: Training caches singleton pair distances once and updates merge distances incrementally for
single,complete, andaveragelinkage instead of recomputing all member-pair distances after every merge.Heap-Based Merge Selection: Training uses a lazy min-heap of candidate cluster pairs and skips stale entries for merged-away nodes instead of rescanning all active pairs at every agglomeration step.
Reusable Hierarchy Cuts: Learned hierarchies can be re-cut to a different
kusingcut/3without retraining.Rich Diagnostics: Learned clusterers record merge counts, dendrogram height, heap rebuilds, scan fallbacks, maximum heap size, and effective options.
Deterministic Tie-Breaking: Equal merge distances and equal split heights are resolved by preferring the smallest node-id pair.
Portable Export: Learned clusterers can be exported as clauses or files and reused later.
Options
The following options can be passed to the learn/3 predicate:
k(K): Number of clusters to retain after cutting the learned hierarchy. Default is2.linkage(Linkage): Linkage strategy to use. Options:single,complete, oraverage(default).distance_metric(Metric): Distance metric to use. Options:euclidean(default) ormanhattan.feature_scaling(FeatureScaling): Whether to standardize continuous attributes before clustering. Options:on(default) oroff.
Clusterer representation
The learned clusterer is represented as a compound term with the functor chosen by the user when exporting the clusterer and arity 5. For example:
hierarchical_clustering_clusterer(Encoders, hierarchy(RootState, MergeRecords, Dendrogram), Clusters, Prototypes, Diagnostics)
Where:
Encoders: List of continuous attribute encoders storing attribute name, mean, and scale.hierarchy(RootState, MergeRecords, Dendrogram): Reusable hierarchy state.RootStateandMergeRecordsare used internally bycut/3, andDendrogramis the learned merge tree represented withleaf(Id)andmerge(Left, Right, Distance, Size)terms.Clusters: List ofcluster(Id, Points)terms for the selectedk-cluster cut.Prototypes: List of average vectors kept for display and export metadata.Diagnostics: Diagnostics metadata including the effective training options used to learn the clusterer.
Additional API
cut(+Clusterer, +K, -RecutClusterer): Reuses the learned hierarchy to derive a newK-cluster cut without retraining.
Diagnostics
The diagnostics/2 predicate returns metadata terms including:
model(hierarchical_clustering)cluster_count(Count)prototype_count(Count)training_example_count(Count)merge_count(Count)dendrogram_height(Height)heap_rebuild_count(Count)scan_fallback_count(Count)maximum_heap_size(Size)tie_breaking(node_id_order)options(Options)