.. _library_gaussian_mixture_clusterer:

``gaussian_mixture_clusterer``
==============================

Gaussian mixture model clusterer. It uses deterministic
expectation-maximization with diagonal covariance matrices. Supports
continuous attributes only.

The library implements the ``clusterer_protocol`` defined in the
``clustering_protocols`` library. It provides predicates for learning a
clusterer from a dataset, assigning new instances to clusters, returning
Gaussian-mixture posterior component probabilities for new instances,
and exporting the learned clusterer as a list of predicate clauses or to
a file.

Datasets are represented as objects implementing the
``clustering_dataset_protocol`` protocol from the
``clustering_protocols`` library.

API documentation
-----------------

Open the
`../../apis/library_index.html#gaussian_mixture_clusterer <../../apis/library_index.html#gaussian_mixture_clusterer>`__
link in a web browser.

Loading
-------

To load this library, load the ``loader.lgt`` file:

::

   | ?- logtalk_load(gaussian_mixture_clusterer(loader)).

Testing
-------

To test this library predicates, load the ``tester.lgt`` file:

::

   | ?- logtalk_load(gaussian_mixture_clusterer(tester)).

To run the performance benchmark suite, load the
``tester_performance.lgt`` file:

::

   | ?- logtalk_load(gaussian_mixture_clusterer(tester_performance)).

Features
--------

- **Expectation-Maximization**: Learns diagonal Gaussian components
  using deterministic EM updates.
- **Continuous Datasets**: Accepts datasets containing only continuous
  attributes.
- **Configurable Dead-Component Handling**: Dead components can either
  be preserved with zero weight or deterministically reseeded to the
  least-confident training row.
- **Deterministic Initialization**: Supports ``first_k`` and
  deterministic ``spread`` initialization for component means. The
  ``spread`` strategy uses a canonical first seed and canonical
  tie-breaking so equivalent row permutations produce the same
  initialization.
- **Optional Feature Scaling**: Continuous attributes can be
  standardized using z-score scaling.
- **Posterior Prediction**: New instances are assigned to the component
  with the highest posterior score, and ``cluster_probabilities/3`` can
  return the full posterior distribution over components.
- **Training Diagnostics**: Learned clusterers record convergence
  reason, iteration count, average log-likelihood, final delta, and
  effective options.
- **Portable Export**: Learned clusterers can be exported as clauses or
  files and reused later.

Gaussian-Mixture-Specific Prediction API
----------------------------------------

In addition to the shared ``cluster/3`` predicate from the clustering
protocols library, this package provides a Gaussian-mixture-specific
predicate:

- ``cluster_probabilities(Clusterer, Instance, Probabilities)``: Returns
  posterior component probabilities for ``Instance`` as
  ``Cluster-Probability`` pairs in component-id order.

Options
-------

The following options can be passed to the ``learn/3`` predicate:

- ``k(K)``: Number of mixture components. Default is ``2``.
- ``initialization(Initialization)``: Mean initialization strategy.
  Options: ``spread`` (default) or ``first_k``.
- ``feature_scaling(FeatureScaling)``: Whether to standardize continuous
  attributes before clustering. Options: ``on`` (default) or ``off``.
- ``maximum_iterations(MaximumIterations)``: Maximum number of EM
  iterations. Default is ``100``.
- ``tolerance(Tolerance)``: Per-example average log-likelihood
  convergence tolerance. Default is ``0.0001``.
- ``covariance_regularization(Regularization)``: Positive diagonal
  covariance regularization constant. Default is ``0.001``.
- ``dead_component_policy(Policy)``: Handling for components whose total
  responsibility collapses below the dead-component threshold. Options:
  ``zero_weight`` (default) keeps the previous component with zero
  weight; ``reseed`` relocates the component to the least-confident
  training row and gives it one-example prior weight.

Clusterer representation
------------------------

The learned clusterer is represented as a compound term with the functor
chosen by the user when exporting the clusterer and arity 5. For
example:

::

   gaussian_mixture_clusterer(Encoders, Components, Weights, Options, Diagnostics)

Where:

- ``Encoders``: List of continuous attribute encoders storing attribute
  name, mean, and scale.
- ``Components``: List of ``component(Mean, Variances)`` terms in
  component-id order.
- ``Weights``: List of mixture weights in component-id order.
- ``Options``: Effective training options used to learn the clusterer.
- ``Diagnostics``: Training diagnostics including convergence status,
  iteration count, average log-likelihood, final delta, and options.
