.. _library_z_score_anomaly_detector:

``z_score_anomaly_detector``
============================

Statistical Z-score anomaly detector for continuous datasets. It is a
statistical anomaly-detection method based on standard scores: the
detector estimates a population mean and standard deviation for each
continuous attribute and supports two learn-time score modes,
``root_mean_square`` for dense multivariate deviation scores and
``any_feature_extreme`` for the maximum absolute Z-score when sparse
single-feature anomalies are more informative.

The library implements the ``anomaly_detector_protocol`` defined in the
``anomaly_detection_protocols`` library. It learns a detector from a
continuous dataset, computes anomaly scores for new instances, predicts
``normal`` or ``anomaly``, and exports learned detectors as clauses or
files.

Datasets are represented as objects implementing the
``anomaly_dataset_protocol`` protocol from the
``anomaly_detection_protocols`` library. See the
``anomaly_detection_protocols/test_datasets`` directory for examples.

API documentation
-----------------

Open the
`../../apis/library_index.html#z-score <../../apis/library_index.html#z-score>`__
link in a web browser.

Loading
-------

To load this library, load the ``loader.lgt`` file:

::

   | ?- logtalk_load(z_score_anomaly_detector(loader)).

Testing
-------

To test this library predicates, load the ``tester.lgt`` file:

::

   | ?- logtalk_load(z_score_anomaly_detector(tester)).

Features
--------

- **Statistical method**: implements anomaly detection based on standard
  scores, using the population mean and standard deviation of each
  continuous attribute to measure how far new observations deviate from
  the training data distribution.

- **Classical per-attribute Z-score**: for each known attribute value
  ``x``, the library computes the standard score
  ``z = (x - mu) / sigma``, where ``mu`` is the learned population mean
  for that attribute and ``sigma`` is the learned population standard
  deviation.

- **Continuous features only**: accepts datasets whose declared
  attributes are all ``continuous``.

- **Population statistics**: reuses the ``statistics`` library
  ``population`` object to compute per-attribute arithmetic means and
  standard deviations.

- **Baseline training selection**: supports learn-time
  ``baseline_class_values(ClassValues)`` and
  ``baseline_selection_policy(Policy)`` options. The default baseline
  class values are ``[normal]``. The default ``reject`` policy throws an
  error if non-baseline examples are present, while ``filter`` removes
  them before fitting.

- **Missing-value tolerant**: ignores missing values when fitting
  attribute statistics. During scoring, queries must provide at least
  one known value. In the default ``score_mode(root_mean_square)``, the
  raw score is normalized by the number of known values so that scores
  remain comparable across different missing-value patterns.

- **Configurable scoring semantics**: supports both dense multivariate
  deviation scoring using ``score_mode(root_mean_square)`` and sparse
  anomaly detection using ``score_mode(any_feature_extreme)``. The
  default root-mean-square mode reuses the ``numberlist`` library
  Euclidean norm predicate as part of the computation. The
  ``score_mode/1`` option only controls how the per-attribute Z-scores
  are aggregated into a single raw anomaly score.

- **Bounded scoring**: maps the raw multivariate Z-score to
  ``[0.0, 1.0)`` using ``Score = Raw / (1 + Raw)``.

- **Default threshold**: the default ``anomaly_threshold(0.70)``
  provides a practical out-of-the-box cutoff for the built-in anomaly
  fixtures while remaining overrideable in ``learn/3`` and
  ``predict/4``.

- **Learn-time score mode**: ``score_mode/1`` is recorded in the learned
  detector and reused for subsequent scoring and prediction. Passing a
  ``score_mode/1`` option to ``predict/4`` does not override the learned
  mode.

- **All-missing queries rejected**: scoring and prediction throw a
  ``domain_error(non_empty_known_values, AttributeNames)`` exception
  when every declared feature is missing in the query.

- **Featureless datasets rejected**: datasets must declare at least one
  continuous feature; otherwise ``learn/2-3`` throws a
  ``domain_error(non_empty_features, Dataset)`` exception.

- **Detector export**: learned detectors can be exported as predicate
  clauses.

- **Explicit validation and diagnostics**: supports the shared
  ``check_anomaly_detector/1``, ``valid_anomaly_detector/1``,
  ``diagnostics/2``, ``diagnostic/2``, and
  ``anomaly_detector_options/2`` predicates.

Options
-------

The following options are supported by the public API:

- ``anomaly_threshold(Threshold)``: Threshold for ``predict/3-4``
  (default: ``0.70``)
- ``baseline_class_values(ClassValues)``: Learn-time class labels that
  are admissible for baseline fitting (default: ``[normal]``)
- ``baseline_selection_policy(Policy)``: Learn-time handling of examples
  whose class is not listed in ``baseline_class_values/1``. Supported
  values are ``filter`` and ``reject`` (default: ``reject``)
- ``score_mode(Mode)``: Learn-time score aggregation mode for
  ``learn/3``. Supported values are ``root_mean_square`` and
  ``any_feature_extreme`` (default: ``root_mean_square``). If passed to
  ``predict/4``, it is ignored and the value stored in the learned
  detector is used.

Detector representation
-----------------------

The learned detector is represented by default as:

::

   z_score_detector(TrainingDataset, Encoders, Diagnostics)

Where:

- ``TrainingDataset``: training dataset object identifier
- ``Encoders``: list of ``zscore(Attribute, Mean, Scale)`` records
- ``Diagnostics``: learned metadata terms including ``model/1``,
  ``training_dataset/1``, ``attribute_names/1``, ``feature_count/1``,
  ``example_count/1``, and ``options/1``

When exported using ``export_to_clauses/4`` or ``export_to_file/4``,
this detector term is serialized directly as the single argument of the
generated predicate clause so that the exported model can be loaded and
reused as-is.

Notes
-----

Scoring has three stages. First, the detector computes one classical
per-attribute Z-score for each known attribute value using
``z = (x - mu) / sigma``. Second, those per-attribute Z-scores are
aggregated into a single raw anomaly score according to the learned
``score_mode/1`` option. Third, the raw score is mapped to the interval
``[0.0, 1.0)`` using ``Score = Raw / (1 + Raw)``.

With this normalization, a raw score of ``3.0`` maps to ``0.75``.

The ``score_mode/1`` option does not change the classical per-attribute
formula. It only changes the aggregation step. With
``score_mode(root_mean_square)``, the raw score is the root mean square
of the per-attribute Z-scores. With ``score_mode(any_feature_extreme)``,
the raw score is the maximum absolute per-attribute Z-score.

The ``baseline_class_values/1`` option declares which dataset class
labels are admissible for fitting the baseline means and standard
deviations. The ``baseline_selection_policy/1`` option then controls
what happens when other labels are present in the training data. The
default ``reject`` policy raises a
``domain_error(baseline_only_training_data, Dataset)`` exception when
any non-baseline example is found. The ``filter`` policy removes
non-baseline examples before fitting.

Attributes with zero observed dispersion are assigned a fallback scale
of ``1.0``. This keeps the detector well-defined for singleton datasets
or constant columns while still yielding zero score for matching values
and positive scores for deviating values.

The root-mean-square aggregation keeps the default threshold stable as
the number of observed dimensions grows and avoids penalizing partially
observed queries solely for having fewer known attributes.

Use ``score_mode(any_feature_extreme)`` when a single extreme feature
should be sufficient to flag an anomaly in high-dimensional data.
