iqr_anomaly_detector

Statistical interquartile-range anomaly detector for continuous datasets. It is a statistical anomaly-detection method based on Tukey interquartile fences: for each known continuous attribute value it learns Q1 and Q3, computes the exceedance beyond [Q1,Q3] in interquartile-range units normalized by the learned fence_multiplier/1, and then aggregates the per-attribute normalized deviations according to score_mode/1, so any value at or beyond [Q1 - k*IQR, Q3 + k*IQR] reaches the default anomaly boundary when using fence_multiplier(k).

The library implements the anomaly_detector_protocol defined in the anomaly_detection_protocols library. It learns a detector from a continuous dataset, computes anomaly scores for new instances, predicts normal or anomaly, and exports learned detectors as clauses or files.

Datasets are represented as objects implementing the anomaly_dataset_protocol protocol from the anomaly_detection_protocols library. See the anomaly_detection_protocols/test_datasets directory for examples.

API documentation

Open the ../../apis/library_index.html#iqr_anomaly_detector link in a web browser.

Loading

To load this library, load the loader.lgt file:

| ?- logtalk_load(iqr_anomaly_detector(loader)).

Testing

To test this library predicates, load the tester.lgt file:

| ?- logtalk_load(iqr_anomaly_detector(tester)).

Features

  • Statistical method: implements anomaly detection based on interquartile-range fences, using per-attribute first and third quartiles to measure how far new observations deviate from the central baseline distribution.

  • Quartile-based scoring: for each known attribute value x, the library computes the positive exceedance of x beyond the interval [Q1, Q3] in interquartile-range units, where Q1 and Q3 are the learned sample quartiles.

  • Continuous features only: accepts datasets whose declared attributes are all continuous.

  • Robust statistics: reuses the statistics library sample object quartiles/4 predicate to compute per-attribute quartiles.

  • Baseline training selection: supports learn-time baseline_class_values(ClassValues) and baseline_selection_policy(Policy) options. The default baseline class values are [normal]. The default reject policy throws an error if non-baseline examples are present, while filter removes them before fitting.

  • Missing-value tolerant: ignores missing values when fitting attribute statistics. During scoring, queries must provide at least one known value. In the default score_mode(root_mean_square), the raw score is aggregated over attributes with positive normalized deviation so that neutral inlier attributes do not dilute fence anomalies. The learned detector stores a precomputed attribute schema so that scoring reuses the same attribute ordering without rebuilding it on every call.

  • Configurable scoring semantics: supports both dense multivariate deviation scoring using score_mode(root_mean_square) and sparse anomaly detection using score_mode(any_feature_extreme). The default root-mean-square mode reuses the numberlist library Euclidean norm predicate as part of the computation.

  • Configurable Tukey fences: supports a learn-time fence_multiplier/1 option. The default 1.5 corresponds to the classical Tukey inner fence cutoff, and the learned multiplier is applied directly in the score path.

  • Bounded scoring: maps the raw multivariate IQR score to [0.0, 1.0) using Score = Raw / (1 + Raw).

  • Default threshold: the default anomaly_threshold(0.5) corresponds to the learned Tukey fence cutoff after score scaling, while remaining overrideable in learn/3 and predict/4.

  • Learn-time options: fence_multiplier/1 and score_mode/1 are recorded in the learned detector and reused for subsequent scoring and prediction. Passing either option to predict/4 does not override the learned value.

  • All-missing queries rejected: scoring and prediction throw a domain_error(non_empty_known_values, AttributeNames) exception when every declared feature is missing in the query.

  • Featureless datasets rejected: datasets must declare at least one continuous feature; otherwise learn/2-3 throws a domain_error(non_empty_features, Dataset) exception.

  • Detector export: learned detectors can be exported as predicate clauses.

  • Explicit validation and diagnostics: supports the shared check_anomaly_detector/1, valid_anomaly_detector/1, diagnostics/2, diagnostic/2, and anomaly_detector_options/2 predicates.

Options

The following options are supported by the public API:

  • anomaly_threshold(Threshold): Threshold for predict/3-4 (default: 0.5)

  • baseline_class_values(ClassValues): Learn-time class labels that are admissible for baseline fitting (default: [normal])

  • baseline_selection_policy(Policy): Learn-time handling of examples whose class is not listed in baseline_class_values/1. Supported values are filter and reject (default: reject)

  • fence_multiplier(Multiplier): Learn-time Tukey fence multiplier stored in the learned detector (default: 1.5)

  • score_mode(Mode): Learn-time score aggregation mode for learn/3. Supported values are root_mean_square and any_feature_extreme (default: root_mean_square). If passed to predict/4, it is ignored and the value stored in the learned detector is used.

Detector representation

The learned detector is represented by default as:

iqr_detector(TrainingDataset, AttributeSchema, Encoders, Diagnostics)

Where:

  • TrainingDataset: training dataset object identifier

  • AttributeSchema: precomputed attribute ordering used for validation and scoring

  • Encoders: list of iqr_anomaly_detector(Attribute, Q1, Q3, Scale) records

  • Diagnostics: learned metadata terms including model/1, training_dataset/1, attribute_names/1, feature_count/1, example_count/1, and options/1

When exported using export_to_clauses/4 or export_to_file/4, this detector term is serialized directly as the single argument of the generated predicate clause so that the exported model can be loaded and reused as-is.

Notes

Scoring has three stages. First, the detector computes one per-attribute IQR deviation score for each known attribute value using its exceedance beyond the interval [Q1, Q3] in interquartile-range units. Second, each per-attribute score is normalized by the learned fence_multiplier/1, so that it reaches 1.0 exactly at the chosen Tukey fence cutoff. Third, those normalized per-attribute scores are aggregated into a single raw deviation score according to the learned score_mode/1 option before being mapped to the interval [0.0, 1.0) using Score = Raw / (1 + Raw).

The score_mode/1 option does not change the per-attribute quartile formula. It only changes the aggregation step. With score_mode(root_mean_square), the raw score is the root mean square of the positive normalized per-attribute deviations, computed over the attributes with positive deviation so that inlier padding does not dilute fence-reaching anomalies. With score_mode(any_feature_extreme), the raw score is the maximum normalized per-attribute deviation.

The fence_multiplier/1 option defines the classical Tukey anomaly cutoff and directly normalizes the per-attribute deviation scores. With any learned fence_multiplier(K), a per-attribute score of 1.0 means the query has reached the chosen Tukey fence on that attribute, so the default normalized threshold 0.5 corresponds to that cutoff in both supported aggregation modes.

The baseline_class_values/1 option declares which dataset class labels are admissible for fitting the baseline quartiles and interquartile ranges. The baseline_selection_policy/1 option then controls what happens when other labels are present in the training data. The default reject policy raises a domain_error(baseline_only_training_data, Dataset) exception when any non-baseline example is found. The filter policy removes non-baseline examples before fitting.

Attributes with zero observed interquartile range are assigned a fallback scale of 1.0. This keeps the detector well-defined for singleton datasets or constant columns while still yielding zero score for matching values and positive scores for deviating values.

The root-mean-square aggregation keeps the default threshold stable while avoiding dilution from missing or neutral inlier attributes.

Use score_mode(any_feature_extreme) when a single extreme feature should be sufficient to flag an anomaly in high-dimensional data.