anomaly_detection_protocols

This library provides protocols used in the implementation of machine learning anomaly-detection algorithms. Datasets are represented as objects implementing the anomaly_dataset_protocol protocol. Anomaly detectors are represented as objects importing the anomaly_detector_common category which imports the anomaly_detector_protocol protocol. The category provides shared learn/2, predict/3-4, diagnostics/2, diagnostic/2, anomaly_detector_options/2, file export, baseline training-selection helpers, and dataset helper predicates. It keeps threshold-based prediction and export behavior separate from the algorithm-specific learning, scoring, clause export, pretty-printing, and diagnostics metadata code.

The shared category also provides reusable protected predicates for baseline-only training workflows. Libraries can use the baseline_class_values/1 and baseline_selection_policy/1 options via a single helper instead of reimplementing class-label validation and baseline-example filtering or rejection logic locally.

Learned detector terms can be validated explicitly using the shared check_anomaly_detector/1 and valid_anomaly_detector/1 predicates. This validation API is never called implicitly by scoring, prediction, printing, or export predicates.

This library also provides a reusable shared category, anomaly benchmark datasets, and a small family smoke-test suite.

Export header format

The shared exporter in the anomaly_detector_common category writes a header before the exported clauses in the following format:

% exported anomaly detector predicate: Functor/Arity
% training dataset: Dataset
% options: Options
% Functor(Detector)
Functor(Detector)

The exported clauses serialize the learned detector term as a single predicate argument so that loading the file gives a detector term that can be passed directly to the predict/3-4 and score_all/3 predicates.

When exporting a serialized detector term, using a noun such as detector/1 or model/1 is recommended.

API documentation

Open the ../../apis/library_index.html#anomaly-detection-protocols link in a web browser.

Loading

To load all entities in this library, load the loader.lgt file:

| ?- logtalk_load(anomaly_detection_protocols(loader)).

Testing

To run the library smoke tests, shared category tests, and dataset checks, load the tester.lgt file:

| ?- logtalk_load(anomaly_detection_protocols(tester)).

Test datasets

Several sample datasets are included in the test_datasets directory:

  • gaussian_anomalies.lgt — A synthetic 2D anomaly detection dataset with 48 examples and 2 continuous attributes (x, y). Normal points are sampled from a standard normal distribution centered at the origin. Anomalous points are placed far from the cluster center. Inspired by the canonical test case used in the Extended Isolation Forest paper by Hariri et al. (2019).

  • malformed_anomalies.lgt — A negative fixture with invalid class labels for testing family-level dataset validation.

  • mixed_anomalies.lgt — A small mixed-feature anomaly dataset with 16 examples, 2 continuous attributes (age, income), and 2 categorical attributes (student, credit_rating). Includes missing values and uncommon feature combinations to exercise anomaly-detection code on heterogeneous data.

  • mixed_distance_behaviors.lgt — A compact mixed-feature anomaly fixture with 8 examples, 2 continuous attributes (size, weight), and 2 categorical attributes (color, shape). Intended for smoke-testing continuous plus categorical distance behavior and basic mixed-data handling.

  • sensor_anomalies.lgt — A synthetic industrial sensor anomaly dataset with 40 examples and 3 continuous attributes (temperature, pressure, vibration). Contains missing values (14 examples with missing values, represented using anonymous variables). Normal readings cluster around typical operating ranges. Anomalous readings show extreme values indicating equipment malfunction.

  • shuttle_anomalies.lgt — A subset of the Statlog Shuttle dataset with 50 examples and 9 continuous attributes representing sensor readings from the NASA Space Shuttle. Class 1 (Rad Flow) is the majority class (normal), while all other classes are treated as anomalies. Originally from Catlett, J. (1991). Available from the UCI Machine Learning Repository: https://archive.ics.uci.edu/dataset/148/statlog+shuttle

  • water_potability.lgt — A water potability dataset with 48 examples and 9 continuous attributes (pH, hardness, solids, chloramines, sulfate, conductivity, organic carbon, trihalomethanes, turbidity). Normal instances represent potable water samples within acceptable ranges. Anomalous instances represent water samples with hazardous contamination levels. Based on the publicly available Water Quality dataset (Kadiwal, A., 2020, Kaggle).