random_forest_regression

Random Forest regressor supporting continuous and mixed-feature datasets. The library implements the regressor_protocol defined in the regression_protocols library and learns an ensemble of regression trees trained on bootstrap samples and per-split random feature subsets, predicting with the arithmetic mean of the individual tree predictions.

API documentation

Open the ../../apis/library_index.html#random_forest_regression link in a web browser.

Loading

To load this library, load the loader.lgt file:

| ?- logtalk_load(random_forest_regression(loader)).

Testing

To test this library predicates, load the tester.lgt file:

| ?- logtalk_load(random_forest_regression(tester)).

To run the performance benchmark suite, load the tester_performance.lgt file:

| ?- logtalk_load(random_forest_regression(tester_performance)).

Features

  • Bootstrap Ensembles: Trains multiple regression trees on bootstrap samples.

  • Random Feature Subsets: Samples a random subset of the available dataset attributes at each split of every tree.

  • Portable Seeded Sampling: Uses fast_random(xoshiro128pp) so bootstrap and split-level feature sampling are portable and reproducible.

  • Tree Averaging: Predicts numeric targets using the arithmetic mean of the tree predictions.

  • Tree Configuration: Exposes the underlying regression-tree split-feature, depth, minimum-leaf, variance-reduction, and scaling options.

  • Categorical Features Encoding: Uses reference-level dummy coding derived from the declared dataset attribute values, with a missing-value indicator, and the resulting encoded features are treated as ordinary numeric split features by the tree learners.

  • Diagnostics Metadata: Learned regressors record model name, target, training example count, attribute count, tree count, and effective options, accessible using the shared regression diagnostics predicates.

  • Model Export: Learned regressors can be exported as predicate clauses or written to a file.

  • Reference Benchmarks: Includes a dedicated performance suite reporting training time, RMSE, and MAE for representative regression datasets.

Regressor representation

The learned regressor is represented by default as:

  • rf_regressor(Trees, Diagnostics)

The exported predicate clauses therefore use the shape:

  • Functor(Trees, Diagnostics)

Diagnostics syntax

The diagnostics/2 predicate returns a list of metadata terms with the form:

[
    model(random_forest_regression),
    target(Target),
    training_example_count(TrainingExampleCount),
    options(Options),
    attribute_count(AttributeCount),
    tree_count(TreeCount)
]

Where:

  • model(random_forest_regression) identifies the learning algorithm that produced the regressor.

  • target(Target) stores the target attribute name declared by the training dataset.

  • training_example_count(TrainingExampleCount) stores the number of examples used during training.

  • options(Options) stores the effective learning options after merging the user options with the library defaults.

  • attribute_count(AttributeCount) stores the number of dataset attributes available to the ensemble before split-level subsampling.

  • tree_count(TreeCount) stores the number of trained regression trees in the ensemble.

Use the regression_protocols diagnostic/2 and regressor_options/2 helper predicates when you only need a single metadata term or the effective options.

Options

The learn/3 predicate accepts the following options:

  • number_of_trees/1: Number of regression trees to train in the ensemble. Increasing this value usually improves stability at the cost of additional training and prediction time. The default is 10.

  • maximum_features_per_split/1: Number of dataset attributes randomly sampled at each split when searching for the best partition. Accepted values are a positive integer or all. When omitted, the library uses the square root of the total number of available attributes, with a minimum of one attribute. Passing all disables split-level attribute subsampling.

  • maximum_depth/1: Maximum depth allowed for each regression-tree base learner. The default is 10.

  • minimum_samples_leaf/1: Minimum number of training examples required in each leaf of a base learner tree. The default is 1.

  • minimum_variance_reduction/1: Minimum split gain required by each base learner tree before accepting a partition. The default is 0.0.

  • feature_scaling/1: Controls z-score standardization of continuous attributes inside each regression-tree base learner. Accepted values are true and false. The default is false.

  • random_seed/1: Positive integer seed used by the portable fast_random(xoshiro128pp) pseudo-random generator when drawing bootstrap samples and split-level random feature subsets. Using the same seed with the same dataset and options reproduces the same learned regressor. The default is 1357911.