.. _library_random_forest_regression:

``random_forest_regression``
============================

Random Forest regressor supporting continuous and mixed-feature
datasets. The library implements the ``regressor_protocol`` defined in
the ``regression_protocols`` library and learns an ensemble of
regression trees trained on bootstrap samples and per-split random
feature subsets, predicting with the arithmetic mean of the individual
tree predictions.

API documentation
-----------------

Open the
`../../apis/library_index.html#random_forest_regression <../../apis/library_index.html#random_forest_regression>`__
link in a web browser.

Loading
-------

To load this library, load the ``loader.lgt`` file:

::

   | ?- logtalk_load(random_forest_regression(loader)).

Testing
-------

To test this library predicates, load the ``tester.lgt`` file:

::

   | ?- logtalk_load(random_forest_regression(tester)).

To run the performance benchmark suite, load the
``tester_performance.lgt`` file:

::

   | ?- logtalk_load(random_forest_regression(tester_performance)).

Features
--------

- **Bootstrap Ensembles**: Trains multiple regression trees on bootstrap
  samples.

- **Random Feature Subsets**: Samples a random subset of the available
  dataset attributes at each split of every tree.

- **Portable Seeded Sampling**: Uses ``fast_random(xoshiro128pp)`` so
  bootstrap and split-level feature sampling are portable and
  reproducible.

- **Tree Averaging**: Predicts numeric targets using the arithmetic mean
  of the tree predictions.

- **Tree Configuration**: Exposes the underlying regression-tree
  split-feature, depth, minimum-leaf, variance-reduction, and scaling
  options.

- **Categorical Features Encoding**: Uses reference-level dummy coding
  derived from the declared dataset attribute values, with a
  missing-value indicator, and the resulting encoded features are
  treated as ordinary numeric split features by the tree learners.

- **Diagnostics Metadata**: Learned regressors record model name,
  target, training example count, attribute count, tree count, and
  effective options, accessible using the shared regression diagnostics
  predicates.

- **Model Export**: Learned regressors can be exported as predicate
  clauses or written to a file.

- **Reference Benchmarks**: Includes a dedicated performance suite
  reporting training time, RMSE, and MAE for representative regression
  datasets.

Regressor representation
------------------------

The learned regressor is represented by default as:

- ``rf_regressor(Trees, Diagnostics)``

The exported predicate clauses therefore use the shape:

- ``Functor(Trees, Diagnostics)``

Diagnostics syntax
------------------

The ``diagnostics/2`` predicate returns a list of metadata terms with
the form:

::

   [
       model(random_forest_regression),
       target(Target),
       training_example_count(TrainingExampleCount),
       options(Options),
       attribute_count(AttributeCount),
       tree_count(TreeCount)
   ]

Where:

- ``model(random_forest_regression)`` identifies the learning algorithm
  that produced the regressor.
- ``target(Target)`` stores the target attribute name declared by the
  training dataset.
- ``training_example_count(TrainingExampleCount)`` stores the number of
  examples used during training.
- ``options(Options)`` stores the effective learning options after
  merging the user options with the library defaults.
- ``attribute_count(AttributeCount)`` stores the number of dataset
  attributes available to the ensemble before split-level subsampling.
- ``tree_count(TreeCount)`` stores the number of trained regression
  trees in the ensemble.

Use the ``regression_protocols`` ``diagnostic/2`` and
``regressor_options/2`` helper predicates when you only need a single
metadata term or the effective options.

Options
-------

The ``learn/3`` predicate accepts the following options:

- ``number_of_trees/1``: Number of regression trees to train in the
  ensemble. Increasing this value usually improves stability at the cost
  of additional training and prediction time. The default is ``10``.
- ``maximum_features_per_split/1``: Number of dataset attributes
  randomly sampled at each split when searching for the best partition.
  Accepted values are a positive integer or ``all``. When omitted, the
  library uses the square root of the total number of available
  attributes, with a minimum of one attribute. Passing ``all`` disables
  split-level attribute subsampling.
- ``maximum_depth/1``: Maximum depth allowed for each regression-tree
  base learner. The default is ``10``.
- ``minimum_samples_leaf/1``: Minimum number of training examples
  required in each leaf of a base learner tree. The default is ``1``.
- ``minimum_variance_reduction/1``: Minimum split gain required by each
  base learner tree before accepting a partition. The default is
  ``0.0``.
- ``feature_scaling/1``: Controls z-score standardization of continuous
  attributes inside each regression-tree base learner. Accepted values
  are ``true`` and ``false``. The default is ``false``.
- ``random_seed/1``: Positive integer seed used by the portable
  ``fast_random(xoshiro128pp)`` pseudo-random generator when drawing
  bootstrap samples and split-level random feature subsets. Using the
  same seed with the same dataset and options reproduces the same
  learned regressor. The default is ``1357911``.
