Package 'randomMachines'

Title: An Ensemble Modeling using Random Machines
Description: A novel ensemble method employing Support Vector Machines (SVMs) as base learners. This powerful ensemble model is designed for both classification (Ara A., et. al, 2021) <doi:10.6339/21-JDS1014>, and regression (Ara A., et. al, 2021) <doi:10.1016/j.eswa.2022.117107> problems, offering versatility and robust performance across different datasets and compared with other consolidated methods as Random Forests (Maia M, et. al, 2021) <doi:10.6339/21-JDS1025>.
Authors: Mateus Maia [aut, cre] , Anderson Ara [cte] , Gabriel Ribeiro [cte]
Maintainer: Mateus Maia <[email protected]>
License: MIT + file LICENSE
Version: 0.1.0
Built: 2025-02-13 05:43:19 UTC
Source: https://github.com/mateusmaiads/randommachines

Help Index


Bolsa Família Dataset

Description

The 'bolsafam' dataset contains information about the utilization rate of the Bolsa Família program in Brazilian municipalities. The utilization rate yiy_{i} is defined as the number of people benefiting from the assistance divided by the total population of the city.

Usage

data(bolsafam)

Format

A data frame with 5564 rows and 11 columns.

Details

This dataset includes the following columns:

y

Rate of use of the social assistance program by municipality.

COD_UF

Code to identify the Brazilian state to which the city belongs.

T_DENS

Percentage of the population living in households with a density greater than 2 people per bedroom.

TRABSC

Percentage of employed persons aged 18 or over who are employed without a formal contract.

PPOB

Proportion of people vulnerable to poverty.

T_NESTUDA_NTRAB_MMEIO

Percentage of people aged 15 to 24 who do not study or work and are vulnerable to poverty.

T_FUND15A17

Percentage of the population aged 15 to 17 with complete primary education.

RAZDEP

Dependency ratio.

T_ATRASO_0_BASICO

Percentage of the population aged 6 to 17 years attending basic education that does not have an age-grade delay.

T_AGUA

Percentage of the population living in households with running water.

REGIAO

Aggregation of states according to the regions defined by IBGE.

Source

The 'bolsafam' dataset is sourced from the Brazilian organizational site called Transparency Portal.

References

Mateus Maia & Anderson Ara (2023). rmachines: Random Machines: a package for a support vector ensemble based on random kernel space. R package version 0.1.0.

Examples

data(bolsafam)
    head(bolsafam)

Brier Score function

Description

Calculate the Brier Score for a set of predicted probabilities and observed outcomes. The Brier Score is a measure of the accuracy of probabilistic predictions. It is commonly used in the evaluation of predictive models.

Usage

brier_score(prob, observed, levels)

Arguments

prob

predicted probabilities

observed

yy observed values (it assumed that the positive class is coded is equal to one and the negative 0)

levels

A string vector with the original levels from the target variable

Value

Returns the Brier Score, a numeric value indicating the accuracy of the predictions.


Ionosphere Dataset

Description

The 'ionosphere' dataset contains radar data for the classification of radar returns as either 'good' or 'bad'.

Usage

data(ionosphere)

Format

A data frame with 351 rows and 35 columns.

Details

This dataset includes the following columns:

X1-X34

Features extracted from radar signals.

y

Class label indicating whether the radar return is 'g' (good) or 'b' (bad).

Source

The 'ionosphere' dataset is sourced from the UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/ionosphere

Examples

data(ionosphere)
    head(ionosphere)

Prediction function for the rm_class_model

Description

This function predicts the outcome for a RM object model using new data

Usage

## S4 method for signature 'rm_class'
predict(object,newdata)

Arguments

object

A fitted RM model object of class rm_class.

newdata

A data frame or matrix containing the new data to be predicted.

Value

A vector of predicted outcomes: probabilities in case of 'prob_model = TRUE' and classes in case of 'prob_model = FALSE'.

Examples

# Generating a sample for the simulation
library(randomMachines)
sim_data <- sim_class(n = 75)
sim_new <- sim_class(n = 25)
rm_mod <- randomMachines(y~., train = sim_data)
y_hat <- predict(rm_mod, newdata = sim_new)

Prediction function for the rm_reg_model

Description

This function predicts the outcome for a RM object model using new data for continuous yy

Usage

## S4 method for signature 'rm_reg'
predict(object,newdata)

Arguments

object

A fitted RM model object of class rm_reg.

newdata

A data frame or matrix containing the new data to be predicted.

Value

Predicted values newdata object from the Random Machines model.

Examples

# Generating a sample for the simulation
library(randomMachines)
sim_data <- sim_reg1(n = 75)
sim_new <- sim_reg1(n = 25)
rm_mod_reg <- randomMachines(y~., train = sim_data)
y_hat <- predict(rm_mod_reg, newdata = sim_new)

Random Machines

Description

Random Machines is an ensemble model which uses the combination of different kernel functions to improve the diversity in the bagging approach, improving the predictions in general. Random Machines was developed for classification and regression problems by bagging multiple kernel functions in support vector models.

Random Machines uses SVMs (Cortes and Vapnik, 1995) as base learners in the bagging procedure with a random sample of kernel functions to build them.

Let a training sample given by (xi,yi)(\boldsymbol{x_{i}},y_i) with i=1,,ni=1,\dots, n observations, where xi\boldsymbol{x_{i}} is the vector of independent variables and yiy_{i} the dependent one. The kernel bagging method initializes by training of the rr single learner, where r=1,,Rr=1,\dots,R and RR is the total number of different kernel functions that could be used in support vector models. In this implementation the default value is R=4R=4 (gaussian, polynomial, laplacian and linear). See more details below.

Each single learner is internally validated and the weights λr\lambda_{r} are calculated proportionally to the strength from the single predictive performance.

Afterwards, BB bootstrap samples are sampled from the training set. A support vector machine model gbg_{b} is trained for each bootstrap sample, b=i,,Bb=i,\dots,B and the kernel function that will be used for gbg_{b} will be determined by a random choice with probability λr\lambda_{r}. The final weight wbw_b in the bagging procedure is calculated by out-of-bag samples.

The final model G(xi)G(\boldsymbol{x}_i) for a new xi\boldsymbol{x}_i is given by,

The weights λr\lambda_{r} and wbw_b are different calculated for each task (classification, probabilistic classification and regression). See more details in the references.

  • For a binary classification problem G(xi)=sgn(b=1Bwbgb(xi))\mathbin{{ G(\boldsymbol{x_{i}})= \text{sgn} \left( \sum_{b=1}^{B}w_{b}g_{b}(\boldsymbol{x_{i}})\right)}}, where gbg_b are single binary classification outputs;

  • For a probabilistic binary classification problem G(xi)=b=1Bwbgb(xi)\mathbin{{ G(\boldsymbol{x_{i}})= \sum_{b=1}^{B}w_{b}g_{b}(\boldsymbol{x_{i}})}}, where gbg_b are single probabilistic classification outputs;

  • For a regression problem G(xi)=b=1Bwbgb(xi)G(\boldsymbol{x_{i}})= \sum_{b=1}^{B}w_{b}g_{b}(\boldsymbol{x_{i}}), , where gbg_b are single regression outputs.

Usage

randomMachines(
     formula,
     train,validation,
     B = 25, cost = 1,
     automatic_tuning = FALSE,
     gamma_rbf = 1,
     gamma_lap = 1,
     degree = 2,
     poly_scale = 1,
     offset = 0,
     gamma_cau = 1,
     d_t = 2,
     kernels = c("rbfdot", "polydot", "laplacedot", "vanilladot"),
     prob_model = TRUE,
     loss_function = RMSE,
     epsilon = 0.1,
     beta = 2
)

Arguments

formula

an object of class formula: it should contain a symbolic description of the model to be fitted, indicating the dependent variable and all predictors that should be included.

train

the training data {(xi,yi)}i=1n\left\{\left( \mathbf{x}_{i},y_{i} \right)\right\}_{i=1}^{n} used to train the model.

validation

the validation data {(xi,yi)}i=1V\left\{\left( \mathbf{x}_{i},y_{i}\right) \right\}_{i=1}^{V} used to calculate probabilities λr\lambda_{r}. If validation = NULL,the validation set is going be selected as 0.25 partition from the training data, and the remaining partition is selected as the new training sample.

B

number of bootstrap samples. The default value is B=25.

cost

the CC-constant term of the regularization on soft margins at support vector models. The default value is cost=1.

automatic_tuning

boolean to define if the kernel hyperparameters will be selected using the sigest from the ksvm function. The default value is FALSE.

gamma_rbf

the hyperparameter γg\gamma_{g} used in the RBF kernel. The default value is gamma_rbf=1.

gamma_lap

the hyperparameter γl\gamma_{l} used in the Laplacian kernel. The default value is gamma_lap=1.

degree

the degree used in the Polynomial kernel. The default value is degree=2.

poly_scale

the scale parameter from the Polynomial kernel. The default value is poly_scale=1.

offset

the offset parameter from the Polynomial kernel. The default value is offset=0.

gamma_cau

the hyperparameter γc\gamma_{c} used in the Cauchy kernel. The default value is gamma_cau=1.

d_t

the dtd_{t}-norm from the t-Student kernel. The default value is d_t=2.

kernels

a vector with the name of kernel functions that will be used in the Random Machines model. The default include the kernel functions: c("rbfdot", "polydot", "laplacedot", "vanilladot"). The other kernel functions as "cauchydot" and "tdot" are exclusive to the binary classification setting.

prob_model

a boolean to define if the algorithm will be using a probabilistic approach to the define the predictions (default = TRUE).

loss_function

Define which loss function is going to be used in the regression approach. The default is the RMSE function but others can be added.

epsilon

The epsilon in the loss function used from the SVR implementation. The default value is epsilon=0.1.

beta

The correlation parameter β\beta which calibrates the penalisation of each kernel performance in regression tasks. The default value is beta=2.

Details

The Random Machines is an ensemble method which combines the bagging procedure proposed by Breiman (1996), using Support Vector Machine models as base learners jointly with a random selection of kernel functions that add diversity to the ensemble without harming its predictive performance. The kernel functions k(x,y)k(x,y) are described by the functions below,

  • Linear Kernel: k(x,y)=(xy)k(x,y) = (x\cdot y)

  • Polynomial Kernel: k(x,y)=(scale(xy)+offset)degreek(x,y) = \left(scale( x\cdot y) + offset\right)^{degree}

  • Gaussian Kernel: k(x,y)=eγgxy2k(x,y) = e^{-\gamma_{g}||x-y||^2}

  • Laplacian Kernel: k(x,y)=eγxyk(x,y) = e^{-\gamma_{\ell}||x-y||}

  • Cauchy Kernel: k(x,y)=11+xy2γck(x,y) = \frac{1}{1 + \frac{||x-y||^{2}}{\gamma_{c}}}

  • Student's t Kernel: k(x,y)=11+xydtk(x,y) = \frac{1}{1 + ||x-y||^{d_{t}}}

Value

randomMachines() returns an object of class "rm_class" for classification tasks or "rm_reg" for if the target variable is a continuous numerical response. See predict.rm_class or predict.rm_reg for more details of how to obtain predictions from each model respectively.

Author(s)

Mateus Maia: [email protected], Gabriel Felipe Ribeiro: [email protected], Anderson Ara: [email protected]

References

Ara, Anderson, et al. "Regression random machines: An ensemble support vector regression model with free kernel choice." Expert Systems with Applications 202 (2022): 117107.

Ara, Anderson, et al. "Random machines: A bagged-weighted support vector model with free kernel choice." Journal of Data Science 19.3 (2021): 409-428.

Breiman, L. (1996). Bagging predictors. Machine learning, 24, 123-140.

Cortes, C., and Vapnik, V. (1995). Support-vector networks. Machine learning, 20, 273-297.

Maia, Mateus, Arthur R. Azevedo, and Anderson Ara. "Predictive comparison between random machines and random forests." Journal of Data Science 19.4 (2021): 593-614.

Examples

library(randomMachines)

# Simulation from a binary output context
sim_data <- sim_class(n = 75)

## Setting the training and validation set
sim_new <- sim_class(n = 75)

# Modelling Random Machines (probabilistic output)
rm_mod_prob <- randomMachines(y~., train = sim_data)

## Modelling Random Machines (binary class output)
rm_mod_label <- randomMachines(y~., train = sim_data,prob_model = FALSE)

## Predicting for new data
y_hat <- predict(rm_mod_label,sim_new)

S4 class for RM classification

Description

S4 class for RM classification

Details

For more details see Ara, Anderson, et al. "Random machines: A bagged-weighted support vector model with free kernel choice." Journal of Data Science 19.3 (2021): 409-428.

Slots

train

a data.frame corresponding to the training data used into the model

class_name

a string with target variable used in the model

kernel_weight

a numeric vector corresponding to the weights for each bootstrap model contribution

lambda_values

a named list with value of the vector of λ\boldsymbol{\lambda} sampling probabilities associated with each each kernel function

model_params

a list with all used model specifications

bootstrap_models

a list with all ksvm objects for each bootstrap sample

bootstrap_samples

a list with all bootstrap samples used to train each base model of the ensemble

prob

a boolean indicating if a probabilitistic approch was used in the classification Random Machines


S4 class for RM regression

Description

S4 class for RM regression

Details

For more details see Ara, Anderson, et al. "Regression random machines: An ensemble support vector regression model with free kernel choice." Expert Systems with Applications 202 (2022): 117107.

Slots

y_train_hat

a numeric corresponding to the predictions y^i\hat{y}_{i} for the training set

lambda_values

a named list with value of the vector of λ\boldsymbol{\lambda} sampling probabilities associated with each each kernel function

model_params

a list with all used model specifications

bootstrap_models

a list with all ksvm objects for each bootstrap sample

bootstrap_samples

a list with all bootstrap samples used to train each base model of the ensemble

kernel_weight_norm

a numeric vector corresponding to the normalised weights for each bootstrap model contribution


Root Mean Squared Error (RMSE) Function

Description

Computes the Root Mean Squared Error (RMSE), a widely used metric for evaluating the accuracy of predictions in regression tasks. The formula is given by

RMSE=1ni=1n(yiy^i)2RMSE = \sqrt{\frac{1}{n}\sum_{i=1}^{n}\left(y_{i}-\hat{y}_{i}\right)^{2}}

Usage

RMSE(predicted, observed)

Arguments

predicted

A vector of predicted values y^\hat{\mathbf{y}}.

observed

A vector of observed values y\mathbf{y}.

Value

a the Root Mean Squared error calculated by the formula in the description.


Generate a binary classification data set from normal distribution

Description

Simulation used as example of a classification task based on a separation of two normal multivariate distributions with different vector of means and differerent covariate matrices. For the label AA the XA\mathbf{X}_{A} are sampled from a normal distribution MVN(μA1p,σA2Ip){MVN}\left(\mu_{A}\mathbf{1}_{p},\sigma_{A}^{2}\mathbf{I}_{p}\right) while for label BB the samples XB\mathbf{X}_{B} are from a normal distribution MVN(μB1p,σB2Ip){MVN} \left(\mu_{B}\mathbf{1}_{p},\sigma_{B}^{2}\mathbf{I}_{p}\right). For more details see Ara et. al (2021), and Breiman L (1998).

Usage

sim_class(
  n,
  p = 2,
  ratio = 0.5,
  mu_a = 0,
  sigma_a = 1,
  mu_b = 1,
  sigma_b = 1
)

Arguments

n

Sample size

p

Number of predictors

ratio

Ratio between class A and class B

mu_a

Mean of X1X_{1}.

sigma_a

Standard deviation of X1X_{1}.

mu_b

Mean of X2X_{2}

sigma_b

Standard devation of X2X_{2}

Value

A simulated data.frame with two predictors for a binary classification problem

Author(s)

Mateus Maia: [email protected], Anderson Ara: [email protected]

References

Ara, Anderson, et al. "Random machines: A bagged-weighted support vector model with free kernel choice." Journal of Data Science 19.3 (2021): 409-428.

Breiman, L. (1998). Arcing classifier (with discussion and a rejoinder by the author). The annals of statistics, 26(3), 801-849.

Examples

library(randomMachines)
sim_data <- sim_class(n = 100)

Simulation for a regression toy examples from Random Machines Regression 1

Description

Simulation toy example initially found in Scornet (2016), and used and escribed by Ara et. al (2022). Inputs are 2 independent variables uniformly distributed on the interval [1,1][-1,1]. Outputs are generated following the equation

Y=X12+eX22+ε,εN(0,σ2)Y={X^{2}_{1}}+e^{{-{X^{2}_{2}}}} + \varepsilon, \varepsilon \sim \mathcal{N}(0,\sigma^{2})

Usage

sim_reg1(n, sigma)

Arguments

n

Sample size

sigma

Standard deviation of residual noise

Value

A simulated data.frame with two predictors and the target variable.

Author(s)

Mateus Maia: [email protected], Anderson Ara: [email protected]

References

Ara, Anderson, et al. "Regression random machines: An ensemble support vector regression model with free kernel choice." Expert Systems with Applications 202 (2022): 117107.

Scornet, E. (2016). Random forests and kernel methods. IEEE Transactions on Information Theory, 62(3), 1485-1500.

Examples

library(randomMachines)
sim_data <- sim_reg1(n=100)

Simulation for a regression toy examples from Random Machines Regression 2

Description

Simulation toy example initially found in Scornet (2016), and used and escribed by Ara et. al (2022). Inputs are 8 independent variables uniformly distributed on the interval [1,1][-1,1]. Outputs are generated following the equation

Y=X1X2+X32X4X7+X5X8X62+ε,εN(0,σ2)Y={X_{1}}{X_{2}}+{X^{2}_{3}}-{X_{4}}{X_{7}}+{X_{5}}{X_{8}}-{X^{2}_{6}}+ \varepsilon, \varepsilon \sim \mathcal{N}(0,\sigma^{2})

Usage

sim_reg2(n, sigma)

Arguments

n

Sample size

sigma

Standard deviation of residual noise

Value

A simulated data.frame with two predictors and the target variable.

Author(s)

Mateus Maia: [email protected], Anderson Ara: [email protected]

References

Ara, Anderson, et al. "Regression random machines: An ensemble support vector regression model with free kernel choice." Expert Systems with Applications 202 (2022): 117107.

Scornet, E. (2016). Random forests and kernel methods. IEEE Transactions on Information Theory, 62(3), 1485-1500.

Examples

library(randomMachines)
sim_data <- sim_reg2(n=100)

Simulation for a regression toy examples from Random Machines Regression 3

Description

Simulation toy example initially found in Scornet (2016), and used and escribed by Ara et. al (2022). Inputs are 4 independent variables uniformly distributed on the interval [1,1][-1,1]. Outputs are generated following the equation

Y=sin(X1)+X22+X3eX42+ε,εN(0,0.5)Y= -\sin({X}_{1})+{X}^{2}_{2}+{X}_{3}-e^{{-X^{2}_{4}}} + \varepsilon, \varepsilon \sim \mathcal{N}(0,0.5)

Usage

sim_reg3(n, sigma)

Arguments

n

Sample size

sigma

Standard deviation of residual noise

Value

A simulated data.frame with two predictors and the target variable.

Author(s)

Mateus Maia: [email protected], Anderson Ara: [email protected]

References

Ara, Anderson, et al. "Regression random machines: An ensemble support vector regression model with free kernel choice." Expert Systems with Applications 202 (2022): 117107.

Scornet, E. (2016). Random forests and kernel methods. IEEE Transactions on Information Theory, 62(3), 1485-1500.

Examples

library(randomMachines)
sim_data <- sim_reg3(n=100)

Simulation for a regression toy examples from Random Machines Regression 3

Description

Simulation toy example initially found in Van der Laan, et.al (2016), and used and escribed by Ara et. al (2022). Inputs are 6 independent variables uniformly distributed on the interval [1,1][-1,1]. Outputs are generated following the equation

Y=X12+X22X3eX4+X6X5+ε,εN(0,σ2)Y={X^{2}_{1}}+{X}^{2}_{2}{X_{3}}e^{-|{X_{4}}|}+{X_{6}}-{X_{5}}+ \varepsilon, \varepsilon \sim \mathcal{N}(0,\sigma^{2})

Usage

sim_reg4(n, sigma)

Arguments

n

Sample size

sigma

Standard deviation of residual noise

Value

A simulated data.frame with two predictors and the target variable.

Author(s)

Mateus Maia: [email protected], Anderson Ara: [email protected]

References

Ara, Anderson, et al. "Regression random machines: An ensemble support vector regression model with free kernel choice." Expert Systems with Applications 202 (2022): 117107.

Van der Laan, M. J., Polley, E. C., & Hubbard, A. E. (2007). Super learner. Statistical applications in genetics and molecular biology, 6(1).

Examples

library(randomMachines)
sim_data <- sim_reg4(n=100)

Simulation for a regression toy examples from Random Machines Regression 3

Description

Simulation toy example initially found in Van der Laan, et.al (2016), and used and escribed by Ara et. al (2022). Inputs are 6 independent variables sampled from N(0,1)N(0,1). Outputs are generated following the equation

Y=X1+0.707X22+21(X3>0)+0.873log(X1)X3+0.894X2X4+21(X5>0)+0.464eX6+ε,εN(0,σ2)Y=X_{1}+0.707 X^{2}_{2} + 2\mathcal{1}_{(X_{3}>0)}+0.873 \log (X_{1})|X_{3}|+0.894 X_{2} X_{4}+2\mathcal{1}_{(X_{5}>0)}+0.464e^{X_{6}}+ \varepsilon, \varepsilon \sim \mathcal{N}(0,\sigma^{2})

Usage

sim_reg5(n, sigma)

Arguments

n

Sample size

sigma

Standard deviation of residual noise

Value

A simulated data.frame with two predictors and the target variable.

Author(s)

Mateus Maia: [email protected], Anderson Ara: [email protected]

References

Ara, Anderson, et al. "Regression random machines: An ensemble support vector regression model with free kernel choice." Expert Systems with Applications 202 (2022): 117107.

Roy, M. H., & Larocque, D. (2012). Robustness of random forests for regression. Journal of Nonparametric Statistics, 24(4), 993-1006.

Examples

library(randomMachines)
sim_data <- sim_reg5(n=100)

Wholesale Dataset

Description

The 'whosale' dataset contains information about wholesale customers' annual spending on various product categories.

Usage

data(whosale)

Format

A data frame with 440 rows and 8 columns.

Details

This dataset includes the following columns:

y

Type of customer, either 'Horeca' (Hotel/Restaurant/Cafe), coded as 1 or 'Retail' coded as 2.

Region

Geographic region of the customer, either 'Lisbon', 'Oporto', or 'Other'. Coded as {1,2,3}, respectively.

Fresh

Annual spending (in monetary units) on fresh products.

Milk

Annual spending on milk products.

Grocery

Annual spending on grocery products.

Frozen

Annual spending on frozen products.

Detergents Paper

Annual spending on detergents and paper products.

Delicassen

Annual spending on delicatessen products.

Source

The 'whosale' dataset is sourced from the UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/wholesale+customers

Examples

data(whosale)
    head(whosale)