Title: | An Ensemble Modeling using Random Machines |
---|---|
Description: | A novel ensemble method employing Support Vector Machines (SVMs) as base learners. This powerful ensemble model is designed for both classification (Ara A., et. al, 2021) <doi:10.6339/21-JDS1014>, and regression (Ara A., et. al, 2021) <doi:10.1016/j.eswa.2022.117107> problems, offering versatility and robust performance across different datasets and compared with other consolidated methods as Random Forests (Maia M, et. al, 2021) <doi:10.6339/21-JDS1025>. |
Authors: | Mateus Maia [aut, cre] |
Maintainer: | Mateus Maia <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.1.0 |
Built: | 2025-02-13 05:43:19 UTC |
Source: | https://github.com/mateusmaiads/randommachines |
The 'bolsafam' dataset contains information about the utilization rate of the Bolsa Família program in Brazilian municipalities. The utilization rate is defined as the number of people benefiting from the assistance divided by the total population of the city.
data(bolsafam)
data(bolsafam)
A data frame with 5564 rows and 11 columns.
This dataset includes the following columns:
Rate of use of the social assistance program by municipality.
Code to identify the Brazilian state to which the city belongs.
Percentage of the population living in households with a density greater than 2 people per bedroom.
Percentage of employed persons aged 18 or over who are employed without a formal contract.
Proportion of people vulnerable to poverty.
Percentage of people aged 15 to 24 who do not study or work and are vulnerable to poverty.
Percentage of the population aged 15 to 17 with complete primary education.
Dependency ratio.
Percentage of the population aged 6 to 17 years attending basic education that does not have an age-grade delay.
Percentage of the population living in households with running water.
Aggregation of states according to the regions defined by IBGE.
The 'bolsafam' dataset is sourced from the Brazilian organizational site called Transparency Portal.
Mateus Maia & Anderson Ara (2023). rmachines: Random Machines: a package for a support vector ensemble based on random kernel space. R package version 0.1.0.
data(bolsafam) head(bolsafam)
data(bolsafam) head(bolsafam)
Calculate the Brier Score for a set of predicted probabilities and observed outcomes. The Brier Score is a measure of the accuracy of probabilistic predictions. It is commonly used in the evaluation of predictive models.
brier_score(prob, observed, levels)
brier_score(prob, observed, levels)
prob |
predicted probabilities |
observed |
|
levels |
A string vector with the original levels from the target variable |
Returns the Brier Score, a numeric value indicating the accuracy of the predictions.
The 'ionosphere' dataset contains radar data for the classification of radar returns as either 'good' or 'bad'.
data(ionosphere)
data(ionosphere)
A data frame with 351 rows and 35 columns.
This dataset includes the following columns:
Features extracted from radar signals.
Class label indicating whether the radar return is 'g' (good) or 'b' (bad).
The 'ionosphere' dataset is sourced from the UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/ionosphere
data(ionosphere) head(ionosphere)
data(ionosphere) head(ionosphere)
This function predicts the outcome for a RM object model using new data
## S4 method for signature 'rm_class' predict(object,newdata)
## S4 method for signature 'rm_class' predict(object,newdata)
object |
A fitted RM model object of class |
newdata |
A data frame or matrix containing the new data to be predicted. |
A vector of predicted outcomes: probabilities in case of 'prob_model = TRUE' and classes in case of 'prob_model = FALSE'.
# Generating a sample for the simulation library(randomMachines) sim_data <- sim_class(n = 75) sim_new <- sim_class(n = 25) rm_mod <- randomMachines(y~., train = sim_data) y_hat <- predict(rm_mod, newdata = sim_new)
# Generating a sample for the simulation library(randomMachines) sim_data <- sim_class(n = 75) sim_new <- sim_class(n = 25) rm_mod <- randomMachines(y~., train = sim_data) y_hat <- predict(rm_mod, newdata = sim_new)
This function predicts the outcome for a RM object model using new data for continuous
## S4 method for signature 'rm_reg' predict(object,newdata)
## S4 method for signature 'rm_reg' predict(object,newdata)
object |
A fitted RM model object of class |
newdata |
A data frame or matrix containing the new data to be predicted. |
Predicted values newdata
object from the Random Machines model.
# Generating a sample for the simulation library(randomMachines) sim_data <- sim_reg1(n = 75) sim_new <- sim_reg1(n = 25) rm_mod_reg <- randomMachines(y~., train = sim_data) y_hat <- predict(rm_mod_reg, newdata = sim_new)
# Generating a sample for the simulation library(randomMachines) sim_data <- sim_reg1(n = 75) sim_new <- sim_reg1(n = 25) rm_mod_reg <- randomMachines(y~., train = sim_data) y_hat <- predict(rm_mod_reg, newdata = sim_new)
Random Machines is an ensemble model which uses the combination of different kernel functions to improve the diversity in the bagging approach, improving the predictions in general. Random Machines was developed for classification and regression problems by bagging multiple kernel functions in support vector models.
Random Machines uses SVMs (Cortes and Vapnik, 1995) as base learners in the bagging procedure with a random sample of kernel functions to build them.
Let a training sample given by with
observations, where
is the vector of independent variables and
the dependent one. The kernel bagging method initializes by training of the
single learner, where
and
is the total number of different kernel functions that could be used in support vector models. In this implementation the default value is
(gaussian, polynomial, laplacian and linear). See more details below.
Each single learner is internally validated and the weights are calculated proportionally to the strength from the single predictive performance.
Afterwards, bootstrap samples are sampled from the training set. A support vector machine model
is trained for each bootstrap sample,
and the kernel function that will be used for
will be determined by a random choice with probability
. The final weight
in the bagging procedure is calculated by out-of-bag samples.
The final model for a new
is given by,
The weights and
are different calculated for each task (classification, probabilistic classification and regression). See more details in the references.
For a binary classification problem , where
are single binary classification outputs;
For a probabilistic binary classification problem , where
are single probabilistic classification outputs;
For a regression problem , , where
are single regression outputs.
randomMachines( formula, train,validation, B = 25, cost = 1, automatic_tuning = FALSE, gamma_rbf = 1, gamma_lap = 1, degree = 2, poly_scale = 1, offset = 0, gamma_cau = 1, d_t = 2, kernels = c("rbfdot", "polydot", "laplacedot", "vanilladot"), prob_model = TRUE, loss_function = RMSE, epsilon = 0.1, beta = 2 )
randomMachines( formula, train,validation, B = 25, cost = 1, automatic_tuning = FALSE, gamma_rbf = 1, gamma_lap = 1, degree = 2, poly_scale = 1, offset = 0, gamma_cau = 1, d_t = 2, kernels = c("rbfdot", "polydot", "laplacedot", "vanilladot"), prob_model = TRUE, loss_function = RMSE, epsilon = 0.1, beta = 2 )
formula |
an object of class |
train |
the training data |
validation |
the validation data |
B |
number of bootstrap samples. The default value is |
cost |
the |
automatic_tuning |
boolean to define if the kernel hyperparameters will be selected using the |
gamma_rbf |
the hyperparameter |
gamma_lap |
the hyperparameter |
degree |
the degree used in the Polynomial kernel. The default value is |
poly_scale |
the scale parameter from the Polynomial kernel. The default value is |
offset |
the offset parameter from the Polynomial kernel. The default value is |
gamma_cau |
the hyperparameter |
d_t |
the |
kernels |
a vector with the name of kernel functions that will be used in the Random Machines model. The default include the kernel functions: |
prob_model |
a boolean to define if the algorithm will be using a probabilistic approach to the define the predictions (default = |
loss_function |
Define which loss function is going to be used in the regression approach. The default is the |
epsilon |
The epsilon in the loss function used from the SVR implementation. The default value is |
beta |
The correlation parameter |
The Random Machines is an ensemble method which combines the bagging procedure proposed by Breiman (1996), using Support Vector Machine models as base learners jointly with a random selection of kernel functions that add diversity to the ensemble without harming its predictive performance. The kernel functions are described by the functions below,
Linear Kernel:
Polynomial Kernel:
Gaussian Kernel:
Laplacian Kernel:
Cauchy Kernel:
Student's t Kernel:
randomMachines()
returns an object of class
"rm_class" for classification tasks or "rm_reg" for if the target variable is a continuous numerical response. See predict.rm_class
or predict.rm_reg
for more details of how to obtain predictions from each model respectively.
Mateus Maia: [email protected], Gabriel Felipe Ribeiro: [email protected], Anderson Ara: [email protected]
Ara, Anderson, et al. "Regression random machines: An ensemble support vector regression model with free kernel choice." Expert Systems with Applications 202 (2022): 117107.
Ara, Anderson, et al. "Random machines: A bagged-weighted support vector model with free kernel choice." Journal of Data Science 19.3 (2021): 409-428.
Breiman, L. (1996). Bagging predictors. Machine learning, 24, 123-140.
Cortes, C., and Vapnik, V. (1995). Support-vector networks. Machine learning, 20, 273-297.
Maia, Mateus, Arthur R. Azevedo, and Anderson Ara. "Predictive comparison between random machines and random forests." Journal of Data Science 19.4 (2021): 593-614.
library(randomMachines) # Simulation from a binary output context sim_data <- sim_class(n = 75) ## Setting the training and validation set sim_new <- sim_class(n = 75) # Modelling Random Machines (probabilistic output) rm_mod_prob <- randomMachines(y~., train = sim_data) ## Modelling Random Machines (binary class output) rm_mod_label <- randomMachines(y~., train = sim_data,prob_model = FALSE) ## Predicting for new data y_hat <- predict(rm_mod_label,sim_new)
library(randomMachines) # Simulation from a binary output context sim_data <- sim_class(n = 75) ## Setting the training and validation set sim_new <- sim_class(n = 75) # Modelling Random Machines (probabilistic output) rm_mod_prob <- randomMachines(y~., train = sim_data) ## Modelling Random Machines (binary class output) rm_mod_label <- randomMachines(y~., train = sim_data,prob_model = FALSE) ## Predicting for new data y_hat <- predict(rm_mod_label,sim_new)
S4 class for RM classification
For more details see Ara, Anderson, et al. "Random machines: A bagged-weighted support vector model with free kernel choice." Journal of Data Science 19.3 (2021): 409-428.
train
a data.frame
corresponding to the training data used into the model
class_name
a string with target variable used in the model
kernel_weight
a numeric vector corresponding to the weights for each bootstrap model contribution
lambda_values
a named list with value of the vector of sampling probabilities associated with each each kernel function
model_params
a list with all used model specifications
bootstrap_models
a list with all ksvm
objects for each bootstrap sample
bootstrap_samples
a list with all bootstrap samples used to train each base model of the ensemble
prob
a boolean indicating if a probabilitistic approch was used in the classification Random Machines
S4 class for RM regression
For more details see Ara, Anderson, et al. "Regression random machines: An ensemble support vector regression model with free kernel choice." Expert Systems with Applications 202 (2022): 117107.
y_train_hat
a numeric corresponding to the predictions for the training set
lambda_values
a named list with value of the vector of sampling probabilities associated with each each kernel function
model_params
a list with all used model specifications
bootstrap_models
a list with all ksvm
objects for each bootstrap sample
bootstrap_samples
a list with all bootstrap samples used to train each base model of the ensemble
kernel_weight_norm
a numeric vector corresponding to the normalised weights for each bootstrap model contribution
Computes the Root Mean Squared Error (RMSE), a widely used metric for evaluating the accuracy of predictions in regression tasks. The formula is given by
RMSE(predicted, observed)
RMSE(predicted, observed)
predicted |
A vector of predicted values |
observed |
A vector of observed values |
a the Root Mean Squared error calculated by the formula in the description.
Simulation used as example of a classification task based on a separation of two
normal multivariate distributions with different vector of means and differerent covariate matrices.
For the label the
are sampled from a normal distribution
while for label
the samples
are from a normal distribution
. For more details see Ara et. al (2021), and Breiman L (1998).
sim_class( n, p = 2, ratio = 0.5, mu_a = 0, sigma_a = 1, mu_b = 1, sigma_b = 1 )
sim_class( n, p = 2, ratio = 0.5, mu_a = 0, sigma_a = 1, mu_b = 1, sigma_b = 1 )
n |
Sample size |
p |
Number of predictors |
ratio |
Ratio between class A and class B |
mu_a |
Mean of |
sigma_a |
Standard deviation of |
mu_b |
Mean of |
sigma_b |
Standard devation of |
A simulated data.frame with two predictors for a binary classification problem
Mateus Maia: [email protected], Anderson Ara: [email protected]
Ara, Anderson, et al. "Random machines: A bagged-weighted support vector model with free kernel choice." Journal of Data Science 19.3 (2021): 409-428.
Breiman, L. (1998). Arcing classifier (with discussion and a rejoinder by the author). The annals of statistics, 26(3), 801-849.
library(randomMachines) sim_data <- sim_class(n = 100)
library(randomMachines) sim_data <- sim_class(n = 100)
Simulation toy example initially found in Scornet (2016), and used and escribed by Ara et. al (2022).
Inputs are 2 independent variables uniformly distributed on the interval . Outputs are generated following the equation
sim_reg1(n, sigma)
sim_reg1(n, sigma)
n |
Sample size |
sigma |
Standard deviation of residual noise |
A simulated data.frame with two predictors and the target variable.
Mateus Maia: [email protected], Anderson Ara: [email protected]
Ara, Anderson, et al. "Regression random machines: An ensemble support vector regression model with free kernel choice." Expert Systems with Applications 202 (2022): 117107.
Scornet, E. (2016). Random forests and kernel methods. IEEE Transactions on Information Theory, 62(3), 1485-1500.
library(randomMachines) sim_data <- sim_reg1(n=100)
library(randomMachines) sim_data <- sim_reg1(n=100)
Simulation toy example initially found in Scornet (2016), and used and escribed by Ara et. al (2022).
Inputs are 8 independent variables uniformly distributed on the interval . Outputs are generated following the equation
sim_reg2(n, sigma)
sim_reg2(n, sigma)
n |
Sample size |
sigma |
Standard deviation of residual noise |
A simulated data.frame with two predictors and the target variable.
Mateus Maia: [email protected], Anderson Ara: [email protected]
Ara, Anderson, et al. "Regression random machines: An ensemble support vector regression model with free kernel choice." Expert Systems with Applications 202 (2022): 117107.
Scornet, E. (2016). Random forests and kernel methods. IEEE Transactions on Information Theory, 62(3), 1485-1500.
library(randomMachines) sim_data <- sim_reg2(n=100)
library(randomMachines) sim_data <- sim_reg2(n=100)
Simulation toy example initially found in Scornet (2016), and used and escribed by Ara et. al (2022).
Inputs are 4 independent variables uniformly distributed on the interval . Outputs are generated following the equation
sim_reg3(n, sigma)
sim_reg3(n, sigma)
n |
Sample size |
sigma |
Standard deviation of residual noise |
A simulated data.frame with two predictors and the target variable.
Mateus Maia: [email protected], Anderson Ara: [email protected]
Ara, Anderson, et al. "Regression random machines: An ensemble support vector regression model with free kernel choice." Expert Systems with Applications 202 (2022): 117107.
Scornet, E. (2016). Random forests and kernel methods. IEEE Transactions on Information Theory, 62(3), 1485-1500.
library(randomMachines) sim_data <- sim_reg3(n=100)
library(randomMachines) sim_data <- sim_reg3(n=100)
Simulation toy example initially found in Van der Laan, et.al (2016), and used and escribed by Ara et. al (2022).
Inputs are 6 independent variables uniformly distributed on the interval . Outputs are generated following the equation
sim_reg4(n, sigma)
sim_reg4(n, sigma)
n |
Sample size |
sigma |
Standard deviation of residual noise |
A simulated data.frame with two predictors and the target variable.
Mateus Maia: [email protected], Anderson Ara: [email protected]
Ara, Anderson, et al. "Regression random machines: An ensemble support vector regression model with free kernel choice." Expert Systems with Applications 202 (2022): 117107.
Van der Laan, M. J., Polley, E. C., & Hubbard, A. E. (2007). Super learner. Statistical applications in genetics and molecular biology, 6(1).
library(randomMachines) sim_data <- sim_reg4(n=100)
library(randomMachines) sim_data <- sim_reg4(n=100)
Simulation toy example initially found in Van der Laan, et.al (2016), and used and escribed by Ara et. al (2022).
Inputs are 6 independent variables sampled from . Outputs are generated following the equation
sim_reg5(n, sigma)
sim_reg5(n, sigma)
n |
Sample size |
sigma |
Standard deviation of residual noise |
A simulated data.frame with two predictors and the target variable.
Mateus Maia: [email protected], Anderson Ara: [email protected]
Ara, Anderson, et al. "Regression random machines: An ensemble support vector regression model with free kernel choice." Expert Systems with Applications 202 (2022): 117107.
Roy, M. H., & Larocque, D. (2012). Robustness of random forests for regression. Journal of Nonparametric Statistics, 24(4), 993-1006.
library(randomMachines) sim_data <- sim_reg5(n=100)
library(randomMachines) sim_data <- sim_reg5(n=100)
The 'whosale' dataset contains information about wholesale customers' annual spending on various product categories.
data(whosale)
data(whosale)
A data frame with 440 rows and 8 columns.
This dataset includes the following columns:
Type of customer, either 'Horeca' (Hotel/Restaurant/Cafe), coded as 1
or 'Retail' coded as 2
.
Geographic region of the customer, either 'Lisbon', 'Oporto', or 'Other'. Coded as {1,2,3}
, respectively.
Annual spending (in monetary units) on fresh products.
Annual spending on milk products.
Annual spending on grocery products.
Annual spending on frozen products.
Annual spending on detergents and paper products.
Annual spending on delicatessen products.
The 'whosale' dataset is sourced from the UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/wholesale+customers
data(whosale) head(whosale)
data(whosale) head(whosale)