This document is a short introduction to the R package CoVVSURF which combines clustering of variables and feature selection using random forest. The procedure CoV/VSURF is a statistical methodology for dimension reduction and variable selection in the context of supervised classification (which can also be applied for regression problems).
Redundancy is reduced using clustering of variables, based on the R package ClustOfVar
. This clustering approach, denoted by CoV hereafter, allows one to deal with both numerical and categorical variables. The clustering of variables groups together highly correlated variables and provides for each group (cluster) a synthetic variable which is a numerical variable summarizing the variables within a cluster. The main advantage of this approach is to eliminate redundancy and to keep all the variables together in a cluster during the rest of the analysis. Moreover it reduces the dimension of the data by replacing the p original variables by K synthetic variables (where K denotes the selected number of clusters). Note that this clustering of variables approach does not require définition of a priori groups of variables as in group lasso or sparse group lasso approaches.
In addition, the reduction of dimension provides K synthetic variables which only use the variables within a cluster, unlike the principal components in principal component analysis (PCA). Hence, in CoV, an original variable takes action in the construction of a unique synthetic variable, which make the interpretation easier.
The most important synthetic variables given by CoV are selected using a procedure based on random forests (RF) implemented in the R package VSURF
. This variable selection procedure, denoted VSURF hereafter, is applied to the reduced dataset consisting of the n observations described with the K synthetic variables. Thus a list of selected synthetic variables (i.e. a list of clusters of variables) is obtained and the prediction for new observations can be done with a predictor built on these selected synthetic variables.
Details are given in https://arxiv.org/abs/1608.06740.
devtools::install_github("chavent/PCAmixdata")
devtools::install_github("robingenuer/CoVVSURF")
library(CoVVSURF)
help(package=CoVVSURF)
The data are simulated from an underlying classification model where 120 explonatory variables (80 quantitative and 40 qualitative with underlying informative and non informative groups) explain a binary response variable. The function simu_classif generates n observations of the p=120 explonatory variables in a matrix X and of the binary reponse in a vector y.
train <- simu_classif(n=60,seed=10) # simulated training data set
train$X #explonatory data matrix
train$y #binary variable to predict
The function CoV builds a dendrogram of variables using the R package ClustOfVar.
treecov <- CoV(train$X) # dendrogram of the 120 variables with ClustOfVar
plot(treecov)
The functions covsurf and predict.covsurf implement the CoV/VSURF procedure. The input of the procedure is a dataset (X,y) and the goal is to select groups of informative variables in X to predict y.
In the first step the function covsurf selects groups of informative variables in the following way:
Parallel computing with several cores can be used to speed the execution.
kval <- c(2:15, seq(from = 20, to = ncol(train$X), by = 10))
res <- covsurf(train$X, train$y, kval)
#or to seed the execution :
# if the tree has already been built
res <- covsurf(train$X, train$y, tree=treecov,kval)
# and if you can use several cores
res <- covsurf(train$X, train$y, tree=treecov,kval,ncores=3)
names(res)
plot(res)
res$kopt # the partition in 10 cluster of the dendrogram is selected
res$vsel # 5 synthetic variables are selected by VSURF
res$csel # the corresponding 5 selected groups of variables
In the seconde step the function predict.covsurf gives the prediction of a set of new observations x in the following way:
test <- simu_classif(n=100,seed=20) # simulated test data set
pred <- predict(res,test$X)
sum(pred==test$y)/length(test$y) # True classification rate
library(CoVVSURF)
help(package=CoVVSURF)
The data are simulated from an underlying classification model where 120 explonatory variables (80 quantitative and 40 qualitative with underlying informative and non informative groups) explain a numeric response variable. The function simu_reg generates n observations of the p=120 explonatory variables in a matrix X and of the numeric reponse in a vector y.
train <- simu_reg(n=60,seed=10) # simulated training data set
train$X #explonatory data matrix
train$y #numeric variable to predict
The function CoV builds a dendrogram of variables using the R package ClustOfVar.
treecov <- CoV(train$X) # dendrogram of the 120 variables with ClustOfVar
plot(tree)
The functions covsurf and predict.covsurf implement the CoV/VSURF procedure. The procedure is similar to that presented previously in the context of the classification with the OOB mean square errors (sum of squared residuals divided by n) replacing the OOB error rate.
In the first step the function covsurf selects groups of informative variables. Parallel computing with several cores can be used to speed the execution.
kval <- c(2:15, seq(from = 20, to = ncol(train$X), by = 10))
res <- covsurf(train$X, train$y, kval)
#or if the tree has already been built
res <- covsurf(train$X, train$y, tree=treecov,kval) #be patient...
#or if several cores
res <- covsurf(train$X, train$y, tree=treecov,kval,ncores=3)
names(res)
plot(res)
res$kopt # the partition in 9 cluster of the dendrogram is selected
res$vsel # 6 synthetic variables are selected by VSURF
res$csel # the corresponding 6 selected groups of variables
In the seconde step the function predict.covsurf gives the prediction of a set of new observations x.
test <- simu_reg(n=100,seed=20) # simulated test data set
pred <- predict(res,test$X)
sum(pred-test$y)^2/length(test$y) # mean square error