Variable Importance for Regression and Classification Models
VarImp.Rd
Variable importance is an expression of the desire to know how important a variable is within a group of predictors for a particular model. But in general it is not a well defined concept, say there is no theoretically defined variable importance metric. Nevertheless, there are some approaches that have been established in practice for some regression and classification algorithms.
The present function provides an interface for calculating variable importance for some of the models produced by FitMod
, comprising linear models, classification trees, random forests, C5 trees and neural networks. The intention here is to provide reasonably homogeneous output and plot routines.
Usage
VarImp(x, scale = FALSE, sort = TRUE, ...)
# S3 method for class 'FitMod'
VarImp(x, scale = FALSE, sort = TRUE, type=NULL, ...)
# Default S3 method
VarImp(x, scale = FALSE, sort = TRUE, ...)
# S3 method for class 'VarImp'
plot(x, sort = TRUE, maxrows = NULL,
main = "Variable importance", ...)
# S3 method for class 'VarImp'
print(x, digits = 3, ...)
Arguments
- x
the fitted model
- scale
logical, should the importance values be scaled to 0 and 100?
- ...
parameters to pass to the specific
VarImp
methods- sort
the name of the column, the importance table should be ordered after
- maxrows
the maximum number of rows to be reported
- main
the main title for the plot
- type
some models have more than one type available to produce a variable importance. Linear models accept one of
"lmg"
,"pmvd"
,"first"
,"last"
,"betasq"
,"pratt"
.- digits
the number of digits for printing the "VarImp" table
Value
A data frame with class c("VarImp.train", "data.frame")
for
VarImp.train
or a matrix for other models.
Details
Linear Models:
For linear models there's a fine package relaimpo available on CRAN containing several interesting approaches for quantifying the variable importance. See the original documentation.
rpart, Random Forest:
VarImp.rpart
and VarImp.randomForest
are wrappers around the importance functions from the rpart or randomForest packages, respectively.
C5.0:
C5.0 measures predictor importance by determining the
percentage of training set samples that fall into all the terminal
nodes after the split. For example, the predictor in the first split
automatically has an importance measurement of 100 percent since all
samples are affected by this split. Other predictors may be used
frequently in splits, but if the terminal nodes cover only a handful
of training set samples, the importance scores may be close to
zero. The same strategy is applied to rule-based models and boosted
versions of the model. The underlying function can also return the
number of times each predictor was involved in a split by using the
option metric="usage"
.
Neural Networks:
The method used here is "Garson weights".
SVM, GLM, Multinom:
There are no implementations for these models so far.