Variable Importance for Regression and Classification Models

Variable importance is an expression of the desire to know how important a variable is within a group of predictors for a particular model. But in general it is not a well defined concept, say there is no theoretically defined variable importance metric. Nevertheless, there are some approaches that have been established in practice for some regression and classification algorithms. The present function provides an interface for calculating variable importance for some of the models produced by FitMod, comprising linear models, classification trees, random forests, C5 trees and neural networks. The intention here is to provide reasonably homogeneous output and plot routines.

Usage

VarImp(x, scale = FALSE, sort = TRUE, ...)

# S3 method for class 'FitMod'
VarImp(x, scale = FALSE, sort = TRUE, type=NULL, ...)
# Default S3 method
VarImp(x, scale = FALSE, sort = TRUE, ...)


# S3 method for class 'VarImp'
plot(x, sort = TRUE, maxrows = NULL,
           main = "Variable importance", ...)

# S3 method for class 'VarImp'
print(x, digits = 3, ...)

Arguments

x: the fitted model
scale: logical, should the importance values be scaled to 0 and 100?
...: parameters to pass to the specific VarImp methods
sort: the name of the column, the importance table should be ordered after
maxrows: the maximum number of rows to be reported
main: the main title for the plot
type: some models have more than one type available to produce a variable importance. Linear models accept one of "lmg", "pmvd", "first", "last", "betasq", "pratt".
digits: the number of digits for printing the "VarImp" table

Value

A data frame with class c("VarImp.train", "data.frame") for VarImp.train or a matrix for other models.

Details

Linear Models: For linear models there's a fine package relaimpo available on CRAN containing several interesting approaches for quantifying the variable importance. See the original documentation.

rpart, Random Forest: VarImp.rpart and VarImp.randomForest are wrappers around the importance functions from the rpart or randomForest packages, respectively.

C5.0: C5.0 measures predictor importance by determining the percentage of training set samples that fall into all the terminal nodes after the split. For example, the predictor in the first split automatically has an importance measurement of 100 percent since all samples are affected by this split. Other predictors may be used frequently in splits, but if the terminal nodes cover only a handful of training set samples, the importance scores may be close to zero. The same strategy is applied to rule-based models and boosted versions of the model. The underlying function can also return the number of times each predictor was involved in a split by using the option metric="usage".

Neural Networks: The method used here is "Garson weights".

SVM, GLM, Multinom: There are no implementations for these models so far.

Author

Andri Signorell <andri@signorell.net>

References

Quinlan, J. (1992). Learning with continuous classes. Proceedings of the 5th Australian Joint Conference On Artificial Intelligence, 343-348.