LogitBoost Classification Algorithm — LogitBoost • ModTools

Train logitboost classification algorithm using decision stumps (one node decision trees) as weak learners.

Usage

LogitBoost(x, ...)

# S3 method for class 'formula'
LogitBoost(formula, data, ..., subset, na.action)

# Default S3 method
LogitBoost(x, y, nIter=ncol(x), ...)

Arguments

formula: a formula expression as for regression models, of the form response ~ predictors. The response should be a factor or a matrix with K columns, which will be interpreted as counts for each of K classes. See the documentation of formula() for other details.
data: an optional data frame in which to interpret the variables occurring in formula.
...: additional arguments for nnet
subset: expression saying which subset of the rows of the data should be used in the fit. All observations are included by default.
na.action: a function to filter missing data.
x: A matrix or data frame with training data. Rows contain samples and columns contain features
y: Class labels for the training data samples. A response vector with one label for each row/component of xlearn. Can be either a factor, string or a numeric vector.
nIter: An integer, describing the number of iterations for which boosting should be run, or number of decision stumps that will be used.

Details

The function was adapted from logitboost.R function written by Marcel Dettling. See references and "See Also" section. The code was modified in order to make it much faster for very large data sets. The speed-up was achieved by implementing a internal version of decision stump classifier instead of using calls to rpart. That way, some of the most time consuming operations were precomputed once, instead of performing them at each iteration. Another difference is that training and testing phases of the classification process were split into separate functions.

Value

An object of class "LogitBoost" including components:

Stump

List of decision stumps (one node decision trees) used:

column 1: feature numbers or each stump, or which column each stump operates on
column 2: threshold to be used for that column
column 3: bigger/smaller info: 1 means that if values in the column are above threshold than corresponding samples will be labeled as lablist[1]. Value "-1" means the opposite.

If there are more than two classes, than several "Stumps" will be cbind'ed

lablist

names of each class

References

Dettling and Buhlmann (2002), Boosting for Tumor Classification of Gene Expression Data.

Author

Jarek Tuszynski (SAIC) jaroslaw.w.tuszynski@saic.com

Examples

# basic interface
r.lb <- LogitBoost(Species ~ ., data=iris, nIter=20)
pred <- predict(r.lb)
prob <- predict(r.lb, type="prob")
d.res <- data.frame(pred, prob)
d.res[1:10, ]
#>      pred setosa  versicolor    virginica
#> 1  setosa      1 0.017986210 1.522998e-08
#> 2  setosa      1 0.002472623 3.353501e-04
#> 3  setosa      1 0.017986210 8.315280e-07
#> 4  setosa      1 0.002472623 4.539787e-05
#> 5  setosa      1 0.017986210 1.522998e-08
#> 6  setosa      1 0.017986210 1.522998e-08
#> 7  setosa      1 0.017986210 8.315280e-07
#> 8  setosa      1 0.017986210 1.522998e-08
#> 9  setosa      1 0.002472623 3.353501e-04
#> 10 setosa      1 0.002472623 4.539787e-05

# accuracy increases with nIter (at least for train set)
table(predict(r.lb, iris, type="class", nIter= 2), iris$Species)
#>             
#>              setosa versicolor virginica
#>   setosa         48          0         0
#>   versicolor      0         45         1
#>   virginica       0          3        45
table(predict(r.lb, iris, type="class", nIter=10), iris$Species)
#>             
#>              setosa versicolor virginica
#>   setosa         50          0         0
#>   versicolor      0         47         0
#>   virginica       0          1        47
table(predict(r.lb, iris, type="class"),           iris$Species)
#>             
#>              setosa versicolor virginica
#>   setosa         50          0         0
#>   versicolor      0         49         0
#>   virginica       0          0        48

# example of spliting the data into train and test set
d.set <- SplitTrainTest(iris)
r.lb <- LogitBoost(Species ~ ., data=d.set$train, nIter=10)
table(predict(r.lb, d.set$test, type="class", nIter=2), d.set$test$Species)
#>             
#>              setosa versicolor virginica
#>   setosa          2          0         0
#>   versicolor      0          1         0
#>   virginica       0          1         9
table(predict(r.lb, d.set$test, type="class"),          d.set$test$Species)
#>             
#>              setosa versicolor virginica
#>   setosa          3          0         0
#>   versicolor      0          3         0
#>   virginica       0          0         9