LogitBoost Classification Algorithm
LogitBoost.Rd
Train logitboost classification algorithm using decision stumps (one node decision trees) as weak learners.
Usage
LogitBoost(x, ...)
# S3 method for class 'formula'
LogitBoost(formula, data, ..., subset, na.action)
# Default S3 method
LogitBoost(x, y, nIter=ncol(x), ...)
Arguments
- formula
a formula expression as for regression models, of the form
response ~ predictors
. The response should be a factor or a matrix with K columns, which will be interpreted as counts for each of K classes. See the documentation offormula()
for other details.- data
an optional data frame in which to interpret the variables occurring in formula.
- ...
additional arguments for nnet
- subset
expression saying which subset of the rows of the data should be used in the fit. All observations are included by default.
- na.action
a function to filter missing data.
- x
A matrix or data frame with training data. Rows contain samples and columns contain features
- y
Class labels for the training data samples. A response vector with one label for each row/component of
xlearn
. Can be either a factor, string or a numeric vector.- nIter
An integer, describing the number of iterations for which boosting should be run, or number of decision stumps that will be used.
Details
The function was adapted from logitboost.R function written by Marcel
Dettling. See references and "See Also" section. The code was modified in
order to make it much faster for very large data sets. The speed-up was
achieved by implementing a internal version of decision stump classifier
instead of using calls to rpart
. That way, some of the most time
consuming operations were precomputed once, instead of performing them at
each iteration. Another difference is that training and testing phases of the
classification process were split into separate functions.
Value
An object of class "LogitBoost" including components:
- Stump
List of decision stumps (one node decision trees) used:
column 1: feature numbers or each stump, or which column each stump operates on
column 2: threshold to be used for that column
column 3: bigger/smaller info: 1 means that if values in the column are above threshold than corresponding samples will be labeled as
lablist[1]
. Value "-1" means the opposite.
If there are more than two classes, than several "Stumps" will be
cbind
'ed- lablist
names of each class
Author
Jarek Tuszynski (SAIC) jaroslaw.w.tuszynski@saic.com
Examples
# basic interface
r.lb <- LogitBoost(Species ~ ., data=iris, nIter=20)
pred <- predict(r.lb)
prob <- predict(r.lb, type="prob")
d.res <- data.frame(pred, prob)
d.res[1:10, ]
#> pred setosa versicolor virginica
#> 1 setosa 1 0.017986210 1.522998e-08
#> 2 setosa 1 0.002472623 3.353501e-04
#> 3 setosa 1 0.017986210 8.315280e-07
#> 4 setosa 1 0.002472623 4.539787e-05
#> 5 setosa 1 0.017986210 1.522998e-08
#> 6 setosa 1 0.017986210 1.522998e-08
#> 7 setosa 1 0.017986210 8.315280e-07
#> 8 setosa 1 0.017986210 1.522998e-08
#> 9 setosa 1 0.002472623 3.353501e-04
#> 10 setosa 1 0.002472623 4.539787e-05
# accuracy increases with nIter (at least for train set)
table(predict(r.lb, iris, type="class", nIter= 2), iris$Species)
#>
#> setosa versicolor virginica
#> setosa 48 0 0
#> versicolor 0 45 1
#> virginica 0 3 45
table(predict(r.lb, iris, type="class", nIter=10), iris$Species)
#>
#> setosa versicolor virginica
#> setosa 50 0 0
#> versicolor 0 47 0
#> virginica 0 1 47
table(predict(r.lb, iris, type="class"), iris$Species)
#>
#> setosa versicolor virginica
#> setosa 50 0 0
#> versicolor 0 49 0
#> virginica 0 0 48
# example of spliting the data into train and test set
d.set <- SplitTrainTest(iris)
r.lb <- LogitBoost(Species ~ ., data=d.set$train, nIter=10)
table(predict(r.lb, d.set$test, type="class", nIter=2), d.set$test$Species)
#>
#> setosa versicolor virginica
#> setosa 2 0 0
#> versicolor 0 1 0
#> virginica 0 1 9
table(predict(r.lb, d.set$test, type="class"), d.set$test$Species)
#>
#> setosa versicolor virginica
#> setosa 3 0 0
#> versicolor 0 3 0
#> virginica 0 0 9