Regression and Classification Tools — ModTools-package • ModTools

There is a rich selection of R packages implementing algorithms for classification and regression tasks out there. The authors legitimately take the liberty to tailor the function interfaces according to their own taste and needs. For us other users, however, this often results in struggling with user interfaces, some of which are rather weird - to put it mildly - and almost always different in terms of arguments and result structures. ModTools pursues the goal of offering uniform handling for the most important regression and classification models in applied data analyses.
The function FitMod() is designed as a simple and consistent interface to these original functions while maintaining the flexibility to pass on all possible arguments. print, plot, summary and predict operations can so be carried out following the same logic. The results will again be reshaped to a reasonable standard.

For all the functions of this package Google styleguides are used as naming rules (in absence of convincing alternatives). The 'BigCamelCase' style has been consequently applied to functions borrowed from contributed R packages as well.

As always: Feedback, feature requests, bugreports and other suggestions are welcome!

Warning

This package is still under development. You should be aware that everything in the package might be subject to change. Backward compatibility is not yet guaranteed. Functions may be deleted or renamed and new syntax may be inconsistent with earlier versions. By release of version 1.0 the "deprecated-defunct process" will be installed.

Details

The ModTools::FitMod()) function comprises interfaces to the following models:

Regression:
`lm()`	Linear model OLS (base)
`lmrob()`	Robust linear model (robustbase)
`poisson()`	GLM model with family `poisson` (base)
`negbin()`	GLM model with family `negative.binomial` (MASS)
`gamma()`	GLM model with family `gamma` (base)
`tobit()`	Tobit model for censored responses (package AER)

	Classification:
	`lda()`
Linear discriminant analysis (MASS)	`qda()`
Quadratic discriminant analysis (MASS)	`logit()`
Logistic Regression model `glm`, family `binomial(logit)`(base)	`multinom()`
Multinomial Regression model (nnet)	`polr()`
Proportional odds model (MASS)	`rpart()`
Regression and classification trees (rpart)	`nnet()`
Neuronal networks (nnet)	`randomForest()`
Random forests (randomForest)	`C5.0()`
C5.0 tree (C50)	`svm()`
Support vector machines (e1071)	`naive_bayes()`
Naive Bayes classificator (naivebayes)	`LogitBoost()`
Logit boost (using decision stumps as weak learners) (ModTools)

Preprocess:
`SplitTrainTest()`	Splits a data frame or index vector into a training and a test sample
`OverSample()`	Get balanced datasets by sampling with replacement.

	Manipulating `rpart` objects:
	`CP()`
Extract and plot complexity table of an rpart tree.	`Node()`
Accessor to the most important properties of a node, being a split or a leaf.	`Rules()`
Extract the decision rules from top to the end node of an rpart tree.	`LeafRates()`
Returns the misclassification rates in all end nodes.

Prediction and Validation:
`Response()`	Extract the response variable of any model.
`predict()`	Consistent predict for `FitMod` models
`VarImp()`	Variable importance for most `FitMod` models
`ROC()`	ROC curves for all dichotomous classification `FitMod` models
`BestCut()`	Find the optimal cut for a classification based on the ROC curve.
`PlotLift()`	Produces a lift chart for a binary classification model
`TModC()`	Aggregated results for multiple `FitMod` classification models
`Tune()`	Tuning approaches to find optimal parameters for `FitMod` classification models.
`RobSummary()`	Robust summary for GLM models (poisson).

	Tests:
	`BreuschPaganTest()`
Breusch-Pagan test against heteroskedasticity.

Author

Andri Signorell
Helsana Versicherungen AG, Health Sciences, Zurich
HWZ University of Applied Sciences in Business Administration Zurich.

Includes R source code and/or documentation previously published by (in alphabetical order):
Bernhard Compton, Marcel Dettling, Max Kuhn, Michal Majka, Dan Putler, Jarek Tuszynski, Robin Xavier, Achim Zeileis

The good things come from all these guys, any problems are likely due to my tweaking. Thank you all!

Maintainer: Andri Signorell <andri@signorell.net>

Examples


r.swiss <- FitMod(Fertility ~ ., swiss, fitfn="lm")
r.swiss
#> 
#> Call:
#> lm(formula = Fertility ~ ., data = swiss)
#> 
#> Coefficients:
#>                   estimate  95%-lci     uci   p-val     
#> (Intercept)         66.915   45.294  88.536  <0.001  ***
#> Agriculture         -0.172   -0.314  -0.030   0.019  *  
#> Examination         -0.258   -0.771   0.255   0.315     
#> Education           -0.871   -1.241  -0.501  <0.001  ***
#> Catholic             0.104    0.033   0.175   0.005  ** 
#> Infant.Mortality     1.077    0.306   1.848   0.007  ** 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Obs (NAs): 47 (0)	R²/R²adj: 0.707/0.671    AIC: 326.0716
#> 
# PlotTA(r.swiss)
# PlotQQNorm(r.swiss)


## Count models

data(housing, package="MASS")

# poisson count
r.pois <- FitMod(Freq ~ Infl*Type*Cont + Sat, family=poisson, data=housing, fitfn="poisson")

# negative binomial count
r.nb <- FitMod(Freq ~ Infl*Type*Cont + Sat, data=housing, fitfn="negbin")
summary(r.nb)
#> 
#> Call:
#> glm.nb(formula = Freq ~ Infl * Type * Cont + Sat, data = housing, 
#>     init.theta = 11.58904011, link = log)
#> 
#> Coefficients:
#>                                   Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)                        3.14037    0.20801  15.097  < 2e-16 ***
#> InflMedium                         0.26615    0.28836   0.923 0.356028    
#> InflHigh                          -0.25336    0.30111  -0.841 0.400113    
#> TypeApartment                      0.39532    0.28603   1.382 0.166945    
#> TypeAtrium                        -0.77634    0.32149  -2.415 0.015742 *  
#> TypeTerrace                       -0.80653    0.32296  -2.497 0.012515 *  
#> ContHigh                          -0.02395    0.29474  -0.081 0.935244    
#> Sat.L                              0.16860    0.07525   2.240 0.025070 *  
#> Sat.Q                              0.25749    0.07821   3.292 0.000994 ***
#> InflMedium:TypeApartment          -0.12807    0.39875  -0.321 0.748082    
#> InflHigh:TypeApartment             0.15140    0.41087   0.368 0.712507    
#> InflMedium:TypeAtrium             -0.41265    0.45633  -0.904 0.365849    
#> InflHigh:TypeAtrium               -0.13006    0.47471  -0.274 0.784108    
#> InflMedium:TypeTerrace             0.01666    0.44423   0.038 0.970080    
#> InflHigh:TypeTerrace              -0.06969    0.47429  -0.147 0.883177    
#> InflMedium:ContHigh               -0.12632    0.41042  -0.308 0.758256    
#> InflHigh:ContHigh                 -0.59072    0.44424  -1.330 0.183609    
#> TypeApartment:ContHigh             0.52179    0.40019   1.304 0.192281    
#> TypeAtrium:ContHigh                0.71556    0.43760   1.635 0.102013    
#> TypeTerrace:ContHigh               1.15028    0.43270   2.658 0.007852 ** 
#> InflMedium:TypeApartment:ContHigh  0.01178    0.56001   0.021 0.983224    
#> InflHigh:TypeApartment:ContHigh    0.13235    0.59111   0.224 0.822837    
#> InflMedium:TypeAtrium:ContHigh     0.14135    0.62032   0.228 0.819752    
#> InflHigh:TypeAtrium:ContHigh       0.42364    0.65770   0.644 0.519496    
#> InflMedium:TypeTerrace:ContHigh   -0.51739    0.60483  -0.855 0.392307    
#> InflHigh:TypeTerrace:ContHigh     -0.49705    0.66522  -0.747 0.454946    
#> ---
#> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#> 
#> (Dispersion parameter for Negative Binomial(11.589) family taken to be 1)
#> 
#>     Null deviance: 275.558  on 71  degrees of freedom
#> Residual deviance:  67.887  on 46  degrees of freedom
#> AIC: 535.29
#> 
#> Number of Fisher Scoring iterations: 1
#> 
#> 
#>               Theta:  11.59 
#>           Std. Err.:  2.92 
#> 
#>  2 x log-likelihood:  -481.289 

r.log <- FitMod(log(Freq) ~ Infl*Type*Cont + Sat, data=housing, fitfn="lm")
summary(r.log)
#> 
#> Call:
#> lm(formula = log(Freq) ~ Infl * Type * Cont + Sat, data = housing)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -0.90539 -0.22284  0.01119  0.18857  0.96514 
#> 
#> Coefficients:
#>                                    Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)                        3.140416   0.267518  11.739 1.95e-15 ***
#> InflMedium                         0.259891   0.378328   0.687   0.4956    
#> InflHigh                          -0.379083   0.378328  -1.002   0.3216    
#> TypeApartment                      0.219444   0.378328   0.580   0.5647    
#> TypeAtrium                        -0.785497   0.378328  -2.076   0.0435 *  
#> TypeTerrace                       -0.931069   0.378328  -2.461   0.0177 *  
#> ContHigh                          -0.075612   0.378328  -0.200   0.8425    
#> Sat.L                              0.199559   0.094582   2.110   0.0403 *  
#> Sat.Q                              0.199376   0.094582   2.108   0.0405 *  
#> InflMedium:TypeApartment           0.048725   0.535036   0.091   0.9278    
#> InflHigh:TypeApartment             0.398374   0.535036   0.745   0.4603    
#> InflMedium:TypeAtrium             -0.400214   0.535036  -0.748   0.4583    
#> InflHigh:TypeAtrium                0.002462   0.535036   0.005   0.9963    
#> InflMedium:TypeTerrace             0.143412   0.535036   0.268   0.7899    
#> InflHigh:TypeTerrace               0.154151   0.535036   0.288   0.7746    
#> InflMedium:ContHigh               -0.105500   0.535036  -0.197   0.8446    
#> InflHigh:ContHigh                 -0.737873   0.535036  -1.379   0.1745    
#> TypeApartment:ContHigh             0.697935   0.535036   1.304   0.1986    
#> TypeAtrium:ContHigh                0.763012   0.535036   1.426   0.1606    
#> TypeTerrace:ContHigh               1.114096   0.535036   2.082   0.0429 *  
#> InflMedium:TypeApartment:ContHigh -0.141229   0.756655  -0.187   0.8528    
#> InflHigh:TypeApartment:ContHigh    0.087753   0.756655   0.116   0.9082    
#> InflMedium:TypeAtrium:ContHigh     0.060731   0.756655   0.080   0.9364    
#> InflHigh:TypeAtrium:ContHigh       0.503181   0.756655   0.665   0.5094    
#> InflMedium:TypeTerrace:ContHigh   -0.531148   0.756655  -0.702   0.4862    
#> InflHigh:TypeTerrace:ContHigh     -0.296310   0.756655  -0.392   0.6972    
#> ---
#> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#> 
#> Residual standard error: 0.4634 on 46 degrees of freedom
#> Multiple R-squared:  0.751,	Adjusted R-squared:  0.6157 
#> F-statistic: 5.551 on 25 and 46 DF,  p-value: 2.796e-07
#> 

r.ols <- FitMod(Freq ~ Infl*Type*Cont + Sat, data=housing, fitfn="lm")
summary(r.ols)
#> 
#> Call:
#> lm(formula = Freq ~ Infl * Type * Cont + Sat, data = housing)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -22.4861  -5.2361  -0.1944   3.5347  27.0556 
#> 
#> Coefficients:
#>                                     Estimate Std. Error t value Pr(>|t|)   
#> (Intercept)                        2.333e+01  7.076e+00   3.298  0.00189 **
#> InflMedium                         7.333e+00  1.001e+01   0.733  0.46737   
#> InflHigh                          -4.333e+00  1.001e+01  -0.433  0.66700   
#> TypeApartment                      1.033e+01  1.001e+01   1.033  0.30717   
#> TypeAtrium                        -1.267e+01  1.001e+01  -1.266  0.21195   
#> TypeTerrace                       -1.300e+01  1.001e+01  -1.299  0.20037   
#> ContHigh                          -1.680e-14  1.001e+01   0.000  1.00000   
#> Sat.L                              2.976e+00  2.502e+00   1.190  0.24034   
#> Sat.Q                              5.835e+00  2.502e+00   2.332  0.02412 * 
#> InflMedium:TypeApartment          -1.667e+00  1.415e+01  -0.118  0.90676   
#> InflHigh:TypeApartment             3.333e+00  1.415e+01   0.236  0.81483   
#> InflMedium:TypeAtrium             -8.667e+00  1.415e+01  -0.612  0.54328   
#> InflHigh:TypeAtrium                1.000e+00  1.415e+01   0.071  0.94397   
#> InflMedium:TypeTerrace            -4.000e+00  1.415e+01  -0.283  0.77871   
#> InflHigh:TypeTerrace               1.667e+00  1.415e+01   0.118  0.90676   
#> InflMedium:ContHigh               -4.000e+00  1.415e+01  -0.283  0.77871   
#> InflHigh:ContHigh                 -8.667e+00  1.415e+01  -0.612  0.54328   
#> TypeApartment:ContHigh             2.200e+01  1.415e+01   1.555  0.12689   
#> TypeAtrium:ContHigh                1.033e+01  1.415e+01   0.730  0.46897   
#> TypeTerrace:ContHigh               2.067e+01  1.415e+01   1.460  0.15098   
#> InflMedium:TypeApartment:ContHigh  2.333e+00  2.001e+01   0.117  0.90769   
#> InflHigh:TypeApartment:ContHigh   -1.200e+01  2.001e+01  -0.600  0.55171   
#> InflMedium:TypeAtrium:ContHigh     3.000e+00  2.001e+01   0.150  0.88150   
#> InflHigh:TypeAtrium:ContHigh       3.667e+00  2.001e+01   0.183  0.85544   
#> InflMedium:TypeTerrace:ContHigh   -8.667e+00  2.001e+01  -0.433  0.66700   
#> InflHigh:TypeTerrace:ContHigh     -1.167e+01  2.001e+01  -0.583  0.56278   
#> ---
#> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#> 
#> Residual standard error: 12.26 on 46 degrees of freedom
#> Multiple R-squared:  0.6882,	Adjusted R-squared:  0.5187 
#> F-statistic: 4.061 on 25 and 46 DF,  p-value: 1.939e-05
#> 

r.gam <- FitMod(Freq ~ Infl*Type*Cont + Sat, data=housing, fitfn="gamma")
summary(r.gam)
#> 
#> Call:
#> glm(formula = Freq ~ Infl * Type * Cont + Sat, family = "Gamma", 
#>     data = housing)
#> 
#> Coefficients:
#>                                     Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)                        4.355e-02  1.143e-02   3.809 0.000412 ***
#> InflMedium                        -1.005e-02  1.433e-02  -0.702 0.486503    
#> InflHigh                           9.653e-03  1.811e-02   0.533 0.596679    
#> TypeApartment                     -1.288e-02  1.386e-02  -0.929 0.357851    
#> TypeAtrium                         5.053e-02  2.761e-02   1.830 0.073681 .  
#> TypeTerrace                        5.354e-02  2.835e-02   1.889 0.065241 .  
#> ContHigh                          -1.378e-17  1.615e-02   0.000 1.000000    
#> Sat.L                             -3.211e-03  3.155e-03  -1.018 0.314213    
#> Sat.Q                             -9.144e-03  4.197e-03  -2.179 0.034492 *  
#> InflMedium:TypeApartment           5.915e-03  1.766e-02   0.335 0.739147    
#> InflHigh:TypeApartment            -8.770e-03  2.134e-02  -0.411 0.683073    
#> InflMedium:TypeAtrium              2.340e-02  4.077e-02   0.574 0.568768    
#> InflHigh:TypeAtrium                3.286e-02  4.794e-02   0.685 0.496500    
#> InflMedium:TypeTerrace            -1.346e-02  3.553e-02  -0.379 0.706642    
#> InflHigh:TypeTerrace               2.393e-02  4.718e-02   0.507 0.614453    
#> InflMedium:ContHigh                4.784e-03  2.086e-02   0.229 0.819606    
#> InflHigh:ContHigh                  4.389e-02  3.364e-02   1.305 0.198497    
#> TypeApartment:ContHigh            -1.121e-02  1.856e-02  -0.604 0.548836    
#> TypeAtrium:ContHigh               -4.583e-02  3.247e-02  -1.412 0.164758    
#> TypeTerrace:ContHigh              -6.393e-02  3.174e-02  -2.014 0.049831 *  
#> InflMedium:TypeApartment:ContHigh -1.765e-03  2.412e-02  -0.073 0.941979    
#> InflHigh:TypeApartment:ContHigh   -3.385e-02  3.662e-02  -0.924 0.360209    
#> InflMedium:TypeAtrium:ContHigh    -1.225e-02  4.752e-02  -0.258 0.797727    
#> InflHigh:TypeAtrium:ContHigh      -5.532e-02  6.091e-02  -0.908 0.368570    
#> InflMedium:TypeTerrace:ContHigh    3.236e-02  4.143e-02   0.781 0.438774    
#> InflHigh:TypeTerrace:ContHigh      1.462e-02  6.501e-02   0.225 0.823077    
#> ---
#> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#> 
#> (Dispersion parameter for Gamma family taken to be 0.2163103)
#> 
#>     Null deviance: 38.140  on 71  degrees of freedom
#> Residual deviance: 10.671  on 46  degrees of freedom
#> AIC: 538.18
#> 
#> Number of Fisher Scoring iterations: 6
#> 

r.gami <- FitMod(Freq ~ Infl*Type*Cont + Sat, data=housing, fitfn="gamma", link="identity")
summary(r.gami)
#> 
#> Call:
#> glm(formula = Freq ~ Infl * Type * Cont + Sat, family = function () 
#> Gamma(link = identity), data = housing)
#> 
#> Coefficients:
#>                                   Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)                        23.3635     5.7553   4.060 0.000189 ***
#> InflMedium                          7.1563     9.5281   0.751 0.456438    
#> InflHigh                           -7.2034     6.9307  -1.039 0.304079    
#> TypeApartment                      12.6892    10.6939   1.187 0.241486    
#> TypeAtrium                        -11.1942     6.3987  -1.749 0.086883 .  
#> TypeTerrace                       -10.9532     6.4273  -1.704 0.095097 .  
#> ContHigh                           -1.6216     7.8406  -0.207 0.837058    
#> Sat.L                               4.9368     1.4858   3.323 0.001755 ** 
#> Sat.Q                               3.2415     1.1421   2.838 0.006731 ** 
#> InflMedium:TypeApartment           -3.4540    16.4721  -0.210 0.834835    
#> InflHigh:TypeApartment              1.4504    13.6446   0.106 0.915810    
#> InflMedium:TypeAtrium              -9.2801    10.1778  -0.912 0.366629    
#> InflHigh:TypeAtrium                 3.5471     7.6884   0.461 0.646714    
#> InflMedium:TypeTerrace             -4.5638    10.5680  -0.432 0.667870    
#> InflHigh:TypeTerrace                2.9450     7.6893   0.383 0.703488    
#> InflMedium:ContHigh                -3.5733    12.5840  -0.284 0.777715    
#> InflHigh:ContHigh                  -7.1152     8.8705  -0.802 0.426610    
#> TypeApartment:ContHigh             22.5209    18.6701   1.206 0.233888    
#> TypeAtrium:ContHigh                11.5520     9.9375   1.162 0.251043    
#> TypeTerrace:ContHigh               22.9328    11.8581   1.934 0.059288 .  
#> InflMedium:TypeApartment:ContHigh   0.9018    27.5339   0.033 0.974015    
#> InflHigh:TypeApartment:ContHigh   -13.2180    21.9476  -0.602 0.549965    
#> InflMedium:TypeAtrium:ContHigh      2.3032    14.8708   0.155 0.877591    
#> InflHigh:TypeAtrium:ContHigh        0.3769    11.2392   0.034 0.973392    
#> InflMedium:TypeTerrace:ContHigh    -8.8401    16.8767  -0.524 0.602931    
#> InflHigh:TypeTerrace:ContHigh     -14.3587    12.7806  -1.123 0.267063    
#> ---
#> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#> 
#> (Dispersion parameter for Gamma family taken to be 0.1923107)
#> 
#>     Null deviance: 38.1400  on 71  degrees of freedom
#> Residual deviance:  9.0822  on 46  degrees of freedom
#> AIC: 526.3
#> 
#> Number of Fisher Scoring iterations: 14
#> 

old <-options(digits=3)
TMod(r.pois, r.nb, r.log, r.ols, r.gam, r.gami)
#> Waiting for profiling to be done...
#> Waiting for profiling to be done...
#> Waiting for profiling to be done...
#> Waiting for profiling to be done...
#>                                 coef   r.pois         r.nb       r.log    
#> 1                        (Intercept)    3.136 ***    3.140 ***   3.140 ***
#> 2                         InflMedium    0.273 .      0.266       0.260    
#> 3                           InflHigh   -0.205       -0.253      -0.379    
#> 4                      TypeApartment    0.367 *      0.395       0.219    
#> 5                         TypeAtrium   -0.783 ***   -0.776 *    -0.785 *  
#> 6                        TypeTerrace   -0.815 ***   -0.807 *    -0.931 *  
#> 7                           ContHigh    1.409e-15   -0.024      -0.076    
#> 8                              Sat.L    0.116 **     0.169 *     0.200 *  
#> 9                              Sat.Q    0.263 ***    0.257 ***   0.199 *  
#> 10          InflMedium:TypeApartment   -0.118       -0.128       0.049    
#> 11            InflHigh:TypeApartment    0.175        0.151       0.398    
#> 12             InflMedium:TypeAtrium   -0.407       -0.413      -0.400    
#> 13               InflHigh:TypeAtrium   -0.169       -0.130       0.002    
#> 14            InflMedium:TypeTerrace    0.006        0.017       0.143    
#> 15              InflHigh:TypeTerrace   -0.093       -0.070       0.154    
#> 16               InflMedium:ContHigh   -0.140       -0.126      -0.105    
#> 17                 InflHigh:ContHigh   -0.609 *     -0.591      -0.738    
#> 18            TypeApartment:ContHigh    0.503 *      0.522       0.698    
#> 19               TypeAtrium:ContHigh    0.677 *      0.716       0.763    
#> 20              TypeTerrace:ContHigh    1.099 ***    1.150 **    1.114 *  
#> 21 InflMedium:TypeApartment:ContHigh    0.054        0.012      -0.141    
#> 22   InflHigh:TypeApartment:ContHigh    0.146        0.132       0.088    
#> 23    InflMedium:TypeAtrium:ContHigh    0.156        0.141       0.061    
#> 24      InflHigh:TypeAtrium:ContHigh    0.478        0.424       0.503    
#> 25   InflMedium:TypeTerrace:ContHigh   -0.498       -0.517      -0.531    
#> 26     InflHigh:TypeTerrace:ContHigh   -0.447       -0.497      -0.296    
#> 27                               ---                                      
#> 28                         r.squared    -            -           0.751    
#> 29                     adj.r.squared    -            -           0.616    
#> 30                             sigma    -            -           0.463    
#> 31                            logLik -279.213     -240.645     -30.648    
#> 32                           logLik0 -587.313     -290.899       -        
#> 33                                G2  616.201      100.508       -        
#> 34                          deviance    -            -           9.876    
#> 35                               AIC  610.426      535.289     115.296    
#> 36                               BIC  669.619      596.759     176.766    
#> 37                             numdf   26           27          25        
#> 38                             dendf    -            -          46        
#> 39                                 N   72           72          72        
#> 40                            n vars    8            8           8        
#> 41                            n coef   26           26          26        
#> 42                                 F    -            -           5.551    
#> 43                                 p    -            -           0.000    
#> 44                               MAE    6.641        6.608       0.276    
#> 45                              MAPE    0.354        0.345       0.104    
#> 46                               MSE   92.754       97.134       0.137    
#> 47                              RMSE    9.631        9.856       0.370    
#> 48                          McFadden    0.525        0.173       -        
#> 49                       McFaddenAdj    0.480        0.083       -        
#> 50                        Nagelkerke    1.000        0.753       -        
#> 51                          CoxSnell    1.000        0.752       -        
#>       r.ols        r.gam       r.gami    
#> 1    23.333 **     0.044 ***   23.363 ***
#> 2     7.333       -0.010        7.156    
#> 3    -4.333        0.010       -7.203    
#> 4    10.333       -0.013       12.689    
#> 5   -12.667        0.051 .    -11.194 .  
#> 6   -13.000        0.054 .    -10.953 .  
#> 7    -1.680e-14   -0.000       -1.622    
#> 8     2.976       -0.003        4.937 ** 
#> 9     5.835 *     -0.009 *      3.241 ** 
#> 10   -1.667        0.006       -3.454    
#> 11    3.333       -0.009        1.450    
#> 12   -8.667        0.023       -9.280    
#> 13    1.000        0.033        3.547    
#> 14   -4.000       -0.013       -4.564    
#> 15    1.667        0.024        2.945    
#> 16   -4.000        0.005       -3.573    
#> 17   -8.667        0.044       -7.115    
#> 18   22.000       -0.011       22.521    
#> 19   10.333       -0.046       11.552    
#> 20   20.667       -0.064 *     22.933 .  
#> 21    2.333       -0.002        0.902    
#> 22  -12.000       -0.034      -13.218    
#> 23    3.000       -0.012        2.303    
#> 24    3.667       -0.055        0.377    
#> 25   -8.667        0.032       -8.840    
#> 26  -11.667        0.015      -14.359    
#> 27                                       
#> 28    0.688        -            -        
#> 29    0.519        -            -        
#> 30   12.256        -            -        
#> 31 -266.465     -242.088     -236.152    
#> 32    -         -290.204     -290.204    
#> 33    -           96.232      108.104    
#> 34 6909.139        -            -        
#> 35  586.930      538.176      526.304    
#> 36  648.400      599.646      587.774    
#> 37   25           27           27        
#> 38   46            -            -        
#> 39   72           72           72        
#> 40    8            8            8        
#> 41   26           26           26        
#> 42    4.061        -            -        
#> 43    0.000        -            -        
#> 44    7.050        6.883        6.823    
#> 45    0.385        0.371        0.321    
#> 46   95.960       98.993      102.471    
#> 47    9.796        9.950       10.123    
#> 48    -            0.166        0.186    
#> 49    -            0.076        0.097    
#> 50    -            0.737        0.777    
#> 51    -            0.737        0.777    
options(old)


## Ordered Regression

r.polr <- FitMod(Sat ~ Infl + Type + Cont, data=housing, fitfn="polr", weights = Freq)

# multinomial Regression
# r.mult <- FitMod(factor(Sat, ordered=FALSE) ~ Infl + Type + Cont, data=housing,
#                  weights = housing$Freq, fitfn="multinom")


# Regression tree
r.rp <- FitMod(factor(Sat, ordered=FALSE) ~ Infl + Type + Cont, data=housing,
                 weights = housing$Freq, fitfn="rpart")

# compare predictions
d.p <- expand.grid(Infl=levels(housing$Infl), Type=levels(housing$Type), Cont=levels(housing$Cont))
d.p$polr <- predict(r.polr, newdata=d.p)
# ??
# d.p$ols <- factor(round(predict(r.ols, newdata=d.p)^2), labels=levels(housing$Sat))
# d.p$mult <- predict(r.mult, newdata=d.p)
d.p$rp <- predict(r.rp, newdata=d.p, type="class")

d.p
#>      Infl      Type Cont polr   rp
#> 1     Low     Tower  Low  Low High
#> 2  Medium     Tower  Low High High
#> 3    High     Tower  Low High High
#> 4     Low Apartment  Low  Low  Low
#> 5  Medium Apartment  Low  Low High
#> 6    High Apartment  Low High High
#> 7     Low    Atrium  Low  Low High
#> 8  Medium    Atrium  Low High High
#> 9    High    Atrium  Low High High
#> 10    Low   Terrace  Low  Low  Low
#> 11 Medium   Terrace  Low  Low High
#> 12   High   Terrace  Low High High
#> 13    Low     Tower High High High
#> 14 Medium     Tower High High High
#> 15   High     Tower High High High
#> 16    Low Apartment High  Low  Low
#> 17 Medium Apartment High High High
#> 18   High Apartment High High High
#> 19    Low    Atrium High  Low High
#> 20 Medium    Atrium High High High
#> 21   High    Atrium High High High
#> 22    Low   Terrace High  Low  Low
#> 23 Medium   Terrace High  Low High
#> 24   High   Terrace High High High


# Classification with 2 classes  ***************

r.pima <- FitMod(diabetes ~ ., d.pima, fitfn="logit")
r.pima
#> 
#> Call:  glm(formula = diabetes ~ ., family = "binomial", data = d.pima)
#> 
#> Coefficients:
#> (Intercept)     pregnant      glucose     pressure      triceps      insulin  
#>   -8.404696     0.123182     0.035164    -0.013296     0.000619    -0.001192  
#>        mass     pedigree          age  
#>    0.089701     0.945180     0.014869  
#> 
#> Degrees of Freedom: 767 Total (i.e. Null);  759 Residual
#> Null Deviance:	    993.5 
#> Residual Deviance: 723.4 	AIC: 741.4
Conf(r.pima)
#> 
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction pos neg
#>        pos 156  55
#>        neg 112 445
#> 
#>                 Total n : 768
#>                Accuracy : 0.7826
#>                  95% CI : (0.7520, 0.8103)
#>     No Information Rate : 0.6510
#>     P-Value [Acc > NIR] : 1.37e-15
#> 
#>                   Kappa : 0.4966
#>  Mcnemar's Test P-Value : 1.47e-05
#> 
#>             Sensitivity : 0.5821
#>             Specificity : 0.8900
#>          Pos Pred Value : 0.7393
#>          Neg Pred Value : 0.7989
#>              Prevalence : 0.3490
#>          Detection Rate : 0.2747
#>    Detection Prevalence : 0.2031
#>       Balanced Accuracy : 0.7360
#>          F-val Accuracy : 0.6514
#>      Matthews Cor.-Coef : 0.5041
#> 
#>        'Positive' Class : pos
#> 
plot(ROC(r.pima))
#> Setting levels: control = neg, case = pos
#> Setting direction: controls < cases
OddsRatio(r.pima)
#> 
#> Call:
#> glm(formula = diabetes ~ ., family = "binomial", data = d.pima)
#> 
#> Odds Ratios:
#>                or or.lci or.uci  Pr(>|z|)    
#> (Intercept) 0.000  0.000  0.001 < 2.2e-16 ***
#> pregnant    1.131  1.062  1.204  1.23e-04 ***
#> glucose     1.036  1.028  1.043 < 2.2e-16 ***
#> pressure    0.987  0.977  0.997    0.0111 *  
#> triceps     1.001  0.987  1.014    0.9285    
#> insulin     0.999  0.997  1.001    0.1861    
#> mass        1.094  1.062  1.127  2.76e-09 ***
#> pedigree    2.573  1.432  4.625    0.0016 ** 
#> age         1.015  0.997  1.034    0.1112    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
#> 
#> Brier Score: 0.153     Nagelkerke R2: 0.408
#> 


# rpart tree
rp.pima <- FitMod(diabetes ~ ., d.pima, fitfn="rpart")
rp.pima
#> n= 768 
#> 
#> node), split, n, loss, yval, (yprob)
#>       * denotes terminal node
#> 
#>   1) root 768 268 neg (0.65104167 0.34895833)  
#>     2) glucose< 127.5 485  94 neg (0.80618557 0.19381443)  
#>       4) age< 28.5 271  23 neg (0.91512915 0.08487085) *
#>       5) age>=28.5 214  71 neg (0.66822430 0.33177570)  
#>        10) mass< 26.35 41   2 neg (0.95121951 0.04878049) *
#>        11) mass>=26.35 173  69 neg (0.60115607 0.39884393)  
#>          22) glucose< 99.5 55  10 neg (0.81818182 0.18181818) *
#>          23) glucose>=99.5 118  59 neg (0.50000000 0.50000000)  
#>            46) pedigree< 0.561 84  34 neg (0.59523810 0.40476190)  
#>              92) pedigree< 0.2 21   4 neg (0.80952381 0.19047619) *
#>              93) pedigree>=0.2 63  30 neg (0.52380952 0.47619048)  
#>               186) pregnant>=1.5 52  21 neg (0.59615385 0.40384615)  
#>                 372) pressure>=67 40  12 neg (0.70000000 0.30000000) *
#>                 373) pressure< 67 12   3 pos (0.25000000 0.75000000) *
#>               187) pregnant< 1.5 11   2 pos (0.18181818 0.81818182) *
#>            47) pedigree>=0.561 34   9 pos (0.26470588 0.73529412) *
#>     3) glucose>=127.5 283 109 pos (0.38515901 0.61484099)  
#>       6) mass< 29.95 76  24 neg (0.68421053 0.31578947)  
#>        12) glucose< 145.5 41   6 neg (0.85365854 0.14634146) *
#>        13) glucose>=145.5 35  17 pos (0.48571429 0.51428571)  
#>          26) insulin< 14.5 21   8 neg (0.61904762 0.38095238) *
#>          27) insulin>=14.5 14   4 pos (0.28571429 0.71428571) *
#>       7) mass>=29.95 207  57 pos (0.27536232 0.72463768)  
#>        14) glucose< 157.5 115  45 pos (0.39130435 0.60869565)  
#>          28) age< 30.5 50  23 neg (0.54000000 0.46000000)  
#>            56) pressure>=61 40  13 neg (0.67500000 0.32500000)  
#>             112) mass< 41.8 31   7 neg (0.77419355 0.22580645) *
#>             113) mass>=41.8 9   3 pos (0.33333333 0.66666667) *
#>            57) pressure< 61 10   0 pos (0.00000000 1.00000000) *
#>          29) age>=30.5 65  18 pos (0.27692308 0.72307692) *
#>        15) glucose>=157.5 92  12 pos (0.13043478 0.86956522) *
Conf(rp.pima)
#> 
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction neg pos
#>        neg 449  72
#>        pos  51 196
#> 
#>                 Total n : 768
#>                Accuracy : 0.8398
#>                  95% CI : (0.8122, 0.8641)
#>     No Information Rate : 0.6510
#>     P-Value [Acc > NIR] : < 2.2e-16
#> 
#>                   Kappa : 0.6410
#>  Mcnemar's Test P-Value : 0.0713
#> 
#>             Sensitivity : 0.8980
#>             Specificity : 0.7313
#>          Pos Pred Value : 0.8618
#>          Neg Pred Value : 0.7935
#>              Prevalence : 0.6510
#>          Detection Rate : 0.6784
#>    Detection Prevalence : 0.5846
#>       Balanced Accuracy : 0.8147
#>          F-val Accuracy : 0.8795
#>      Matthews Cor.-Coef : 0.6422
#> 
#>        'Positive' Class : neg
#> 
lines(ROC(rp.pima), col=hblue)
#> Setting levels: control = neg, case = pos
#> Setting direction: controls < cases

# to be improved
plot(rp.pima, col=SetAlpha(c("blue","red"), 0.4), cex=0.7)


# Random Forest
rf.pima <- FitMod(diabetes ~ ., d.pima, method="class", fitfn="randomForest")
rf.pima
#> 
#> Call:
#>  randomForest(formula = diabetes ~ ., data = d.pima, method = "class",      na.action = function (object, ...)      UseMethod("na.omit")) 
#>                Type of random forest: classification
#>                      Number of trees: 500
#> No. of variables tried at each split: 2
#> 
#>         OOB estimate of  error rate: 23.05%
#> Confusion matrix:
#>     neg pos class.error
#> neg 430  70   0.1400000
#> pos 107 161   0.3992537
Conf(rf.pima)
#> 
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction neg pos
#>        neg 430 107
#>        pos  70 161
#> 
#>                 Total n : 768
#>                Accuracy : 0.7695
#>                  95% CI : (0.7384, 0.7979)
#>     No Information Rate : 0.6510
#>     P-Value [Acc > NIR] : 7.16e-13
#> 
#>                   Kappa : 0.4760
#>  Mcnemar's Test P-Value : 0.0068
#> 
#>             Sensitivity : 0.8600
#>             Specificity : 0.6007
#>          Pos Pred Value : 0.8007
#>          Neg Pred Value : 0.6970
#>              Prevalence : 0.6510
#>          Detection Rate : 0.6992
#>    Detection Prevalence : 0.5599
#>       Balanced Accuracy : 0.7304
#>          F-val Accuracy : 0.8293
#>      Matthews Cor.-Coef : 0.4789
#> 
#>        'Positive' Class : neg
#> 
lines(ROC(r.pima), col=hred)
#> Setting levels: control = neg, case = pos
#> Setting direction: controls < cases




# more models to compare

d.pim <- SplitTrainTest(d.pima, p = 0.2)
mdiab <- formula(diabetes ~ pregnant + glucose + pressure + triceps
                            + insulin + mass + pedigree + age)

r.glm <- FitMod(mdiab, data=d.pim$train, fitfn="logit")
r.rp <- FitMod(mdiab, data=d.pim$train, fitfn="rpart")
r.rf <- FitMod(mdiab, data=d.pim$train, fitfn="randomForest")
r.svm <- FitMod(mdiab, data=d.pim$train, fitfn="svm")
r.c5 <- FitMod(mdiab, data=d.pim$train, fitfn="C5.0")
r.nn <- FitMod(mdiab, data=d.pim$train, fitfn="nnet")
r.nb <- FitMod(mdiab, data=d.pim$train, fitfn="naive_bayes")
r.lda <- FitMod(mdiab, data=d.pim$train, fitfn="lda")
r.qda <- FitMod(mdiab, data=d.pim$train, fitfn="qda")
r.lb <- FitMod(mdiab, data=d.pim$train, fitfn="lb")

mods <- list(glm=r.glm, rp=r.rp, rf=r.rf, svm=r.svm, c5=r.c5
             , nn=r.nn, nb=r.nb, lda=r.lda, qda=r.qda, lb=r.lb)

# insight in the Regression tree
plot(r.rp, box.palette = as.list(Pal("Helsana", alpha = 0.5)))


# Insample accuracy ...
TModC(mods, ord="auc")
#> Setting levels: control = neg, case = pos
#> Setting direction: controls < cases
#> Setting levels: control = neg, case = pos
#> Setting direction: controls < cases
#> Setting levels: control = neg, case = pos
#> Setting direction: controls < cases
#> Error in eval(object$call$data): object 'd.pim' not found
# ... is substantially different from the out-of-bag:
TModC(mods, newdata=d.pim$test, reference=d.pim$test$diabetes, ord="bs")
#> Setting levels: control = neg, case = pos
#> Setting direction: controls < cases
#> Setting levels: control = neg, case = pos
#> Setting direction: controls < cases
#> Setting levels: control = neg, case = pos
#> Setting direction: controls < cases
#> Setting levels: control = neg, case = pos
#> Setting direction: controls < cases
#> Setting levels: control = neg, case = pos
#> Setting direction: controls < cases
#> Setting levels: control = neg, case = pos
#> Setting direction: controls < cases
#> Warning: predict.naive_bayes(): more features in the newdata are provided as there are probability tables in the object. Calculation is performed based on features to be found in the tables.
#> Setting levels: control = neg, case = pos
#> Setting direction: controls < cases
#> Setting levels: control = neg, case = pos
#> Setting direction: controls < cases
#> Setting levels: control = neg, case = pos
#> Setting direction: controls < cases
#> Setting levels: control = neg, case = pos
#> Setting direction: controls < cases
#> Error in eval(x$call$formula): object 'mdiab' not found
# C5 and SVM turn out to be show-offs! They overfit quite ordinary
# whereas randomforest and logit keep their promises. ...

sapply(mods, function(z) VarImp(z))
#> Warning: no VarImp definition found for x (class FitMod, naive_bayes)
#> Warning: no VarImp definition found for x (class FitMod, lda)
#> Warning: no VarImp definition found for x (class FitMod, qda)
#> Warning: no VarImp definition found for x (class FitMod, LogitBoost)
#> $glm
#>          Overall
#> pregnant   3.210
#> glucose    8.176
#> pressure   2.437
#> triceps    0.056
#> insulin    1.642
#> mass       5.010
#> pedigree   2.970
#> age        1.687
#> 
#> $rp
#>          varimp
#> glucose  57.127
#> mass     17.811
#> pressure 13.715
#> age      13.531
#> insulin  12.599
#> pregnant  2.970
#> triceps   2.931
#> pedigree  0.866
#> 
#> $rf
#>          varimp
#> glucose  68.585
#> mass     44.600
#> age      38.308
#> pedigree 35.697
#> pressure 25.106
#> pregnant 24.142
#> triceps  20.404
#> insulin  18.435
#> 
#> $svm
#>      x
#> 1 <NA>
#> 
#> $c5
#>          Overall
#> glucose  100.000
#> mass     100.000
#> age       66.180
#> triceps   20.000
#> pregnant  18.860
#> pedigree  15.610
#> pressure   7.320
#> insulin    0.000
#> 
#> $nn
#>          Overall
#> triceps   17.802
#> pedigree  16.997
#> glucose   15.842
#> pressure  14.737
#> pregnant  12.023
#> age        8.704
#> insulin    7.800
#> mass       6.095
#> 
#> $nb
#>      x
#> 1 <NA>
#> 
#> $lda
#>      x
#> 1 <NA>
#> 
#> $qda
#>      x
#> 1 <NA>
#> 
#> $lb
#>      x
#> 1 <NA>
#> 


# Multinomial classification problem with n classes  ***************

d.gl <- SplitTrainTest(d.glass, p = 0.2)
mglass <- formula(Type ~ RI + Na + Mg + Al + Si + K + Ca + Ba + Fe)

# *** raises an unclear error in CRAN-Debian tests *** ??
# r.mult <- FitMod(mglass, data=d.gl$train, maxit=600, fitfn="multinom")
r.rp <- FitMod(mglass, data=d.gl$train, fitfn="rpart")
r.rf <- FitMod(mglass, data=d.gl$train, fitfn="randomForest")
r.svm <- FitMod(mglass, data=d.gl$train, fitfn="svm")
r.c5 <- FitMod(mglass, data=d.gl$train, fitfn="C5.0")
r.nn <- FitMod(mglass, data=d.gl$train, fitfn="nnet")
r.nbay <- FitMod(mglass, data=d.gl$train, fitfn="naive_bayes")
r.lda <- FitMod(mglass, data=d.gl$train, fitfn="lda")
# r.qda <- FitMod(mglass, data=d.glass, fitfn="qda")
r.lb <- FitMod(mglass, data=d.gl$train, fitfn="lb")

mods <- list(rp=r.rp, rf=r.rf, svm=r.svm, c5=r.c5,
             nn=r.nn, nbay=r.nbay, lda=r.lda, lb=r.lb)

# confusion matrix and other quality measures can be calculated with Conf()
Conf(r.rf)
#> 
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  1  2  3  5  6  7
#>          1 51 10  6  0  0  1
#>          2  6 52  2  3  1  3
#>          3  1  0  5  0  0  0
#>          5  0  1  0  6  0  0
#>          6  0  0  0  0  6  0
#>          7  0  1  0  1  0 16
#> 
#> Overall Statistics
#> 
#>                 Total n : 172
#>                Accuracy : 0.7907
#>                  95% CI : (0.7239, 0.8448)
#>     No Information Rate : 0.3721
#>     P-Value [Acc > NIR] : < 2.2e-16
#> 
#>                   Kappa : 0.7023
#>  Mcnemar's Test P-Value : NA
#> 
#> 
#> Statistics by Class:
#> 
#>                           1      2      3      5      6      7
#> Sensitivity          0.8793 0.8125 0.3846 0.6000 0.8571 0.8000
#> Specificity          0.8509 0.8611 0.9937 0.9938 1.0000 0.9868
#> Pos Pred Value       0.7500 0.7761 0.8333 0.8571 1.0000 0.8889
#> Neg Pred Value       0.9327 0.8857 0.9518 0.9758 0.9940 0.9740
#> Prevalence           0.3372 0.3721 0.0756 0.0581 0.0407 0.1163
#> Detection Rate       0.3953 0.3895 0.0349 0.0407 0.0349 0.1047
#> Detection Prevalence 0.2965 0.3023 0.0291 0.0349 0.0349 0.0930
#> Balanced Accuracy    0.8651 0.8368 0.6892 0.7969 0.9286 0.8934
#> F-val Accuracy       0.8095 0.7939 0.5263 0.7059 0.9231 0.8421
#> Matthews Cor.-Coef   0.7060 0.6677 0.5450 0.7033 0.9230 0.8240
#> 

# we only extract the general accuracy
sapply(lapply(mods, function(z) Conf(z)), "[[", "acc")
#> Error in eval(object$call$data): object 'd.gl' not found

# let's compare r.mult with a model without RI as predictor
# Conf(r.mult)
# Conf(update(r.mult, . ~ . -RI))