Zero-inflated Count Data Regression
zeroinfl.Rd
Fit zero-inflated regression models for count data via maximum likelihood.
Usage
zeroinfl(formula, data, subset, na.action, weights, offset,
dist = c("poisson", "negbin", "geometric"),
link = c("logit", "probit", "cloglog", "cauchit", "log"),
control = zeroinfl.control(...),
model = TRUE, y = TRUE, x = FALSE, ...)
Arguments
- formula
symbolic description of the model, see details.
- data, subset, na.action
arguments controlling formula processing via
model.frame
.- weights
optional numeric vector of weights.
- offset
optional numeric vector with an a priori known component to be included in the linear predictor of the count model. See below for more information on offsets.
- dist
character specification of count model family (a log link is always used).
- link
character specification of link function in the binary zero-inflation model (a binomial family is always used).
- control
a list of control arguments specified via
zeroinfl.control
.- model, y, x
logicals. If
TRUE
the corresponding components of the fit (model frame, response, model matrix) are returned.- ...
arguments passed to
zeroinfl.control
in the default setup.
Details
Zero-inflated count models are two-component mixture models combining a point mass at zero with a proper count distribution. Thus, there are two sources of zeros: zeros may come from both the point mass and from the count component. Usually the count model is a Poisson or negative binomial regression (with log link). The geometric distribution is a special case of the negative binomial with size parameter equal to 1. For modeling the unobserved state (zero vs. count), a binary model is used that captures the probability of zero inflation. in the simplest case only with an intercept but potentially containing regressors. For this zero-inflation model, a binomial model with different links can be used, typically logit or probit.
The formula
can be used to specify both components of the model:
If a formula
of type y ~ x1 + x2
is supplied, then the same
regressors are employed in both components. This is equivalent to
y ~ x1 + x2 | x1 + x2
. Of course, a different set of regressors
could be specified for the count and zero-inflation component, e.g.,
y ~ x1 + x2 | z1 + z2 + z3
giving the count data model y ~ x1 + x2
conditional on (|
) the zero-inflation model y ~ z1 + z2 + z3
.
A simple inflation model where all zero counts have the same
probability of belonging to the zero component can by specified by the formula
y ~ x1 + x2 | 1
.
Offsets can be specified in both components of the model pertaining to count and
zero-inflation model: y ~ x1 + offset(x2) | z1 + z2 + offset(z3)
, where
x2
is used as an offset (i.e., with coefficient fixed to 1) in the
count component and z3
analogously in the zero-inflation component. By the rule
stated above y ~ x1 + offset(x2)
is expanded to
y ~ x1 + offset(x2) | x1 + offset(x2)
. Instead of using the
offset()
wrapper within the formula
, the offset
argument
can also be employed which sets an offset only for the count model. Thus,
formula = y ~ x1
and offset = x2
is equivalent to
formula = y ~ x1 + offset(x2) | x1
.
All parameters are estimated by maximum likelihood using optim
,
with control options set in zeroinfl.control
.
Starting values can be supplied, estimated by the EM (expectation maximization)
algorithm, or by glm.fit
(the default). Standard errors
are derived numerically using the Hessian matrix returned by optim
.
See zeroinfl.control
for details.
The returned fitted model object is of class "zeroinfl"
and is similar
to fitted "glm"
objects. For elements such as "coefficients"
or
"terms"
a list is returned with elements for the zero and count component,
respectively. For details see below.
A set of standard extractor functions for fitted model objects is available for
objects of class "zeroinfl"
, including methods to the generic functions
print
, summary
, coef
,
vcov
, logLik
, residuals
,
predict
, fitted
, terms
,
model.matrix
. See predict.zeroinfl
for more details
on all methods.
Value
An object of class "zeroinfl"
, i.e., a list with components including
- coefficients
a list with elements
"count"
and"zero"
containing the coefficients from the respective models,- residuals
a vector of raw residuals (observed - fitted),
- fitted.values
a vector of fitted means,
- optim
a list with the output from the
optim
call for minimizing the negative log-likelihood,- control
the control arguments passed to the
optim
call,- start
the starting values for the parameters passed to the
optim
call,- weights
the case weights used,
- offset
a list with elements
"count"
and"zero"
containing the offset vectors (if any) from the respective models,- n
number of observations (with weights > 0),
- df.null
residual degrees of freedom for the null model (=
n - 2
),- df.residual
residual degrees of freedom for fitted model,
- terms
a list with elements
"count"
,"zero"
and"full"
containing the terms objects for the respective models,- theta
estimate of the additional \(\theta\) parameter of the negative binomial model (if a negative binomial regression is used),
- SE.logtheta
standard error for \(\log(\theta)\),
- loglik
log-likelihood of the fitted model,
- vcov
covariance matrix of all coefficients in the model (derived from the Hessian of the
optim
output),- dist
character string describing the count distribution used,
- link
character string describing the link of the zero-inflation model,
- linkinv
the inverse link function corresponding to
link
,- converged
logical indicating successful convergence of
optim
,- call
the original function call,
- formula
the original formula,
- levels
levels of the categorical regressors,
- contrasts
a list with elements
"count"
and"zero"
containing the contrasts corresponding tolevels
from the respective models,- model
the full model frame (if
model = TRUE
),- y
the response count vector (if
y = TRUE
),- x
a list with elements
"count"
and"zero"
containing the model matrices from the respective models (ifx = TRUE
),
References
Cameron, A. Colin and Pravin K. Trevedi. 1998. Regression Analysis of Count Data. New York: Cambridge University Press.
Cameron, A. Colin and Pravin K. Trivedi. 2005. Microeconometrics: Methods and Applications. Cambridge: Cambridge University Press.
Lambert, Diane. 1992. “Zero-Inflated Poisson Regression, with an Application to Defects in Manufacturing.” Technometrics. 34(1):1-14
Zeileis, Achim, Christian Kleiber and Simon Jackman 2008. “Regression Models for Count Data in R.” Journal of Statistical Software, 27(8). URL https://www.jstatsoft.org/v27/i08/.
Examples
## data
data("bioChemists", package = "ModTools")
## without inflation
## ("art ~ ." is "art ~ fem + mar + kid5 + phd + ment")
fm_pois <- glm(art ~ ., data = bioChemists, family = poisson)
fm_qpois <- glm(art ~ ., data = bioChemists, family = quasipoisson)
fm_nb <- MASS::glm.nb(art ~ ., data = bioChemists)
## with simple inflation (no regressors for zero component)
fm_zip <- zeroinfl(art ~ . | 1, data = bioChemists)
fm_zinb <- zeroinfl(art ~ . | 1, data = bioChemists, dist = "negbin")
## inflation with regressors
## ("art ~ . | ." is "art ~ fem + mar + kid5 + phd + ment | fem + mar + kid5 + phd + ment")
fm_zip2 <- zeroinfl(art ~ . | ., data = bioChemists)
fm_zinb2 <- zeroinfl(art ~ . | ., data = bioChemists, dist = "negbin")