G-Test for Count Data

GTest performs chi-squared contingency table tests and goodness-of-fit tests.

GTest(x, y = NULL, correct = c("none", "williams", "yates"),
      p = rep(1/length(x), length(x)), rescale.p = FALSE) <!-- % , simulate.p.value = FALSE, B = 2000 -->

Arguments

x: a numeric vector or matrix. x and y can also both be factors.
y: a numeric vector; ignored if x is a matrix. If x is a factor, y should be a factor of the same length.
correct: one out of "none" (default), "williams", "yates" . See Details.
p: a vector of probabilities of the same length of x. An error is given if any entry of p is negative.
rescale.p: a logical scalar; if TRUE then p is rescaled (if necessary) to sum to 1. If rescale.p is FALSE, and p does not sum to 1, an error is given.

Details

The G-test is also called "Likelihood Ratio Test" and is asymptotically equivalent to the Pearson ChiSquare-test but not usually used when analyzing 2x2 tables. It is used in logistic regression and loglinear modeling which involves contingency tables. The G-test is also reported in the standard summary of Desc for tables.

If x is a matrix with one row or column, or if x is a vector and y is not given, then a goodness-of-fit test is performed (x is treated as a one-dimensional contingency table). The entries of x must be non-negative integers. In this case, the hypothesis tested is whether the population probabilities equal those in p, or are all equal if p is not given.

If x is a matrix with at least two rows and columns, it is taken as a two-dimensional contingency table: the entries of x must be non-negative integers. Otherwise, x and y must be vectors or factors of the same length; cases with missing values are removed, the objects are coerced to factors, and the contingency table is computed from these. Then G-test is performed on the null hypothesis that the joint distribution of the cell counts in a 2-dimensional contingency table is the product of the row and column marginals.

Test of independence Yates' correction taken from Mike Camann's 2x2 G-test function. Goodness of Fit Yates' correction as described in Zar (2000).

Value

A list with class "htest" containing the following components:

statistic: the value the chi-squared test statistic.
parameter: the degrees of freedom of the approximate chi-squared distribution of the test statistic, NA if the p-value is computed by Monte Carlo simulation.
p.value: the p-value for the test.
method: a character string indicating the type of test performed, and whether Monte Carlo simulation or continuity correction was used.
data.name: a character string giving the name(s) of the data.
observed: the observed counts.
expected: the expected counts under the null hypothesis.

Author

Pete Hurd <phurd@ualberta.ca>, Andri Signorell <andri@signorell.net> (tiny tweaks)

References

Hope, A. C. A. (1968) A simplified Monte Carlo significance test procedure. J. Roy, Statist. Soc. B 30, 582–598.

Patefield, W. M. (1981) Algorithm AS159. An efficient method of generating r x c tables with given row and column totals. Applied Statistics 30, 91–97.

Agresti, A. (2007) An Introduction to Categorical Data Analysis, 2nd ed., New York: John Wiley & Sons. Page 38.

Sokal, R. R., F. J. Rohlf (2012) Biometry: the principles and practice of statistics in biological research. 4th edition. W. H. Freeman and Co.: New York. 937 pp.

Examples


## From Agresti(2007) p.39
M <- as.table(rbind(c(762, 327, 468), c(484,239,477)))
dimnames(M) <- list(gender=c("M","F"),
                    party=c("Democrat","Independent", "Republican"))

(Xsq <- GTest(M))   # Prints test summary
#> 
#> 	Log likelihood ratio (G-test) test of independence without correction
#> 
#> data:  M
#> G = 30.017, X-squared df = 2, p-value = 0.0000003034
#> 

Xsq$observed        # observed counts (same as M)
#>       party
#> gender Democrat Independent Republican
#>      M      762         327        468
#>      F      484         239        477
Xsq$expected        # expected counts under the null
#>   Democrat Independent Republican
#> M 703.6714    319.6453   533.6834
#> F 542.3286    246.3547   411.3166


## Testing for population probabilities
## Case A. Tabulated data
x <- c(A = 20, B = 15, C = 25)
GTest(x)
#> 
#> 	Log likelihood ratio (G-test) goodness of fit test
#> 
#> data:  x
#> G = 2.5267, X-squared df = 2, p-value = 0.2827
#> 
GTest(as.table(x))             # the same
#> 
#> 	Log likelihood ratio (G-test) goodness of fit test
#> 
#> data:  as.table(x)
#> G = 2.5267, X-squared df = 2, p-value = 0.2827
#> 
x <- c(89,37,30,28,2)
p <- c(40,20,20,15,5)
try(
GTest(x, p = p)                # gives an error
)
#> Error in GTest(x, p = p) : probabilities must sum to 1.
# works
p <- c(0.40,0.20,0.20,0.19,0.01)
# Expected count in category 5
# is 1.86 < 5 ==> chi square approx.
GTest(x, p = p)                # maybe doubtful, but is ok!
#> 
#> 	Log likelihood ratio (G-test) goodness of fit test
#> 
#> data:  x
#> G = 5.8414, X-squared df = 4, p-value = 0.2113
#> 

## Case B. Raw data
x <- trunc(5 * runif(100))
GTest(table(x))                # NOT 'GTest(x)'!
#> 
#> 	Log likelihood ratio (G-test) goodness of fit test
#> 
#> data:  table(x)
#> G = 3.2444, X-squared df = 4, p-value = 0.5178
#>