Cramer's V, Pearson's Contingency Coefficient and Phi Coefficient Yule's Q and Y, Tschuprow's T

Calculate Cramer's V, Pearson's contingency coefficient and phi, Yule's Q and Y and Tschuprow's T of x, if x is a table. If both, x and y are given, then the according table will be built first.

Phi(x, y = NULL, ...)
ContCoef(x, y = NULL, correct = FALSE, ...)
CramerV(x, y = NULL, conf.level = NA,
        method = c("ncchisq", "ncchisqadj", "fisher", "fisheradj"), 
        correct = FALSE, ...)

YuleQ(x, y = NULL, ...)
YuleY(x, y = NULL, ...)
TschuprowT(x, y = NULL, correct = FALSE, ...)

Arguments

x: can be a numeric vector, a matrix or a table.
y: NULL (default) or a vector with compatible dimensions to x. If y is provided, table(x, y, ...) is calculated.
conf.level: confidence level of the interval. This is only implemented for Cramer's V. If set to NA (which is the default) no confidence interval will be calculated.
See examples for calculating bootstrap intervals.
method: string defining the method to calculate confidence intervals for Cramer's V. One out of "ncchisq" (using noncentral chisquare), "ncchisqadj", "fisher" (using fisher z transformation), "fisheradj" (using fisher z transformation and bias correction). Default is "ncchisq".
correct: logical. Applying to ContCoef this indicates, whether the Sakoda's adjusted Pearson's C should be returned. For CramerV() and TschuprowT() it defines, whether a bias correction should be applied or not. Default is FALSE.
...: further arguments are passed to the function table, allowing i.e. to set useNA.

Details

For x either a matrix or two vectors x and y are expected. In latter case table(x, y, ...) is calculated. The function handles NAs the same way the table function does, so tables are by default calculated with NAs omitted.

A provided matrix is interpreted as a contingency table, which seems in the case of frequency data the natural interpretation (this is e.g. also what chisq.test expects).

Use the function PairApply (pairwise apply) if the measure should be calculated pairwise for all columns. This allows matrices of association measures to be calculated the same way cor does. NAs are by default omitted pairwise, which corresponds to the pairwise.complete option of cor. Use complete.cases, if only the complete cases of a data.frame are to be used. (see examples)

The maximum value for Phi is \(\sqrt(min(r, c) - 1)\). The contingency coefficient goes from 0 to \(\sqrt(\frac{min(r, c) - 1}{min(r, c)})\). For the corrected contingency coefficient and for Cramer's V the range is 0 to 1.
A Cramer's V in the range of [0, 0.3] is considered as weak, [0.3,0.7] as medium and > 0.7 as strong. The minimum value for all is 0 under statistical independence.

Value

a single numeric value if no confidence intervals are requested,
and otherwise a numeric vector with 3 elements for the estimate, the lower and the upper confidence interval

References

Yule, G. Uday (1912) On the methods of measuring association between two attributes. Journal of the Royal Statistical Society, LXXV, 579-652

Tschuprow, A. A. (1939) Principles of the Mathematical Theory of Correlation, translated by M. Kantorowitsch. W. Hodge & Co.

Cramer, H. (1946) Mathematical Methods of Statistics. Princeton University Press

Agresti, Alan (1996) Introduction to categorical data analysis. NY: John Wiley and Sons

Sakoda, J.M. (1977) Measures of Association for Multivariate Contingency Tables, Proceedings of the Social Statistics Section of the American Statistical Association (Part III), 777-780.

Smithson, M.J. (2003) Confidence Intervals, Quantitative Applications in the Social Sciences Series, No. 140. Thousand Oaks, CA: Sage. pp. 39-41

Bergsma, W. (2013) A bias-correction for Cramer's V and Tschuprow's T Journal of the Korean Statistical Society 42(3) DOI: 10.1016/j.jkss.2012.10.002

Author

Andri Signorell <andri@signorell.net>,
Michael Smithson <michael.smithson@anu.edu.au> (confidence intervals for Cramer V)

Examples

tab <- table(d.pizza$driver, d.pizza$wine_delivered)
Phi(tab)
#> [1] 0.1328222
ContCoef(tab)
#> [1] 0.1316659
CramerV(tab)
#> [1] 0.1328222
TschuprowT(tab)
#> [1] 0.08486583

# just x and y
CramerV(d.pizza$driver, d.pizza$wine_delivered)
#> [1] 0.1328222

# data.frame
PairApply(d.pizza[,c("driver","operator","area")], CramerV, symmetric = TRUE)
#>             driver   operator       area
#> driver   1.0000000 0.23585686 0.65018461
#> operator 0.2358569 1.00000000 0.08670047
#> area     0.6501846 0.08670047 1.00000000


# useNA is passed to table
PairApply(d.pizza[,c("driver","operator","area")], CramerV,
          useNA="ifany", symmetric = TRUE)
#>             driver   operator       area
#> driver   1.0000000 0.20253639 0.53066544
#> operator 0.2025364 1.00000000 0.07847762
#> area     0.5306654 0.07847762 1.00000000

d.frm <- d.pizza[,c("driver","operator","area")]
PairApply(d.frm[complete.cases(d.frm),], CramerV, symmetric = TRUE)
#>             driver  operator      area
#> driver   1.0000000 0.2345141 0.6504665
#> operator 0.2345141 1.0000000 0.0869935
#> area     0.6504665 0.0869935 1.0000000


m <- as.table(matrix(c(2,4,1,7), nrow=2))
YuleQ(m)
#> [1] 0.5555556
YuleY(m)
#> [1] 0.303337


# Bootstrap confidence intervals for Cramer's V
# http://support.sas.com/documentation/cdl/en/statugfreq/63124/PDF/default/statugfreq.pdf, p. 1821

tab <- as.table(rbind(
  c(26,26,23,18, 9),
  c( 6, 7, 9,14,23)))
d.frm <- Untable(tab)

n <- 1000
idx <- matrix(sample(nrow(d.frm), size=nrow(d.frm) * n, replace=TRUE), ncol=n, byrow=FALSE)
v <- apply(idx, 2, function(x) CramerV(d.frm[x,1], d.frm[x,2]))
quantile(v, probs=c(0.025,0.975))
#>      2.5%     97.5% 
#> 0.2814951 0.5600137 

# compare this to the analytical ones
CramerV(tab, conf.level=0.95)
#>  Cramer V    lwr.ci    upr.ci 
#> 0.4064888 0.2211672 0.5410622