Cor.Rd
Cov
and Cor
compute the covariance or correlation of x
and y
if these
are vectors. If x
and y
are matrices then the
covariances (or correlations) between the columns of x
and the
columns of y
are computed.
a numeric vector, matrix or data frame.
NULL
(default) or a vector, matrix or data frame with
compatible dimensions to x
. The default is equivalent to
y = x
(but more efficient).
an optional character string giving a
method for computing covariances in the presence
of missing values. This must be (an abbreviation of) one of the strings
"everything"
, "all.obs"
, "complete.obs"
,
"na.or.complete"
, or "pairwise.complete.obs"
.
a character string indicating which correlation
coefficient (or covariance) is to be computed. One of
"pearson"
(default), "kendall"
, or "spearman"
:
can be abbreviated.
For r <- Cor(*, use = "all.obs")
, it is now guaranteed that
all(abs(r) <= 1)
.
For Cov
and Cor
one must either give a matrix or
data frame for x
or give both x
and y
.
The inputs must be numeric (as determined by is.numeric
:
logical values are also allowed for historical compatibility): the
"kendall"
and "spearman"
methods make sense for ordered
inputs but xtfrm
can be used to find a suitable prior
transformation to numbers.
If use
is "everything"
, NA
s will
propagate conceptually, i.e., a resulting value will be NA
whenever one of its contributing observations is NA
.
If use
is "all.obs"
, then the presence of missing
observations will produce an error. If use
is
"complete.obs"
then missing values are handled by casewise
deletion (and if there are no complete cases, that gives an error).
"na.or.complete"
is the same unless there are no complete
cases, that gives NA
.
Finally, if use
has the value "pairwise.complete.obs"
then the correlation or covariance between each pair of variables is
computed using all complete pairs of observations on those variables.
This can result in covariance or correlation matrices which are not positive
semi-definite, as well as NA
entries if there are no complete
pairs for that pair of variables. For Cov
and Var
,
"pairwise.complete.obs"
only works with the "pearson"
method.
Note that (the equivalent of) Var(double(0), use = *)
gives
NA
for use = "everything"
and "na.or.complete"
,
and gives an error in the other cases.
The denominator \(n - 1\) is used which gives an unbiased estimator
of the (co)variance for i.i.d. observations.
These functions return NA
when there is only one
observation (whereas S-PLUS has been returning NaN
), and
fail if x
has length zero.
For Cor()
, if method
is "kendall"
or
"spearman"
, Kendall's \(\tau\) or Spearman's
\(\rho\) statistic is used to estimate a rank-based measure of
association. These are more robust and have been recommended if the
data do not necessarily come from a bivariate normal distribution.
For Cov()
, a non-Pearson method is unusual but available for
the sake of completeness. Note that "spearman"
basically
computes Cor(R(x), R(y))
(or Cov(., .)
) where R(u)
:= rank(u, na.last = "keep")
. In the case of missing values, the
ranks are calculated depending on the value of use
, either
based on complete observations, or based on pairwise completeness with
reranking for each pair.
Scaling a covariance matrix into a correlation one can be achieved in
many ways, mathematically most appealing by multiplication with a
diagonal matrix from left and right, or more efficiently by using
sweep(.., FUN = "/")
twice.
Some people have noted that the code for Kendall's tau is slow for
very large datasets (many more than 1000 cases). It rarely makes
sense to do such a computation, but see function
cor.fk
in package pcaPP.
Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language. Wadsworth & Brooks/Cole.
## Two simple vectors
Cor(1:10, 2:11) # == 1
#> [1] 1
## Correlation Matrix of Multivariate sample:
(Cl <- Cor(longley))
#> GNP.deflator GNP Unemployed Armed.Forces Population
#> GNP.deflator 1.0000000 0.9915892 0.6206334 0.4647442 0.9791634
#> GNP 0.9915892 1.0000000 0.6042609 0.4464368 0.9910901
#> Unemployed 0.6206334 0.6042609 1.0000000 -0.1774206 0.6865515
#> Armed.Forces 0.4647442 0.4464368 -0.1774206 1.0000000 0.3644163
#> Population 0.9791634 0.9910901 0.6865515 0.3644163 1.0000000
#> Year 0.9911492 0.9952735 0.6682566 0.4172451 0.9939528
#> Employed 0.9708985 0.9835516 0.5024981 0.4573074 0.9603906
#> Year Employed
#> GNP.deflator 0.9911492 0.9708985
#> GNP 0.9952735 0.9835516
#> Unemployed 0.6682566 0.5024981
#> Armed.Forces 0.4172451 0.4573074
#> Population 0.9939528 0.9603906
#> Year 1.0000000 0.9713295
#> Employed 0.9713295 1.0000000
## Graphical Correlation Matrix:
symnum(Cl) # highly correlated
#> GNP. GNP U A P Y E
#> GNP.deflator 1
#> GNP B 1
#> Unemployed , , 1
#> Armed.Forces . . 1
#> Population B B , . 1
#> Year B B , . B 1
#> Employed B B . . B B 1
#> attr(,"legend")
#> [1] 0 ‘ ’ 0.3 ‘.’ 0.6 ‘,’ 0.8 ‘+’ 0.9 ‘*’ 0.95 ‘B’ 1
## Spearman's rho and Kendall's tau
symnum(clS <- Cor(longley, method = "spearman"))
#> GNP. GNP U A P Y E
#> GNP.deflator 1
#> GNP B 1
#> Unemployed , , 1
#> Armed.Forces . 1
#> Population B B , 1
#> Year B B , 1 1
#> Employed B B . B B 1
#> attr(,"legend")
#> [1] 0 ‘ ’ 0.3 ‘.’ 0.6 ‘,’ 0.8 ‘+’ 0.9 ‘*’ 0.95 ‘B’ 1
symnum(clK <- Cor(longley, method = "kendall"))
#> GNP. GNP U A P Y E
#> GNP.deflator 1
#> GNP B 1
#> Unemployed . . 1
#> Armed.Forces 1
#> Population B B . 1
#> Year B B . 1 1
#> Employed * * . + + 1
#> attr(,"legend")
#> [1] 0 ‘ ’ 0.3 ‘.’ 0.6 ‘,’ 0.8 ‘+’ 0.9 ‘*’ 0.95 ‘B’ 1
## How much do they differ?
i <- lower.tri(Cl)
Cor(cbind(P = Cl[i], S = clS[i], K = clK[i]))
#> P S K
#> P 1.0000000 0.9802390 0.9572562
#> S 0.9802390 1.0000000 0.9742171
#> K 0.9572562 0.9742171 1.0000000
##--- Missing value treatment:
C1 <- Cov(swiss)
range(eigen(C1, only.values = TRUE)$values) # 6.19 1921
#> [1] 6.191249 1921.562488
## swM := "swiss" with 3 "missing"s :
swM <- swiss
colnames(swM) <- abbreviate(colnames(swiss), min=6)
swM[1,2] <- swM[7,3] <- swM[25,5] <- NA # create 3 "missing"
## Consider all 5 "use" cases :
(C. <- Cov(swM)) # use="everything" quite a few NA's in cov.matrix
#> Frtlty Agrclt Exmntn Eductn Cathlc Infn.M
#> Frtlty 156.04250 NA NA -79.729510 NA 15.156193
#> Agrclt NA NA NA NA NA NA
#> Exmntn NA NA NA NA NA NA
#> Eductn -79.72951 NA NA 92.456059 NA -2.781684
#> Cathlc NA NA NA NA NA NA
#> Infn.M 15.15619 NA NA -2.781684 NA 8.483802
try(Cov(swM, use = "all")) # Error: missing obs...
#> Error in Cov(swM, use = "all") : missing observations in cov/cor
C2 <- Cov(swM, use = "complete")
stopifnot(identical(C2, Cov(swM, use = "na.or.complete")))
range(eigen(C2, only.values = TRUE)$values) # 6.46 1930
#> [1] 6.462385 1930.505982
C3 <- Cov(swM, use = "pairwise")
range(eigen(C3, only.values = TRUE)$values) # 6.19 1938
#> [1] 6.194469 1938.033663
## Kendall's tau doesn't change much:
symnum(Rc <- Cor(swM, method = "kendall", use = "complete"))
#> F A Ex Ed C I
#> Frtlty 1
#> Agrclt 1
#> Exmntn . . 1
#> Eductn . . . 1
#> Cathlc . 1
#> Infn.M 1
#> attr(,"legend")
#> [1] 0 ‘ ’ 0.3 ‘.’ 0.6 ‘,’ 0.8 ‘+’ 0.9 ‘*’ 0.95 ‘B’ 1
symnum(Rp <- Cor(swM, method = "kendall", use = "pairwise"))
#> F A Ex Ed C I
#> Frtlty 1
#> Agrclt 1
#> Exmntn . . 1
#> Eductn . . . 1
#> Cathlc . 1
#> Infn.M . 1
#> attr(,"legend")
#> [1] 0 ‘ ’ 0.3 ‘.’ 0.6 ‘,’ 0.8 ‘+’ 0.9 ‘*’ 0.95 ‘B’ 1
symnum(R. <- Cor(swiss, method = "kendall"))
#> F A Ex Ed C I
#> Fertility 1
#> Agriculture 1
#> Examination . . 1
#> Education . . . 1
#> Catholic . 1
#> Infant.Mortality . 1
#> attr(,"legend")
#> [1] 0 ‘ ’ 0.3 ‘.’ 0.6 ‘,’ 0.8 ‘+’ 0.9 ‘*’ 0.95 ‘B’ 1
## "pairwise" is closer componentwise,
summary(abs(c(1 - Rp/R.)))
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 0.00000 0.00000 0.04481 0.09573 0.15214 0.53941
summary(abs(c(1 - Rc/R.)))
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 0.00000 0.02021 0.08482 0.50675 0.16192 7.08509
## but "complete" is closer in Eigen space:
EV <- function(m) eigen(m, only.values=TRUE)$values
summary(abs(1 - EV(Rp)/EV(R.)) / abs(1 - EV(Rc)/EV(R.)))
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 0.8942 1.1464 1.2452 1.3732 1.3722 2.3265