Covariance and Correlation (Matrices)

Cov and Cor compute the covariance or correlation of x and y if these are vectors. If x and y are matrices then the covariances (or correlations) between the columns of x and the columns of y are computed.

Cov(x, y = NULL, use = "everything",
    method = c("pearson", "kendall", "spearman"))

Cor(x, y = NULL, use = "everything",
    method = c("pearson", "kendall", "spearman"))

Arguments

x: a numeric vector, matrix or data frame.
y: NULL (default) or a vector, matrix or data frame with compatible dimensions to x. The default is equivalent to y = x (but more efficient).
use: an optional character string giving a method for computing covariances in the presence of missing values. This must be (an abbreviation of) one of the strings "everything", "all.obs", "complete.obs", "na.or.complete", or "pairwise.complete.obs".
method: a character string indicating which correlation coefficient (or covariance) is to be computed. One of "pearson" (default), "kendall", or "spearman": can be abbreviated.

Value

For r <- Cor(*, use = "all.obs"), it is now guaranteed that all(abs(r) <= 1).

Details

For Cov and Cor one must either give a matrix or data frame for x or give both x and y.

The inputs must be numeric (as determined by is.numeric: logical values are also allowed for historical compatibility): the "kendall" and "spearman" methods make sense for ordered inputs but xtfrm can be used to find a suitable prior transformation to numbers.

If use is "everything", NAs will propagate conceptually, i.e., a resulting value will be NA whenever one of its contributing observations is NA.
If use is "all.obs", then the presence of missing observations will produce an error. If use is "complete.obs" then missing values are handled by casewise deletion (and if there are no complete cases, that gives an error).
"na.or.complete" is the same unless there are no complete cases, that gives NA. Finally, if use has the value "pairwise.complete.obs" then the correlation or covariance between each pair of variables is computed using all complete pairs of observations on those variables. This can result in covariance or correlation matrices which are not positive semi-definite, as well as NA entries if there are no complete pairs for that pair of variables. For Cov and Var, "pairwise.complete.obs" only works with the "pearson" method. Note that (the equivalent of) Var(double(0), use = *) gives NA for use = "everything" and "na.or.complete", and gives an error in the other cases.

The denominator \(n - 1\) is used which gives an unbiased estimator of the (co)variance for i.i.d. observations. These functions return NA when there is only one observation (whereas S-PLUS has been returning NaN), and fail if x has length zero.

For Cor(), if method is "kendall" or "spearman", Kendall's \(\tau\) or Spearman's \(\rho\) statistic is used to estimate a rank-based measure of association. These are more robust and have been recommended if the data do not necessarily come from a bivariate normal distribution.
For Cov(), a non-Pearson method is unusual but available for the sake of completeness. Note that "spearman" basically computes Cor(R(x), R(y)) (or Cov(., .)) where R(u) := rank(u, na.last = "keep"). In the case of missing values, the ranks are calculated depending on the value of use, either based on complete observations, or based on pairwise completeness with reranking for each pair.

Scaling a covariance matrix into a correlation one can be achieved in many ways, mathematically most appealing by multiplication with a diagonal matrix from left and right, or more efficiently by using sweep(.., FUN = "/") twice.

Note

Some people have noted that the code for Kendall's tau is slow for very large datasets (many more than 1000 cases). It rarely makes sense to do such a computation, but see function cor.fk in package pcaPP.

References

Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language. Wadsworth & Brooks/Cole.

Examples

## Two simple vectors
Cor(1:10, 2:11) # == 1
#> [1] 1

## Correlation Matrix of Multivariate sample:
(Cl <- Cor(longley))
#>              GNP.deflator       GNP Unemployed Armed.Forces Population
#> GNP.deflator    1.0000000 0.9915892  0.6206334    0.4647442  0.9791634
#> GNP             0.9915892 1.0000000  0.6042609    0.4464368  0.9910901
#> Unemployed      0.6206334 0.6042609  1.0000000   -0.1774206  0.6865515
#> Armed.Forces    0.4647442 0.4464368 -0.1774206    1.0000000  0.3644163
#> Population      0.9791634 0.9910901  0.6865515    0.3644163  1.0000000
#> Year            0.9911492 0.9952735  0.6682566    0.4172451  0.9939528
#> Employed        0.9708985 0.9835516  0.5024981    0.4573074  0.9603906
#>                   Year  Employed
#> GNP.deflator 0.9911492 0.9708985
#> GNP          0.9952735 0.9835516
#> Unemployed   0.6682566 0.5024981
#> Armed.Forces 0.4172451 0.4573074
#> Population   0.9939528 0.9603906
#> Year         1.0000000 0.9713295
#> Employed     0.9713295 1.0000000
## Graphical Correlation Matrix:
symnum(Cl) # highly correlated
#>              GNP. GNP U A P Y E
#> GNP.deflator 1                 
#> GNP          B    1            
#> Unemployed   ,    ,   1        
#> Armed.Forces .    .     1      
#> Population   B    B   , . 1    
#> Year         B    B   , . B 1  
#> Employed     B    B   . . B B 1
#> attr(,"legend")
#> [1] 0 ‘ ’ 0.3 ‘.’ 0.6 ‘,’ 0.8 ‘+’ 0.9 ‘*’ 0.95 ‘B’ 1

## Spearman's rho  and  Kendall's tau
symnum(clS <- Cor(longley, method = "spearman"))
#>              GNP. GNP U A P Y E
#> GNP.deflator 1                 
#> GNP          B    1            
#> Unemployed   ,    ,   1        
#> Armed.Forces          . 1      
#> Population   B    B   ,   1    
#> Year         B    B   ,   1 1  
#> Employed     B    B   .   B B 1
#> attr(,"legend")
#> [1] 0 ‘ ’ 0.3 ‘.’ 0.6 ‘,’ 0.8 ‘+’ 0.9 ‘*’ 0.95 ‘B’ 1
symnum(clK <- Cor(longley, method = "kendall"))
#>              GNP. GNP U A P Y E
#> GNP.deflator 1                 
#> GNP          B    1            
#> Unemployed   .    .   1        
#> Armed.Forces            1      
#> Population   B    B   .   1    
#> Year         B    B   .   1 1  
#> Employed     *    *   .   + + 1
#> attr(,"legend")
#> [1] 0 ‘ ’ 0.3 ‘.’ 0.6 ‘,’ 0.8 ‘+’ 0.9 ‘*’ 0.95 ‘B’ 1
## How much do they differ?
i <- lower.tri(Cl)
Cor(cbind(P = Cl[i], S = clS[i], K = clK[i]))
#>           P         S         K
#> P 1.0000000 0.9802390 0.9572562
#> S 0.9802390 1.0000000 0.9742171
#> K 0.9572562 0.9742171 1.0000000


##--- Missing value treatment:

C1 <- Cov(swiss)
range(eigen(C1, only.values = TRUE)$values) # 6.19        1921
#> [1]    6.191249 1921.562488

## swM := "swiss" with  3 "missing"s :
swM <- swiss
colnames(swM) <- abbreviate(colnames(swiss), min=6)
swM[1,2] <- swM[7,3] <- swM[25,5] <- NA # create 3 "missing"

## Consider all 5 "use" cases :
(C. <- Cov(swM)) # use="everything"  quite a few NA's in cov.matrix
#>           Frtlty Agrclt Exmntn     Eductn Cathlc    Infn.M
#> Frtlty 156.04250     NA     NA -79.729510     NA 15.156193
#> Agrclt        NA     NA     NA         NA     NA        NA
#> Exmntn        NA     NA     NA         NA     NA        NA
#> Eductn -79.72951     NA     NA  92.456059     NA -2.781684
#> Cathlc        NA     NA     NA         NA     NA        NA
#> Infn.M  15.15619     NA     NA  -2.781684     NA  8.483802
try(Cov(swM, use = "all")) # Error: missing obs...
#> Error in Cov(swM, use = "all") : missing observations in cov/cor
C2 <- Cov(swM, use = "complete")
stopifnot(identical(C2, Cov(swM, use = "na.or.complete")))
range(eigen(C2, only.values = TRUE)$values) # 6.46   1930
#> [1]    6.462385 1930.505982
C3 <- Cov(swM, use = "pairwise")
range(eigen(C3, only.values = TRUE)$values) # 6.19   1938
#> [1]    6.194469 1938.033663

## Kendall's tau doesn't change much:
symnum(Rc <- Cor(swM, method = "kendall", use = "complete"))
#>        F A Ex Ed C I
#> Frtlty 1            
#> Agrclt   1          
#> Exmntn . . 1        
#> Eductn . . .  1     
#> Cathlc     .     1  
#> Infn.M             1
#> attr(,"legend")
#> [1] 0 ‘ ’ 0.3 ‘.’ 0.6 ‘,’ 0.8 ‘+’ 0.9 ‘*’ 0.95 ‘B’ 1
symnum(Rp <- Cor(swM, method = "kendall", use = "pairwise"))
#>        F A Ex Ed C I
#> Frtlty 1            
#> Agrclt   1          
#> Exmntn . . 1        
#> Eductn . . .  1     
#> Cathlc     .     1  
#> Infn.M .           1
#> attr(,"legend")
#> [1] 0 ‘ ’ 0.3 ‘.’ 0.6 ‘,’ 0.8 ‘+’ 0.9 ‘*’ 0.95 ‘B’ 1
symnum(R. <- Cor(swiss, method = "kendall"))
#>                  F A Ex Ed C I
#> Fertility        1            
#> Agriculture        1          
#> Examination      . . 1        
#> Education        . . .  1     
#> Catholic             .     1  
#> Infant.Mortality .           1
#> attr(,"legend")
#> [1] 0 ‘ ’ 0.3 ‘.’ 0.6 ‘,’ 0.8 ‘+’ 0.9 ‘*’ 0.95 ‘B’ 1

## "pairwise" is closer componentwise,
summary(abs(c(1 - Rp/R.)))
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#> 0.00000 0.00000 0.04481 0.09573 0.15214 0.53941 
summary(abs(c(1 - Rc/R.)))
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#> 0.00000 0.02021 0.08482 0.50675 0.16192 7.08509 

## but "complete" is closer in Eigen space:
EV <- function(m) eigen(m, only.values=TRUE)$values
summary(abs(1 - EV(Rp)/EV(R.)) / abs(1 - EV(Rc)/EV(R.)))
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>  0.8942  1.1464  1.2452  1.3732  1.3722  2.3265