Outlier.Rd
Return outliers following Tukey's boxplot and Hampel's median/mad definition.
Outlier(x, method = c("boxplot", "hampel"), value = TRUE,na.rm = FALSE)
a (non-empty) numeric vector of data values.
the method to be used. So far Tukey's boxplot and Hampel's rule are implemented.
logical. If FALSE
, a vector containing the (integer) indices of the outliers is returned, and if TRUE
(default), a vector containing the matching elements themselves is returned.
logical. Should missing values be removed? Defaults to FALSE
.
Outlier detection is a tricky problem and should be handled with care. We implement Tukey's boxplot rule as a rough idea of spotting extreme values.
Hampel considers values outside of median +/- 3 * (median absolute deviation) to be outliers.
the values of x lying outside the whiskers in a boxplot
or the indices of them
Hampel F. R. (1974) The influence curve and its role in robust estimation, Journal of the American Statistical Association, 69, 382-393
Outlier(d.pizza$temperature, na.rm=TRUE)
#> [1] 20.00 20.45 22.20 20.35 22.10 21.30 21.00 21.60 21.70 21.80 22.20 22.50
#> [13] 20.40 21.90 19.40 20.20 19.30 20.20 22.40
# it's the same as the result from boxplot
sort(d.pizza$temperature[Outlier(d.pizza$temperature, value=FALSE, na.rm=TRUE)])
#> [1] 19.30 19.40 20.00 20.20 20.20 20.35 20.40 20.45 21.00 21.30 21.60 21.70
#> [13] 21.80 21.90 22.10 22.20 22.20 22.40 22.50
b <- boxplot(d.pizza$temperature, plot=FALSE)
sort(b$out)
#> [1] 19.30 19.40 20.00 20.20 20.20 20.35 20.40 20.45 21.00 21.30 21.60 21.70
#> [13] 21.80 21.90 22.10 22.20 22.20 22.40 22.50
# nice to find the corresponding rows
d.pizza[Outlier(d.pizza$temperature, value=FALSE, na.rm=TRUE), ]
#> index date week weekday area count rabate price operator
#> 20 20 2014-03-01 9 6 Westminster 1 FALSE 11.990 Rhonda
#> 41 41 2014-03-01 9 6 Brent 2 FALSE 24.980 Rhonda
#> 47 47 2014-03-02 9 7 Brent 4 TRUE 78.957 Allanah
#> 142 142 2014-03-05 10 3 Brent 3 FALSE 39.970 Allanah
#> 190 190 2014-03-06 10 4 <NA> 1 FALSE 12.990 Maria
#> 202 202 2014-03-06 10 4 Westminster 2 FALSE 25.980 Maria
#> 206 206 2014-03-06 10 4 Westminster 7 TRUE 134.334 Maria
#> 257 257 2014-03-08 10 6 Westminster 5 TRUE 61.155 Allanah
#> 273 273 2014-03-08 10 6 Camden 3 FALSE 42.970 Allanah
#> 298 298 2014-03-08 10 6 Westminster 3 FALSE 39.970 Allanah
#> 300 300 2014-03-08 10 6 Westminster 4 TRUE 53.964 Allanah
#> 305 305 2014-03-08 10 6 Westminster 3 FALSE 49.970 Allanah
#> 306 306 2014-03-08 10 6 Westminster 2 FALSE 29.980 Allanah
#> 309 309 2014-03-08 10 6 Camden 3 FALSE 45.970 Allanah
#> 343 343 <NA> NA NA Westminster 3 FALSE 71.700 Allanah
#> 611 611 2014-03-16 11 7 Brent 2 FALSE 29.980 Allanah
#> 1077 1077 2014-03-29 13 6 Westminster 5 TRUE 70.155 Rhonda
#> 1101 1101 2014-03-29 13 6 Westminster 3 FALSE 47.970 Maria
#> 1104 1104 2014-03-29 13 6 Westminster 1 FALSE 10.990 Rhonda
#> driver delivery_min temperature wine_ordered wine_delivered wrongpizza
#> 20 Miller 37.3 20.00 0 0 FALSE
#> 41 Taylor 39.7 20.45 0 0 FALSE
#> 47 Hunter 9.2 22.20 1 1 FALSE
#> 142 Taylor 20.1 20.35 0 0 FALSE
#> 190 Carpenter 36.7 22.10 0 0 FALSE
#> 202 Carpenter 39.8 21.30 0 0 FALSE
#> 206 Carpenter 47.7 21.00 1 1 FALSE
#> 257 Carpenter 43.6 21.60 0 0 FALSE
#> 273 Taylor 40.4 21.70 0 0 FALSE
#> 298 Carpenter 49.6 21.80 0 0 FALSE
#> 300 Butcher 31.2 22.20 0 0 FALSE
#> 305 Carpenter 63.2 22.50 0 0 FALSE
#> 306 Miller 36.7 20.40 0 0 TRUE
#> 309 Taylor 55.0 21.90 0 0 FALSE
#> 343 Carpenter 62.9 19.40 1 1 FALSE
#> 611 Taylor 36.3 20.20 0 0 FALSE
#> 1077 Carpenter 65.6 19.30 0 0 FALSE
#> 1101 Hunter 53.4 20.20 0 0 FALSE
#> 1104 Carpenter 46.5 22.40 0 0 FALSE
#> quality
#> 20 low
#> 41 low
#> 47 low
#> 142 low
#> 190 medium
#> 202 medium
#> 206 medium
#> 257 low
#> 273 low
#> 298 <NA>
#> 300 low
#> 305 low
#> 306 <NA>
#> 309 low
#> 343 <NA>
#> 611 low
#> 1077 low
#> 1101 medium
#> 1104 low
# compare to Hampel's rule
Outlier(d.pizza$temperature, method="hampel", na.rm=TRUE)
#> [1] 20.00 20.45 22.20 20.35 22.10 21.30 21.00 21.60 21.70 21.80 22.20 20.40
#> [13] 21.90 19.40 20.20 19.30 20.20 22.40
# outliers for the each driver
tapply(d.pizza$temperature, d.pizza$driver, Outlier, na.rm=TRUE)
#> $Butcher
#> [1] 26.3 22.2
#>
#> $Carpenter
#> numeric(0)
#>
#> $Carter
#> [1] 27.00 26.60 26.45 24.00 26.70 25.20
#>
#> $Farmer
#> [1] 26.35 28.80 25.85 26.95 25.55 26.70
#>
#> $Hunter
#> [1] 22.20 25.95 26.70 20.20
#>
#> $Miller
#> [1] 28.80 20.00 25.85 26.15 29.10 30.40 20.40 29.50 27.30
#>
#> $Taylor
#> numeric(0)
#>
# the same as:
boxplot(temperature ~ driver, d.pizza)$out
#> [1] 26.30 22.20 27.00 26.60 26.45 24.00 26.70 25.20 26.35 28.80 25.85 26.95
#> [13] 25.55 26.70 22.20 25.95 26.70 20.20 28.80 20.00 25.85 26.15 29.10 30.40
#> [25] 20.40 29.50 27.30