Function that fills in all NA values using the k Nearest Neighbours of each case with NA values. By default it uses the values of the neighbours and obtains an weighted (by the distance to the case) average of their values to fill in the unknows. If meth='median' it uses the median/most frequent value, instead.

ImputeKnn(data, k = 10, scale = TRUE, meth = "weighAvg", distData = NULL)

Arguments

data

A data frame with the data set

k

The number of nearest neighbours to use (defaults to 10)

scale

Boolean setting if the data should be scale before finding the nearest neighbours (defaults to TRUE)

meth

String indicating the method used to calculate the value to fill in each NA. Available values are 'median' or 'weighAvg' (the default).

distData

Optionally you may sepecify here a data frame containing the data set that should be used to find the neighbours. This is usefull when filling in NA values on a test set, where you should use only information from the training set. This defaults to NULL, which means that the neighbours will be searched in data

Details

This function uses the k-nearest neighbours to fill in the unknown (NA) values in a data set. For each case with any NA value it will search for its k most similar cases and use the values of these cases to fill in the unknowns.

If meth='median' the function will use either the median (in case of numeric variables) or the most frequent value (in case of factors), of the neighbours to fill in the NAs. If meth='weighAvg' the function will use a weighted average of the values of the neighbours. The weights are given by exp(-dist(k,x) where dist(k,x) is the euclidean distance between the case with NAs (x) and the neighbour k.

Value

A data frame without NA values

References

Torgo, L. (2010) Data Mining using R: learning with case studies, CRC Press (ISBN: 9781439810187).

Author

Luis Torgo ltorgo@dcc.fc.up.pt

Examples

cleanPizza <- ImputeKnn(d.pizza[, -2])   # no dates allowed
summary(cleanPizza)
#>      index           week          weekday               area    
#>  Min.   :   1   Min.   : 9.00   Min.   :1.000   Brent      :480  
#>  1st Qu.: 303   1st Qu.:10.00   1st Qu.:3.000   Camden     :346  
#>  Median : 605   Median :11.00   Median :5.000   Westminster:383  
#>  Mean   : 605   Mean   :11.41   Mean   :4.443                    
#>  3rd Qu.: 907   3rd Qu.:13.00   3rd Qu.:6.000                    
#>  Max.   :1209   Max.   :14.00   Max.   :7.000                    
#>                                                                  
#>      count          rabate              price            operator  
#>  Min.   :1.000   Length:1209        Min.   :  8.792   Allanah:370  
#>  1st Qu.:2.000   Class :character   1st Qu.: 31.176   Maria  :391  
#>  Median :3.000   Mode  :character   Median : 46.764   Rhonda :448  
#>  Mean   :3.445                      Mean   : 48.734                
#>  3rd Qu.:4.000                      3rd Qu.: 62.955                
#>  Max.   :8.000                      Max.   :134.334                
#>                                                                    
#>        driver     delivery_min    temperature     wine_ordered   
#>  Butcher  : 96   Min.   : 8.80   Min.   :19.30   Min.   :0.0000  
#>  Carpenter:275   1st Qu.:17.40   1st Qu.:42.20   1st Qu.:0.0000  
#>  Carter   :235   Median :24.40   Median :49.80   Median :0.0000  
#>  Farmer   :118   Mean   :25.65   Mean   :47.84   Mean   :0.1561  
#>  Hunter   :156   3rd Qu.:32.50   3rd Qu.:55.30   3rd Qu.:0.0000  
#>  Miller   :125   Max.   :65.60   Max.   :64.80   Max.   :1.0000  
#>  Taylor   :204                                                   
#>  wine_delivered    wrongpizza          quality   
#>  Min.   :0.0000   Length:1209        low   :229  
#>  1st Qu.:0.0000   Class :character   medium:458  
#>  Median :0.0000   Mode  :character   high  :522  
#>  Mean   :0.1361                                  
#>  3rd Qu.:0.0000                                  
#>  Max.   :1.0000                                  
#>