Fast k-Nearest Neighbor classifier build upon ANN, a high efficient C++ library for nearest neighbor searching.

fastknn(xtr, ytr, xte, k, method = "dist", normalize = NULL)

Arguments

xtr
matrix containing the training instances. Rows are observations and columns are variables. Only numeric variables are allowed.
ytr
factor array with the training labels.
xte
matrix containing the test instances.
k
number of neighbors considered.
method
method used to infer the class membership probabilities of the test instances. Choose "dist" (default) to compute probabilites from the inverse of the nearest neighbor distances. This method works as a shrinkage estimator and provides a better predictive performance in general. Or you can choose "vote" to compute probabilities from the frequency of the nearest neighbor labels.
normalize
variable normalization to be applied prior to searching the nearest neighbors. Default is normalize=NULL. Normalization is recommended if variables are not in the same units. It can be one of the following:
  • normalize="std": standardize variables by removing the mean and scaling to unit variance.
  • normalize="minmax": transforms variables by scaling each one between 0 and 1.
  • normalize="maxabs": scales each variable by its maximum absolute value. This is the best choice for sparse data because it does not shift/center the variables.
  • normalize="robust": scales variables using statistics that are robust to outliers. It removes the median and scales by the interquartile range (IQR).

Value

list with predictions for the test set:

  • class: factor array of predicted classes.
  • prob: matrix with predicted probabilities.

Details

There are two estimators for the class membership probabilities:

  1. method="vote": The classical estimator based on the label proportions of the nearest neighbors. This estimator can be thought as of a voting rule.
  2. method="dist": A shrinkage estimator based on the distances from the nearest neighbors, so that those neighbors more close to the test observation have more importance on predicting the class label. This estimator can be thought as of a weighted voting rule. In general, it reduces log-loss.

Examples

## Not run: ------------------------------------ # library("mlbench") # library("caTools") # library("fastknn") # # data("Ionosphere") # # x <- data.matrix(subset(Ionosphere, select = -Class)) # y <- Ionosphere$Class # # set.seed(2048) # tr.idx <- which(sample.split(Y = y, SplitRatio = 0.7)) # x.tr <- x[tr.idx,] # x.te <- x[-tr.idx,] # y.tr <- y[tr.idx] # y.te <- y[-tr.idx] # # knn.out <- fastknn(xtr = x.tr, ytr = y.tr, xte = x.te, k = 10) # # knn.out$class # knn.out$prob ## ---------------------------------------------