Do feature engineering on the original dataset and extract new features, generating a new dataset. Since KNN is a nonlinear learner, it makes a nonlinear mapping from the original dataset, making possible to achieve a great classification performance using a simple linear model on the new features, like GLM or LDA.
knnExtract(xtr, ytr, xte, k = 1, normalize = NULL, folds = 5, nthread = 1)
k
may increase a lot the computing time for big datasets.fastknn
.n
identifying what fold each observation is in. The smallest
value allowable is nfolds=3
.list
with the new data:
new.tr
: matrix
with the new training instances.
new.te
: matrix
with the new test instances.
This feature engineering procedure generates k * c
new
features using the distances between each observation and its k
nearest neighbors inside each class, where c
is the number of class
labels. The procedure can be summarized as follows:
Repeat it for each class to generate the k * c
new features. For the
new training set, a n-fold CV approach is used to avoid overfitting.
This procedure is not so simple. But this method provides a easy interface to do it, and is very fast.
## Not run: ------------------------------------ # library("mlbench") # library("caTools") # library("fastknn") # library("glmnet") # # data("Ionosphere") # # x <- data.matrix(subset(Ionosphere, select = -Class)) # y <- Ionosphere$Class # # # Remove near zero variance columns # x <- x[, -c(1,2)] # # set.seed(2048) # tr.idx <- which(sample.split(Y = y, SplitRatio = 0.7)) # x.tr <- x[tr.idx,] # x.te <- x[-tr.idx,] # y.tr <- y[tr.idx] # y.te <- y[-tr.idx] # # # GLM with original features # glm <- glmnet(x = x.tr, y = y.tr, family = "binomial", lambda = 0) # yhat <- drop(predict(glm, x.te, type = "class")) # yhat <- factor(yhat, levels = levels(y.tr)) # classLoss(actual = y.te, predicted = yhat) # # set.seed(2048) # new.data <- knnExtract(xtr = x.tr, ytr = y.tr, xte = x.te, k = 3) # # # GLM with KNN features # glm <- glmnet(x = new.data$new.tr, y = y.tr, family = "binomial", lambda = 0) # yhat <- drop(predict(glm, new.data$new.te, type = "class")) # yhat <- factor(yhat, levels = levels(y.tr)) # classLoss(actual = y.te, predicted = yhat) ## ---------------------------------------------