21 Regression Analysis

21.1 Regression

We use simulated data, so that we know the underlying function.

21.1.1 Kernel smoothing

We use the Gaussian kernel (aka heat kernel).

Notice how the last two estimators are grossly biased near \(x = 1\). ### Choice of bandwidth by cross-validation With a visual exploration, a choice of bandwidth around \(h = 0.05\) seems best. But we notice that the estimator is biased, particularly at the extremes and at the boundary. We use cross-validation to choose the bandwidth to optimize prediction. Monte Carlo leave-k-out cross-validation is particularly easy to implement.

21.1.2 Local linear regression

(See the manual for explanation of what the parameter “span” stands for. It essentially controls the number of nearest neighbors.)

21.1.3 Polynomial regression

We now fit a polynomial of varying degree. This is done using the function, which fits linear models (and polynomial models are linear models)according to the least squares criterion.

21.2 Classification

We consider the following synthetic dataset.

21.2.1 Nearest neighbor classifier

Nearest neighbor majority voting is readily available. (\(k\) below denotes the number of neighbors.)

The classifier does not seem very sensitive to the choice of the number of nearest neighbors. Nevertheless, we can of course apply cross-validation to choose that parameter. The following implements leave-out-out cross-validation.

[1] 1

The selected value of the parameters (number of neighbors) varies substantially from simulation to simulation, likely due to the fact that the sample size is relatively small. ### Linear classification We first apply logistic regression based on a polynomial of degree 2 in the predictor variable(s). The fit is excellent (as expected).

We now turn to support vector machines (with a polynomial kernel of degree 2). The fit is excellent (as expected).