16 One Numerical Sample

The following dataset was published in the paper “Measurement of Treadware of Commercial Tires” published in Rubber Age in 1953. Each tire was subject to tread wear measurements by two methods, the first based on weight loss (WGT) and the second based on groove wear (GRO). Importantly, these measurements are paired by tire. We take the logarithm of the ratio.

16.1 Quantiles and other summary statistics

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-0.41985 -0.25247 -0.17578 -0.17229 -0.06516  0.02421 
[1] -0.41985385  0.02421426
         0%         20%         40%         60%         80%        100% 
-0.41985385 -0.25593337 -0.18713311 -0.17278632 -0.06082956  0.02421426 
[1] -0.175781
[1] 0.08949068
[1] -0.1722861
[1] 0.1182363

The boxplot is based on the extremes and the quartiles. (In this particular version, no points are labeled as outliers. This is not the default behavior.)

16.4 Bootstrap confidence interval

We now look at a famous dataset collected by Pearson on the heights of adult men and their sons. This time we simply take the difference.

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-8.9556 -0.8515  0.9629  0.9970  2.7484 11.2369 

This is a larger dataset, allowing for a more sophisticated analysis. In particular, it is reasonable to compute a bootstrap confidence interval for the mean. Although it appears as easy to implement this by hand, we make a point of using a package. We compute the simplest interval, the percentile bootstrap confidence interval. (Note that the level is approximate, as usual with bootstrap confidence intervals.)

bootstrap percentile 0.9 confidence interval: [ 0.857773691558442 , 1.13548268970315 ]

16.5 Goodness-of-fit testing

For a more complex task, consider testing whether the data was generated from a normal distribution. To illustrate the situation we draw a histogram with the best fitting normal distribution (i.e., with parameters fitted by the method of moments) superimposed. The fit is very good.

More formally, we apply the Lilliefors test, which confirms that the fit is good.

    Lilliefors (Kolmogorov-Smirnov) normality test

data:  height
D = 0.021301, p-value = 0.2767

Instead of a histogram, we can visualize the sample density using a kernel density estimate.

16.6 Kaplan–Meier estimator

Censored data is commonplace in survival analysis. We use the following specialized package, and the lung cancer dataset within it.