Math 284

Survival Analysis
Spring 2020


                               ********** Announcements ************


·    To join zoom lecture MWF at noon please use the link provided in Canvas Announcements; DO NOT click on zoom in Canvas.




Overview: Survival outcome is often the ‘ultimate’ outcome, in many critical areas of disease research such as cancer, as well as recently emerging medical AI.  This course discusses the concepts, theories, and applications associated with censored and truncated survival data.  The topics include likelihood for right censored and left truncated data, nonparametric estimation of survival distributions, comparing survival distributions, proportional hazards regression, semiparametric theory and other extended topics on complex survival data including competing risks etc. as time permitting. 


Important Note: You are strongly encouraged to attend lectures and take notes. You are also strongly encouraged to take advantage of the office hours to discuss any questions/problems that you have - Note that you can make appointments for office hours!


Lecture:  MWF 12:00-12:50pm



Instructor: Ronghui (Lily) Xu

Office:  APM 5856

Phone:  534-6380

Office Hours:

 By appointment.



Teaching Assistant:  Denise Rava



Reference books:


1.   Cox and Oakes, Analysis of Survival Data, Chapman & Hall, 1984

2.   Fleming and Harrington, Counting Processes and Survival Analysis, Wiley, 1991

3.   O'Quigley, Proportional Hazards Regression, Springer, 2008

4. Kalbfleisch and Prentice, The Statistical Analysis of Failure Time Data, Wiley, 1st or 2nd ed.

5. Bickel, Klaassen, Ritov and Wellner, Efficient and Adaptive Estimation for Semiparametric Models. Springer, 1998.


Not reference but read for fun: Gladwell “David and Goliath” which has the story of the Freireich (1963) leukemia survival data that D.R.Cox used and we also use.


Topics: (future topics are subject to update)


Week 1:  Introduction to survival analysis; right-censored and left truncated data; Kaplan-Meier estimate of survival.

Week 2:  Log-rank test of two-sample survival; weighted log-rank tests and efficiency; counting processes.

Week 3:  Parametric survival distributions; likelihood for right-censored and left truncated data; Cox proportional hazards regression model.

Week 4:  Partial likelihood inference; predict survival under the Cox model; time-dependent covariates.

Week 5:  Martingale theory; stratified Cox model; goodness-of-fit methods.

Week 6:  Case study; model selection - stepwise, explained variation, information criteria, penalized log-likelihood.

Week 7:  Design of a survival study; other survival models.

Week 8:  Additive hazards model; semiparametric efficiency; competing risks.

Week 9:  Multivariate survival; causal inference.



Reference papers:


1.    [Introduction] Efron, B. and Hinkley, D.V. (1978) Assessing the accuracy of the maximum likelihood estimator: observed versus expected Fisher information. Biometrika, 65, 457-487.


2.    Tsiatis A A. A nonidentifiability aspect of the problem of competing risks. Proceedings of the National Academy of Science USA, 1975; 72: 20-22.

3.    Cox DR. (1969) Some sampling problems in technology. In: New Development in Survey Sampling, Ed. Johnson and Smith. Wiley.

4.    Vardi Y. Multiplicative censoring, renewal processes, deconvolution and decreasing density: Nonparametric estimation. Biometrika, 1989; 76: 751-61.


5.    Tsai, Jewell and Wang, A note on the product-limit estimator under right censoring and left truncation. Biometrika, 1987; 74: 883-6.

6.    Wang M-C. Nonparametric estimation of cross-sectional survival data. JASA, 1991; 86: 130-143.

7.    Wang M-C. A semiparametric model for randomly truncated data. JASA, 1989; 84: 742-748.

8.    Struthers and Farewell. A mixture model for time to AIDS data with left truncation and an uncertain origin. Biometrika, 1989; 76: 814-7.

9.    Asgharian M, M’Lan CE, Walfson DB. Length-biased sampling with right censoring: an unconditional approach. J Amer Stat Assoc (JASA) 2002, 97: 201-209.


10. Harrington DP, Fleming TR. A class of rank test procedures for censored survival data. Biometrika, 1982; 69(3): 553-566.


11. Reid N. A conversation with Sir David Cox. Statistical Science, 1994; 9: p449-450 (about the Cox model).

12. Thomsen and Keiding. A note on the calculation of expected survival.  Statistics in Medicine, 1991; vol. 10, p. 733-738.

13. Xu R and O’Quigley J. Proportional hazards estimate of the conditional survival function. Journal of the Royal Statistical Society, Series B, 2000; vol.62, p. 667-680.

14. Xu R, Luo Y, Chambers, CD. Assessing the effect of vaccine on spontaneous abortion using time-dependent covariates Cox models. Pharmacoepidemiology and Drug Safety, 2012; 21(8): 844-50; doi: 10.1002/pds.3301.

15. O’Quigley J and Pessione F. The problem of a covariate-time qualitative interaction in a survival study. Biometrics, 1991; 47: 101-115.

16. Xu R, Adak S. Survival analysis with time-varying regression effects using a tree-based approach. Biometrics, 2002; 58: 305-315.


17. Gill R. Understanding Cox’s regression model: a martingale approach. J Amer Stat Assoc (JASA). 1984; 79: 441-447.

18. Andersen PK and Gill RD. Cox’s regression model for counting processes: a large sample theory. The Annals of Statistics, 1982; 10: 1100-1120.


19. Lin et al. Checking the Cox model with cumulative sums of martingale-based residuals.  Biometrika, 1993; vol. 80, p. 557-572.

20. Xu R, O’Quigley J. Estimating average regression effect under non-proportional hazards. Biostatistics, 2000; 1: 423-439.

21. Xu R, Harrington DP. A semiparametric estimate of treatment effects with censored data. Biometrics, 2001; 57:875-885.


22. Loftus JR and Taylor JE. A significance test for forward stepwise model selection.

23. Akaika H (1973). Information theory and an extension of the maximum likelihood principle. In: Breakthroughs in Statistics, 1992, vol.1, p.610-24. Springer, New York.

24. Xu, Vaida and Harrington.  Using profile likelihood for semiparametric model selection with application to proportional hazards mixed models. Statistica Sinica, 2009; 19: 819-842.

25. Volinsky, CT and Raftery, AE. Bayesian information criterion for censored survival models. Biometrics, 2000; 56: 256-262.

26. Harezlak et al. Variable selection in regression – estimation, prediction, sparsity, inference. In Li and Xu (ed) ‘High-Dimensional Data Analysis in Cancer Research’. Springer, 2009. (available via elink)

27. Tibshirani, R. The lasso method for variable selection in the Cox model. Statistics in medicine, 1997; 16(4): 385-395.

28. Huang J and Harrington D.  Penalized partial likelihood regression for right-censored data with bootstrap selection of the penalty parameter.  Biometrics, 2002; 58: 781-791.

29. Fan J, Li R.  Variable selection for Cox’s proportional hazards model and frailty model.  Annals of Statistics, 2002; 30(1): 74-99.

30. Bradic J, Fan J, Jiang J. Regularization for Cox's Proportional Hazards Model with NP-Dimensionality. Annals of Statistics, 2011; 39(6): 3092-3120.


31. Kent J. Information gain and a general measure of correlation. Biometrika, 1983; 70: 163-173.

32. O’Quigley J, Xu R, Stare J. Explained randomness in proportional hazards models. Statistics in Medicine, 2005; 24: 479-489.


33. Xu R, Chambers C. A sample size calculation for spontaneous abortion in observational studies. Reproductive Toxicology, 2011; 32: 490-493.


34. Gray RJ. Flexible methods for analyzing survival data using splines, with application to breast cancer prognosis. JASA, 1992: 87: 942-951.

35. Chan P, Xu R, Chambers C. A study of R-squared measure under the accelerated failure time models. Communications in Statistics – Simulation and Computation, 2018, 47(2): 380-391.

36. Struthers CA, Kalbfleisch JD. Misspecified proportional hazards models. Biometrika, 1986; 73: 363-369.

37. Lagakos SW, Schoenfeld DA. Properties of proportional-hazards score tests under misspecified regression models. Biometrics, 1984; 40: 1037-1048.

38. Chastang C, Byar D, Piantadosi S. A quantitative study of the bias in estimating the treatment effect caused by omitting a balanced covariate in survival model. Statistics in Medicine, 1988; 7: 1243-1255.


39. Murphy SA, van der Vaart AW. On profile likelihood (with discussion). JASA. 2000; 95: 449-485.

40. Maples JJ, Murphy SA, Axinn WG. Two-level proportional hazards models. Biometrics, 2002; 58: 754-763.

41. Newey WK. Semiparametric efficiency bounds. J Applied Econometrics, 1990; 5(2): 99-135.


42. Li X, Xu R. Empirical and kernel estimation of covariate distribution conditional on survival time. Computational Statistics and Data Analysis. 2006; 50(12): 3629-3643.


43. Strandberg E, Lin X, Xu R. Estimation of main effect when covariates have non-proportional hazards. Communications in Statistics – Simulation and Computation, 2014, 43(7): 1760-1770.


44. Prentice RL.  On non-parametric maximum likelihood estimation of the bivariate survivor function. Statistics in Medicine, 1999; 18: 2517-2527.

45. Wei LJ, Lin DY, Weissfeld L.  Failure time data by modeling marginal distributions. JASA 1989; 84: 1065-1073.

46. Morris CN. Parametric empirical Bayes inference: theory and applications (with discussion). JASA, 1983; 78: 47-65.

47. Vaida F, Xu R. Proprotional hazards model with random effects. Statistics in Medicine, 2000; 19: 3309-3324.

48. Gamst A, Donohue M, Xu R.  Asymptotic properties and empirical evaluation of the NPMLE in the proportional hazards mixed-effects model. Statistica Sinica, 2009; 19: 997-1011.

49. Louis TA.  Finding the observed information matrix when using the EM algorithm. Journal of the Royal Statistical Society B, 1982; 44(2): 226-233.

50. Ripatti S, Palmgren J.  Estimation of multivariate frailty models using penalized partial likelihood. Biometrics, 2000; 56: 1016-1022.

51. Murphy SA.  Consistency in a proportional hazards model incorporating a random effect. Annals of Statistics, 1994; 22(2): 712-731.


52. Hou J et al. High-dimensional variable selection and prediction under competing risks with application to SEER-Medicare linked data. Statistics in Medicine, 2018; 37(24): 3486-3502.

53. Hernan MA, Brumback B, Robins JM. Marginal structural models to estimate the joint causal effect of nonrandomized treatments. JASA, 2001; 96(454): 440-448.




Homework:  You may discuss, but please write them independently. Write your solutions, answers and results in your own words (and in complete sentences, and clearly lay out your setup, background etc.) in the main part, and append program codes in the back; all needs to be turned in. Any two students turning in exactly the same solutions may be considered plagiarism.



HW1 (35%, due on Gradescope by 11:59pm on Sunday 4/26):

1)   Answer the questions in the study points file, up to and including Topic 4 (10%, see Canvas);

2)   Exercise on page 18 of Topic 2 lecture notes;

3)   a) Simulate a clinical trial data set by taking sample size n=100, T from the standard Exponential (1) distribution, and C from Uniform (0, c) distribution. Choose a value for c so that about 30% of the data are censored. Plot the KM curve for the survival function S(t) of T and its pointwise 95% confidence intervals (CI). What is the estimated median follow-up time?

b) Now focus on estimating S(0.3) from the above distribution. Repeat the simulation of part a) 1000 times, summarize in a table the bias, standard error (SE), standard deviation (SD) of the estimates from the 1000 repeats, and coverage probability (CP) of the 95% confidence intervals.

4)   Exercise on page 38 of Topic 3 lecture notes (latest version);

5)   Exercise on page 18 of Topic 4 lecture notes.




HW2 (35%, due on Gradescope by 11:59pm on Sunday 5/17):

1)   Answer the questions in the study points file, up to and including design of a survival study (10%, see Canvas);

2)   Consider the data set ‘lymphoma.prognosis’ at

Find the 5 binary covariates used in Xu and Adak (2002, which were identified in the original Shipp et al. paper).

a)    Carry out log-rank tests to compare the survival of the two groups formed according to each covariate (eg. Age > 60 vs. otherwise, etc.), plot the corresponding KM curves; plot also the log-log of the KM curves to check the PH assumption.

b)   Fit a Cox model with all 5 covariates, and continue with parts c) and d) below;

c)    Use the cumulative martingale-based residuals to check the proportional hazards assumption for each covariate in the model;

d)   Compute one of the R-squares measures that we talked about.

e)    Fit a non-PH model with piecewise constant regression effects for all 5 covariates, where the ‘pieces’ are 3 intervals with approximately equal number of events each.

3)   Do the exercise on page 6 of martingale theory notes, as well as the verification on page 10.




Papers for final presentation:  


 # 20[group 2], 52[group 3], 27[group 4], 34[group 5], 28[group 6], 53[group 1]


Grading: 70% Homework + 30% Final paper presentation