Upcoming Events

Biostatistics MA Presentation

The Impact of Missing Covariate Data in High Dimensional Variable Selection: Evidence from a Prostate Cancer Study

Wednesday, Apr. 26, 2017 at 11 a.m.

720 Kimball Tower

Department of Biostatistics

Chi Chen, Biostatistics MA Candidate

Prostate cancer is the most common cancer in American men. Dozens of specific gene types have been shown to be correlated to prostate cancer from a biology perspective. In this paper, we apply penalized logistic regression model with different penalty functions to select gene types that have significant contribution to prostate cancer, based on the data from a prostate cancer study. The tuning parameter is determined by cross validation so that the estimates of coefficients can be obtained consequently. In order to take into account some specific genes that have been classified as prostate cancer genes through biology research but with missing values, multiple imputation is adopted to create complete data sets. We analyze the prostate cancer data by comparing the selection results with completely observed data only, and the results with imputed data.

Key words: Prostate cancer; variable selection; missing covariate; high dimensional; multiple imputation; cross validation.