Prostate cancer is the most common cancer in American men. Dozens of specific gene types have been shown to be correlated to prostate cancer from a biology perspective. In this paper, we apply penalized logistic regression model with different penalty functions to select gene types that have significant contribution to prostate cancer, based on the data from a prostate cancer study. The tuning parameter is determined by cross validation so that the estimates of coefficients can be obtained consequently. In order to take into account some specific genes that have been classified as prostate cancer genes through biology research but with missing values, multiple imputation is adopted to create complete data sets. We analyze the prostate cancer data by comparing the selection results with completely observed data only, and the results with imputed data.
Key words: Prostate cancer; variable selection; missing covariate; high dimensional; multiple imputation; cross validation.