Selecting and Combining Biomarkers (or Tests)
This SAS macro is used for selecting and combining multiple continuous biomarker candidates to predict a binary outcome, e.g. diseased versus non-diseased. It conducts the following analyses in one package: 1) Select individual markers using partial area under the ROC curve, specificity or sensitivity, or both; 2) For selected individual markers, there are three options to combine markers (forward logistic regression, Boosting logistic regression (Real AdaBoost), Boosting tree (Discrete AdaBoost)); 3) Cross-Validation is built-in from the beginning of the first step, i.e., at the biomarker selection stage. Model building stops when cross-validation total classification error starts to increase. One also has the option to give sensitivity and specificity equal weights in classification error calculation. This is useful when the case:control ratio is not 1 and analyst does not want the dominant group to drive model selection.
Boosting is included due to its reputation for resistance to over fitting, a desirable feature for high dimensional data analysis. However, we have found that boosting can still over fit data. That is why cross-validation is important in model selection and assessment.
A description of the procedure can be found in chapter 18 of the book, Informatics in Proteomics Srivastava S (Eds) Marcel Dekker Inc., New York. 2005. A copy of the chapter, "Statistical design and analytical strategies for discovery of disease specific protein patterns." can be found
here. The SAS code is
here.