Generation and Analysis of Breast Tumor Data Using Distance Weighted Discrimination and Subgroup Discovery.
Abstract
For the first time in history, mankind has the capacity to capture and analyze biological information at the genetic level. Using this genomic data for disease prognosis and diagnosis is arguably one of the most important application of this knowledge. However, several problems currently exist that
impede the ability of researchers to effectively analyze such data sets. One issue is the difficulty in obtaining sufficient sample data in this domain. Due to the expense, both monetary and temporal,
involved in the collection and processing of biological samples, most result sets are comprised of very few samples. These few samples, however, typically contain many thousands of genetic
features. This has led some researchers to explore methods for merging data sets from various studies into a single, cohesive set. Given the differing objectives of the research, variations in
sampling protocols, and no universal standard for data curation, however, unifying these data sets is not trivial. Our proposal is three-fold; First, to replicate a study conducted at the University of North Carolina that used an approach called distance weighted discrimination (DWD) to fuse multiple, disparate breast cancer data sets into a single validation set. Second, we will replicate the generation of an intrinsic gene list based upon gene expression rates. And lastly, we will perform subgroup discovery on this
validation set to compare with the clusters generated by that research.