Owing to the selleckchem Brefeldin A continuous nature of the hybridization date in this data set, the assignments of the five batches are somewhat subjective. The vehicle control samples are only used as references for the ratio-based batch effects removal methods. They are not used during the construction of the predictive models. We assign B1, B2 and B3 as the three batches in the training set, and B4 and B5 as the two batches in the test set. Table 2 Batch information of the Iconix data set An additional toxicogenomic data set (Hamner) was provided by The Hamner Institutes for Health Sciences (Research Triangle Park, NC, USA). Thomas et al.12 carried out analyses using a subset of this data set hybridized in the years 2005 and 2006, aimed at distinguishing samples treated with chemicals that are, and are not lung-carcinogens.
In the MAQC-II study,5 the training set consists of 70 samples hybridized in two consecutive years (2005 and 2006), and the test set contains 88 samples hybridized in the following 2 years (2007 and 2008). Table 3 shows the sample size distribution within each batch (year). Following the convention of MAQC-II, Control and non-lung tumor samples are combined together as the negative class, and lung tumor samples are used as the positive class. Unlike the Iconix data set, the control samples in the Hamner data set were not only used as references for applying ratio-based batch effects removal methods, but also used as part of the training set and test set. In this way the sample sizes are adequate for analysis, even though there is minor information leakage in this manner, because this is done before the predictive model construction.
Table 3 Batch information of the Hamner data set A Necrosis data set was provided by the National Institute of Environmental Health Sciences (NIEHS) of the National Institutes of Health (Research Triangle Park, NC, USA).13 The study objective in MAQC-II was to use microarray gene expression data acquired from the liver of rats exposed to hepatotoxicants to build classifiers for prediction of liver necrosis. This data set was generated using different microarray platforms and tissues, which allowed us to perform comparisons for three types of batch (group) effects removal: Cross-platform: To study whether liver samples profiled on the Agilent platform can be used to predict liver necrosis of liver samples profiled on the Affymetrix platform and vice versa.
14 Cross-tissue: To study whether blood samples profiled on the Agilent platform can be used to predict liver necrosis of liver samples profiled on the Agilent platform and vice versa.15 Cross-tissue-and-cross-platform: To study whether GSK-3 blood samples profiled on the Agilent platform can be used to predict liver necrosis of liver samples profiled on the Affymetrix platform and vice versa.