Combining pathway identification and survival prediction via screening-network analysis

Motivation Gene expression data from high-throughput assays, such as microarray, are often used to predict cancer survival. However, available datasets consist of a small number of samples (n patients) and a large number of gene expression data (p predictors). Therefore, the main challenge is to cope with the high-dimensionality, i.e. p>>n, and a novel appealing approach is to use screening procedures to reduce the size of the feature space to a moderate scale (Wu & Yin 2015, Song et al. 2014, He et al. 2013). In addition, genes are often co-regulated and their expression levels are expected to be highly correlated. Genes that are involved in the same biological process are grouped in pathway structures. In order to incorporate the pathway information of genes, network-based methods have been applied (Zhang et al. 2013, Sun et al. 2013). Motivated by the most recent models based on variable screening techniques and integration of pathway information into penalized Cox methods, we propose a new procedure to obtain more accurate predictions. First, we identify the high-risk genes by using variable screening techniques and then, we perform Cox regression analysis integrating network information associated with the selected high-risk genes. By combining these two approaches, we present a new method to select important core pathways and genes that are related to the survival outcome and we show the benefits of our proposal both in simulation and real studies. Methods In our study, we combine variable screening techniques and network methods to identify genes and pathways highly associated with the disease and to better predict patient risk. We propose a new method for survival analysis based on the following steps. First, (i) we perform variable screening, such as the sure independence screening (Fan et al. 2008) and its advancement (Gorst-Rasmussen & Scheike 2013, Zhao & Li 2012, Fan et al. 2010) to select the active set of variables strongly correlated with the survival response, and then (ii) we apply network-based Cox regression models, such as Net-Cox and AdaLnet, which use a network based on the number of selected signature genes to predict survival probability. In order to build our apriori network information, we use the human gene functional linkage approach (Huttenhower et al. 2009). Such network contains maps of functional activity and interaction networks in over 200 areas of human cellular biology with information from 30.000 genome-scale experiments. The functional linkage network summarizes information from a variety of biologically informative perspectives: prediction of protein function and functional modules, cross-talk among biological processes, and association of novel genes and pathways with known genetic disorders. In particular, our gene network is built by using the HEFalMp tool to determine the edge's weight w between two nodes (i.e. genes). The resulting network consists of a fixed number of unique genes (about 2000 genes), where w describes how strong is the relation between two genes and it takes values in [0,1]. Hence, while the screening methods recruit the features with the best marginal utility to reduce the dimensionality of the data, the network incorporates the pathway information used as a prior knowledge network into the survival analysis. Results We combine variable screening procedures and network-penalized Cox models for high-dimensional survival data aimed at determining pathway structures and biomarkers involved in cancer progression. By using this approach, it is possible to obtain a deeper insight of the gene-regulatory networks and investigate the gene signatures related to the cancer survival time in order to understand how patient features (molecular and clinical information) can influence cancer treatment and detection. In particular, we show the results obtained in simulation and real cancer studies, along with screening rules. The simulated data are aimed to illustrate two different biological scenarios. In the first setting, we examine the situation where all genes within the same module belong to different groups or pathways. In the second one, the pathways are not independent among them (as in genomic studies), but the activation of some groups is conditional from other pathways. We use specificity, sensitivity and Matthews Correlation Coefficient to compare the prediction performance. We also predict patient survival using molecular data of different cancer types, such as ovarian and breast cancer. We investigate the set of the active signature genes and the corresponding pathways involved in the cancer disease process. Then, using the biological network, as prior information network, we perform network-based Cox model including Kaplan- Meier curve and log-rank test. Overall this study shows that the new screening-network analysis is useful for improving
Tipo pubblicazione
Altri Autori
Iuliano A, Occhipinti A, Angelini C, De Feis I, Lio; P