Abstract
Motivation
Gene expression data from high-throughput assays, such as microarray, are often used to
predict cancer survival. However, available datasets consist of a small number of samples (n patients)
and a large number of gene expression data (p predictors). Therefore, the main challenge
is to cope with the high-dimensionality, i.e. p>>n, and a novel appealing approach is to use
screening procedures to reduce the size of the feature space to a moderate scale (Wu & Yin 2015,
Song et al. 2014, He et al. 2013). In addition, genes are often co-regulated and their expression
levels are expected to be highly correlated. Genes that are involved in the same biological process
are grouped in pathway structures. In order to incorporate the pathway information of genes,
network-based methods have been applied (Zhang et al. 2013, Sun et al. 2013). Motivated
by the most recent models based on variable screening techniques and integration of pathway
information into penalized Cox methods, we propose a new procedure to obtain more accurate
predictions. First, we identify the high-risk genes by using variable screening techniques and
then, we perform Cox regression analysis integrating network information associated with the
selected high-risk genes. By combining these two approaches, we present a new method to select
important core pathways and genes that are related to the survival outcome and we show the
benefits of our proposal both in simulation and real studies.
Methods
In our study, we combine variable screening techniques and network methods to identify
genes and pathways highly associated with the disease and to better predict patient risk. We
propose a new method for survival analysis based on the following steps. First, (i) we perform
variable screening, such as the sure independence screening (Fan et al. 2008) and its advancement
(Gorst-Rasmussen & Scheike 2013, Zhao & Li 2012, Fan et al. 2010) to select the active set of
variables strongly correlated with the survival response, and then (ii) we apply network-based
Cox regression models, such as Net-Cox and AdaLnet, which use a network based on the number
of selected signature genes to predict survival probability. In order to build our apriori network
information, we use the human gene functional linkage approach (Huttenhower et al. 2009).
Such network contains maps of functional activity and interaction networks in over 200 areas of
human cellular biology with information from 30.000 genome-scale experiments. The functional
linkage network summarizes information from a variety of biologically informative perspectives:
prediction of protein function and functional modules, cross-talk among biological processes, and
association of novel genes and pathways with known genetic disorders. In particular, our gene
network is built by using the HEFalMp tool to determine the edge's weight w between two nodes
(i.e. genes). The resulting network consists of a fixed number of unique genes (about 2000
genes), where w describes how strong is the relation between two genes and it takes values in
[0,1]. Hence, while the screening methods recruit the features with the best marginal utility to
reduce the dimensionality of the data, the network incorporates the pathway information used
as a prior knowledge network into the survival analysis.
Results
We combine variable screening procedures and network-penalized Cox models for high-dimensional
survival data aimed at determining pathway structures and biomarkers involved in cancer progression.
By using this approach, it is possible to obtain a deeper insight of the gene-regulatory
networks and investigate the gene signatures related to the cancer survival time in order to understand
how patient features (molecular and clinical information) can influence cancer treatment
and detection. In particular, we show the results obtained in simulation and real cancer studies,
along with screening rules. The simulated data are aimed to illustrate two different biological
scenarios. In the first setting, we examine the situation where all genes within the same module
belong to different groups or pathways. In the second one, the pathways are not independent
among them (as in genomic studies), but the activation of some groups is conditional from other
pathways. We use specificity, sensitivity and Matthews Correlation Coefficient to compare the
prediction performance. We also predict patient survival using molecular data of different cancer
types, such as ovarian and breast cancer. We investigate the set of the active signature genes and
the corresponding pathways involved in the cancer disease process. Then, using the biological
network, as prior information network, we perform network-based Cox model including Kaplan-
Meier curve and log-rank test. Overall this study shows that the new screening-network analysis
is useful for improving
Anno
2015
Autori IAC
Tipo pubblicazione
Altri Autori
Iuliano A, Occhipinti A, Angelini C, De Feis I, Lio; P