Download PDFOpen PDF in browserGraph Random Forest: a Graph Embedded Algorithm for Identifying Highly Connected Important FeaturesEasyChair Preprint 891312 pages•Date: October 3, 2022AbstractRandom Forest (RF) is a widely used machine learning method with good performance on classification and regression tasks. It can train on over parameterized datasets which benefits the applications in the field of biology. For example, gene expression data always has a considerable number of features $(p)$ compared to the size of samples $(n)$. Though the predictive accuracy using RF is high, there are some problems when selecting important genes from a large number of features. The important genes selected by RF are usually scattered on the gene network, which conflicts with the biological assumption of connectivity between effective features. To apply random forest better in the biological field with external topological information between features, we propose the Graph Random Forest (GRF) for identifying highly connected important features by involving an interactive network when constructing the forest. The algorithm can identify effective features that form a highly connected sub-graph and achieve equivalent classification accuracy to RF. To evaluate the capability of our proposed method, we conducted simulation experiments and applied the method to two real datasets -- non-small cell lung cancer RNA-seq data from The Cancer Genome Atlas and human embryonic stem cell RNA-seq dataset (GSE93593). The resulting high classification accuracy, connectivity of selected sub-graph, and interpretable feature selection results suggest the method is a helpful addition to graph-based classification models and feature selection procedures. Keyphrases: Random Forest, feature selection, gene network
|