Predicting gene essentiality in Caenorhabditis elegans by feature engineering and machine-learning

Comput Struct Biotechnol J. 2020 May 15:18:1093-1102. doi: 10.1016/j.csbj.2020.05.008. eCollection 2020.

Abstract

Defining genes that are essential for life has major implications for understanding critical biological processes and mechanisms. Although essential genes have been identified and characterised experimentally using functional genomic tools, it is challenging to predict with confidence such genes from molecular and phenomic data sets using computational methods. Using extensive data sets available for the model organism Caenorhabditis elegans, we constructed here a machine-learning (ML)-based workflow for the prediction of essential genes on a genome-wide scale. We identified strong predictors for such genes and showed that trained ML models consistently achieve highly-accurate classifications. Complementary analyses revealed an association between essential genes and chromosomal location. Our findings reveal that essential genes in C. elegans tend to be located in or near the centre of autosomal chromosomes; are positively correlated with low single nucleotide polymorphim (SNP) densities and epigenetic markers in promoter regions; are involved in protein and nucleotide processing; are transcribed in most cells; are enriched in reproductive tissues or are targets for small RNAs bound to the argonaut CSR-1. Based on these results, we hypothesise an interplay between epigenetic markers and small RNA pathways in the germline, with transcription-based memory; this hypothesis warrants testing. From a technical perspective, further work is needed to evaluate whether the present ML-based approach will be applicable to other metazoans (including Drosophila melanogaster) for which comprehensive data sets (i.e. genomic, transcriptomic, proteomic, variomic, epigenetic and phenomic) are available.

Keywords: CDS, coding sequence; CRISPR, Clustered Regularly Interspaced Short Palindromic Repeats; Caenorhabditis elegans; ES, Essentiality Score; EST, expressed sequence tag; Essential genes; Essentiality predictions; GBM, Gradient Boosting Method; GFF, general feature format; GLM, Generalised Linear Model; GO, gene ontology; ML, machine-learning; Machine-learning; NN, Artificial Neural Network; PPI, protein-protein interaction; PR-AUC, Area Under the Precision-Recall Curve; RF, Random Forest; RNAi, RNA interference; ROC-AUC, Area Under the Receiver Operating Characteristic Curve; SNP, single nucleotide polymorphism; SPLS, Sparse Partial Least Squares; SVM, Support-Vector Machine; TEA, Tissue Enrichment Analysis tool (WormBase); TSS, transcription start site; VCF, variant call file.

Associated data

  • figshare/10.6084/m9.figshare.11533101