FexSplice: A LightGBM-Based Model for Predicting the Splicing Effect of a Single Nucleotide Variant Affecting the First Nucleotide G of an Exon

Genes (Basel). 2023 Sep 6;14(9):1765. doi: 10.3390/genes14091765.

Abstract

Single nucleotide variants (SNVs) affecting the first nucleotide G of an exon (Fex-SNVs) identified in various diseases are mostly recognized as missense or nonsense variants. Their effect on pre-mRNA splicing has been seldom analyzed, and no curated database is available. We previously reported that Fex-SNVs affect splicing when the length of the polypyrimidine tract is short or degenerate. However, we cannot readily predict the splicing effects of Fex-SNVs. We here scrutinized the available literature and identified 106 splicing-affecting Fex-SNVs based on experimental evidence. We similarly identified 106 neutral Fex-SNVs in the dbSNP database with a global minor allele frequency (MAF) of more than 0.01 and less than 0.50. We extracted 115 features representing the strength of splicing cis-elements and developed machine-learning models with support vector machine, random forest, and gradient boosting to discriminate splicing-affecting and neutral Fex-SNVs. Gradient boosting-based LightGBM outperformed the other two models, and the length and nucleotide compositions of the polypyrimidine tract played critical roles in the discrimination. Recursive feature elimination showed that the LightGBM model using 15 features achieved the best performance with an accuracy of 0.80 ± 0.12 (mean and SD), a Matthews Correlation Coefficient (MCC) of 0.57 ± 0.15, an area under the curve of the receiver operating characteristics curve (AUROC) of 0.86 ± 0.08, and an area under the curve of the precision-recall curve (AUPRC) of 0.87 ± 0.09 using a 10-fold cross-validation. We developed a web service program, named FexSplice that accepts a genomic coordinate either on GRCh37/hg19 or GRCh38/hg38 and returns a predicted probability of aberrant splicing of A, C, and T variants.

Keywords: FexSplice web service program; LightGBM model; first nucleotide of an exon; splicing-affecting variants.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Databases, Factual
  • Exons / genetics
  • Gene Frequency
  • Nucleotides* / genetics
  • RNA Splicing*

Substances

  • Nucleotides

Grants and funding

This study was supported by Grants-in-Aid from the Japan Agency for Medical Research and Development (JP22ek0109488 to K.O.), the Japan Society for the Promotion of Science (JP23K18273 to K.O., JP23H02794 to K.O., JP21H02476 to A.M., and JP22K19269 to A.M.), the Ministry of Health, Labour and Welfare of Japan (23FC1014 to K.O.), and the National Center of Neurology and Psychiatry (5–6 to K.O.). A.J. receives scholarship from the THERS Interdisciplinary Frontier Next Generation Researcher Project (JST SPRING, Grant Number JPMJSP2125).