Alignment-free approaches for predicting novel Nuclear Mitochondrial Segments (NUMTs) in the human genome

Gene. 2019 Apr 5:691:141-152. doi: 10.1016/j.gene.2018.12.040. Epub 2019 Jan 8.

Abstract

The nuclear human genome harbors sequences of mitochondrial origin, indicating an ancestral transfer of DNA from the mitogenome. Several Nuclear Mitochondrial Segments (NUMTs) have been detected by alignment-based sequence similarity search, as implemented in the Basic Local Alignment Search Tool (BLAST). Identifying NUMTs is important for the comprehensive annotation and understanding of the human genome. Here we explore the possibility of detecting NUMTs in the human genome by alignment-free sequence similarity search, such as k-mers (k-tuples, k-grams, oligos of length k) distributions. We find that when k=6 or larger, the k-mer approach and BLAST search produce almost identical results, e.g., detect the same set of NUMTs longer than 3 kb. However, when k=5 or k=4, certain signals are only detected by the alignment-free approach, and these may indicate yet unrecognized, and potentially more ancestral NUMTs. We introduce a "Manhattan plot" style representation of NUMT predictions across the genome, which are calculated based on the reciprocal of the Jensen-Shannon divergence between the nuclear and mitochondrial k-mer frequencies. The further inspection of the k-mer-based NUMT predictions however shows that most of them contain long-terminal-repeat (LTR) annotations, whereas BLAST-based NUMT predictions do not. Thus, similarity of the mitogenome to LTR sequences is recognized, which we validate by finding the mitochondrial k-mer distribution closer to those for transposable sequences and specifically, close to some types of LTR.

Keywords: Alignment-free; Jensen-Shannon divergence; Manhattan plot; Mitochondria; NUMT; k-mer.

MeSH terms

  • Algorithms
  • Cell Nucleus / genetics*
  • Evolution, Molecular
  • Genome, Human
  • Humans
  • Mitochondria / genetics*
  • Phylogeny
  • Sequence Alignment
  • Sequence Analysis, DNA / methods*