SPDI: data model for variants and applications at NCBI

Bioinformatics. 2020 Mar 1;36(6):1902-1907. doi: 10.1093/bioinformatics/btz856.

Abstract

Motivation: Normalizing sequence variants on a reference, projecting them across congruent sequences and aggregating their diverse representations are critical to the elucidation of the genetic basis of disease and biological function. Inconsistent representation of variants among variant callers, local databases and tools result in discrepancies that complicate analysis. NCBI's genetic variation resources, dbSNP and ClinVar, require a robust, scalable set of principles to manage asserted sequence variants.

Results: The SPDI data model defines variants as a sequence of four attributes: sequence, position, deletion and insertion, and can be applied to nucleotide and protein variants. NCBI web services convert representations among HGVS, VCF and SPDI and provide two functions to aggregate variants. One, based on the NCBI Variant Overprecision Correction Algorithm, returns a unique, normalized representation termed the 'Contextual Allele'. The SPDI data model, with its four operations, defines exactly the reference subsequence affected by the variant, even in repeat regions, such as homopolymer and other sequence repeats. The second function projects variants across congruent sequences and depends on an alignment dataset of non-assembly NCBI RefSeq sequences (prefixed NM, NR and NG), as well as inter- and intra-assembly-associated genomic sequences (NCs, NTs and NWs), supporting robust projection of variants across congruent sequences and assembly versions. The variant is projected to all congruent Contextual Alleles. One of these Contextual Alleles, typically the allele based on the latest assembly version, represents the entire set, is designated the unique 'Canonical Allele' and is used directly to aggregate variants across congruent sequences.

Availability and implementation: The SPDI services are available for open access at: https://api.ncbi.nlm.nih.gov/variation/v0.

Supplementary information: Supplementary data are available at Bioinformatics online.

Publication types

  • Research Support, N.I.H., Intramural

MeSH terms

  • Algorithms
  • Databases, Genetic*
  • Genome
  • Genomics*
  • Vocabulary, Controlled