Structural Variation Data Hub

Here are the datasets most commonly requested by our users. For a complete listing of all dbVar data please see our Study Browser or our Variant Summary page.

  1. Clinical Structural Variants
  2. Common Structural Variants
  3. Long Read Technology
  4. Genome-wide surveys of structural variation
  5. Datasets most accessed by users

Last updated: Thursday, March 25, 2021

Clinical Structural Variants

All structural variants with clinical interpretations curated at ClinVar are included in a single dbVar study: Clinical Structural Variants (nstd102). Many of these variants were previously accessioned in separate studies at dbVar (e.g., ClinGen Laboratory-Submitted (nstd37), ClinGen Kaminsky et al. 2011 (nstd101), etc.). We recommend using the nstd102 accessions because they will be updated with the latest information in ClinVar. The old accessions will not be updated and may eventually be retired. A file linking old accessions to new is available here. The easiest way to browse all dbVar clinical variants is to visit the Clinical Structural Variants (nstd102) data track in NCBI's Variation Viewer or connect to the Public dbVar Hub at the UCSC Genome Browser.

Study Download Regions; Calls Search Variants Description
Clinical Structural Variants (nstd102) 65,141; 67,589 nstd102 variants Structural Variants with clinical assertions, submitted to ClinVar by external labs. dbVar now imports all placements from ClinVar as "submitted" and only remaps what is missing in order to place all variants on both GRCh37 and GRCh38. See Variant Summary counts for nstd102 in dbVar Variant Summary. See the latest statistics for nstd102 in Summary of nstd102 (Clinical Structural Variants).

Common Structural Variants

All common structural variants are included in a single dbVar study: NCBI Curated Common Structural Variants (nstd186). These variants are also accessioned in separate studies at dbVar (gnomAD Structural Variants (nstd166), 1000 Genomes Consortium Phase 3 Integrated SV (estd219), DECIPHER Common CNVs (nstd183), Lee et al. 2020 (nstd194)). A file linking accessions between the studies is available here. The easiest way to browse all dbVar common variants is to visit the NCBI Curated Common Structural Variants (nstd186) data track in NCBI's Variation Viewer or connect to the Public dbVar Hub at the UCSC Genome Browser.

Study Download Regions; Calls Search Variants Description
NCBI Curated Common Structural Variants (nstd186) 55,981; 61,086 nstd186 variants A curated dataset of all structural variants in dbVar that meet the following criteria: were part of a study with at least 100 samples; included allele frequency data; had an allele frequency of >=0.01 in at least one population. Data content of this study is subject to change as new data become available. See Variant Summary counts for nstd186 in dbVar Variant Summary. See the latest statistics for nstd186 in Summary of nstd186 (NCBI Curated Common Structural Variants).

Long Read Technology

Long-read sequencing is better than short-read technologies at capturing large structural variation events. The following studies used long-read sequencing methods to detect SV.

Study Download Regions; Calls Search Variants Description
Genome in a Bottle Structural Variants - Tier I, v0.6 (nstd175) 12,745; 12,745 nstd175 variants The v0.6 Genome in a Bottle Consortium [www.genomeinabottle.org] structural variant (SV) benchmark set includes ~10,000 sequence-resolved insertions and deletions >49bp from the broadly-consented GIAB/Personal Genome Project Ashkenazi son (HG002/GM24385). These SVs, along with an accompanying benchmark BED file, are discovered and evaluated by multiple short, linked, and long read sequencing technologies and are intended as a benchmark for identifying false positive and false negative SV calls in any method. Original VCF files and the benchmark BED file can be found here. See Variant Summary counts for nstd175 in dbVar Variant Summary. PubMed:Genome in a Bottle.
PacBio Circular Consensus Sequencing of human male (nstd167) 30,218; 30,634 nstd167 variants PacBio Circular Consensus Sequencing (CCS) of the human male HG002/NA24385 to evaluate the ability of highly-accurate long-read sequencing to identify small and large variants, to phase variants into haplotypes, and to assemble a genome de novo. See Variant Summary counts for nstd167 in dbVar Variant Summary. PubMed:Wenger et al. 2019.
Intermediate-sized deletions examined with Nanopore long-read sequencing (nstd171) 4,378; 4,378 nstd171 variants Intermediate-sized deletions (30bp-5kbp) were identified from whole-genome sequencing data of a Japanese population using a two-stage identification process. Detected intermediate-sized deletions underwent stringent filtering and accuracy of the deletion calls were checked using data from Oxford Nanopore long-read sequencers. See Variant Summary counts for nstd171 in dbVar Variant Summary. PubMed:Wong et al. 2019.
Multi-platform discovery of haplotype-resolved structural variation in human genomes (nstd152) 103,985; 214,917 nstd152 variants This is an integrated callset from three individuals (HG00514, HG00733, and NA19240) sequenced using Illumina, Illumina 3.5 kbp jumping libraries, Illumina 6kbp jumping libraries, PacBio, BioNano Genomics, 10x, Hi-C, and Strand-seq. See Variant Summary counts for nstd152 in dbVar Variant Summary. PubMed:Chaisson et al. 2019.
Major Structural Variant Alleles of the Human Genome (nstd162) 99,810; 342,842 nstd162 variants Although the accuracy of the human reference genome is critical for basic and clinical research, structural variants (SVs) have been difficult to assess because data capable of resolving them have been limited. To address potential bias, we generated long-read sequence data on thirteen genomes. Systematically merging SVs yielded 95,827 sequence-resolved insertions, deletions, and inversions. Among these, we identified more than 1 Mbp of SVs shared among all genomes and more than 6.5 Mbp of SVs in the majority of genomes indicating errors or extreme minor alleles captured in the reference. See Variant Summary counts for nstd162 in dbVar Variant Summary. PubMed:Audano et al. 2019.
Discovery and genotyping of structural variation from long-read haploid genome sequence data (nstd137) 32,954; 35,154 nstd137 variants In an effort to more fully understand the full spectrum of human genetic variation, we generated deep single-molecule, real-time (SMRT) sequencing data from two haploid human genomes. Using an assembly-based approach (SMRT-SV), we systematically assessed each genome independently for structural variants (SVs) and indels resolving the sequence structure of 461,553 genetic variants from 2 bp to 28 kbp in length. We find that 82% of these variants have been missed as part of analysis of the 1000 Genomes Project. We estimate that this theoretical human diploid differs by as much as ~16 Mbp with respect to the human reference, with long-read sequencing data providing a fivefold increase in sensitivity for genetic variants ranging in size from 7 bp to 1 kbp when compared to short-read sequence data. See Variant Summary counts for nstd137 in dbVar Variant Summary. PubMed:Huddleston et al. 2016.

Genome-wide surveys of structural variation

The following are high-quality datasets that contain the results of genome-wide discovery surveys of CNVs and other Structural Variants from a wide variety of global populations.

Study Download Regions; Calls Search Variants Description
Structural variants in gnomAD (nstd166) 304,733; 313,581 nstd166 variants The v2.1 release of gnomAD-SV represents a catalogue of structural variants (SVs) discovered from whole-genome sequencing of 14,891 individuals at 32X mean coverage with 2x150bp Illumina reads. From this dataset, site-level SV data was able to be released for 10,847 unrelated individuals with appropriate consent for broad data sharing. For more information, please refer to Collins*, Brand*, et al., bioRxiv (2019), or the gnomAD-SV explainer. Original VCF files can be found here and, with dbVar accessions included, here. See Variant Summary counts for nstd166 in dbVar Variant Summary. PubMed:gnomAD_Structural_Variants.
Major Structural Variant Alleles of the Human Genome (nstd162) 99,810; 342,842 nstd162 variants Although the accuracy of the human reference genome is critical for basic and clinical research, structural variants (SVs) have been difficult to assess because data capable of resolving them have been limited. To address potential bias, we generated long-read sequence data on thirteen genomes. Systematically merging SVs yielded 95,827 sequence-resolved insertions, deletions, and inversions. Among these, we identified more than 1 Mbp of SVs shared among all genomes and more than 6.5 Mbp of SVs in the majority of genomes indicating errors or extreme minor alleles captured in the reference. See Variant Summary counts for nstd162 in dbVar Variant Summary. PubMed:Audano et al. 2019.
Multi-platform discovery of haplotype-resolved structural variation in human genomes (nstd152) 103,985; 214,917 nstd152 variants This is an integrated callset from three individuals (HG00514, HG00733, and NA19240) sequenced using Illumina, Illumina 3.5 kbp jumping libraries, Illumina 6kbp jumping libraries, PacBio, BioNano Genomics, 10x, Hi-C, and Strand-seq. See Variant Summary counts for nstd152 in dbVar Variant Summary. PubMed:Chaisson et al. 2019.
Genome in a Bottle Structural Variants - Tier I, v0.6 (nstd175) 12,745; 12,745 nstd175 variants The v0.6 Genome in a Bottle Consortium [www.genomeinabottle.org] structural variant (SV) benchmark set includes ~10,000 sequence-resolved insertions and deletions >49bp from the broadly-consented GIAB/Personal Genome Project Ashkenazi son (HG002/GM24385). These SVs, along with an accompanying benchmark BED file, are discovered and evaluated by multiple short, linked, and long read sequencing technologies and are intended as a benchmark for identifying false positive and false negative SV calls in any method. Original VCF files and the benchmark BED file can be found here. See Variant Summary counts for nstd175 in dbVar Variant Summary. PubMed:Genome in a Bottle.
1000 Genomes Project (Phase 3 SV analysis) (estd219) 68,825; 8,812,557 estd219 variants 1000 Genomes Phase 3 structural variants as reported in a companion paper specifically dedicated to SV analysis. Much of these data are identical to those reported in the main paper as study estd214. See Variant Summary counts for estd219 in dbVar Variant Summary. PubMed:1000 Genomes Consortium Phase 3 Integrated SV.
Short Tandem Repeat (STR) Population Survey (nstd128) 1,328,521; 4,394,628 nstd128 variants We report high quality genomes from 300 individuals from 142 diverse populations. As part of this study, we generated a comprehensive catalog of short tandem repeat (STR) genotypes. We used this call set to characterize allele frequency spectra, analyze sequence determinants of STR variation, and to identify common loss of function alleles. See Variant Summary counts for nstd128 in dbVar Variant Summary. PubMed:Mallick et al. 2016.
CNV Global Population Survey (nstd112) 15,012; 3,303,297 nstd112 variants To explore the diversity and selective signatures of duplications and deletions in human copy number variation (CNV), we sequenced 236 individuals from 125 distinct human populations. We observed that duplications exhibit fundamentally different population genetic and selective signatures than deletions and are more likely to be stratified between human populations. We find that the proportion of CNV to SNV base pairs is greater among non-Africans than it is among African populations but we conclude that this difference is likely due to unique aspects of non-African population history as opposed to differences in CNV load. See Variant Summary counts for nstd112 in dbVar Variant Summary. PubMed:Sudmant et al. 2015.

Datasets most accessed by users

The following are the top 10 most accessed datasets in the last 12 months.

Study Download Regions; Calls Search Variants Description
Clinical Structural Variants (nstd102) 65,141; 67,589 nstd102 variants Structural Variants with clinical assertions, submitted to ClinVar by external labs. dbVar now imports all placements from ClinVar as "submitted" and only remaps what is missing in order to place all variants on both GRCh37 and GRCh38. See Variant Summary counts for nstd102 in dbVar Variant Summary. See the latest statistics for nstd102 in Summary of nstd102 (Clinical Structural Variants).
ClinGen Laboratory Submitted Variants (nstd37) 20,391; 33,378 nstd37 variants Copy number variation identified through the course of routine clinical cytogenomic testing in postnatal populations, with clinical assertions as classified by the original submitter. For data from the original published study, Kaminsky, et al. 2011, please see nstd101. See Variant Summary counts for nstd37 in dbVar Variant Summary. PubMed:Miller et al. 2010.
Structural variants in gnomAD (nstd166) 304,733; 313,581 nstd166 variants The v2.1 release of gnomAD-SV represents a catalogue of structural variants (SVs) discovered from whole-genome sequencing of 14,891 individuals at 32X mean coverage with 2x150bp Illumina reads. From this dataset, site-level SV data was able to be released for 10,847 unrelated individuals with appropriate consent for broad data sharing. For more information, please refer to Collins*, Brand*, et al., bioRxiv (2019), or the gnomAD-SV explainer. Original VCF files can be found here and, with dbVar accessions included, here. See Variant Summary counts for nstd166 in dbVar Variant Summary. PubMed:gnomAD_Structural_Variants.
Original ISCA (ClinGen) publication (nstd101) 3,105; 3,833 nstd101 variants Copy number variation identified through the course of routine clinical cytogenomic testing in postnatal populations. Clinical assertions have been curated as described in Kaminsky, et al. 2011. For additional ClinGen data, please see nstd37. See Variant Summary counts for nstd101 in dbVar Variant Summary. PubMed:Kaminsky et al. 2011.
NCBI Curated Common Structural Variants (nstd186) 55,981; 61,086 nstd186 variants A curated dataset of all structural variants in dbVar that meet the following criteria: were part of a study with at least 100 samples; included allele frequency data; had an allele frequency of >=0.01 in at least one population. Data content of this study is subject to change as new data become available. See Variant Summary counts for nstd186 in dbVar Variant Summary. See the latest statistics for nstd186 in Summary of nstd186 (NCBI Curated Common Structural Variants).
ClinGen Curated Dosage Sensitivity Map - obsoleted (nstd45) 364; 397 nstd45 variants Genes/genomic regions with sufficient evidence supporting (pathogenic) or refuting (benign) dosage sensitivity as a mechanism for disease. Evidence is evaluated on a continual basis by the ClinGen Structural Variation Working Group as described in Riggs et al. 2012. See Variant Summary counts for nstd45 in dbVar Variant Summary. PubMed:Riggs et al. 2011.
1000 Genomes Project (Phase 3, landmark paper) (estd214) 61,678; 6,943,353 estd214 variants This study contains the structural variants from the combined release set which contains more than 79 million variant sites and includes not just biallelic snps but also indels, deletions, complex short substitutions and other structural variant classes. It is based on data from 2504 unrelated individuals from 26 populations around the world. Most of the structural variants reported here can also be found in estd219, where they were reported in a companion paper specifically dedicated to SV analysis. See Variant Summary counts for estd214 in dbVar Variant Summary. PubMed:1000 Genomes Project Consortium et al. 2015.
COSMIC v71 (estd192) 61,187; 68,202 estd192 variants Catalogue of Somatic Mutations in Cancer (COSMIC) version 71 - All cancers arise as a result of the acquisition of a series of fixed DNA sequence abnormalities, mutations, many of which ultimately confer a growth advantage upon the cells in which they have occurred. There is a vast amount of information available in the published scientific literature about these changes. COSMIC is designed to store and display somatic mutation information and related details and contains information relating to human cancers. See Variant Summary counts for estd192 in dbVar Variant Summary. PubMed:Forbes et al. 2008.
Refining analyses of CNV and developmental delay (nstd100) 70,319; 318,775 nstd100 variants Copy Number Variants from 29,083 cases of Developmental Delay and Intellectual Disability from Signature Genomics, and 11,256 Control Samples. This study contains samples in common with Cooper et al. 2011. Due to analysis differences (see manuscripts) please use the case samples (Sampleset 1) from only one of these submissions. Control sample sets do not overlap and may be combined. See Variant Summary counts for nstd100 in dbVar Variant Summary. PubMed:Coe et al. 2014.
Multi-platform discovery of haplotype-resolved structural variation in human genomes (nstd152) 103,985; 214,917 nstd152 variants This is an integrated callset from three individuals (HG00514, HG00733, and NA19240) sequenced using Illumina, Illumina 3.5 kbp jumping libraries, Illumina 6kbp jumping libraries, PacBio, BioNano Genomics, 10x, Hi-C, and Strand-seq. See Variant Summary counts for nstd152 in dbVar Variant Summary. PubMed:Chaisson et al. 2019.

Support Center

Last updated: 2021-03-25T16:59:12Z