Guide to using files from the ftp site or accessed via e-utilities

Background
Definition of variants represented in ClinVar
Finding data for a variant
Finding data for a variant and a specific condition
Finding evidence based on individuals assessed or functional data
Summary of cross-references

Background

Definition of variants represented in ClinVar

Variants and sets of variants

The vast majority of classifications in ClinVar are for single variants. There are some cases where the classification is for a set of variants, as a haplotype or genotype. For this reason, all variants in ClinVar are treated as sets of variants, even though most sets have a single member.

The ClinVar Variation ID represents the set of variants; the ClinVar Allele ID represents each individual variant within a set. Read more about identifiers in ClinVar.

Genomic coordinates for variants

ClinVar calculates the location of a variant with offset 1. Please note that the convention for reporting location in the XML (SequenceLocation element) and tab-delimited files is consistent with HGVS nomenclature (i.e. right-shifted), while the location in the VCF files meets the VCF standard (i.e. left-shifted). In other words:

for single nucleotide variants not in repeat regions, the location based on VCF or HGVS matches on nucleotide position
for insertions, duplications, and deletions in repeat regions, the different conventions of left justification (VCF) and right justification (HGVS) will make it appear that the definitions of the locations of the alleles are inconsistent among the files

Variants without genomic coordinates

A small number of records in ClinVar report variants that have not been mapped to genomic coordinates. There are several causes of this, ranked below in decreasing order of frequency.

ClinVar processes allelic variant records on behalf of OMIM, and until recently, most allelic variant records did not contain a computable sequence definition of the variant. ClinVar does not have the resources to research all of these to define the variant, but our staff try to reduce the number of gaps.
Variants originally defined based on analyses of cDNAs, without evidence of the genomic basis of the sequence change. An example is an exon loss, without evidence of either genomic deletion or single nucleotide changes affecting splice junctions.
Submissions accepted from non-OMIM sources without having validated the sequence definition. These records are old; ClinVar no longer accepts variants that cannot be mapped to the genome.
Database maintenance errors

XML files

VCV XML

In each VariationArchive/ClassifiedRecord, the variant or set of variants that was classified is represented by one of the following elements:

SimpleAllele – for classification of a single variant; most records in ClinVar are for a single variant
Haplotype – for classification of a haplotype; this is not common
Genotype – for classification of a genotype, either a compound heterozygote or a diplotype; this is not common

Within Haplotype and Genotype elements, each single allele within the set is represented by a SimpleAllele element.

RCV XML

The path //ReferenceClinVarAssertion/MeasureSet contains all the data ClinVar has accumulated about any variant or set of variants that was classified. Most sets contain a single variant.

Each single allele in the set is described in the path //ReferenceClinVarAssertion/MeasureSet/Measure.

Because of RCV record represents a variant/condition pair, the same /MeasureSet may be reported in more than one RCV. Each instance will have the same ID value . To find all data for a variant, we recommend using the VCV XML instead.

variant_summary

The tab-delimited file variant_summary.txt represents all variants in ClinVar that have a location on the genome. However, the file includes only selected metadata about each variant. Locations in this file are consistent with HGVS expressions. This file is released on the first Thursday of the month.

The Entrez document summary

ClinVar data can be accessed by E-utilities, Entrez's programming utilities. ClinVar's document summary, accessed by the esummary command, is structured around the Variation ID.

For annotated examples of the xml that is generated from an esummary request, please refer to

Locations in these reports are consistent with HGVS expressions.

Current views in XML and JSON formats:

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=clinvar&id=9&retmode=json

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=clinvar&id=9&retmode=XML

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=clinvar&id=1904&retmode=json

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=clinvar&id=1904&retmode=xml

VCF

ClinVar’s VCF file includes variants that are simple alleles (not a haplotype or genotype) < 10 kb in length with precise endpoints mapped to the GRCh37 or GRCh38 human genome assemblies. Other variants are not in scope, including cytogenetic variants, copy number variants with inner and/or outer start and stop coordinates, and variants >10 kb.

The directories

GRCh37/hg19: https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/

GRCh38/hg38: https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/

Finding data for a variant

Data aggregated by the variant, regardless of the condition, can be accessed in:

The VCV XML files (see above)
the VCV web page
the VCF files (see above)
the Entrez document summary accessed with esearch (see above)
the tab-delimited file variant_summary.txt (see above)

Finding data for a variant and a specific condition

Data aggregated by the variant-condition pair can be accessed only in the RCV XML (see above).

Finding evidence based on individuals assessed or functional data

Some submissions include detailed descriptions of the individuals tested or functional data, as observations of the variant. This data is represented only in the ClinVar XML files under the ObservedIn element.

For example, phenotypes observed in an individual are represented as /ObservedIn/TraitSet.

Summary of cross-references

Two files in the tab-delimited directory report cross-references between ClinVar's AlleleID and VariationID and other databases.

var_citations.txt(https://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/var_citations.txt%20) reports PubMed identifiers, along with ids from dbSNP and dbVar.
cross_references.txt provides identifiers for dbSNP and dbVar, and when those were last modified.

ClinVar

Relating variation to medicine