Identifiers in ClinVar
ClinVar archives submissions that interpret the effect of a single variant or set of variants on phenotype. It also archives data aggregated from those submissions. These archives are assigned accession numbers.
To support the function of aggregating data, key concepts in the submission are extracted, assigned identifiers, and enriched with content from other databases.
This document
- provides detailed information about ClinVar's accessions, identifiers, and relationships established with records in other public databases.
- enumerates where each type of identifier is reported in ClinVar's information products
- gives examples of using some of these identifiers
Table of contents
- Accession numbers
- Identifiers specific to ClinVar
- Identifiers specific to other NCBI resources
- Identifiers specific to resources outside of NCBI
Accession numbers
ClinVar assigns accession numbers to its records. Accession numbers in ClinVar have the pattern of 3 letters and 9 numerals. The letters are either SCV (think of it as Submitted record in ClinVar), RCV (Reference ClinVar record) or VCV (Variation ClinVar record). These accession numbers also are assigned a version number. The version is incremented when a submitter updates a record or when the contents of a reference or variation record change because of addition to, updates of, or deletion of the SCV accessions on which it is based.
SCV
Web display
'SCV' refers to the first 3 letters of the accession number assigned to a submission to ClinVar, e.g. SCV000020145. If you submit a query to ClinVar based on that accession number, e.g. SCV000020145, you are directed automatically to a page specific to the VCV, or VariationID, generated from that submission. The SCV accession number and version are displayed in the Submitted interpretations and evidence section. At present, there is no web display in ClinVar specific to an SCV record.
XML releases
In our XML releases called ClinVarFullRelease, (https://ftp.ncbi.nlm.nih.gov/pub/clinvar/xml/), the content specific to each SCV is reported in a //ClinVarSet/ClinVarAssertion/ element, with the accession number reported as /ReleaseSet/ClinVarSet/ClinVarAssertion/ClinVarAccession/@Acc and the version as /ReleaseSet/ClinVarSet/ClinVarAssertion/ClinVarAccession/@Version.
In our XML releases called ClinVarVariationRelease, (https://ftp.ncbi.nlm.nih.gov/pub/clinvar/xml/clinvar_variation/), the content specific to each SCV is reported in a /ClinicalAssertion/ element, with the accession number reported as /ClinicalAssertion/ClinVarAccession/@Accession and the version as /ClinicalAssertion/ClinVarAccession/@Version.
See the README file for more information about our XML releases.
Reports in the tab_delimited directory
There are two files in the tab-delimited directory on our ftp site ( https://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/ ) that contain information specific to submitted data.
- submission_summary.txt.gz
- summary_of_conflicting_interpretations.txt
submission_summary provides an overview of interpretation, phenotypes, observations, and methods reported in the current version of each submission.
summary_of_conflicting_interpretations reports all pairwise differences in interpretation of a variant, without regard to phenotype.
VCF files
The SCV accession is not reported in ClinVar's VCF files.
RCV
Web display
'RCV' refers to the first 3 letters of the accession calculated by ClinVar to aggregate information from all submitted records interpreting the same phenotype relative to the same variant or set of variants. If you submit a query to ClinVar based on an RCV accession, e.g. RCV000009910, you are directed to the page specific to that record. The Assertion and evidence details section of this record lists all the supporting submitted records with their SCV accessions.
XML releases
In our XML releases called ClinVarFullRelease, (https://ftp.ncbi.nlm.nih.gov/pub/clinvar/xml/), the content specific to each RCV is reported in a //ClinVarSet/ReferenceClinVarAssertion/ element, with the accession reported as //ReleaseSet//ClinVarSet/ReferenceClinVarAssertion/ClinVarAccession/@Acc and the version as //ReleaseSet/ClinVarSet/ReferenceClinVarAssertion/ClinVarAccession/@Version.
In our XML releases called ClinVarVariationRelease, (https://ftp.ncbi.nlm.nih.gov/pub/clinvar/xml/clinvar_variation/), summary information for each related RCV is reported in a //RCVList/RCVAccession/ element, with the accession number reported as //RCVList/RCVAccession/@Accession and the version as //RCVList/RCVAccession/@Version.
See the README file for more information about our XML releases.
Reports in the tab_delimited directory
submission_summary mentions RCV accessions and versions in comments supplied by submitters (Description column). The references to RCV are not verified by ClinVar.
summary_of_conflicting_interpretations mentions RCV accessions and versions in comments supplied by submitters (Submitter1_Description, Submitter2_Description). The references to RCV accessions are not verified by ClinVar.
VCF files
The list of RCV accessions that aggregate information about variants represented in ClinVar are reported in ClinVar's VCF file.
VCV
Web display
'VCV' refers to the first 3 letters of the accession calculated by ClinVar to aggregate information from all submitted records interpreting the same variant or set of variants. If you submit a query to ClinVar based on a VCV accession, e.g. VCV000009325, you are directed to the page specific to that record. The Submitted interpretations and evidence section of this record lists all the supporting submitted records with their SCV accessions.
XML releases
In our XML releases called ClinVarFullRelease, (https://ftp.ncbi.nlm.nih.gov/pub/clinvar/xml/), the VCV accession is reported as //ReferenceClinVarAssertion/MeasureSet/@Acc and the version as //ReferenceClinVarAssertion/MeasureSet/@Version.
In our XML releases called ClinVarVariationRelease, (https://ftp.ncbi.nlm.nih.gov/pub/clinvar/xml/clinvar_variation/), the content specific to each RCV is reported in a //ClinVarVariationRelease/VariationArchive/ element, with the accession number reported as //ClinVarVariationRelease/VariationArchive/@Accession and the version as //ClinVarVariationRelease/VariationArchive/@Version.
See the README file for more information about our XML releases.
Reports in the tab_delimited directory
VCV accessions are not reported in any tab-delimited files yet. Note that the VCV accession number is constructed using the Variation ID and prefixing VCV and a number of zeros to equal nine digits. Thus the Variation ID can be used in lieu of the VCV accession number; it is reported in the following files:
- variant_summary.txt
- var_citations.txt
- summary_of_conflicting_interpretations.txt
- hgvs4variation.txt.gz
- variation_allele.txt
- submission_summary.txt
See the README file for more information about our tab-delimited files.
Identifiers specific to ClinVar
Variation ID
ClinVar assigns a unique integer identifier to each set of variants described in submitted records. The majority of submitted records in ClinVar interpret a single variant, and a Variation ID is assigned even if there is only one variant in the set. There are two subclasses of Variation IDs:
- those being interpreted directly (interpreted)
- those being interpreted only in the context of a set of variants (included)
The majority of Variation IDs in ClinVar are interpreted, meaning that they were the focus of a submitted record, with the clinical significance, or interpretation, of that variant provided by the submitter. However, there are submitted records that describe a compound heterozygote, a haploype, or a diplotype, for which ClinVar has not independently received a submission interpreting the effect of each individual variant. The individual variants defining the complex set, but without a direct interpretation, are represented by Variation IDs of the 'included' class.
Note the example https://0-www-ncbi-nlm-nih-gov.brum.beds.ac.uk/clinvar/variation/561/. The Variation ID is 561. None of the Allele IDs 38381, 38382, nor 15600 has a submitted record in ClinVar directly interpreting that variant, so each of those variants is considered "included". The Variation IDs assigned to each individual variant (242756, 242755, 242821, respectively) are being phased into ClinVar's public reports, and are currently accessible in the XML releases called ClinVarVariationRelease (https://ftp.ncbi.nlm.nih.gov/pub/clinvar/xml/clinvar_variation/). The standard for reporting the correspondences is https://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/variation_allele.txt.gz
Note: If for any reason the description of Variation ID were changed from the set of 3 simple variants as it is now, to one deletion/insertion event spanning 8 base pairs, the Variation ID would be retained, because it describes the same final sequence at that location, but there would be a single, different Allele ID assigned to the deletion/insertion.
Web display
The Variation ID is used to anchor ClinVar's display specific to a set of variants. Take for example, Variation ID 561.
- the value 561 in the URL https://0-www-ncbi-nlm-nih-gov.brum.beds.ac.uk/clinvar/variation/561/ is the Variation ID.
- the value 561 has been assigned to the set of 3 distinct simple variants reported in the Allele(s) section.
- The Variation ID is also displayed explicitly on the Variation Report.
XML releases
The ClinVarFullRelease files describe the interpreted set of variants in the element //MeasureSet. The attribute @ID is the Variation ID.
The ClinVarVariationRelease files describe the interpreted set of variants in the following elements:
- /ClinVarVariationRelease/VariationArchive/InterpretedRecord/SimpleAllele
- /ClinVarVariationRelease/VariationArchive/InterpretedRecord/Haplotype
- /ClinVarVariationRelease/VariationArchive/InterpretedRecord/Genotype
In each case, the attribute @VariationID is the Variation ID.
The ClinVarVariationRelease files describe the included set of variants in the following elements:
- /ClinVarVariationRelease/VariationArchive/IncludedRecord/SimpleAllele
- /ClinVarVariationRelease/VariationArchive/IncludedRecord/Haplotype
In each case, the attribute @VariationID is the Variation ID.
Reports in the tab_delimited directory
There are multiple reports in the tab-delimited directory that reference the Variation ID. The column containing those values is clearly labeled. The file reporting the relationships between VariationID and AlleleID is variation_allele.txt.gz.
- variant_summary.txt
- var_citations.txt
- summary_of_conflicting_interpretations.txt
- hgvs4variation.txt.gz
- variation_allele.txt
- submission_summary.txt
VCF files
The Variation ID is reported in the VCF file as the ID, in column 3.
Allele ID
A unique integer identifier, the Allele ID, is assigned to each individual variant in ClinVar. The numbering systems for the Allele ID and the Variation ID described above overlap, so it is important to note the context of any integer identifier.
XML releases
The ClinVarFullRelease files describe each individual variant in the element //Measure. The attribute @ID is the Allele ID.
Reports in the tab_delimited directory
There are multiple reports in the tab-delimited directory that reference the Allele ID. The column containing the Allele ID is clearly labeled. The file reporting the relationships between VariationID and AlleleID is variation_allele.txt.gz.
- allele_gene.txt
- cross_references.txt
- hgvs4variation.txt.gz
- variant_summary.txt.gz
- var_citations.txt
- variation_allele.txt.gz
VCF files
The Allele ID is reported in the VCF files as the ALLELEID INFO tag.
Relationships between Variation ID and Allele ID
A Variation ID represents one or more Allele IDs. Any Allele ID may be a component of one or more sets of discrete variants (aka Variation ID). For example, consider a submission with an interpretation for a single nucleotide variant (Allele ID n, assigned Variation ID a), and a different submission with an interpretation for that same single nucleotide variant (Allele ID n) in combination with a different single nucleotide variant (Allele ID m), the combination being assigned Variation ID b. The standard for reporting the correspondence between Variation ID and Allele ID is https://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/variation_allele.txt.gz.
To find all Allele IDs represented by a Variation ID, try something like
zcat variation_allele.txt.gz | awk '$1==561 {print}'
which finds all the lines where the value in the first column (Variation ID) is 561, and reports that 561 corresponds to an interpreted (yes in column 4) Haplotype, with the Allele IDs in the 3rd column.
561 Haplotype 15600 yes
561 Haplotype 38381 yes
561 Haplotype 38382 yes
To find all Variation IDs represented by an Allele ID, try something like
zcat variation_allele.txt.gz | awk '$3==15600{print}'
which finds all the lines where the value in the third column (Allele ID) is 15600, and reports that 15600 is part of an interpreted haplotype (yes in column 4), but also is represented by Variation ID 242821 which has not been interpreted.
561 Haplotype 15600 yes
242821 Variant 15600 no
Other identifiers in ClinVar's XML releases
ClinVar's XML does report integer @ID values for multiple elements other than MeasureSet and Measure. These values correspond to the unique keys used in the relational database tables that ClinVar uses to represent the data. At present these values can be used for identification in processing any element from one report to another, e.g. //Trait/@ID, but ClinVar does not consider these as public identifiers and reserves the right to alter the numbering system.
Identifiers specific to other NCBI resources
ClinVar maintains multiple identifiers to other NCBI resources. These include the BookShelf, dbSNP, dbVar, Gene, MedGen's CUI, PubMed, and PubMedCentral.
- In the XML, these are reported in the XRef element.
- In the tab-delimited directories, these are reported in
- cross-references.txt
- var_citations.txt
Identifiers specific to resources outside of NCBI
ClinVar maintains multiple identifiers to resources outside of NCBI.
- In the XML, these are reported in the XRef element.
- In the tab-delimited directories, these are reported in
- cross-references.txt
- var_citations.txt
In other ftp paths (See https://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/README )
- gene_condition_source_id