dbVar ClinVar GaP PubMed Nucleotide Protein
Search small variations in dbSNP or large structural variations in dbVar
transparent GIF
Spacer gif
Have a question about dbSNP? Try searching the SNP FAQ Archive!

Spacer gif
Method Detail
Submitter Method Handle: 1000GENOMES
Submitter Method ID: PHASE1_INDELS
Submitted method description:
README for 1000 Genomes Phase 1 short indels:
Accurately calling short insertions and deletions from low coverage
sequencing data has proven unexpectedly challenging. The present
data set has gone through many twists and turns, and it is necessary
to give some complex details in order to identify exactly which version
is being deposited in dbSNP. The 1000 Genomes Project's goal has
been to identify 95 % of all polymorphic sites above 0.5 % minor
allele frequency, with a false positive rate per site not exceeding 5 %.
Experimental validation of short indel polymorphism is challenging as
well. The current set does not reach 95 % sensitivity, and it may or
may not achieve a 5 % false positive rate (specificity).
The Phase 1 integrated genotype release contains SNPs from both
low coverage and exome capture sequencing, but its short indels
come only from low coverage sequencing. These are merged from
sets made by five different methods: freeBayes (Boston College),
GATK (Broad Institute), Dindel (Sanger), Platypus (Oxford) and
samtools (Sanger), using the 1000 Genomes Phase 1 low coverage
sequence data for 1092 individuals. The merged indel calls were
filtered using GATK Variant Quality Score Recalibration and
restricted to bi-allelic sites only. These are found in directory
Further filtering beyond the restriction to biallelic indels is described
by Hyun Min Kang as follows:
1. Exclusion of excessive 1-bp outliers
We identified that a subset of indels from the low-coverage data have
very high false positive rates. In particular, the following 10 samples
showed excessive number of singleton indels (~1,000 to 23,000) that
are mostly 1bp insertions.
Upon further investigation, we found that the excessive 1bp singleton
insertions are due to technical artifacts introduced in the sequencing
step. We removed 162,928 1bp singleton insertions specific only to
the 10 outlier samples.
2. In addition, we found much higher fraction of frameshift indels in
low-coverage specific indels compared to the indels shared between
low-coverage and exome data, suggesting that low-coverage specific
coding indels may have enriched false positive rates. We removed
additional 3,014 protein-coding frameshift indels exclusive to low-
coverage samples to increase the specificity of the protein-coding
3. Preliminary evaluations of INDEL call sets demonstrated high apparent
false positive rate after the above steps, and rare INDELs demonstrated
higher discordance with independent datasets. To extract high quality
INDELs, we restricted the minimum allele frequency (before integration)
to 0.5%, and additionally applied SVM approach to further filter out
potential false positive INDELs guided by the indel genotypes from the
Affymetrix Axiom genotyping chip were provided (Jeannette Schmidt &
Jeremy Gollub). The SVM was trained using multiple features including
(a) allele balance (b) inbreeding coefficient (c) flanking sequence complexity
(d) homopolymer runs (e) strand bias (f) cycle bias (g) mapping quality
(h) number of supporting non-ref reads, and (i) distance to nearby INDELs.
A list of these excluded sites can be found in directory
The Affymetrix indel genotypes are in directory
The final release set is currently in directory
Full details of the underlying sequence data are given in files:
All sequence coordinates are stated relative to the GRCh37 human
genome reference sequence.

This method was used in the following submission:

Submitter Handle Batch Type Submitter batch id Release build id
1000GENOMES Assay phase_1_low_coverage_indels 136