NCBI
dbSNP

dbVar ClinVar GaP PubMed Nucleotide Protein
Search small variations in dbSNP or large structural variations in dbVar
transparent GIF
Spacer gif
Have a question about dbSNP? Try searching the SNP FAQ Archive!

Spacer gif
Method Detail
Submitter Method Handle: 1000GENOMES
Submitter Method ID: LOW_COVERAGE_PILOT_INDELS
Submitted method description:
1000 Genomes Pilot Project low coverage short indel calls:
"Short indels" are defined for present purposes as insertions or
deletions of 50 base pairs or less, relative to the NCBI build 36
human genome reference sequence. They are discovered and
genotyped by local realignment of short read sequencing data
using Dindel software (see published paper).
Sequence production and mapping:
Lymphoblastoid cell line DNA for each individual is sequenced on
Illumina, Roche 454 or AB SOLiD platforms and mapped to the NCBI
build 36 human genome reference sequence using MAQ for Illumina
data, ssaha2 for Roche 454 and Corona_lite 4.01 with AB SOLiD data.
The average depth of sequence coverage at HapMap 3 sites is 5.1 x
per individual for CEU, 2.8 x for CHB+JPT, and 3.7 x for YRI.
Illumina and Roche 454 base call quality scores are recalibrated
using the GATK software and PCR duplicate reads are removed with
Picard MarkDuplicates. Neither base call quality recalibration
nor duplicate removal are done for AB SOLiD data. Complete .bam
files containing all mapped sequence reads are available from
ftp://ftp-trace.ncbi.nlm.nih.gov/1000genomes/ftp/pilot_data/data/
NA*/alignment/ in files:
NAxxxxx.SLX.maq.SRP000031.2009_08.bam - for Illumina Solexa data
NAxxxxx.454.ssaha.SRP000031.2009_10.bam - for Roche 454 data
NAxxxxx.SOLID.corona.SRP000031.2009_08.bam - for AB SOLiD data
Indel discovery and genotyping:
Calls were made and annotated using the following procedure.
1. Extract candidate indel locations from alignments of Illumina,
Roche 454 and AB SOLiD short read genome sequences to the human
genome reference sequence. The candidate indel set was compiled
by Gerton Lunter from candidates provided by the Sanger, Broad,
Oxford, Sanger/LUMC and TGEN groups. There are a total of
8,504,899 candidate sites. All candidates are tested in all
populations. The JPT and CHB populations are analysed jointly.
The full list of candidate sites is available at:
http://www.well.ox.ac.uk/~gerton/1000G/LC/pilot1-indelcalls-17sept09.tgz
2. Realign reads around candidate indels to candidate haplotypes
using the indel caller Dindel (Albers et al., Genome Research, 2011).
Dindel at this stage was used both to produce indel site calls (make
a call whether a candidate indel segregates in the population) and
to produce genotype likelihoods for each individual at a called site.
3. Use linkage disequilibrium (LD) with the HapMap SNP genotypes
for each individual to refine indel genotype calls. This uses Dindel's
genotype likelihoods for each site and each individual with the QCALL
linkage disequilibrium software (Si Quang Le, Richard Durbin, Genome
Research, 2011).
Even though candidate indel sites from all technologies were used,
the support for candidate indels was evaluated only using Illumina
sequence data. For the following individuals there was no Illumina
data, and as a result their genotypes are completely imputed from
those of other individuals. (Any SOLiD/454 data for these samples
was *not* used to compute genotype likelihoods).
List of individuals without Illumina sequence data whose genotypes
are imputed from other individuals:
CEU: NA12814, NA11840, NA12872, NA12815, NA12812, NA12760,
NA12874, NA12762, NA06985, NA12873, NA12234
YRI: NA19141, NA19143
JPTCHB: NA18969, NA18970
4. The source .vcf files for this submission to dbSNP are currently
available by anonymous ftp in directory:
ftp://ftp-trace.ncbi.nlm.nih.gov/1000genomes/ftp/pilot_data/
paper_data_sets/a_map_of_human_variation/low_coverage/indels
These source .vcf files contain two additional types of information
not submitted to dbSNP:
5. QCALL filters out a small fraction (<0.25%) of the sites called
by Dindel; these are sites where the genotype likelihoods are not
consistent with the local LD structure. This 'NoQCALL' subset is
likely to be enriched for false calls, but it may also contain
potentially interesting targets for association studies, since one
reason for these sites being filtered out by QCALL could be low
LD with nearby SNPs. When a site is filtered out in this way, the
FILTER column in the original 'sites' .vcf file will say 'NoQCALL',
there will be no record in the original 'genotypes' .vcf file, and
the site is not submitted to dbSNP.
6. More stringent filters were applied to predicted loss of function
(LOF) coding variants in the original .vcf files. In coding regions,
indel rates are significantly lower due to selection. Since noise levels
(factors resulting in false positive indel calls) are expected to be
approximately constant across the genome, the fraction of indel calls
which are false positives (the false discovery rate) will be increased
in coding regions. To lower the number of false positive indel calls,
we applied more stringent filters to the subset of indels that are called
in the genome-wide set and are predicted to fall into the LOF class.
The more stringent filter requires that the entire range of positions
where an indel would yield the same alternative haplotype sequence
as the originally assigned indel (for instance, within a repeat, deleting
any single repeat unit will give the same alternative haplotype) plus
4 bases of reference sequence on each side of this region, must be
covered by at least one read on the forward strand, and at least one
read on the reverse strand, with at most one mismatch between the
read and the alternative haplotype sequence resulting from the indel
(regardless of base call qualities). This filter removes the excess
of 1-bp frameshift insertions seen in CHB+JPT with respect to CEU
in the less stringently filtered genome-wide indel call set, although
it is expected to remove a significant number of true positive calls
as well. The indels that pass this stringent filter are marked in the
original .vcf files by 'SF' in the INFO field, but this annotation is not
submitted to dbSNP.
(7/19/2010 -- Kees Albers, on behalf of the 1000 Genomes Project
Analysis group, adapted by Tom Blackwell for dbSNP submission.)

This method was used in the following submission:

Submitter Handle Batch Type Submitter batch id Release build id
1000GENOMES Assay CEU_low_coverage_pilot_indel 133
1000GENOMES Assay YRI_low_coverage_pilot_indel 133
1000GENOMES Assay JPTCHB_low_coverage_pilot_indel 133