dbVar ClinVar GaP PubMed Nucleotide Protein
Search small variations in dbSNP or large structural variations in dbVar
transparent GIF
Spacer gif
Have a question about dbSNP? Try searching the SNP FAQ Archive!

Spacer gif
Method Detail
Submitter Method Handle: 1000GENOMES
Submitter Method ID: LOW_COVERAGE
Submitted method description:
README for 1000 Genomes Pilot 1 SNP calls:
Lymphoblastoid cell line DNA for each individual is sequenced on
Illumina, Roche 454 or AB SOLiD platforms and mapped to the NCBI
version 36 human genome reference sequence using MAQ for Illumina
data, ssaha2 for Roche 454 and Corona_lite 4.01 with AB SOLiD data.
The average depth of sequence coverage at HapMap 3 sites is 5.1 x
per individual for CEU, 2.8 x for CHB+JPT, and 3.7 x for YRI.
Illumina and Roche 454 base call quality scores are recalibrated
using the GATK software and PCR duplicate reads are removed with
Picard MarkDuplicates. Neither base call quality recalibration
nor duplicate removal were done for AB SOLiD data. Complete BAM
files containing all mapped sequence reads are available from
NA*/alignment/ in files:
NAxxxxx.SLX.maq.SRP000031.2009_08.bam - for Illumina Solexa data
NAxxxxx.454.ssaha.SRP000031.2009_10.bam - for Roche 454 data
NAxxxxx.SOLID.corona.SRP000031.2009_08.bam - for AB SOLiD data
For each population, three primary call sets were made, by Jared
Maguire and colleagues at the Broad Institute using GATK (BI), by
Yun Li and Goncalo Abecasis from the University of Michigan using
MACH (UMich), and by Quang Le and Richard Durbin at the Sanger
Institute using QCALL (SI). The final release SNP loci consist of
all sites called in at least two of the three primary sets that also
pass the following filters:
(1) the call is not removed by local realignment of reads around the
call using the GATK local realignment tool. This filter removes a
relatively small number of spurious calls caused by alignment
problems around indels.
(2) the summed depth of all reads covering the call is less than twice
the average summed depth at HapMap3 sites -- specifically, less
than or equal to 625 for CEU, 445 for YRI and 330 for CHB+JPT.
This eliminates a small number of calls at high copy number sites.
(3) less than or equal to 20% of all Illumina calls covering this site
have MAQ mapping quality score 0. This filter removes calls at sites
in more repetitive sequence regions. The use of a threshold based on
Illumina read mapping quality scores does not indicate that Illumina
reads are the primary source of errors -- this is simply chosen as a
proxy for the ability to accurately map reads from any technology to
the region. (The other short read aligners used, Corona_lite and
ssaha2, do not provide mapping quality scores.)
Mask files for each population indicate for each base in the genome
whether it passes filters 2 and 3. These are available by anonymous
ftp at
release/2010_03/pilot1/supporting/*.mask.fa.gz. Larger files giving
the read depth at each position are in subdirectory /supporting/depths/.
Genotype calls for the sequenced individuals are also the consensus of
those from the three methods, i.e. if all three of the methods or two
of them agree on a genotype, that one is called. Calls for CEU and YRI
are phased based on the SI calls, which were made on the trio-phased
HapMap3 haplotypes, except for individuals NA10851, NA12717, NA12004,
NA10847, NA12414 for CEU, and individuals NA18523, NA18522, NA19129,
NA18502, NA18907, NA18856 for YRI. There are no trio-phased HapMap 3
haplotypes for these individuals. These individuals, and all of the
CHB+JPT genotype calls, are phased according to the UMich primary call
set. In the rare cases where there is no consensus genotype call,
the SI call is used for CEU and YRI and the UMich call for CHB+JPT.
This 2/3 consensus genotype strategy produces a few thousand loci where
a SNP is called by either two or three methods, but different individuals
show the minor allele in the genotype calls from each method. In this
case, the 2/3 consensus rule legitimately says that every individual is
homozygous reference, and the alternate allele is shown as "." with
allele count zero. These sites are included in the .vcf files but are
excluded from the Pilot 1 submission to dbSNP. For similar reasons, the
third allele at some tri-allelic loci will show an allele count of zero.
(28/3/10 -- Richard Durbin on behalf of the 1000 Genomes Project
Analysis group, adapted by Tom Blackwell for dbSNP submission.)

This method was used in the following submission:

Submitter Handle Batch Type Submitter batch id Release build id
1000GENOMES Frequency pilot_1_CEU_mar_2010 132
1000GENOMES Frequency pilot_1_CHB+JPT_mar_2010 132
1000GENOMES Frequency pilot_1_YRI_mar_2010 132
1000GENOMES Assay phase_1_vqsr_v2b_sites_june_2011 133