Method Detail

Method Detail

Submitter Method Handle:	1000GENOMES
Submitter Method ID:	LOW_COVERAGE_PILOT_INDELS
Submitted method description:


	1000 Genomes Pilot Project low coverage short indel calls:
	"Short indels" are defined for present purposes as insertions or
	deletions of 50 base pairs or less, relative to the NCBI build 36
	human genome reference sequence. They are discovered and
	genotyped by local realignment of short read sequencing data
	using Dindel software (see published paper).
	Sequence production and mapping:
	Lymphoblastoid cell line DNA for each individual is sequenced on
	Illumina, Roche 454 or AB SOLiD platforms and mapped to the NCBI
	build 36 human genome reference sequence using MAQ for Illumina
	data, ssaha2 for Roche 454 and Corona_lite 4.01 with AB SOLiD data.
	The average depth of sequence coverage at HapMap 3 sites is 5.1 x
	per individual for CEU, 2.8 x for CHB+JPT, and 3.7 x for YRI.
	Illumina and Roche 454 base call quality scores are recalibrated
	using the GATK software and PCR duplicate reads are removed with
	Picard MarkDuplicates. Neither base call quality recalibration
	nor duplicate removal are done for AB SOLiD data. Complete .bam
	files containing all mapped sequence reads are available from
	ftp://ftp-trace.ncbi.nlm.nih.gov/1000genomes/ftp/pilot_data/data/
	NA*/alignment/ in files:
	NAxxxxx.SLX.maq.SRP000031.2009_08.bam - for Illumina Solexa data
	NAxxxxx.454.ssaha.SRP000031.2009_10.bam - for Roche 454 data
	NAxxxxx.SOLID.corona.SRP000031.2009_08.bam - for AB SOLiD data
	Indel discovery and genotyping:
	Calls were made and annotated using the following procedure.
	1. Extract candidate indel locations from alignments of Illumina,
	Roche 454 and AB SOLiD short read genome sequences to the human
	genome reference sequence. The candidate indel set was compiled
	by Gerton Lunter from candidates provided by the Sanger, Broad,
	Oxford, Sanger/LUMC and TGEN groups. There are a total of
	8,504,899 candidate sites. All candidates are tested in all
	populations. The JPT and CHB populations are analysed jointly.
	The full list of candidate sites is available at:
	http://www.well.ox.ac.uk/~gerton/1000G/LC/pilot1-indelcalls-17sept09.tgz
	2. Realign reads around candidate indels to candidate haplotypes
	using the indel caller Dindel (Albers et al., Genome Research, 2011).
	Dindel at this stage was used both to produce indel site calls (make
	a call whether a candidate indel segregates in the population) and
	to produce genotype likelihoods for each individual at a called site.
	3. Use linkage disequilibrium (LD) with the HapMap SNP genotypes
	for each individual to refine indel genotype calls. This uses Dindel's
	genotype likelihoods for each site and each individual with the QCALL
	linkage disequilibrium software (Si Quang Le, Richard Durbin, Genome
	Research, 2011).
	Even though candidate indel sites from all technologies were used,
	the support for candidate indels was evaluated only using Illumina
	sequence data. For the following individuals there was no Illumina
	data, and as a result their genotypes are completely imputed from
	those of other individuals. (Any SOLiD/454 data for these samples
	was not used to compute genotype likelihoods).
	List of individuals without Illumina sequence data whose genotypes
	are imputed from other individuals:
	CEU: NA12814, NA11840, NA12872, NA12815, NA12812, NA12760,
	NA12874, NA12762, NA06985, NA12873, NA12234
	YRI: NA19141, NA19143
	JPTCHB: NA18969, NA18970
	4. The source .vcf files for this submission to dbSNP are currently
	available by anonymous ftp in directory:
	ftp://ftp-trace.ncbi.nlm.nih.gov/1000genomes/ftp/pilot_data/
	paper_data_sets/a_map_of_human_variation/low_coverage/indels
	These source .vcf files contain two additional types of information
	not submitted to dbSNP:
	5. QCALL filters out a small fraction (<0.25%) of the sites called
	by Dindel; these are sites where the genotype likelihoods are not
	consistent with the local LD structure. This 'NoQCALL' subset is
	likely to be enriched for false calls, but it may also contain
	potentially interesting targets for association studies, since one
	reason for these sites being filtered out by QCALL could be low
	LD with nearby SNPs. When a site is filtered out in this way, the
	FILTER column in the original 'sites' .vcf file will say 'NoQCALL',
	there will be no record in the original 'genotypes' .vcf file, and
	the site is not submitted to dbSNP.
	6. More stringent filters were applied to predicted loss of function
	(LOF) coding variants in the original .vcf files. In coding regions,
	indel rates are significantly lower due to selection. Since noise levels
	(factors resulting in false positive indel calls) are expected to be
	approximately constant across the genome, the fraction of indel calls
	which are false positives (the false discovery rate) will be increased
	in coding regions. To lower the number of false positive indel calls,
	we applied more stringent filters to the subset of indels that are called
	in the genome-wide set and are predicted to fall into the LOF class.
	The more stringent filter requires that the entire range of positions
	where an indel would yield the same alternative haplotype sequence
	as the originally assigned indel (for instance, within a repeat, deleting
	any single repeat unit will give the same alternative haplotype) plus
	4 bases of reference sequence on each side of this region, must be
	covered by at least one read on the forward strand, and at least one
	read on the reverse strand, with at most one mismatch between the
	read and the alternative haplotype sequence resulting from the indel
	(regardless of base call qualities). This filter removes the excess
	of 1-bp frameshift insertions seen in CHB+JPT with respect to CEU
	in the less stringently filtered genome-wide indel call set, although
	it is expected to remove a significant number of true positive calls
	as well. The indels that pass this stringent filter are marked in the
	original .vcf files by 'SF' in the INFO field, but this annotation is not
	submitted to dbSNP.
	(7/19/2010 -- Kees Albers, on behalf of the 1000 Genomes Project
	Analysis group, adapted by Tom Blackwell for dbSNP submission.)

This method was used in the following submission:

Submitter Handle	Batch Type	Submitter batch id	Release build id
1000GENOMES	Assay	CEU_low_coverage_pilot_indel	133
1000GENOMES	Assay	YRI_low_coverage_pilot_indel	133
1000GENOMES	Assay	JPTCHB_low_coverage_pilot_indel	133

Disclaimer Privacy statement