Method Detail

Method Detail

Submitter Method Handle:	1000GENOMES
Submitter Method ID:	PHASE1_INDELS
Submitted method description:


	README for 1000 Genomes Phase 1 short indels:
	Accurately calling short insertions and deletions from low coverage
	sequencing data has proven unexpectedly challenging. The present
	data set has gone through many twists and turns, and it is necessary
	to give some complex details in order to identify exactly which version
	is being deposited in dbSNP. The 1000 Genomes Project's goal has
	been to identify 95 % of all polymorphic sites above 0.5 % minor
	allele frequency, with a false positive rate per site not exceeding 5 %.
	Experimental validation of short indel polymorphism is challenging as
	well. The current set does not reach 95 % sensitivity, and it may or
	may not achieve a 5 % false positive rate (specificity).
	The Phase 1 integrated genotype release contains SNPs from both
	low coverage and exome capture sequencing, but its short indels
	come only from low coverage sequencing. These are merged from
	sets made by five different methods: freeBayes (Boston College),
	GATK (Broad Institute), Dindel (Sanger), Platypus (Oxford) and
	samtools (Sanger), using the 1000 Genomes Phase 1 low coverage
	sequence data for 1092 individuals. The merged indel calls were
	filtered using GATK Variant Quality Score Recalibration and
	restricted to bi-allelic sites only. These are found in directory
	ftp://ftp-trace.ncbi.nlm.nih.gov/1000genomes/ftp
	/release/20110521/supporting
	Further filtering beyond the restriction to biallelic indels is described
	by Hyun Min Kang as follows:
	1. Exclusion of excessive 1-bp outliers
	We identified that a subset of indels from the low-coverage data have
	very high false positive rates. In particular, the following 10 samples
	showed excessive number of singleton indels (~1,000 to 23,000) that
	are mostly 1bp insertions.
	NA12144
	NA20752
	NA18626
	NA19437
	NA19439
	NA19436
	NA19448
	NA18627
	NA19313
	NA19446
	Upon further investigation, we found that the excessive 1bp singleton
	insertions are due to technical artifacts introduced in the sequencing
	step. We removed 162,928 1bp singleton insertions specific only to
	the 10 outlier samples.
	2. In addition, we found much higher fraction of frameshift indels in
	low-coverage specific indels compared to the indels shared between
	low-coverage and exome data, suggesting that low-coverage specific
	coding indels may have enriched false positive rates. We removed
	additional 3,014 protein-coding frameshift indels exclusive to low-
	coverage samples to increase the specificity of the protein-coding
	indels.
	3. Preliminary evaluations of INDEL call sets demonstrated high apparent
	false positive rate after the above steps, and rare INDELs demonstrated
	higher discordance with independent datasets. To extract high quality
	INDELs, we restricted the minimum allele frequency (before integration)
	to 0.5%, and additionally applied SVM approach to further filter out
	potential false positive INDELs guided by the indel genotypes from the
	Affymetrix Axiom genotyping chip were provided (Jeannette Schmidt &
	Jeremy Gollub). The SVM was trained using multiple features including
	(a) allele balance (b) inbreeding coefficient (c) flanking sequence complexity
	(d) homopolymer runs (e) strand bias (f) cycle bias (g) mapping quality
	(h) number of supporting non-ref reads, and (i) distance to nearby INDELs.
	A list of these excluded sites can be found in directory
	ftp://ftp-trace.ncbi.nlm.nih.gov/1000genomes/ftp
	/technical/working/20120131_indel_sites_to_exclude
	The Affymetrix indel genotypes are in directory
	ftp://ftp-trace.ncbi.nlm.nih.gov/1000genomes/ftp
	/technical/working/20120208_axiom_genotypes
	The final release set is currently in directory
	ftp://ftp-trace.ncbi.nlm.nih.gov/1000genomes/ftp
	/technical/working/20120312_phase1_v2_indel_cleaned_sites_list
	Full details of the underlying sequence data are given in files:
	ftp://ftp-trace.ncbi.nlm.nih.gov/1000genomes/ftp
	/sequence_indices/20101123.sequence.index
	/phase1/phase1.alignment.index
	All sequence coordinates are stated relative to the GRCh37 human
	genome reference sequence.

This method was used in the following submission:

Submitter Handle	Batch Type	Submitter batch id	Release build id
1000GENOMES	Assay	phase_1_low_coverage_indels	136

Disclaimer Privacy statement