|
|
| README for 1000 Genomes Phase 1 short indels: |
| Accurately calling short insertions and deletions from low coverage |
| sequencing data has proven unexpectedly challenging. The present |
| data set has gone through many twists and turns, and it is necessary |
| to give some complex details in order to identify exactly which version |
| is being deposited in dbSNP. The 1000 Genomes Project's goal has |
| been to identify 95 % of all polymorphic sites above 0.5 % minor |
| allele frequency, with a false positive rate per site not exceeding 5 %. |
| Experimental validation of short indel polymorphism is challenging as |
| well. The current set does not reach 95 % sensitivity, and it may or |
| may not achieve a 5 % false positive rate (specificity). |
| The Phase 1 integrated genotype release contains SNPs from both |
| low coverage and exome capture sequencing, but its short indels |
| come only from low coverage sequencing. These are merged from |
| sets made by five different methods: freeBayes (Boston College), |
| GATK (Broad Institute), Dindel (Sanger), Platypus (Oxford) and |
| samtools (Sanger), using the 1000 Genomes Phase 1 low coverage |
| sequence data for 1092 individuals. The merged indel calls were |
| filtered using GATK Variant Quality Score Recalibration and |
| restricted to bi-allelic sites only. These are found in directory |
| ftp://ftp-trace.ncbi.nlm.nih.gov/1000genomes/ftp |
| /release/20110521/supporting |
| Further filtering beyond the restriction to biallelic indels is described |
| by Hyun Min Kang as follows: |
| 1. Exclusion of excessive 1-bp outliers |
| We identified that a subset of indels from the low-coverage data have |
| very high false positive rates. In particular, the following 10 samples |
| showed excessive number of singleton indels (~1,000 to 23,000) that |
| are mostly 1bp insertions. |
| NA12144 |
| NA20752 |
| NA18626 |
| NA19437 |
| NA19439 |
| NA19436 |
| NA19448 |
| NA18627 |
| NA19313 |
| NA19446 |
| Upon further investigation, we found that the excessive 1bp singleton |
| insertions are due to technical artifacts introduced in the sequencing |
| step. We removed 162,928 1bp singleton insertions specific only to |
| the 10 outlier samples. |
| 2. In addition, we found much higher fraction of frameshift indels in |
| low-coverage specific indels compared to the indels shared between |
| low-coverage and exome data, suggesting that low-coverage specific |
| coding indels may have enriched false positive rates. We removed |
| additional 3,014 protein-coding frameshift indels exclusive to low- |
| coverage samples to increase the specificity of the protein-coding |
| indels. |
| 3. Preliminary evaluations of INDEL call sets demonstrated high apparent |
| false positive rate after the above steps, and rare INDELs demonstrated |
| higher discordance with independent datasets. To extract high quality |
| INDELs, we restricted the minimum allele frequency (before integration) |
| to 0.5%, and additionally applied SVM approach to further filter out |
| potential false positive INDELs guided by the indel genotypes from the |
| Affymetrix Axiom genotyping chip were provided (Jeannette Schmidt & |
| Jeremy Gollub). The SVM was trained using multiple features including |
| (a) allele balance (b) inbreeding coefficient (c) flanking sequence complexity |
| (d) homopolymer runs (e) strand bias (f) cycle bias (g) mapping quality |
| (h) number of supporting non-ref reads, and (i) distance to nearby INDELs. |
| A list of these excluded sites can be found in directory |
| ftp://ftp-trace.ncbi.nlm.nih.gov/1000genomes/ftp |
| /technical/working/20120131_indel_sites_to_exclude |
| The Affymetrix indel genotypes are in directory |
| ftp://ftp-trace.ncbi.nlm.nih.gov/1000genomes/ftp |
| /technical/working/20120208_axiom_genotypes |
| The final release set is currently in directory |
| ftp://ftp-trace.ncbi.nlm.nih.gov/1000genomes/ftp |
| /technical/working/20120312_phase1_v2_indel_cleaned_sites_list |
| Full details of the underlying sequence data are given in files: |
| ftp://ftp-trace.ncbi.nlm.nih.gov/1000genomes/ftp |
| /sequence_indices/20101123.sequence.index |
| /phase1/phase1.alignment.index |
| All sequence coordinates are stated relative to the GRCh37 human |
| genome reference sequence. |