|
|
| 1000 Genomes Pilot Project low coverage short indel calls: |
| "Short indels" are defined for present purposes as insertions or |
| deletions of 50 base pairs or less, relative to the NCBI build 36 |
| human genome reference sequence. They are discovered and |
| genotyped by local realignment of short read sequencing data |
| using Dindel software (see published paper). |
| Sequence production and mapping: |
| Lymphoblastoid cell line DNA for each individual is sequenced on |
| Illumina, Roche 454 or AB SOLiD platforms and mapped to the NCBI |
| build 36 human genome reference sequence using MAQ for Illumina |
| data, ssaha2 for Roche 454 and Corona_lite 4.01 with AB SOLiD data. |
| The average depth of sequence coverage at HapMap 3 sites is 5.1 x |
| per individual for CEU, 2.8 x for CHB+JPT, and 3.7 x for YRI. |
| Illumina and Roche 454 base call quality scores are recalibrated |
| using the GATK software and PCR duplicate reads are removed with |
| Picard MarkDuplicates. Neither base call quality recalibration |
| nor duplicate removal are done for AB SOLiD data. Complete .bam |
| files containing all mapped sequence reads are available from |
| ftp://ftp-trace.ncbi.nlm.nih.gov/1000genomes/ftp/pilot_data/data/ |
| NA*/alignment/ in files: |
| NAxxxxx.SLX.maq.SRP000031.2009_08.bam - for Illumina Solexa data |
| NAxxxxx.454.ssaha.SRP000031.2009_10.bam - for Roche 454 data |
| NAxxxxx.SOLID.corona.SRP000031.2009_08.bam - for AB SOLiD data |
| Indel discovery and genotyping: |
| Calls were made and annotated using the following procedure. |
| 1. Extract candidate indel locations from alignments of Illumina, |
| Roche 454 and AB SOLiD short read genome sequences to the human |
| genome reference sequence. The candidate indel set was compiled |
| by Gerton Lunter from candidates provided by the Sanger, Broad, |
| Oxford, Sanger/LUMC and TGEN groups. There are a total of |
| 8,504,899 candidate sites. All candidates are tested in all |
| populations. The JPT and CHB populations are analysed jointly. |
| The full list of candidate sites is available at: |
| http://www.well.ox.ac.uk/~gerton/1000G/LC/pilot1-indelcalls-17sept09.tgz |
| 2. Realign reads around candidate indels to candidate haplotypes |
| using the indel caller Dindel (Albers et al., Genome Research, 2011). |
| Dindel at this stage was used both to produce indel site calls (make |
| a call whether a candidate indel segregates in the population) and |
| to produce genotype likelihoods for each individual at a called site. |
| 3. Use linkage disequilibrium (LD) with the HapMap SNP genotypes |
| for each individual to refine indel genotype calls. This uses Dindel's |
| genotype likelihoods for each site and each individual with the QCALL |
| linkage disequilibrium software (Si Quang Le, Richard Durbin, Genome |
| Research, 2011). |
| Even though candidate indel sites from all technologies were used, |
| the support for candidate indels was evaluated only using Illumina |
| sequence data. For the following individuals there was no Illumina |
| data, and as a result their genotypes are completely imputed from |
| those of other individuals. (Any SOLiD/454 data for these samples |
| was *not* used to compute genotype likelihoods). |
| List of individuals without Illumina sequence data whose genotypes |
| are imputed from other individuals: |
| CEU: NA12814, NA11840, NA12872, NA12815, NA12812, NA12760, |
| NA12874, NA12762, NA06985, NA12873, NA12234 |
| YRI: NA19141, NA19143 |
| JPTCHB: NA18969, NA18970 |
| 4. The source .vcf files for this submission to dbSNP are currently |
| available by anonymous ftp in directory: |
| ftp://ftp-trace.ncbi.nlm.nih.gov/1000genomes/ftp/pilot_data/ |
| paper_data_sets/a_map_of_human_variation/low_coverage/indels |
| These source .vcf files contain two additional types of information |
| not submitted to dbSNP: |
| 5. QCALL filters out a small fraction (<0.25%) of the sites called |
| by Dindel; these are sites where the genotype likelihoods are not |
| consistent with the local LD structure. This 'NoQCALL' subset is |
| likely to be enriched for false calls, but it may also contain |
| potentially interesting targets for association studies, since one |
| reason for these sites being filtered out by QCALL could be low |
| LD with nearby SNPs. When a site is filtered out in this way, the |
| FILTER column in the original 'sites' .vcf file will say 'NoQCALL', |
| there will be no record in the original 'genotypes' .vcf file, and |
| the site is not submitted to dbSNP. |
| 6. More stringent filters were applied to predicted loss of function |
| (LOF) coding variants in the original .vcf files. In coding regions, |
| indel rates are significantly lower due to selection. Since noise levels |
| (factors resulting in false positive indel calls) are expected to be |
| approximately constant across the genome, the fraction of indel calls |
| which are false positives (the false discovery rate) will be increased |
| in coding regions. To lower the number of false positive indel calls, |
| we applied more stringent filters to the subset of indels that are called |
| in the genome-wide set and are predicted to fall into the LOF class. |
| The more stringent filter requires that the entire range of positions |
| where an indel would yield the same alternative haplotype sequence |
| as the originally assigned indel (for instance, within a repeat, deleting |
| any single repeat unit will give the same alternative haplotype) plus |
| 4 bases of reference sequence on each side of this region, must be |
| covered by at least one read on the forward strand, and at least one |
| read on the reverse strand, with at most one mismatch between the |
| read and the alternative haplotype sequence resulting from the indel |
| (regardless of base call qualities). This filter removes the excess |
| of 1-bp frameshift insertions seen in CHB+JPT with respect to CEU |
| in the less stringently filtered genome-wide indel call set, although |
| it is expected to remove a significant number of true positive calls |
| as well. The indels that pass this stringent filter are marked in the |
| original .vcf files by 'SF' in the INFO field, but this annotation is not |
| submitted to dbSNP. |
| (7/19/2010 -- Kees Albers, on behalf of the 1000 Genomes Project |
| Analysis group, adapted by Tom Blackwell for dbSNP submission.) |