NCBI Logo
GEO Logo
   NCBI > GEO > Accession DisplayHelp Not logged in | LoginHelp
GEO help: Mouse over screen elements for information.
          Go
Series GSE36217 Query DataSets for GSE36217
Status Public on Mar 03, 2012
Title Comparison of systematic sequencing errors using spike-in standards
Organism Homo sapiens
Experiment type Other
Summary While the importance of random sequencing errors decreases at higher DNA or RNA sequencing depths, systematic sequencing errors (SSEs) dominate at high sequencing depths and can be difficult to distinguish from biological variants. These SSEs can cause base quality scores to underestimate the probability of error at certain genomic positions, resulting in false positive variant calls, particularly in mixtures such as samples with RNA editing, tumors, circulating tumor cells, bacteria, mitochondrial heteroplasmy, or pooled DNA. Most algorithms proposed for correction of SSEs require a training data set, which is typically either from a part of the data set being “recalibrated” (Genome Analysis ToolKit, or GATK) or from a separate data set with special characteristics (SysCall). Here, we combine the advantages of these approaches by adding synthetic RNA spike-in standards to human RNA, and use GATK to recalibrate base quality scores with reads mapped to the spike-in standards. Compared to conventional GATK recalibration that uses reads mapped to the genome, spike-ins improve the accuracy of Illumina base quality scores by a mean of 5 units, and by as much as 13 units  at CpG sites. In addition, since reads mapping to the genome are not used for recalibration, our method allows run-specific recalibration even for the many species without a comprehensive and accurate SNP database. We also use GATK with the spike-in standards to demonstrate that the Illumina RNA sequencing runs overestimate quality scores for AC, CC, GC, GG, and TC dinucleotides, while SOLiD has less dinucleotide SSEs but more SSEs for certain cycles. We conclude that using these DNA and RNA spike-in standards with GATK improves base quality score recalibration.
 
Overall design Four human RNA samples with equimolar ERCC spike-in standards were sequenced on Illumina. Two human brain/liver/muscle RNA mixtures with dynamic range of ERCC spike-in standards were sequenced on SOLiD.
 
Contributor(s) Salit M, Zook JM, Sen S, McDaniel JD
Citation(s) 22859977
Submission date Mar 01, 2012
Last update date May 15, 2019
Contact name Justin M Zook
E-mail(s) jzook@nist.gov
Phone 3019754133
Organization name National Institute of Standards and Technology
Department Biochemical Science Division
Lab Material Measurement Laboratory
Street address 100 Bureau Dr., MS8313
City Gaithersburg
State/province Maryland
ZIP/Postal code 20899
Country USA
 
Platforms (2)
GPL10999 Illumina Genome Analyzer IIx (Homo sapiens)
GPL13393 AB SOLiD 4 System (Homo sapiens)
Samples (6)
GSM883914 BLM1Plus
GSM883915 BLM2Plus
GSM883916 08-01-0277
Relations
SRA SRP011192
BioProject PRJNA153117

Download family Format
SOFT formatted family file(s) SOFTHelp
MINiML formatted family file(s) MINiMLHelp
Series Matrix File(s) TXTHelp

Supplementary file Size Download File type/resource
GSE36217_RAW.tar 5.7 Mb (http)(custom) TAR (of CSV)
SRA Run SelectorHelp
Raw data are available in SRA
Processed data provided as supplementary file

| NLM | NIH | GEO Help | Disclaimer | Accessibility |
NCBI Home NCBI Search NCBI SiteMap