Quantifying molecular bias in DNA data storage

Yuan-Jyue Chen; Christopher N Takahashi; Lee Organick; Callista Bee; Siena Dumas Ang; Patrick Weiss; Bill Peck; Georg Seelig; Luis Ceze; Karin Strauss

doi:10.1038/s41467-020-16958-3

Quantifying molecular bias in DNA data storage

Nat Commun. 2020 Jun 29;11(1):3264. doi: 10.1038/s41467-020-16958-3.

Authors

Yuan-Jyue Chen¹, Christopher N Takahashi², Lee Organick², Callista Bee², Siena Dumas Ang³, Patrick Weiss⁴, Bill Peck⁴, Georg Seelig^{2

5}, Luis Ceze⁶, Karin Strauss⁷

Affiliations

¹ Microsoft Research, Redmond, Washington, 98052, USA. yuanjc@microsoft.com.
² Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, Washington, 98195, USA.
³ Microsoft Research, Redmond, Washington, 98052, USA.
⁴ Twist Bioscience, San Francisco, California, 94158, USA.
⁵ Department of Electrical and Computer Engineering, University of Washington, Seattle, Washington, 98195, USA.
⁶ Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, Washington, 98195, USA. luisceze@cs.washington.edu.
⁷ Microsoft Research, Redmond, Washington, 98052, USA. kstrauss@microsoft.com.

Abstract

DNA has recently emerged as an attractive medium for archival data storage. Recent work has demonstrated proof-of-principle prototype systems; however, very uneven (biased) sequencing coverage has been reported, which indicates inefficiencies in the storage process. Deviations from the average coverage in the sequence copy distribution can either cause wasteful provisioning in sequencing or excessive number of missing sequences. Here, we use millions of unique sequences from a DNA-based digital data archival system to study the oligonucleotide copy unevenness problem and show that the two paramount sources of bias are the synthesis and amplification (PCR) processes. Based on these findings, we develop a statistical model for each molecular process as well as the overall process. We further use our model to explore the trade-offs between synthesis bias, storage physical density, logical redundancy, and sequencing redundancy, providing insights for engineering efficient, robust DNA data storage systems.

MeSH terms

Bias
Information Storage and Retrieval*
Models, Theoretical
Sequence Analysis, DNA* / methods
Sequence Analysis, DNA* / statistics & numerical data