U.S. flag

An official website of the United States government

SRA Data Formats

SRA Data is Available with Simplified Quality Scores

SRA data are available either with full base quality scores (SRA Normalized Format), or with simplified quality scores (SRA Lite), depending on user preference. Both formats can be streamed on demand to the same filetypes (fastq, sam, etc.), so they are both compatible with existing workflows and applications that expect quality scores. However, the SRA Lite format is much smaller, enabling a reduction in storage footprint and data transfer times, allowing dumps to complete faster. The SRA toolkit defaults to using the SRA Normalized Format that includes full, per-base quality scores, but users that do not require full base quality scores for their analysis can request the SRA Lite version to save time on their data transfers.
To request the SRA Lite data when using the SRA toolkit, set the Prefer SRA Lite files with simplified base quality scores option on the main page of the toolkit configuration - this will instruct the tools to preferentially use the SRA Lite format when available (please be sure to use toolkit version 2.11.2 or later to access this feature). The quality scores generated from SRA Lite files will be the same for each base within a given read (quality = 30 or 3, depending on whether the Read Filter flag is set to pass or reject). Data in the SRA Normalized Format will continue to have a .sra file extension, while the SRA Lite files have a .sralite file extension.

SRA Data Going Forward

TheSRA Normalized Format was created to support FAIR (Findable, Accessible, Interoperable, Reusable) principles, and newer, efficiently sized SRA formats continue this support, making it easier to manipulate and analyze large datasets while also reducing file size and bandwidth requirements. Full base quality scores are not needed for many bioinformatic use cases and workflows, and data formats without them reduce the typical SRA file footprint by ~60% with commensurate reductions in transfer times when accessing the data. SRA Lite and SRA Normalized Format files are both fully accessible and stream-able using the SRA toolkit.

SRA Normalized Format - original format with full base quality scores

This is the format provided since the inception of the SRA. It contains base calls, full base quality scores, and alignments.
This format has a .sra file extension and is available from cloud providers and via the SRA Toolkit.

SRA Lite - smaller format with simplified quality scores

This new format contains base calls, simplified quality scores, and alignments. This format has a .sralite file extension and is available from cloud providers and NCBI via the SRA Toolkit.

Output files derived from this format contain simplified quality scores.

SRA Lite files are produced from SRA Normalized Format by assessing overall read quality, setting a per-read quality flag (Read_Filter), and removing base quality scores from the file. In the resulting files, all reads have a Read_Filter flag with value pass or reject. Importantly, it is still possible to produce fastq formatted files from SRA Lite format using the SRA toolkit. In this case, each read will have a constant quality score set to 30 for reads with Read_Filter value "pass" or 3 for reads with a value "reject".

Illumina fastq and sam/bam specifications support a quality bit that is set by the sequencing instrument and SRA Lite stores this as a "pass"/"reject" Read_Filter value. If this bit is set in the submitted fastq or bam file, the value is retained. If it is not, SRA will set a pass/reject value based on the quality score distribution within each read. Reads that have more than half of quality score values <20 are flagged "reject". Reads that begin or end with a run of more than 10 quality scores <20 are also flagged "reject". Reads that pass these quality checks are flagged "pass". When dumping data using the fastq-dump, fasterq-dump, or sam-dump utilities in the SRA toolkit, all reads are included by default. However, the fastq-dump tool has an option to include only passed or only rejected reads:

fastq-dump --read-filter <[pass|reject]>

In order to interact with these files and set your preference for SRA Lite files, please use SRA Toolkit version 2.11.2 or later.

Original Submitted Files - files as submitted to SRA

All original submitted files are available via our Cloud Data Delivery Service. From the SRA Run Selector users can request that the submitted files for a run be delivered to their own AWS or GCP cloud bucket. This service is provided for both public and authorized access (dbGaP) data with no charge for delivery.

Frequently Asked Questions

Answers to Frequently Asked Questions: Data Format FAQ


Engage

NCBI wants your feedback on SRA in the Cloud. Contact sra@ncbi.nlm.nih.gov with questions or if you would like to provide input on new functionality.

Support Center

Last updated: 2021-10-06T16:21:48Z