Setting-up BigQuery
Overview
Sequence Read Archive (SRA) has moved all of its metadata into BigQuery to provide the bioinformatics community with programmatic access to these data. You can now search across the entire SRA by sequencing methodologies and sample attributes. NCBI is piloting this in BigQuery to help users leverage the benefits of elastic scaling and parallel execution of queries. BigQuery has a large collection of client libraries that can be used within your workflow. You can also interact with it on a web browser.
Get started in BigQuery
Set up account
To access BigQuery, you will need to set up a Google cloud account:
https://cloud.google.com/
Once you have set up the account, you will need to create your project:
https://cloud.google.com/resource-manager/docs/creating-managing-projects
You will need to record the project ID as it will be necessary if you want to access BigQuery by client libraries or command line.
Payment
The user pays for running queries against public data sets and you should review the payment requirements for on-demand queries from Big Query. Big Query provides 1TB per month for free for querying data.
Access methods
We recommend to first use the BigQuery query editor to become familiar with SQL and writing queries before attempting to use the command line tools or client libraries.
BigQuery can be accessed through a web browser query editor:
https://console.cloud.google.com/bigquery
BigQuery client library documentation is also available for reference if you plan to access it through the supported programming languages:
https://cloud.google.com/bigquery/docs/reference/libraries
BigQuery command line tools can be downloaded and set up from here:
https://cloud.google.com/sdk/docs/quickstarts
Linking the SRA dataset in BigQuery Console
You will want to pin the SRA dataset to your BigQuery Console to make it easier to access and explore the available metadata. Click the Add Data button on the left side of the screen, in the Explorer panel.
Next, select Pin a project, click on Enter project name, paste nih-sra-datastore into the Pin a project box and click Pin.
Now you can proceed to example queries.
Set up the command line tool
First, you should create your account and your project through the Google web interface.
Next, download and install the Cloud SDK from the link above.
Then, you can use the command line tools to sign into your account using this command:
Once you are signed into your account, you need to set your project ID:
Where PROJECT_ID is the ID that was set when you created your project (this is different from your project name).
Now you will be able to run the following example query:
The accession_list.txt file will contain the list of accession that you can use with the SRA Toolkit to download the data.
More command line query example could be found here.
Contact SRA
Contact SRA staff for assistance at sra@ncbi.nlm.nih.gov