Download large genome data packages

Use the datasets command-line tool to get large NCBI Datasets genome data packages

Download large genome data packages

Use the datasets command-line tool to get large NCBI Datasets genome data packages

If you want to download genome data for more than 1000 genomes or the genome data package exceeds 15 GB, you’ll need to use the datasets command-line tool (CLI).

The datasets CLI downloads a large NCBI Datasets genome data package as a dehydrated zip archive that contains only metadata and the location of the data on NCBI servers.

You can get the data in three steps:

  1. Download the dehydrated zip archive.
  2. Unzip the downloaded zip archive.
  3. Rehydrate the extracted zip archive to retrieve the data.

1. Download

Download a dehydrated data package (< 5 KB) for the human GRCh38 RefSeq genome using the datasets CLI.

datasets download genome accession GCF_000001405.40 --dehydrated --filename human_GRCh38_dataset.zip

2. Unzip

Unzip the dehydrated zip archive to a directory, for example my_human_dataset:

unzip human_GRCh38_dataset.zip -d my_human_dataset

The output will look like this:

Archive:  human_GRCh38_dataset.zip
  inflating: my_human_dataset/README.md  
  inflating: my_human_dataset/ncbi_dataset/data/assembly_data_report.jsonl  
  inflating: my_human_dataset/ncbi_dataset/fetch.txt  
  inflating: my_human_dataset/ncbi_dataset/data/dataset_catalog.json  

3. Rehydrate

Run the rehydrate command to get the genome sequence:

datasets rehydrate --directory my_human_dataset/

A progress bar will indicate the number of files to be retrieved. When complete, the output looks like this:

Found 1 of 1 files for rehydration
Completed 1 of 1 [================================================] 100%
Generated April 19, 2024