Questions and answers for common NCBI Datasets questions


Questions and answers for common NCBI Datasets questions

How is version 14 of the Datasets command-line tools (CLI v14.x) different from CLI v13.x and previous versions?

  • Easier access to metadata
  • Smaller data packages (faster downloads)
  • Expanded content for virus genomes
  • Genome sequences are now delivered as a single file by default
  • Simpler command syntax - data files are now included using the --include flag

Easier access to metadata
All metadata can now be printed to the screen, redirected to a file, or piped to the dataformat command-line tool to generate a customized table. Previously, some metadata was only available as part of a downloaded data package. In addition, metadata formats have been standardized across services, and all metadata schemas are documented.

Smaller data packages
Data packages now include a smaller set of files by default, so downloads are faster and more reliable.

For example, the default genome data package will include only genome sequence and the data report file. All other sequence and annotation files, as well as the sequence report file, can be optionally included.

Expanded content for virus genomes
All genomes in NCBI Virus are now available through virus.

Genome sequences are now delivered as a single file by default
Genome sequences are now delivered as a single file by default. You may optionally request genome sequences as separate files by chromosome using --chromosomes.

Simpler command syntax
We have simplified the way that specific data files and data reports (metadata) are requested. Data files can be specified using a single --include flag instead of multiple exclude flags. For example, to get the genome and protein sequences for the current human reference genome, try:
datasets download genome taxon human --reference --include seq,protein

Additional data reports are also optionally added to data packages using the --include flag.

Where is the data I requested?

Your data is in the subdirectory ncbi_dataset/data/ within the zip archive you downloaded.

I still can’t find my data, can you help?

We have identified a bug affecting Mac Safari users. When downloading data from the NCBI Datasets web interface, you may see only a README file after the download has completed (while other files appear to be missing). As a workaround to prevent this issue from recurring, we recommend disabling automatic zip archive extraction in Safari until Apple releases a bug fix. For more information, visit: Mac Safari zip archive bug

What file formats can be downloaded using NCBI Datasets?

Datasets offers the following file formats (if available for the requested query):

  • Sequence files in FASTA format: genomic/gene, transcript and protein nucleotide sequences
  • Annotation files: GTF, GFF3, and GBFF
  • Metadata files: JSON and JSON Lines

What is a data package?

A “data package” is an NCBI Datasets zip archive that contains sequence, annotation, metadata and other biological data. For more detailed information about the gene, genome and virus data packages, please visit: Data packages

How do I work with JSON Lines data reports?

Visit our JSON Lines data report documentation page

How can I access resources on NCBI Datasets website programmatically?

We have three options for programmatic access. Click on each link for more information and installation options.

What is the difference between a GenBank (GCA) and RefSeq (GCF) genome assembly?

A GenBank (GCA) genome assembly contains assembled genome sequences submitted by investigators or sequencing centers to GenBank or another member of the International Nucleotide Sequence Database Collaboration (INSDC). The GenBank (GCA) assembly is an archival record that is owned by the submitter and may or may not include annotation. A RefSeq (GCF) genome assembly represents an NCBI-derived copy of a submitted GenBank (GCA) assembly. RefSeq (GCF) assembly records are maintained by NCBI. In some cases the RefSeq (GCF) assembly may not be completely identical to the GenBank (GCA) assembly due to assembly improvements made by NCBI staff. All RefSeq (GCF) genome assemblies include annotation.

GCA vs GCF table

Generated September 30, 2022