Get genome metadata

Get genome metadata by accession, bioproject, or taxonomic name

Get genome metadata

Get genome metadata by accession, bioproject, or taxonomic name

Get genome metadata from NCBI Datasets through the easy-to-use website, command line tool, or programming languages.

Using a taxonomic name

Get genome metadata for all assemblies for an organism and its subspecies using the organism name or NCBI Taxonomy ID.

  1. Start at the NCBI Datasets Genome page
  2. Click the name Homo sapiens in the list of popular species or type homo sapiens in the Taxonomic Name search box and click the species name
  3. Click Select Columns to specify what metadata is shown in the table

Run the following command to get metadata in JSON format:

datasets summary genome taxon human

Use quotes for taxon names that include spaces, such as mus musculus:

datasets summary genome taxon 'mus musculus'

For more information, see the Datasets Python API reference documentation.

Use the get_assembly_metadata_by_taxon method from ncbi-datasets-pylib to get all genome metadata for a single taxon.

from ncbi.datasets.metadata.genome import print_assembly_metadata_by_fields
from ncbi.datasets.metadata.genome import get_assembly_metadata_by_taxon

taxon_name = "human"


# Retrieve and print genomic metadata for assemblies belonging to the specified taxon
for assembly in get_assembly_metadata_by_taxon(taxon_name):
    print_assembly_metadata_by_fields(assembly, ["assembly_accession", "assembly_level", "seq_length"])
For more information, see the Datasets R API reference documentation.
api.genome_instance <- GenomeApi$new()
result_genome <- api.genome_instance$AssemblyDescriptorsByBioproject('PRJEB33226')
prettify(result_genome$toJSONString())

Using BioProject accession

Get genome metadata for genome assemblies belonging to an NCBI BioProject, for example, the Sanger 25 Genomes Project, PRJEB33226.
  1. Start at the NCBI Datasets Homepage
  2. Enter a BioProject accession, for example PRJEB33226 into the search box at the top of the page
  3. Click Search
  4. In the BioProject box, click browse a table of Genomes for this project
  5. Click Select Columns to specify what metadata is shown in the table
Run the following command to get metadata in JSON format:
datasets summary genome accession PRJEB33226

For more information, see the Datasets Python API reference documentation

Use the get_assembly_metadata_by_bioproject_accessions method from ncbi-datasets-pylib to get genome metadata for all genomes associated with the provided bioproject accessions.

from typing import List

from ncbi.datasets.metadata.genome import print_assembly_metadata_by_fields
from ncbi.datasets.metadata.genome import get_assembly_metadata_by_bioproject_accessions

bioprojects: List[str] = ["PRJEB33226"]


# Retrieve and print genome metadata for a list of bioproject accessions
for assembly in get_assembly_metadata_by_bioproject_accessions(bioprojects):
    print_assembly_metadata_by_fields(assembly, ["assembly_accession", "assembly_level", "seq_length"])
For more information, see the Datasets R API reference documentation.
api.genome_instance <- GenomeApi$new()
result_genome <- api.genome_instance$AssemblyDescriptorsByBioproject('PRJEB33226')
prettify(result_genome$toJSONString())

Using an Assembly accession

Get metadata using an NCBI Assembly accession, for example for the human reference assembly, GRCh38.

  1. Visit the NCBI Assembly page
  2. Paste the Assembly Accession GCF_000001405.39 into the search box at the top of the page
  3. Click Search
  4. Find the desired assembly in the search results and click the assembly name underlined in blue to go to the Assembly record
  5. In the column on the right side of the page, under Access the data, click NCBI Datasets to go to the NCBI Datasets Genomes Page
  6. In the NCBI Datasets Genomes Page, click Select Columns to specify what metadata is shown in the table

Run the following command to get metadata in JSON format:

datasets summary genome accession GCF_000001405.39

For more information, see the Datasets Python API reference documentation.

Use the get_assembly_metadata_by_asm_accessions method from ncbi-datasets-pylib to get genome metadata for all genomes with the provided NCBI Assembly accessions.

from typing import List

from ncbi.datasets.metadata.genome import print_assembly_metadata_by_fields
from ncbi.datasets.metadata.genome import get_assembly_metadata_by_asm_accessions

genome_assembly_accessions: List[str] = ["GCF_000001405.39"]


# Retrieves and prints genome metadata for a list of assembly accessions
for assembly in get_assembly_metadata_by_asm_accessions(genome_assembly_accessions):
    print_assembly_metadata_by_fields(assembly, ["assembly_accession", "assembly_level", "seq_length"])
For more information, see the Datasets R API reference documentation.
api.genome_instance <- GenomeApi$new()
result_genome <- api.genome_instance$AssemblyDescriptorsByAccessions('GCF_000001405.39')
prettify(result_genome$toJSONString())

Filtering by genome assembly properties

When getting genome metadata by either taxon, Assembly or BioProject accession, you can filter the results by different genome assembly properties, including the following:

  • reference status
  • annotation status
  • assembly level
  • year released
  • infraspecies name
  • assembly name
  • submitter name
  1. Start at the NCBI Datasets Genome page
  2. Click the name Homo sapiens in the list of popular species or type homo sapiens in the Taxonomic Name search box and click the species name
  3. Expand the Filters box
  4. To filter by reference status, annotation status, assembly level or year released, use the appropriate slider or switch.
  5. To filter by infraspecies name, assembly name or submitter name, enter the term into the Text Filter box.
  6. Click Select Columns to specify what metadata is shown in the table

Get metadata for the human reference genome:

datasets summary genome taxon human --reference
Get metadata for annotated human genomes:
datasets summary genome taxon human --annotated
Get metadata for human genomes with the Assembly level of "complete genome" (all chromosomes are gapless):
datasets summary genome taxon human --assembly-level complete_genome
Get metadata for human genomes released after January 1, 2020:
datasets summary genome taxon human --released-since 01/01/2020
Get metadata for human genomes submitted by the T2T Consortium:
datasets summary genome taxon human --search 'T2T Consortium'

For more information, see the Datasets Python API reference documentation.

All of the genome metadata retrieval functions support filtering, but for our examples we use the the get_assembly_metadata_by_taxon method from ncbi-datasets-pylib to get genome metadata for all genomes that match the selected taxon and filter criteria.

from ncbi.datasets.metadata.genome import print_assembly_metadata_by_fields
from ncbi.datasets.metadata.genome import get_assembly_metadata_by_taxon

taxon_name = "human"


print(f"Reference assemblies for {taxon_name}:")
for assembly in get_assembly_metadata_by_taxon(taxon_name, filters_reference_only=True):
    print_assembly_metadata_by_fields(assembly, ["assembly_accession", "assembly_level", "seq_length"])

print(f"\nAnnotated assemblies for {taxon_name}:")
for assembly in get_assembly_metadata_by_taxon(taxon_name, filters_has_annotation=True):
    print_assembly_metadata_by_fields(assembly, ["assembly_accession", "assembly_level", "seq_length"])

# valid assembly levels are: ['chromosome', 'scaffold', 'contig', 'complete_genome']
print(f"\n{taxon_name} assemblies with complete (all chromosomes are gapless) genomes:")
for assembly in get_assembly_metadata_by_taxon(taxon_name, filters_assembly_level=["complete_genome"]):
    print_assembly_metadata_by_fields(assembly, ["assembly_accession", "assembly_level", "seq_length"])

print(f"\nassemblies for {taxon_name}  released in 2017:")
for assembly in get_assembly_metadata_by_taxon(
    taxon_name, filters_first_release_date="01/01/2017", filters_last_release_date="12/31/2017"
):
    print_assembly_metadata_by_fields(assembly, ["assembly_accession", "assembly_level", "seq_length"])

# filters_search_text includes the species and infraspecies, assembly name and submitter fields
print(f'\n{taxon_name} assemblies including text "T2T Consortium"')
for assembly in get_assembly_metadata_by_taxon(taxon_name, filters_search_text=["T2T Consortium"]):
    print_assembly_metadata_by_fields(assembly, ["assembly_accession", "submitter", "assembly_level", "seq_length"])
Generated October 22, 2021