jq cheatsheet for genome metadata

jq cheatsheet for parsing genome metadata from the datasets CLI summary command

jq cheatsheet for genome metadata

jq cheatsheet for parsing genome metadata from the datasets CLI summary command

Try out jq commands on the web: https://jqplay.org/
The below examples were run using the datasets CLI v12.13.2 on 9/24/2021.

Download jq

https://stedolan.github.io/jq/

First generate a json file with metadata for all cow genomes

datasets summary genome taxon cow > cow_genomes.json

Pretty-print the data (and only show the first 10 lines)

Note that the data is hierarchically structured: the busco information is nested within annotation_metadata, and annotation_metadata is nested within the assembly object

jq . cow_genomes.json | head
{
  "assemblies": [
    {
      "assembly": {
        "annotation_metadata": {
          "busco": {
            "busco_lineage": "cetartiodactyla_odb10",
            "busco_ver": "4.0.2 ",
            "complete": 0.98672664,
            "duplicated": 0.005024372,

Show the assembly count

jq '.total_count' cow_genomes.json
7

Only show data for the first assembly in a set of multiple assemblies (and only show the first 10 lines)

Note that assemblies[0] is used to specify the first assembly in the set, assemblies[1] refers to the second assembly, etc.

jq '.assemblies[0]' cow_genomes.json | head 
{
  "assembly": {
    "annotation_metadata": {
      "busco": {
        "busco_lineage": "cetartiodactyla_odb10",
        "busco_ver": "4.0.2 ",
        "complete": 0.98672664,
        "duplicated": 0.005024372,
        "fragmented": 0.0045744283,
        "missing": 0.008698912,

Show the BUSCO data for the first assembly in a set

jq '.assemblies[0].assembly.annotation_metadata.busco' cow_genomes.json        
{
  "busco_lineage": "cetartiodactyla_odb10",
  "busco_ver": "4.0.2 ",
  "complete": 0.98672664,
  "duplicated": 0.005024372,
  "fragmented": 0.0045744283,
  "missing": 0.008698912,
  "single_copy": 0.98170227,
  "total_count": "13335"
}

Show the gene counts for the first assembly in a set

jq '.assemblies[0].assembly.annotation_metadata.stats.gene_counts' cow_genomes.json
{
  "protein_coding": 21039,
  "total": 35143
}

Show the assembly accession, submitter, and submission date for the first assembly in a set and format the output in a new JSON object with custom key names

jq '.assemblies[0].assembly | {accession: .assembly_accession, submitter: .submitter, date: .submission_date}' cows.json 
{
  "accession": "GCF_002263795.1",
  "submitter": "USDA ARS",
  "date": "2018-04-11"
}

Generate a table of 3 columns including assembly accession, submission date and submitter

jq -r '.assemblies[].assembly | [.assembly_accession, .submission_date, .submitter] | @tsv' cows.json
GCF_002263795.1	2018-04-11	USDA ARS
GCF_000003055.6	2014-11-25	Center for Bioinformatics and Computational Biology, University of Maryland
GCF_000003205.5	2011-11-02	Cattle Genome Sequencing International Consortium
GCF_000003205.7	2015-11-19	Cattle Genome Sequencing International Consortium
GCA_000003055.5	2014-11-25	Center for Bioinformatics and Computational Biology, University of Maryland
GCA_000003205.6	2015-11-19	Cattle Genome Sequencing International Consortium
GCA_002263795.2	2018-04-11	USDA ARS

Show the assembly accession and the chromosome count for the first assembly in a set

Note that we will use jq length to count the number of chromosomes. Chromosome count includes all assembled chromosomes, the set of unplaced scaffolds counts as 1 chromosome, and each organelle genome counts as 1 chromosome, so in this example 29 autosomes + 1 X chromosome + 1 set of unplaced scaffolds + 1 mitochondrial genome = 32

jq '.assemblies[0].assembly | {accession: .assembly_accession, chromosome_count: [.chromosomes[]] | length}' cows.json       
{
  "accession": "GCF_002263795.1",
  "chromosome_count": 32
}
Generated December 7, 2021