Prokaryote gene report

Prokaryote gene record identifiers, protein info, and taxonomic scope

Prokaryote gene report

Prokaryote gene record identifiers, protein info, and taxonomic scope

The downloaded prokaryote package contains a prokaryote gene data report in JSON Lines format in the file:

ncbi_dataset/data/data_report.jsonl

Each line of the prokaryote gene data report file is a hierarchical JSON object that represents a single prokaryote gene record. The schema of the prokaryote gene record is defined in the tables below where each row describes a single field in the report or a sub-structure, which is a collection of fields. The outermost structure of the report is ProkaryoteGene.

Table fields that include a Table Field Mnemonic can be used with the dataformat command-line tool's --fields option. Refer to the dataformat CLI tool reference to see how you can use this tool to transform prokaryote gene data reports from JSON Lines to tabular formats.

Sample report

{
  "accession": "WP_001435165.1",
  "geneSymbol": "merC",
  "numberOfGenomeMappings": 8,
  "proteinLength": 137,
  "proteinName": "organomercurial transporter MerC",
  "proteinNameEvidence": {
    "accession": "NF010318.0",
    "category": "HMM",
    "source": "NCBI Protein Cluster (PRK)"
  },
  "taxonomyScope": {
    "organismName": "Pseudomonas",
    "taxId": 286
  }
}

ProkaryoteGene Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
accessionaccessionAccessionstringThe RefSeq WP_ prefixed accession for the protein sequence.WP_000443665.1
geneSymbolgene-symbolGene SymbolstringThe gene symbolligA
proteinNameprotein-nameProtein NamestringThe protein nameNAD-dependent DNA ligase LigA
proteinLengthprotein-lengthProtein Lengthuint32Length of the protein671
taxonomyScopeOrganism
numberOfGenomeMappingsmapping-countNumber of Genome Mappingsuint32The number of nucleotide mappings7642
proteinNameEvidencename-evidence-Protein Name EvidenceProkaryoteGene.ProteinNameEvidence
descriptiondescriptionDescriptionstringDescriptionCatalyzes the formation of a phosphodiester at the site of a single-strand break in duplex DNA
ecNumber repeatedec-numberEC NumberstringEC Number6.5.1.2

ProkaryoteGene.ProteinNameEvidence Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
accessionaccessionAccessionstringAccessionNF005932.1
categorycategoryCategorystringCatagoryHMM
sourcesourceSourcestringSourceNCBI Protein Cluster (PRK)

Organism Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
taxIdtax-idTaxonomic IDuint32NCBI Taxonomy identifier9606
2697049
organismNameorganism-nameOrganism NamestringScientific nameHomo sapiens
Severe acute respiratory syndrome coronavirus 2
commonNamecommon-nameCommon NamestringCommon namehuman
pangolin
MERS
SARS2
lineage repeatedLineageOrganismLineage ordered from superkingdom level to increasingly more specific taxonomic entries
strainstrainStrainstringSE11
pangolinClassificationpangolinPangolin ClassificationstringB.1.1.7

BioProject Structure

A BioProject is a collection of biological data related to a single initiative, originating from a single organization or from a consortium. A BioProject record provides users a single place to find links to the diverse data types generated for that project. The record can be retrieved from NCBI BioProject

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
accessionaccessionAccessionstringBioProject accessionPRJEB35387
titletitleTitlestringTitle of the BioProject provided by the submitterSciurus carolinensis (grey squirrel) genome assembly, mSciCar1
parentAccessions repeatedparent-accessionsParent AccessionsstringBioProject accession containing multiple children BioProjects["PRJNA489243","PRJEB33226","PRJEB40665"]

BioProjectLineage Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
bioprojects repeatedlineage-LineageBioProjectA BioProject is a collection of biological data related to a single initiative, originating from a single organization or from a consortium

LineageOrganism Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
taxIdcoming sooncoming soonuint32NCBI Taxonomy identifier11118
namecoming sooncoming soonstringScientific nameCoronaviridae

Scalar Value Types

Protocol buffers typeNotesC++PythonJavaGo
doubledoublefloatdoublefloat64
floatfloatfloatfloatfloat32
int32Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint32 instead.int32intintint32
int64Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint64 instead.int64int/longlongint64
uint32Uses variable-length encoding.uint32int/longintuint32
uint64Uses variable-length encoding.uint64int/longlonguint64
sint32Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int32s.int32intintint32
sint64Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int64s.int64int/longlongint64
fixed32Always four bytes. More efficient than uint32 if values are often greater than 2^28.uint32intintuint32
fixed64Always eight bytes. More efficient than uint64 if values are often greater than 2^56.uint64int/longlonguint64
sfixed32Always four bytes.int32intintint32
sfixed64Always eight bytes.int64int/longlongint64
boolboolbooleanbooleanbool
stringA string must always contain UTF-8 encoded or 7-bit ASCII text.stringstr/unicodeStringstring
bytesMay contain any arbitrary sequence of bytes.stringstrByteString[]byte
Generated October 22, 2021