Genome sequence report

Genome assembly sequence accessions, chromosome, and length

Genome sequence report

Genome assembly sequence accessions, chromosome, and length

The downloaded genome package contains a genome sequence data report in JSON Lines format in the file:

ncbi_dataset/data/<assembly>/sequence_report.jsonl

Each line of the genome assembly sequence data report file is a JSON object that represents a single genome assembly sequence record. The schema of the genome assembly sequence record is defined in the table below where each row in SequenceInfo describes a single field in the report.

Table fields that include a Table Field Mnemonic can be used with the dataformat command-line tool's --fields option. Refer to the dataformat CLI tool reference to see how you can use this tool to transform assembly sequence data reports from JSON Lines to tabular formats.

Sample report

{
  "assemblyUnit": "GCF_000001305.15",
  "assignedMoleculeLocationType": "Chromosome",
  "chrName": "1",
  "gcCount": "103993629",
  "genbankAccession": "CM000663.2",
  "length": 248956422,
  "refseqAccession": "NC_000001.11",
  "sortOrder": 1,
  "ucscStyleName": "chr1"
}

SequenceInfo Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
chrNamechr-nameChromosome namestringThe name of the associated chromosome. The name “Un” indicates that the chromosome is unknown.21
MT
Un
ucscStyleNameucsc-style-nameUCSC style namestringName ascribed to this sequence by the UC Santa Cruz genome browserchr21
chrM
Un
sortOrderorderingOrderinguint32A sort order value assigned to the sequence1
25
assignedMoleculeLocationTypemol-typeMolecule typestringThe type of molecule represented by the sequenceChromosome
Mitochondrion
refseqAccessionrefseq-seq-accRefSeq seq accessionstringThe RefSeq accession of the sequenceNC_000021.9
assemblyUnitassm-unit-accAssembly-unit accessionstringThe NCBI Assembly accession of the associated assembly unit. Assembly units can include the primary assembly and non-nuclear assembly unitsGCF_000001305.15
lengthseq-lengthSeq lengthuint32The length of the sequence in nucleotides46709983
genbankAccessiongenbank-seq-accGenBank seq accessionstringThe GenBank accession of the sequenceCM000683.2
gcCountgc-countGC Countuint64The number of GC base-pairs in the chromosome

BioProject Structure

A BioProject is a collection of biological data related to a single initiative, originating from a single organization or from a consortium. A BioProject record provides users a single place to find links to the diverse data types generated for that project. The record can be retrieved from NCBI BioProject

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
accessionaccessionAccessionstringBioProject accessionPRJEB35387
titletitleTitlestringTitle of the BioProject provided by the submitterSciurus carolinensis (grey squirrel) genome assembly, mSciCar1
parentAccessions repeatedparent-accessionsParent AccessionsstringBioProject accession containing multiple children BioProjects["PRJNA489243","PRJEB33226","PRJEB40665"]

BioProjectLineage Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
bioprojects repeatedlineage-LineageBioProjectA BioProject is a collection of biological data related to a single initiative, originating from a single organization or from a consortium

Scalar Value Types

Protocol buffers typeNotesC++PythonJavaGo
doubledoublefloatdoublefloat64
floatfloatfloatfloatfloat32
int32Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint32 instead.int32intintint32
int64Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint64 instead.int64int/longlongint64
uint32Uses variable-length encoding.uint32int/longintuint32
uint64Uses variable-length encoding.uint64int/longlonguint64
sint32Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int32s.int32intintint32
sint64Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int64s.int64int/longlongint64
fixed32Always four bytes. More efficient than uint32 if values are often greater than 2^28.uint32intintuint32
fixed64Always eight bytes. More efficient than uint64 if values are often greater than 2^56.uint64int/longlonguint64
sfixed32Always four bytes.int32intintint32
sfixed64Always eight bytes.int64int/longlongint64
boolboolbooleanbooleanbool
stringA string must always contain UTF-8 encoded or 7-bit ASCII text.stringstr/unicodeStringstring
bytesMay contain any arbitrary sequence of bytes.stringstrByteString[]byte
Generated October 22, 2021