Biological Sequences

Introduction
Bioseq: the Biological Sequence
Seq-id: Identifying the Bioseq
Seq-annot: Annotating the Bioseq
Seq-descr: Describing the Bioseq and Placing It In Context
Seq-inst: Instantiating the Bioseq
Seq-hist: History of a Seq-inst
Seq-data: Encoding the Sequence Data Itself
Tables of Sequence Codes
Mapping Between Different Sequence Alphabets
Data and Tools for Sequence Alphabets
Pubdesc: Publication Describing a Bioseq
Numbering: Applying a Numbering System to a Bioseq
ASN.1 Specification: seq.asn
ASN.1 Specification: seqblock.asn
ASN.1 Specification: seqcode.asn
C Structures and Functions: objseq.h
C Structures and Functions: objpubd.h
C Structures and Functions: objblock.h
C Structures and Functions: objcode.h

Introduction

A biological sequence is a single, continuous molecule of nucleic acid or protein. It can be thought of as a multiple inheritance class hierarchy. One hierarchy is that of the underlying molecule type: DNA, RNA, or protein. The other hierarchy is the way the underlying biological sequence is represented by the data structure. It could be a physical or genetic map, an actual sequence of amino acids or nucleic acids, or some more complicated data structure building a composite view from other entries. An overview of this data model has been presented previously, in the Data Model chapter. The overview will not be repeated here so if you have not read that chapter, do so now. This chapter will concern itself with the details of the specification and representation of biological sequence data.

Bioseq: the Biological Sequence

A Bioseq represents a single, continuous molecule of nucleic acid or protein. It can be anything from a band on a gel to a complete chromosome. It can be a genetic or physical map. All Bioseqs have more common properties than differences. All Bioseqs must have at least one identifier, a Seq-id (i.e. Bioseqs must be citable). Seq-ids are discussed in detail in the chapter Sequence Ids and Locations. All Bioseqs represent an integer coordinate system (even maps). All positions on Bioseqs are given by offsets from the first residue, and thus fall in the range from zero to (length - 1). All Bioseqs may have specific descriptive data elements (descriptors) and/or annotations such as feature tables, alignments, or graphs associated with them.

The differences in Bioseqs arise primarily from the way they are instantiated (represented). Different data elements are required to represent a map than are required to represent a sequence of residues.

The C structure for a Bioseq has pointers for a linked list of Seq-ids, a linked list of Seq-descr, and a linked list of Seq-annot, mapping quite directly from the ASN.1. However, since a Seq-inst is always required for a Bioseq, those fields have been incorporated into the Bioseq itself. There are SeqInstAsnRead() and SeqInstAsnWrite() as separate functions, but they take a pointer to a Bioseq.

A number of #defines are provided in objseq.h for the representation classes, molecule types, and types of sequence encoding used in the Bioseq C structure. Also the macros ISA_na() and ISA_aa() are provided to split Bioseqs into the two major molecule classes. A Bioseq.length equal to -1 means the length is unknown and will not appear in the ASN.1. When actual sequence data is present, Bioseq.seq_data holds the pointer to it. Bioseq.seq_data_type contains a value indicating the type of sequence encoding used (and thus the pointer type to cast Bioseq.seq_data to). Sequence encoding is discussed in more detail below.

Seq-id: Identifying the Bioseq

Every Bioseq MUST have at least one Seq-id, or sequence identifier. This means a Bioseq is always citable. You can refer to it by a label of some sort. This is a crucial property for different software tools or different scientists to be able to talk about the same thing. There is a wide range of Seq-ids and they are used in different ways. They are discussed in more detail in the Sequence Ids and Locations chapter.

Seq-annot: Annotating the Bioseq

A Seq-annot is a self-contained package of sequence annotations, or information that refers to specific locations on specific Bioseqs. Every Seq-annot can have an Object-id for local use by software, a Dbtag for globally identifying the source of the Seq-annot, and/or a name and description for display and use by a human. These describe the whole package of annotations and make it attributable to a source, independent of the source of the Bioseq.

A Seq-annot may contain a feature table, a set of sequence alignments, or a set of graphs of attributes along the sequence. These are described in detail in the Sequence Annotation chapter.

A Bioseq may have many Seq-annots. This means it is possible for one Bioseq to have feature tables from several different sources, or a feature table and set of alignments. A collection of sequences (see Sets Of Bioseqs) can have Seq-annots as well. Finally, a Seq-annot can stand alone, not directly attached to anything. This is because each element in the Seq-annot has specific references to locations on Bioseqs so the information is very explicitly associated with Bioseqs, not implicitly associated by attachment. This property makes possible the exchange of information about Bioseqs as naturally as the exchange of the Bioseqs themselves, be it among software tools or between scientists or as contributions to public databases.

Seq-descr: Describing the Bioseq and Placing It In Context

A Seq-descr is meant to describe a Bioseq (or set of Bioseqs.. see Sets Of Bioseqs) and place it in a biological and/or bibliographic context. Seq-descrs apply to the whole Bioseq. Some Seq-descr classes appear also as features, when used to describe a specific part of a Bioseq. But anything appearing at the Seq-descr level applies to the whole thing.

The C implementation uses a linked list of ValNodes, where the ValNode.choice indicates what kind of Seq-descr this is, and ValNode.data contains either an integer or pointer depending on the type of descriptor. The file objseq.h lists the choices and data types and is summarize in the following table. Under Value is the value of ValNode.choice. Type gives an indication of the data stored in ValNode.data. If "i", then an integer is stored in valnode->data.intvalue. Otherwise a pointer is stored in valnode->data.ptrvalue and the datatype of the pointer is given. The file objseq.h also has a series of #defines for Value below constructed by prefixing "Seq_descr_" to the Name below and replacing any hyphens (-) in the ASN.1 name with underline (_) to make it legal C (e.g. #define Seq_descr_mol_type 1).

Seq-descr

Value	Name	Type	Explanation
1	mol-type	i	role of molecule in life
2	modif	ValNodePtr	modifying keywords of mol-type
3	method	i	protein sequencing method used
4	name	CharPtr	a commonly used name (e.g. "SV40")
5	title	CharPtr	a descriptive title or definition
6	org	OrgRefPtr	(single) organism from which mol comes
7	comment	CharPtr	descriptive comment (may have many)
8	num	NumberingPtr	a numbering system for whole Bioseq
9	maploc	DbtagPtr	a map location from a mapping database
10	pir	PirBlockPtr	PIR specific data
11	genbank	GBBlockPtr	GenBank flatfile specific data
12	pub	PubdescPtr	Publication citation and descriptive info from pub
13	region	CharPtr	name of genome region (e.g. B-globin cluster)
14	user	UserObjectPtr	user defined data object for any purpose
15	sp	SPBlockPtr	SWISSPROT specific data
16	neighbors	LinkSetPtr	ids of pre-calculated similar sequences
17	embl	EMBLBlockPtr	EMBL specific data
18	create-date	DatePtr	date entry was created by source database
19	update-date	DatePtr	date entry last updated by source database
20	prf	PrfBlockPtr	PRF specific data
21	pdb	PdbBlockPtr	PDB specific data
22	het	CharPtr	heterogen: non-Bioseq atom/molecule

mol-type: The Molecule Type

A Seq-descr.mol-type is of type GIBB-mol. It is derived from the molecule information used in the GenInfo BackBone database. It indicates the biological role of the Bioseq in life. It can be genomic (including organelle genomes). It can be a transcription product such as pre-mRNA, mRNA, rRNA, tRNA, snRNA (small nuclear RNA), or scRNA (small cytoplasmic RNA). All amino acid sequences are peptides. No distinction is made at this level about the level of processing of the peptide (but see Prot-ref in the Sequence Annotations chapter). The type other-genetic is provided for "other genetic material" such a B chromosomes or F factors that are not normal genomic material but are also not transcription products. The type genomic-mRNA is provided to describe sequences presented in figures in papers in which the author has combined genomic flanking sequence with cDNA sequence. Since such a figure often does not accurately reflect either the sequence of the mRNA or the sequence of genome, this practice should be discouraged.

Since GIBB-mol is an ENUMERATED type, the ValNode for the Seq-descr simply places the enumerated value in ValNode.data.intvalue.

modif: Modifying Our Assumptions About a Bioseq

A GIBB-mod began as a GenInfo BackBone component and was found to be of general utility. A GIBB-mod is meant to modify the assumptions one might make about a Bioseq. If a GIBB-mod is not present, it does not mean it does not apply, only that it is part of a reasonable assumption already. For example, a Bioseq with GIBB-mol = genomic would be assumed to be DNA, to be chromosomal, and to be partial (complete genome sequences are still rare). If GIBB-mod = mitochondrial and GIBB-mod = complete are both present in Seq-descr, then we know this is a complete mitochondrial genome. Even though GIBB-mod = DNA is not present we can still assume it is DNA.

The modifier concept permits a lot of flexibility. So a peptide with GIBB-mod = mitochondrial is a mitochondrial protein. There is no implication that it is from a mitochondrial gene only that it functions in the mitochondrion. The assumption is that peptide sequences are complete, so GIBB-mod = complete is not necessary for most proteins, but GIBB-mod = partial is important information for some. A list of brief explanations of GIBB-mod values follows:

GIBB-mod

Value	Name	Explanation
0	dna	molecule is DNA in life
1	rna	molecule is RNA in life
2	extrachrom	molecule is extrachromosomal
3	plasmid	molecule is or is from a plasmid
4	mitochondrial	molecule is from mitochondrion
5	chloroplast	molecule is from chloroplast
6	kinetoplast	molecule is from kinetoplast
7	cyanelle	molecule is from cyanelle
8	synthetic	molecule was synthesized artificially
9	recombinant	molecule was formed by recombination
10	partial	not a complete sequence for molecule
11	complete	sequence covers complete molecule
12	mutagen	molecule subjected to mutagenesis
13	natmut	molecule is a naturally occurring mutant
14	transposon	molecule is a transposon
15	insertion-seq	molecule is an insertion sequence
16	no-left	partial molecule is missing left end 5' end for nucleic acid, NH3 end for peptide
17	no-right	partial molecule is missing right end 3' end for nucleic acid, COOH end for peptide
18	macronuclear	molecule is from macronucleus
19	proviral	molecule is an integrated provirus
20	est	molecule is an expressed sequence tag

Seq-descr.modif is defined as a SET OF GIBB-mod, so it must be implemented as a chain, not as a single value. The ValNode representing a Seq-descr.modif then has ValNode.choice = Seq_descr_modif and a ValNode.data.ptrvalue is the head of a chain of ValNodes. Each member of that chain has a ValNode.data.intvalue set to represent a single GIBB-mod according to the table above.

method: Protein Sequencing Method

The method Seq-descr gives the method used to obtain a protein sequence. The values for a GIBB-method are also stored in the C structure as integer values mapping directly from the ASN.1 ENUMERATED type. They are:

GIBB-method

Value	Name	Explanation
1	concept-trans	conceptual translation
2	seq-pept	peptide itself was sequenced
3	both	conceptual translation with partial peptide sequencing
4	seq-pept-overlap	peptides sequenced, fragments ordered by overlap
5	seq-pept-homol	peptides sequenced, fragments ordered by homology
6	concept-trans-a	conceptual translation, provided by author of sequence

name: A Descriptive Name

A sequence name is very different from a sequence identifier. A Seq-id uniquely identifies a specific Bioseq. A Seq-id may be no more than an integer and will not necessarily convey any biological or descriptive information in itself. A name is not guaranteed to uniquely identify a single Bioseq, but if used with caution, can be a very useful tool to identify the best current entry for a biological entity. For example, we may wish to associate the name "SV40" with a single Bioseq for the complete genome of SV40. Let us suppose this Bioseq has the Seq-id 10. Then it is discovered that there were errors in the original Bioseq designated 10, and it is replaced by a new Bioseq from a curator with Seq-id 15. The name "SV40" can be moved to Seq-id 15 now. If a biologist wishes to see the "best" or "most typical" sequence of the SV40 genome, she would retrieve on the name "SV40". At an earlier point in time she would get Bioseq 10. At a later point she would get Bioseq 15. Note that her query is always answered in the context of best current data. On the other hand, if she had done a sequence analysis on Bioseq 10 and wanted to compare results, she would cite Seq-id 10, not the name "SV40", since her results apply to the specific Bioseq, 10, not necessarily to the "best" or "most typical" entry for the virus at the moment.

title: A Descriptive Title

A title is a brief, generally one line, description of an entry. It is extremely useful when presenting lists of Bioseqs returned from a query or search. This is the same as the familiar GenBank flatfile DEFINITION line.

Because of the utility of such terse summaries, NCBI has been experimenting with algorithmically generated titles which try to pack as much information as possible into a single line in a regular and readable format. You will see titles of this form appearing on entries produced by the NCBI journal scanning component of GenBank.

DEFINITION atp6=F0-ATPase subunit 6 {RNA edited} [Brassica napus=rapeseed,

mRNA Mitochondrial, 905 nt]

DEFINITION mprA=metalloprotease, mprR=regulatory protein [Streptomyces

coelicolor, Muller DSM3030, Genomic, 3 genes, 2040 nt]

DEFINITION pelBC gene cluster: pelB=pectate lyase isozyme B, pelC=pectate

lyase isozyme C [Erwinia chrysanthemi, 3937, Genomic, 2481 nt]

DEFINITION glycoprotein J...glycoprotein I [simian herpes B virus SHBV,

prototypic B virus, Genomic, 3 genes, 2652 nt]

DEFINITION glycoprotein B, gB [human herpesvirus-6 HHV6, GS, Peptide, 830

aa]

DEFINITION {pseudogene} RESA-2=ring-infected erythrocyte surface antigen 2

[Plasmodium falciparum, FCR3, Genomic, 3195 nt]

DEFINITION microtubule-binding protein tau {exons 4A, 6, 8 and 13/14} [human,

Genomic, 954 nt, segment 1 of 4]

DEFINITION CAD protein carbamylphosphate synthetase domain {5' end} [Syrian

hamsters, cell line 165-28, mRNA Partial, 553 nt]

DEFINITION HLA-DPB1 (SSK1)=MHC class II antigen [human, Genomic, 288 nt]

Gene and protein names come first. If both gene name and protein name are know they are linked with "=". If more than two genes are on a Bioseq then the first and last gene are given, separated by "...". A region name, if available, will precede the gene names. Extra comments will appear in {}. Organism, strain names, and molecule type and modifier appear in [] at the end. Note that the whole definition is constructed from structured information in the ASN.1 data structure by software. It is not composed by hand, but is instead a brief, machine generated summary of the entry based on data within the entry. We therefore discourage attempts to machine parse this line. It may change, but the underlying structured data will not. Software should always be designed to process the structured data.

org: What Organism Did this Come From?

If the whole Bioseq comes from a single organism (the usual case). See the Feature Table chapter for a detailed description of the Org-ref (organism reference) data structure.

comment: Commentary Text

A comment that applies to the whole Bioseq may go here. A comment may contain many sentences or paragraphs. A Bioseq may have many comments.

num: Applying a Numbering System to a Bioseq

One may apply a custom numbering system over the full length of the Bioseq with this Seq‑descr. See the section on Numbering later in this chapter for a detailed description of the possible forms this can take. To report the numbering system used in a particular publication, the Pubdesc Seq-descr has its own Numbering slot.

maploc: Map Location

The map location given here is a Dbtag, to be able to cite a map location given by a map database to this Bioseq (e.g. "GDB", "4q21"). It is not necessarily the map location published by the author of the Bioseq. A map location published by the author would be part of a Pubdesc Seq-descr.

pir: PIR Specific Data

sp: SWISSPROT Data

embl: EMBL Data

prf: PRF Data

pdb: PDB Data

NCBI produces ASN.1 encoded entries from data provided by many different sources. Almost all of the data items from these widely differing sources are mapped into the common ASN.1 specifications described in this document. However, in all cases a small number of elements are unique to a particular data source, or cannot be unambiguously mapped into the common ASN.1 specification. Rather than lose such elements, they are carried in small data structures unique to each data source. These are specified in seqblock.asn and objblock.h.

genbank: GenBank Flatfile Specific Data

A number of data items unique to the GenBank flatfile format do not map readily to the common ASN.1 specification. These fields are partially populated by NCBI for Bioseqs derived from other sources than GenBank to permit the production of valid GenBank flatfile entries from those Bioseqs. Other fields are populated to preserve information coming from older GenBank entries.

pub: Description of a Publication

This Seq-descr is used both to cite a particular bibliographic source and to carry additional information about the Bioseq as it appeared in that publication, such as the numbering system to use, the figure it appeared in, a map location given by the author in that paper, and so. See the section on the Pubdesc later in this chapter for a more detailed description of this data type.

region: Name of a Genomic Region

A region of genome often has a name which is a commonly understood description for the Bioseq, such as "B-globin cluster".

user: A User-defined Structured Object

This is a place holder for software or databases to add their own structured datatypes to Bioseqs without corrupting the common specification or disabling the automatic ASN.1 syntax checking. A User-object can also be used as a feature. See the chapter on General User Objects for a detailed explanation of User-objects.

neighbors: Bioseqs Related by Sequence Similarity

NCBI computes a list of "neighbors", or closely related Bioseqs based on sequence similarity for use in the Entrez service. This descriptor is so that such context setting information could be included in a Bioseq itself, if desired.

create-date:

This is the date a Bioseq was created for the first time. It is normally supplied by the source database. It may not be present when not normally distributed by the source database.

update-date:

This is the date of the last update to a Bioseq by the source database. For several source databases this is the only date provided with an entry. The nature of the last update done is generally not available in computer readable (or any) form.

het: Heterogen

A "heterogen" is a non-biopolymer atom or molecule associated with Bioseqs from PDB. When a heterogen appears at the Seq-descr level, it means it was resolved in the crystal structure but is not associated with specific residues of the Bioseq. Heterogens which are associated with specific residues of the Bioseq are attached as features.

Seq-inst: Instantiating the Bioseq

Seq-inst.mol gives the physical type of the Bioseq in the living organism. If it is not certain if the Bioseq is DNA (dna) or RNA (rna), then (na) can be used to indicate just "nucleic acid". A protein is always (aa) or "amino acid". The values "not-set" or "other" are provided for internal use by editing and authoring tools, but should not be found on a finished Bioseq being sent to an analytical tool or database.

The representation class to which the Bioseq belongs is encoded in Seq-inst.repr. The values "not-set" or "other" are provided for internal use by editing and authoring tools, but should not be found on a finished Bioseq being sent to an analytical tool or database. The Data Model chapter discusses the representation class hierarchy in general. Specific details follow below.

Seq-inst: Virtual Bioseq

A "virtual" Bioseq is one in which we know the type of molecule, and possibly it's length, topology, and/or strandedness, but for which we do not have sequence data. It is not unusual to have some uncertainty about the length of a virtual Bioseq, so Seq-inst.fuzz may be used. The fields Seq-inst.seq-data and Seq-inst.ext are not appropriate for a virtual Bioseq.

Seq-inst: Raw Bioseq

A "raw" Bioseq does have sequence data, so Seq-inst.length must be set and there should be no Seq-inst.fuzz associated with it. Seq-inst.seq-data must be filled in with the sequence itself and a Seq-data encoding must be selected which is appropriate to Seq-inst.mol. The topology and strandedness may or may not be available. Seq-inst.ext is not appropriate.

Seq-inst: Segmented Bioseq

A segmented ("seg") Bioseq has all the properties of a virtual Bioseq, except that Seq-hist.ext of type Seq-ext.seg must be used to indicate the pieces of other Bioseqs to assemble to make the segmented Bioseq. A Seq-ext.seg is defined as a SEQUENCE OF Seq-loc, or a series of locations on other Bioseqs, taken in order.

For example, a segmented Bioseq (called "X") has a SEQUENCE OF Seq-loc which are an interval from position 11 to 20 on Bioseq "A" followed by an interval from position 6 to 15 on Bioseq "B". So "X" is a Bioseq with no internal gaps which is 20 residues long (no Seq-inst.fuzz). The first residue of "X" is the residue found at position 11 in "A". To obtain this residue, software must retrieve Bioseq "A" and examine the residue at "A" position 11. The segmented Bioseq contains no sequence data itself, only pointers to where to get the sequence data and what pieces to assemble in what order.

The type of segmented Bioseq described above might be used to represent the putative mRNA by simply pointing to the exons on two pieces of genomic sequence. Suppose however, that we had only sequenced around the exons on the genomic sequence, but wanted to represent the putative complete genomic sequence. Let us assume that Bioseq "A" is the genomic sequence of the first exon and some small amount of flanking DNA, and that Bioseq "B" is the genomic sequence around the second exon. Further, we may know from mapping that the exons are separated by about two kilobases of DNA. We can represent the genomic region by creating a segmented sequence in which the first location is all of Bioseq "A". The second location will be all of a virtual Bioseq (call it "C") whose length is two thousand and which has a Seq-inst.fuzz representing whatever uncertainty we may have about the exact length of the intervening genomic sequence. The third location will be all of Bioseq "B". If "A" is 100 base pairs long and "B" is 200 base pairs, then the segmented entry is 2300 base pairs long ("A"+"C"+"B") and has the same Seq-inst.fuzz as "C" to express the uncertainty of the overall length.

A variation of the case above is when one has no idea at all what the length of the intervening genomic region is. A segmented Bioseq can also represent this case. The Seq-inst.ext location chain would be first all of "A", then a Seq-loc of type "null", then all of "B". The "null" indicates that there is no available information here. The length of the segmented Bioseq is just the sum of the length of "A" and the length of "B", and Seq-inst.fuzz is set to indicate the real length is greater-than the length given. The "null" location does not add to the overall length of the segmented Bioseq and is ignored in determining the integer value of a location on the segmented Bioseq itself. If "A" is 100 base pairs long and "B" is 50 base pairs long, then position 0 on the segmented Bioseq is equivalent to the first residue of "A" and position 100 on the segmented Bioseq is equivalent to the first residue of "B", despite the intervening "null" location indicating the gap of unknown length. Utility functions such as the SeqPort (described in the Sequence Utilities chapter) can be configured to signal when crossing such boundaries, or to ignore them.

The Bioseqs referenced by a segmented Bioseq should always be from the same Seq-inst.mol class as the segmented Bioseq, but may well come from a mixture of Seq-inst.repr classes (as for example the mixture of virtual and raw Bioseq references used to describe sequenced and unsequenced genomic regions above). Other reasonable mixtures might be raw and map (see below) Bioseqs to describe a region which is fully mapped and partially sequenced, or even a mixture of virtual, raw, and map Bioseqs for a partially mapped and partially sequenced region. The "character" of any region of a segmented Bioseq is always taken from the underlying Bioseq to which it points in that region. However, a segmented Bioseq can have its own annotations. Things like feature tables are not automatically propagated to the segmented Bioseq.

Seq-inst: Reference Bioseq

A reference Bioseq is effectively a segmented Bioseq with only one pointer location. It behaves exactly like a segmented Bioseq in taking its data and "character" from the Bioseq to which it points. Its purpose is not to construct a new Bioseq from others like a segmented Bioseq, but to refer to an existing Bioseq. It could be used to provide a convenient handle to a frequently used region of a larger Bioseq. Or it could be used to develop a customized, personally annotated view of a Bioseq in a public database without losing the "live" link to the public sequence.

In the first example, software would want to be able to use the Seq-loc to gather up annotations and descriptors for the region and display them to user with corrections to align them appropriately to the sub region. In this form, a scientist my refer to the "lac region" by name, and analyze or annotate it as if it were a separate Bioseq, but each retrieve starts with a fresh copy of the underlying Bioseq and annotations, so corrections or additions made to the underlying Bioseq in the public database will be immediately visible to the scientist, without either having to always look at the whole Bioseq or losing any additional annotations the scientist may have made on the region themselves.

In the second example, software would not propagate annotations or descriptors from the underlying Bioseq by default (because presumably the scientist prefers his own view to the public one) but the connection to the underlying Bioseq is not lost. Thus the public annotations are available on demand and any new annotations added by the scientist share the public coordinate system and can be compared with those done by others.

Seq-inst: Constructed Bioseq

A constructed (const) Bioseq inherits all the attributes of a raw Bioseq. It is used to represent a Bioseq which has been constructed by assembling other Bioseqs. In this case the component Bioseqs normally overlap each other and there may be considerable redundancy of component Bioseqs. A constructed Bioseq is often also called a "contig" or a "merge".

Most raw Bioseqs in the public databases were constructed by merging overlapping gel or sequencer readings of a few hundred base pairs each. While the const Bioseq data structure can easily accommodate this information, the const Bioseq data type was not really intended for this purpose. It was intended to represent higher level merges of public sequence data and private data, such as when a number of sequence entries from different authors are found to overlap or be contained in each other. In this case a view of the larger sequence region can be constructed by merging the components. The relationship of the merge to the component Bioseqs is preserved in the constructed Bioseq, but it is clear that the constructed Bioseq is a "better" or "more complete" view of the overall region, and could replace the component Bioseqs in some views of the sequence database. In this way an author can submit a data structure to the database which in this author's opinion supersedes his own or other scientist's database entries, without the database actually dropping the other author's entries (who may not necessarily agree with the author submitting the constructed Bioseq).

The constructed Bioseq is like a raw, rather than a segmented, Bioseq because Seq-inst.seq-data must be present. The sequence itself is part of the constructed Bioseq. This is because the component Bioseqs may overlap in a number of ways, and expert knowledge or voting rules may have been applied to determine the "correct" or "best" residue from the overlapping regions. The Seq-inst.seq-data contains the sequence which is the final result of such a process.

Seq-inst.ext is not used for the constructed Bioseq. The relationship of the merged sequence to its component Bioseqs is stored in Seq-inst.hist, the history of the Bioseq (described in more detail below). Seq-hist.assembly contains alignments of the constructed Bioseq with its component Bioseqs. Any Bioseq can have a Seq-hist.assembly. A raw Bioseq may use this to show its relationship to its gel readings. The constructed Bioseq is special in that its Seq-hist.assembly shows how a high level view was constructed from other pieces. The sequence in a constructed Bioseq is only posited to exist. However, since it is constructed from data by possibly many different laboratories, it may never have been sequenced in its entirety from a single biological source.

Seq-inst: Typical or Consensus Bioseq

A consensus (consen) Bioseq is used to represent a pattern typical of a sequence region or family of sequences. There is no assertion that even one sequence exists that is exactly like this one, or even that the Bioseq is a best guess at what a real sequence region looks like. Instead it summarizes attributes of an aligned collection of real sequences. It could be a "typical" ferredoxin made by aligning ferredoxin sequences from many organisms and producing a protein sequence which is by some measure "central" to the group. By using the NCBIpaa encoding for the protein, which permits a probability to be assigned to each position that any of the standard amino acids occurs there, one can create a "weight matrix" or "profile" to define the sequence.

While a consensus Bioseq can represent a frequency profile (including the probability that any amino acid can occur at a position, a type of gap penalty), it cannot represent a regular expression per se. That is because all Bioseqs represent fixed integer coordinate systems. This property is essential for attaching feature tables or expressing alignments. There is no clear way to attach a fixed coordinate system to a regular expression, while one can approximate allowing weighted gaps in specific regions with a frequency profile. Since the consensus Bioseq is like any other, information can be attached to it through a feature table and alignments of the consensus pattern to other Bioseqs can be represented like any other alignment (although it may be computed a special way). Through the alignment, annotated features on the pattern can be related to matched regions of the aligned sequence in a straightforward way.

Seq-hist.assembly can be used in a consensus Bioseq to record the sequence regions used to construct the pattern and their relationships with it. While Seq-hist.assembly for a constructed Bioseq indicates the relationship with Bioseqs which are meant to be superseded by the constructed Bioseq, the consensus Bioseq does not in any way replace the Bioseqs in its Seq-hist.assembly. Rather it is a summary of common features among them, not a "better" or "more complete" version of them.

Seq-inst: Map Bioseqs

A map Bioseq inherits all the properties of a virtual Bioseq. For a consensus genetic map of E.coli, we can posit that the chromosome is DNA, circular, double-stranded, and about 5 million base pairs long. Given this coordinate system, we estimate the positions of genes on it based on genetic evidence. That is, we build a feature table with Gene-ref features on it (explained in more detail in the Feature Table chapter). Thus, a map Bioseq is a virtual Bioseq with a Seq-inst.ext which is a feature table. In this case the feature table is an essential part of instantiating the Bioseq, not simply an annotation on the Bioseq. This is not to say a map Bioseq cannot have a feature table in the usual sense as well. It can. It can also be used in alignments, displays, or by any software that can process or store Bioseqs. This is the great strength of this approach. A genetic or physical map is just another Bioseq and can be stored or analyzed right along with other more typical Bioseqs.

It is understood that within a particular physical or genetic mapping research project more data will have to be present than the map Bioseq can represent. But the same is true for a big sequencing project. The Bioseq is an object for reporting the result of such projects to others in a way that preserves most or all the information of use to workers outside the particular research group. It also preserves enough information to be useful to software tools within the project, such as display tools or analysis tools which were written by others.

A number of attributes of Bioseqs can make such a generic representation more "natural" to a particular research community. For the E.coli map example, above, no E.coli geneticist thinks of the positions of genes in base pairs (yet). So a Num-ref annotation (see Seq-descr, below) can be attached to the Bioseq, which provides information to convert the internal integer coordinate system of the map Bioseq to "minutes", the floating point numbers from 0.0 to 100.0 that E.coli gene positions are traditionally given in. Seq-loc objects which the Gene-ref features use to indicate their position can represent uncertainty, and thus give some idea of the accuracy of the mapping in a simple way. This representation cannot store order information directly (e.g. B and C are after A and before D, but we don't know the absolute distance and we don't know the relative order of B and C), which would need to be stored in a genetic mapping research database. However, a reasonable enough presentation can be made of this situation using locations and uncertainties to be very useful for a wide variety of purposes. As more sequence and physical map information become available, such uncertainties in gene position, at least for the "typical" chromosome, will gradually be resolved and will then map very will to such a generic model.

A physical map Bioseq has similar strengths and weaknesses as the genetic map Bioseq. It can represent an ordered map (such as an ordered restriction map) very well and easily. For some contig building approaches, ordering information is essential to the process of building the physical map and would have to be stored and processed separately by the map building research group. However, the map Bioseq serves very well as a vehicle for periodic reports of the group's best view of the physical map for consumption by the scientific public. The map Bioseq data structure maps quite well to the figures such groups publish to summarize their work. The map Bioseq is an electronic summary that can be integrated with other data and software tools.

Seq-hist: History of a Seq-inst

Seq-hist is literally the history of the Seq-inst part of a Bioseq. It does not track changes in annotation at all. However, since the coordinate system provided by the Seq-inst is the critical element for tying annotations and alignments done at various times by various people into a single consistent database, this is the most important element to track.

While Seq-hist can use any valid Seq-id, in practice NCBI will use the best available Seq-id in the Seq-hist. For this purpose, the Seq-id most tightly linked to the exact sequence itself is best. See the Seq-id discussion.

Seq-hist.assembly has been mentioned above. It is a SET OF Seq-align which show the relationship of this Bioseq to any older components that might be merged into it. The Bioseqs included in the assembly are those from which this Bioseq was made or is meant to supersede. The Bioseqs in the assembly need not all be from the author, but could come from anywhere. Assembly just sets the Bioseq in context.

Seq-hist.replaces makes an editorial statement using a Seq-hist-rec. As of a certain date, this Bioseq should replace the following Bioseqs. Databases at NCBI interpret this in a very specific way. Seq-ids in Seq-hist.replaces, which are owned by the owner of the Bioseq, are taken from the public view of the database. The author has told us to replace them with this one. If the author does not own some of them, it is taken as advice that the older entries may be obsolete, but they are not removed from the public view.

Seq-hist.replaced-by is a forward pointer. It means this Bioseq was replaced by the following Seq-id(s) on a certain date. In the case described above, that an author tells NCBI that a new Bioseq replaces some of his old ones, not only is the backward pointer (Seq-hist.replaces) provided by the author in the database, but NCBI will update the Seq-hist.replaced-by forward pointer when the old Bioseq is removed from public view. Since such old entries are still available for specific retrieval by the public, if a scientist does have annotation pointing to the old entry, the new entry can be explicitly located. Conversely, the older versions of a Bioseq can easily be located as well. Note that Seq-hist.replaced-by points only one generation forward and Seq-hist.replaces points only one generation back. This makes Bioseqs with a Seq-hist a doubly linked list over its revision history. This is very different from GenBank/EMBL/DDBJ secondary accession numbers, which only indicate "some relationship" between entries. When that relationship happens to be the replacement relationship, they still carry all accession numbers in the secondary accessions, not just the last ones, so reconstructing the entry history is impossible, even in a very general way.

Another fate which may await a Bioseq is that it is completely withdrawn. This is relatively rare but does happen. Seq-hist.deleted can either be set to just TRUE, or the date of the deletion event can be entered (preferred). In the SeqHist C structure, slots for both the deleted boolean and deleted date are present. If the deleted date is present, the ASN.1 will have the Date CHOICE for Seq-hist.deleted, else if the deleted boolean is TRUE the ASN.1 will have the BOOLEAN form.

Seq-data: Encoding the Sequence Data Itself

In the case of a raw or constructed Bioseq, the sequence data itself is stored in Seq-inst.seq-data, which is the data type Seq-data. Seq-data is a CHOICE of different ways of encoding the data, allowing selection of the optimal type for the case in hand. Both nucleic acid and amino acid encoding are given as CHOICEs of Seq-data rather than further subclassing first. But it is still not reasonable to encode a Bioseq of Seq-inst.mol of "aa" using a nucleic acid Seq-data type.

In the C structures all types of Seq-data are stored in ByteStores in Bioseq.seq_data. The encoding is given by the value of Bioseq.seq_data_type. The file objseq.h contains a series of #defines for the values of Bioseq.seq_data_type. These #defines map exactly to the ASN.1 Seq-code-type described below.

The ASN.1 module seqcode.asn and C header objcode.h define tables for recording the allowed values for the various sequence encoding and the ways to display or map between codes. This permits useful information about the allowed encoding to be stored as ASN.1 data and read into a program at runtime. NCBI uses the text file seqcode.prt and the binary version of that, seqcode.val, with its software tools. Some of the data from this file is presented in tables in the following discussion of the different sequence encoding. The "value" is the internal numerical value of a residue in the C code. The "symbol" is a one letter or multi-letter symbol to be used in display to a human. The "name" is a descriptive name for the residue. Other data in seqcode.prt will be discussed in the section on seqcode.asn itself.

IUPACaa: The IUPAC-IUB Encoding of Amino Acids

A set of one letter abbreviations for amino acids were suggested by the IUPAC-IUB Commission on Biochemical Nomenclature, published in J. Biol. Chem. (1968) 243: 3557-3559. It is very widely used in both printed and electronic forms of protein sequence, and many computer programs have been written to analyze data in this form internally (that is the actual ASCII value of the one letter code is used internally). To support such approaches, the IUPACaa encoding represents each amino acid internally as the ASCII value of its external one letter symbol. Note that this symbol is UPPER CASE. One may choose to display the value as lower case to a user for readability, but the data itself must be the UPPER CASE value.

In the NCBI C code implementation, the values are stored one value per byte.

IUPACaa

Value	Symbol	Name
65	A	Alanine
66	B	Asp or Asn
67	C	Cysteine
68	D	Aspartic Acid
69	E	Glutamic Acid
70	F	Phenylalanine
71	G	Glycine
72	H	Histidine
73	I	Isoleucine
74	J	Leu or Ile
75	K	Lysine
76	L	Leucine
77	M	Methionine
78	N	Asparagine
79	O	Pyrrolysine
80	P	Proline
81	Q	Glutamine
82	R	Arginine
83	S	Serine
84	T	Threoine
86	V	Valine
87	W	Tryptophan
88	X	Undetermined or atypical
89	Y	Tyrosine
90	Z	Glu or Gln

NCBIeaa: Extended IUPAC Encoding of Amino Acids

The official IUPAC amino acid code has some limitations. One is the lack of symbols for termination, gap, or selenocysteine. Such extensions to the IUPAC codes are also commonly used by sequence analysis software. NCBI has created such a code which is simply the IUPACaa code above extended with the additional symbols.

In the NCBI C code implementation, the values are stored one value per byte.

NCBIeaa

Value	Symbol	Name
42	*	Termination
45	-	Gap
65	A	Alanine
66	B	Asp or Asn
67	C	Cysteine
68	D	Aspartic Acid
69	E	Glutamic Acid
70	F	Phenylalanine
71	G	Glycine
72	H	Histidine
73	I	Isoleucine
74	J	Leu or Ile
75	K	Lysine
76	L	Leucine
77	M	Methionine
78	N	Asparagine
79	O	Pyrrolysine
80	P	Proline
81	Q	Glutamine
82	R	Arginine
83	S	Serine
84	T	Threoine
85	U	Selenocysteine
86	V	Valine
87	W	Tryptophan
88	X	Undetermined or atypical
89	Y	Tyrosine
90	Z	Glu or Gln

NCBIstdaa: A Simple Sequential Code for Amino Acids

It is often very useful to separate the external symbol for a residue from its internal representation as a data value. For amino acids NCBI has devised a simple continuous set of values that encompasses the set of "standard" amino acids also represented by the NCBIeaa code above. A continuous set of values means that compact arrays can be used in computer software to look up attributes for residues simply and easily by using the value as an index into the array. The only significance of any particular mapping of a value to an amino acid is that zero is used for gap and the official IUPAC amino acids come first in the list. In general, we recommend the use of this encoding for standard amino acid sequences.

In the NCBI C code implementation, the values are stored one value per byte.

NCBIstdaa

Value	Symbol	Name
0	-	Gap
1	A	Alanine
2	B	Asp or Asn
3	C	Cysteine
4	D	Aspartic Acid
5	E	Glutamic Acid
6	F	Phenylalanine
7	G	Glycine
8	H	Histidine
9	I	Isoleucine
10	K	Lysine
11	L	Leucine
12	M	Methionine
13	N	Asparagine
14	P	Proline
15	Q	Glutamine
16	R	Arginine
17	S	Serine
18	T	Threoine
19	V	Valine
20	W	Tryptophan
21	X	Undetermined or atypical
22	Y	Tyrosine
23	Z	Glu or Gln
24	U	Selenocysteine
25	*	Termination
26	O	Pyrrolysine
27	J	Leu or Ile

NCBI8aa: An Encoding for Modified Amino Acids

Post-translational modifications can introduce a number of non-standard or modified amino acids into biological molecules. The NCBI8aa code will be used to represent up to 250 possible amino acids by using the remaining coding space in the NCBIstdaa code. That is, for the first 26 values, NCBI8aa will be identical to NCBIstdaa. The remaining 224 values will be used for the most commonly encountered modified amino acids. Only the first 250 values will be used to signify amino acids, leaving values in the range of 250-255 to be used for software control codes. Obviously there are a very large number of possible modified amino acids, especially if one takes protein engineering into account. However, the intent here is to only represent commonly found biological forms. This encoding is not yet available since decisions about what amino acids to include have not all been made yet.

IUPAC3aa: A 3 Letter Display Code for Amino Acids

The IUPAC3aa code uses exactly the same values as NCBIstdaa. The only difference is the symbol is the three letters instead of the one letter code. This code is purely for display. As such it does not appear as a valid CHOICE in Seq-data for encoding actual sequence data. However, it does appear in the seqcode.asn specification and is stored in seqcode.val. The symbols follow the IUPAC-IUB recommendations for three letter codes where possible.

IUPAC3aa

Value	Symbol	Name
0	---	Gap
1	Ala	Alanine
2	Asx	Asp or Asn
3	Cys	Cysteine
4	Asp	Aspartic Acid
5	Glu	Glutamic Acid
6	Phe	Phenylalanine
7	Gly	Glycine
8	His	Histidine
9	Ile	Isoleucine
10	Lys	Lysine
11	Leu	Leucine
12	Met	Methionine
13	Asn	Asparagine
14	Pro	Proline
15	Gln	Glutamine
16	Arg	Arginine
17	Ser	Serine
18	Thr	Threoine
19	Val	Valine
20	Trp	Tryptophan
21	Xxx	Undetermined or atypical
22	Tyr	Tyrosine
23	Glx	Glu or Gln
24	Sec	Selenocysteine
25	Ter	Termination
26	Pyl	Pyrrolysine
27	Xle	Leu or Ile

NCBIpaa: A Profile Style Encoding for Amino Acids

The NCBIpaa encoding is designed to accommodate a frequency profile describing a protein motif or family in a form which is consistent with the sequences in a Bioseq. Each position in the sequence is defined by 30 values. Each of the 30 values represents the probability that a particular amino acid (or gap, termination, etc.) will occur at that position. One can consider each set of 30 values an array. The amino acid for each cell of the 30 value array corresponds to the NCBIstdaa index scheme. This means that currently only the first 26 array elements will ever have a meaningful value. The remaining 4 cells are available for possible future additions to NCBIstdaa. Each cell represents the probability that the amino acid defined by the NCBIstdaa index to that cell will appear at that position in the motif or protein. The probability is encoded as an 8-bit value from 0-255 corresponding to a probability from 0.0 to 1.0 by interpolation.

This type of encoding would presumably never appear except in a Bioseq of type "consensus". In the C code implementation these amino acids are encoded at 30 bytes per amino acid in a simple linear order. That is, the first 30 bytes are the first amino acid, the second 30 the next amino acid, and so on.

IUPACna: The IUPAC-IUB Encoding for Nucleic Acids

Like the IUPACaa codes the IUPACna codes are single letters for nucleic acids and the value is the same as the ASCII value of the recommended IUPAC letter. The IUPAC recommendations for nucleic acid codes also include letters to represent all possible ambiguities at a single position in the sequence except a gap. To make the values non-redundant, U is considered the same as T. Whether a sequence actually contains U or T is easily determined from Seq-inst.mol. Since some software tools are designed to work directly on the ASCII representation of the IUPAC letters, this representation is provided. Note that the ASCII values correspond to the UPPER CASE letters. Using values corresponding to lower case letters in Seq-data is an error. For display to a user, any readable case or font is appropriate.

The C implementation encodes one value for a nucleic acid residue per byte.

IUPACna

Value	Symbol	Name
65	A	Adenine
66	B	G or T or C
67	C	Cytosine
68	D	G or A or T
71	G	Guanine
72	H	A or C or T
75	K	G or T
77	M	A or C
78	N	A or G or C or T
82	R	G or A
83	S	G or C
84	T	Thymine
86	V	G or C or A
87	W	A or T
89	Y	T or C

NCBI4na: A Four Bit Encoding of Nucleic Acids

It is possible to represent the same set of nucleic acid and ambiguities with a four bit code, where one bit corresponds to each possible base and where more than one bit is set to represent ambiguity. The particular encoding used for NCBI4na is the same as that used on the GenBank Floppy Disk Format. A four bit encoding has several advantages over the direct mapping of the ASCII IUPAC codes. One can represent "no base" as 0000. One can match various ambiguous or unambiguous bases by a simple AND. For example, in NCBI4na 0001=A, 0010=C, 0100=G, 1000=T/U. Adenine (0001) then matches Purine (0101) by the AND method. Finally, it is possible to store the sequence in half the space by storing two bases per byte. This is done both in the ASN.1 encoding and in the NCBI C software implementation. Utility functions (see SeqPort()) allow the developer to ignore the complexities of storage while taking advantage of the greater packing. Since nucleic acid sequences can be very long, this is a real savings.

NCBI4na

Value	Symbol	Name
0	-	Gap
1	A	Adenine
2	C	Cytosine
3	M	A or C
4	G	Guanine
5	R	G or A
6	S	G or C
7	V	G or C or A
8	T	Thymine/Uracil
9	W	A or T
10	Y	T or C
11	H	A or C or T
12	K	G or T
13	D	G or A or T
14	B	G or T or C
15	N	A or G or C or T

NCBI2na: A Two Bit Encoding for Nucleic Acids

If no ambiguous bases are present in a nucleic acid sequence it can be completely encoded using only two bits per base. This allows encoding into ASN.1 or storage in the NCBI C implementation with a four fold savings in space. As with the four bit packing, the NCBI C utility SeqPort() allows the programmer to ignore the complexities introduced by the packing. The two bit encoding selected is the same as that proposed for the GenBank CDROM.

NCBI2na

Value	Symbol	Name
0	A	Adenine
1	C	Cytosine
2	G	Guanine
3	T	Thymine/Uracil

NCBI8na: An Eight Bit Sequential Encoding for Modified Nucleic Acids

The first 16 values of NCBI8na are identical with those of NCBI4na. The remaining possible 234 values will be used for common, biologically occurring modified bases such as those found in tRNAs. This full encoding is still being determined at the time of this writing. Only the first 250 values will be used, leaving values in the range of 250-255 to be used as control codes in software.

NCBIpna: A Frequency Profile Encoding for Nucleic Acids

Frequency profiles have been used to describe motifs and signals in nucleic acids. This can be encoded by using five bytes per sequence position. The first four bytes are used to express the probability that particular bases occur at that position, in the order A, C, G, T as in the NCBI2na encoding. The fifth position encodes the probability that a base occurs there at all. Each byte has a value from 0-255 corresponding to a probability from 0.0-1.0.

The sequence is encoded as a simple linear sequence of bytes where the first five bytes code for the first position, the next five for the second position, and so on. Typically the NCBIpna notation would only be found on a Bioseq of type consensus. However, one can imagine other uses for such an encoding, for example to represent knowledge about low resolution sequence data in an easily computable form.

Tables of Sequence Codes

Various sequence alphabets can be stored in tables of type Seq-code-table, defined in seqcode.asn. An enumerated type, Seq-code-type is used as a key to each table. Each code can be thought of as a square table essentially like those presented above in describing each alphabet. Each "residue" of the code has a numerical one-byte value used to represent that residue both in ASN.1 data and in internal C structures. The information necessary to display the value is given by the "symbol". A symbol can be in a one-letter series (e.g. A,G,C,T) or more than one letter (e.g. Met, Leu, etc.). The symbol gives a human readable representation the corresponds to each numerical residue value. A name, or explanatory string, is also associated with each.

So, the NCBI2na code above would be coded into a Seq-code-table very simply as:

{ -- NCBI2na

code ncbi2na ,

num 4 , -- continuous 0-3

one-letter TRUE , -- all one letter codes

table {

{ symbol "A", name "Adenine" },

{ symbol "C", name "Cytosine" },

{ symbol "G", name "Guanine" },

{ symbol "T", name "Thymine/Uracil"}

} , -- end of table

comps { -- complements

}

} ,

The table has 4 rows (with values 0-3) with one letter symbols. If we wished to represent a code with values which do not start at 0 (such as the IUPAC codes) then we would set the OPTIONAL "start-at" element to the value for the first row in the table.

In the case of nucleic acid codes, the Seq-code-table also has rows for indexes to complement the values represented in the table. In the example above, the complement of 0 ("A") is 3 ("T").

Mapping Between Different Sequence Alphabets

A Seq-map-table provides a mapping from the values of one alphabet to the values of another, very like the way complements are mapped above. A Seq-map-table has two Seq-code-types, one giving the alphabet to map from and the other the alphabet to map to. The Seq-map-table has the same number of rows and the same "start-at" value as the Seq-code-table for the alphabet it maps FROM. This makes the mapping a simple array lookup using the value of a residue of the FROM alphabet and subtracting "start-at". Remember that alphabets are not created equal and mapping from a bigger alphabet to a smaller may result in loss of information.

Data and Tools for Sequence Alphabets

NCBI provides a collection of Seq-code-tables and Seq-map-tables together in a Seq-code-set as part of the software toolbox. The file is called seqcode.prt (text form) or seqcode.val (binary ASN.1 used by the software). The function SeqCodeSetLoad() will check your NCBI configuration file looking for the path to "DATA", then read seqcode.val into memory using SeqCodeSetAsnRead(). A local static pointer to the loaded SeqCodes is kept in the SeqCode module, and thus need not be kept by the caller. Additional functions use the static pointer to provide access to the codes. SeqCodeTableFind() will return the appropriate SeqCodeTablePtr given a valid sequence code, and SeqMapTableFind() will return the appropriate SeqMapTablePtr given a code to map from and a code to map to. The SeqPort functions use these functions to provide a view of a sequence in any requested alphabet by mapping residues on demand. See the chapter on Writing Sequence Software.

Pubdesc: Publication Describing a Bioseq

A Pubdesc is a data structure used to record how a particular publication described a Bioseq. It contains the citation itself as a Pub-equiv (see the Bibliographic References chapter) so that equivalent forms of the citation (e.g. a MEDLINE uid and a Cit-Art) can all be accommodated in a single data structure. Then a number of additional fields allow a more complete description of what was presented in the publication. These extra fields are generally only filled in for entries produced by the NCBI journal scanning component of GenBank, also known as the Backbone database. This information is not generally available in data from any other database yet.

Pubdesc.name is the name given the sequence in the publication, usually in the figure. Pubdesc.fig gives the figure the Bioseq appeared in so a scientist can locate it in the paper. Pubdesc.num preserves the numbering system used by the author (see Numbering below). Pubdesc.numexc, if TRUE, indicates that a "numbering exception" was found (i.e. the author's numbering did not agree with the number of residues in the sequence). This usually indicates an error in the preparation of the figure. If Pubdesc.poly-a is TRUE, then a poly-A tract was indicated for the Bioseq in the figure, but was not explicitly preserved in the sequence itself (e.g. ...AGAATTTCT (Poly-A) ). Pubdesc.maploc is the map location for this sequence as given by the author in this paper. Pubdesc.seq-raw allows the presentation of the sequence exactly as typed from the figure. This is never used now. Pubdesc.align-group, if present, indicates the Bioseq was presented in a group aligned with other Bioseqs. The align-group value is an arbitrary integer. Other Bioseqs from the same publication which are part of the same alignment will have the same align-group number.

Pubdesc.comment is simply a free text comment associated with this publication. SWISSPROT entries may also have this field filled.

Numbering: Applying a Numbering System to a Bioseq

Internally, locations on Bioseqs are ALWAYS integer offsets in the range 0 to (length - 1). However, it is often helpful to display some other numbering system. The Numbering data structure supports a variety of numbering styles and conventions. In the ASN.1 specification, it is simply a CHOICE of the four possible types. When a Numbering object is supplied as a Seq-descr, then it applies to the complete length of the Bioseq. A Numbering object can also be a feature, in which case it only applies to the interval defined by the feature's location.

Num-cont: A Continuous Integer Numbering System

The most widely used numbering system for sequences is some form of a continuous integer numbering. Num-cont.refnum is the number to assign to the first residue in the Bioseq. If Num-cont.has-zero is TRUE, the numbering system uses zero. When biologists start numbering with a negative number, it is quite common for them to skip zero, going directly from -1 to +1, so the DEFAULT for has-zero is FALSE. This only reflects common usage, not any recommendation in terms of convention. Any useful software tool should support both conventions, since they are both used in the literature. Finally, the most common numbering systems are ascending; however descending numbering systems are encountered from time to time, so Num-cont.ascending would then be set to FALSE.

Num-real: A Real Number Numbering Scheme

Genetic maps may use real numbers as "map units" since they treat the chromosome as a continuous coordinate system, instead of a discrete, integer coordinate system of base pairs. Thus a Bioseq of type "map" which may use an underlying integer coordinate system from 0 to 5 million may be best presented to user in the familiar 0.0 to 100.0 map units. Num-real supports a simply linear equation specifying the relationship:

map units = ( Num-real.a + base_pair_position) + Num-real.b

in this example. Since such numbering systems generally have their own units (e.g. "map units", "centisomes", "centimorgans", etc), Num-real.units provides a string for labeling the display.

Num-enum: An Enumerated Numbering Scheme

Occasionally biologists do not use a continuous numbering system at all. Crystallographers and immunologists, for example, who do extensive studies on one or a few sequences, may name the individual residues in the sequence as they fit them into a theoretical framework. So one might see residues numbered ... "10" "11" "12" "12A" "12B" "12C" "13" "14" ... To accommodate this sort of scheme the "name" of each residue must be explicitly given by a string, since there is no anticipating any convention that may be used. The Num-enum.num gives the number of residue names (which should agree with the number of residues in the Bioseq, in the case of use as a Seq-descr), followed by the names as strings.

Num-ref: Numbering by Reference to Another Bioseq

Two types of references are allowed. The "sources" references are meant to apply the numbering system of constituent Bioseqs to a segmented Bioseq. This is useful for seeing the mapping from the parts to the whole.

The "aligns" reference requires that the Num-ref-aligns alignment be filled in with an alignment of the target Bioseq with one or more pieces of other Bioseqs. The numbering will come from the aligned pieces.

Numbering: C Structures and Utility Functions

A Numbering object is implemented in C simply as a ValNode, where ValNode.choice is given by a series of #defines in objpubd.h and ValNode.ptrvalue is a pointer to the appropriate data structure for the Numbering type.

In sequtil.h (see the Sequence Utilities chapter) a number of functions are defined which convert from internal to display numbering systems and vice versa. These functions make the use of fairly complex numbering systems fairly straightforward.

ASN.1 Specification: seq.asn

--$Revision: 2.1 $

--**********************************************************************

-- NCBI Sequence elements

-- by James Ostell, 1990

--**********************************************************************

NCBI-Sequence DEFINITIONS ::=

BEGIN

EXPORTS Bioseq, Seq-annot, Pubdesc, Seq-descr, Numbering, Heterogen;

IMPORTS Date, Int-fuzz, Dbtag, Object-id, User-object FROM NCBI-General

Seq-align FROM NCBI-Seqalign

Seq-feat FROM NCBI-Seqfeat

Seq-graph FROM NCBI-Seqres

Pub-equiv FROM NCBI-Pub

Org-ref FROM NCBI-Organism

Seq-id, Seq-loc FROM NCBI-Seqloc

Link-set FROM NCBI-Access

GB-block FROM GenBank-General

PIR-block FROM PIR-General

EMBL-block FROM EMBL-General

SP-block FROM SP-General

PRF-block FROM PRF-General

PDB-block FROM PDB-General;

--*** Sequence ********************************

--*

Bioseq ::= SEQUENCE {

id SET OF Seq-id , -- equivalent identifiers

descr Seq-descr OPTIONAL , -- descriptors

inst Seq-inst , -- the sequence data

annot SET OF Seq-annot OPTIONAL }

--*** Descriptors *****************************

--*

Seq-descr ::= SET OF CHOICE {

mol-type GIBB-mol , -- type of molecule

modif SET OF GIBB-mod , -- modifiers

method GIBB-method , -- sequencing method

name VisibleString , -- a name for this sequence

title VisibleString , -- a title for this sequence

org Org-ref , -- if all from one organism

comment VisibleString , -- a more extensive comment

num Numbering , -- a numbering system

maploc Dbtag , -- map location of this sequence

pir PIR-block , -- PIR specific info

genbank GB-block , -- GenBank specific info

pub Pubdesc , -- a reference to the publication

region VisibleString , -- overall region (globin locus)

user User-object , -- user defined object

sp SP-block , -- SWISSPROT specific info

neighbors Link-set , -- neighboring information

embl EMBL-block , -- EMBL specific information

create-date Date , -- date entry first created/released

update-date Date , -- date of last update

prf PRF-block , -- PRF specific information

pdb PDB-block , -- PDB specific information

het Heterogen } -- cofactor, etc associated but not bound

GIBB-mol ::= ENUMERATED { -- type of molecule represented

unknown (0) ,

genomic (1) ,

pre-mRNA (2) ,

mRNA (3) ,

rRNA (4) ,

tRNA (5) ,

snRNA (6) ,

scRNA (7) ,

peptide (8) ,

other-genetic (9) , -- other genetic material

genomic-mRNA (10) , -- reported a mix of genomic and cdna sequence

other (255) }

GIBB-mod ::= ENUMERATED { -- GenInfo Backbone modifiers

dna (0) ,

rna (1) ,

extrachrom (2) ,

plasmid (3) ,

mitochondrial (4) ,

chloroplast (5) ,

kinetoplast (6) ,

cyanelle (7) ,

synthetic (8) ,

recombinant (9) ,

partial (10) ,

complete (11) ,

mutagen (12) , -- subject of mutagenesis ?

natmut (13) , -- natural mutant ?

transposon (14) ,

insertion-seq (15) ,

no-left (16) , -- missing left end (5' for na, NH2 for aa)

no-right (17) , -- missing right end (3' or COOH)

macronuclear (18) ,

proviral (19) ,

est (20) , -- expressed sequence tag

other (255) }

GIBB-method ::= ENUMERATED { -- sequencing methods

concept-trans (1) , -- conceptual translation

seq-pept (2) , -- peptide was sequenced

both (3) , -- concept transl. w/ partial pept. seq.

seq-pept-overlap (4) , -- sequenced peptide, ordered by overlap

seq-pept-homol (5) , -- sequenced peptide, ordered by homology

concept-trans-a (6) , -- conceptual transl. supplied by author

other (255) }

Numbering ::= CHOICE { -- any display numbering system

cont Num-cont , -- continuous numbering

enum Num-enum , -- enumerated names for residues

ref Num-ref , -- by reference to another sequence

real Num-real } -- supports mapping to a float system

Num-cont ::= SEQUENCE { -- continuous display numbering system

refnum INTEGER DEFAULT 1, -- number assigned to first residue

has-zero BOOLEAN DEFAULT FALSE , -- 0 used?

ascending BOOLEAN DEFAULT TRUE } -- ascending numbers?

Num-enum ::= SEQUENCE { -- any tags to residues

num INTEGER , -- number of tags to follow

names SEQUENCE OF VisibleString } -- the tags

Num-ref ::= SEQUENCE { -- by reference to other sequences

type ENUMERATED { -- type of reference

not-set (0) ,

sources (1) , -- by segmented or const seq sources

aligns (2) } , -- by alignments given below

aligns Seq-align OPTIONAL }

Num-real ::= SEQUENCE { -- mapping to floating point system

a REAL , -- from an integer system used by Bioseq

b REAL , -- position = (a * int_position) + b

units VisibleString OPTIONAL }

Pubdesc ::= SEQUENCE { -- how sequence presented in pub

pub Pub-equiv , -- the citation(s)

name VisibleString OPTIONAL , -- name used in paper

fig VisibleString OPTIONAL , -- figure in paper

num Numbering OPTIONAL , -- numbering from paper

numexc BOOLEAN OPTIONAL , -- numbering problem with paper

poly-a BOOLEAN OPTIONAL , -- poly A tail indicated in figure?

maploc VisibleString OPTIONAL , -- map location reported in paper

seq-raw StringStore OPTIONAL , -- original sequence from paper

align-group INTEGER OPTIONAL , -- this seq aligned with others in paper

comment VisibleString OPTIONAL }-- any comment on this pub in context

Heterogen ::= VisibleString -- cofactor, prosthetic group, inibitor, etc

--*** Instances of sequences *******************************

--*

Seq-inst ::= SEQUENCE { -- the sequence data itself

repr ENUMERATED { -- representation class

not-set (0) , -- empty

virtual (1) , -- no seq data

raw (2) , -- continuous sequence

seg (3) , -- segmented sequence

const (4) , -- constructed sequence

ref (5) , -- reference to another sequence

consen (6) , -- consensus sequence or pattern

map (7) , -- ordered map (genetic, restriction)

other (255) } ,

mol ENUMERATED { -- molecule class in living organism

not-set (0) , -- > cdna = rna

dna (1) ,

rna (2) ,

aa (3) ,

na (4) , -- just a nucleic acid

other (255) } ,

length INTEGER OPTIONAL , -- length of sequence in residues

fuzz Int-fuzz OPTIONAL , -- length uncertainty

topology ENUMERATED { -- topology of molecule

not-set (0) ,

linear (1) ,

circular (2) ,

tandem (3) , -- some part of tandem repeat

other (255) } DEFAULT linear ,

strand ENUMERATED { -- strandedness in living organism

not-set (0) ,

ss (1) , -- single strand

ds (2) , -- double strand

mixed (3) ,

other (255) } OPTIONAL , -- default ds for DNA, ss for RNA, pept

seq-data Seq-data OPTIONAL , -- the sequence

ext Seq-ext OPTIONAL , -- extensions for special types

hist Seq-hist OPTIONAL } -- sequence history

--*** Sequence Extensions **********************************

--* for representing more complex types

--* const type uses Seq-hist.assembly

Seq-ext ::= CHOICE {

seg Seg-ext , -- segmented sequences

ref Ref-ext , -- hot link to another sequence (a view)

map Map-ext } -- ordered map of markers

Seg-ext ::= SEQUENCE OF Seq-loc

Ref-ext ::= Seq-loc

Map-ext ::= SEQUENCE OF Seq-feat

--*** Sequence History Record ***********************************

--** assembly = records how seq was assembled from others

--** replaces = records sequences made obsolete by this one

--** replaced-by = this seq is made obsolete by another(s)

Seq-hist ::= SEQUENCE {

assembly SET OF Seq-align OPTIONAL ,-- how was this assembled?

replaces Seq-hist-rec OPTIONAL , -- seq makes these seqs obsolete

replaced-by Seq-hist-rec OPTIONAL , -- these seqs make this one obsolete

deleted CHOICE {

bool BOOLEAN ,

date Date } OPTIONAL }

Seq-hist-rec ::= SEQUENCE {

date Date OPTIONAL ,

ids SET OF Seq-id }

--*** Various internal sequence representations ************

--* all are controlled, fixed length forms

Seq-data ::= CHOICE { -- sequence representations

iupacna IUPACna , -- IUPAC 1 letter nuc acid code

iupacaa IUPACaa , -- IUPAC 1 letter amino acid code

ncbi2na NCBI2na , -- 2 bit nucleic acid code

ncbi4na NCBI4na , -- 4 bit nucleic acid code

ncbi8na NCBI8na , -- 8 bit extended nucleic acid code

ncbipna NCBIpna , -- nucleic acid probabilities

ncbi8aa NCBI8aa , -- 8 bit extended amino acid codes

ncbieaa NCBIeaa , -- extended ASCII 1 letter aa codes

ncbipaa NCBIpaa , -- amino acid probabilities

ncbistdaa NCBIstdaa } -- consecutive codes for std aas

IUPACna ::= StringStore -- IUPAC 1 letter codes, no spaces

IUPACaa ::= StringStore -- IUPAC 1 letter codes, no spaces

NCBI2na ::= OCTET STRING -- 00=A, 01=C, 10=G, 11=T

NCBI4na ::= OCTET STRING -- 1 bit each for agct

-- 0001=A, 0010=C, 0100=G, 1000=T/U

-- 0101=Purine, 1010=Pyrimidine, etc

NCBI8na ::= OCTET STRING -- for modified nucleic acids

NCBIpna ::= OCTET STRING -- 5 octets/base, prob for a,c,g,t,n

-- probabilities are coded 0-255 = 0.0-1.0

NCBI8aa ::= OCTET STRING -- for modified amino acids

NCBIeaa ::= StringStore -- ASCII extended 1 letter aa codes

-- IUPAC codes + U=selenocysteine

NCBIpaa ::= OCTET STRING -- 25 octets/aa, prob for IUPAC aas in order:

-- A-Y,B,Z,X,(ter),anything

-- probabilities are coded 0-255 = 0.0-1.0

NCBIstdaa ::= OCTET STRING -- codes 0-25, 1 per byte

--*** Sequence Annotation *************************************

--*

Seq-annot ::= SEQUENCE {

id Object-id OPTIONAL ,

db Dbtag OPTIONAL ,

name VisibleString OPTIONAL ,

desc VisibleString OPTIONAL ,

data CHOICE {

ftable SET OF Seq-feat ,

align SET OF Seq-align ,

graph SET OF Seq-graph } }

END

ASN.1 Specification: seqblock.asn

--$Revision: 2.0 $

--*********************************************************************

-- EMBL specific data

-- This block of specifications was developed by Reiner Fuchs of EMBL

--*********************************************************************

EMBL-General DEFINITIONS ::=

BEGIN

EXPORTS EMBL-dbname, EMBL-xref, EMBL-block;

IMPORTS Date, Object-id FROM NCBI-General;

EMBL-dbname ::= CHOICE {

code ENUMERATED {

embl(0),

genbank(1),

ddbj(2),

geninfo(3),

medline(4),

swissprot(5),

pir(6),

pdb(7),

epd(8),

ecd(9),

tfd(10),

flybase(11),

prosite(12),

enzyme(13),

mim(14),

ecoseq(15),

hiv(16) },

name VisibleString }

EMBL-xref ::= SEQUENCE {

dbname EMBL-dbname,

id SEQUENCE OF Object-id }

EMBL-block ::= SEQUENCE {

class ENUMERATED {

not-set(0),

standard(1),

unannotated(2),

other(255) } DEFAULT standard,

div ENUMERATED {

fun(0),

inv(1),

mam(2),

org(3),

phg(4),

pln(5),

pri(6),

pro(7),

rod(8),

syn(9),

una(10),

vrl(11),

vrt(12) } OPTIONAL,

creation-date Date,

update-date Date,

extra-acc SEQUENCE OF VisibleString OPTIONAL,

keywords SEQUENCE OF VisibleString OPTIONAL,

xref SEQUENCE OF EMBL-xref OPTIONAL }

END

--*********************************************************************

-- SWISSPROT specific data

-- This block of specifications was developed by Mark Cavanaugh of

-- NCBI working with Amos Bairoch of SWISSPROT

--*********************************************************************

SP-General DEFINITIONS ::=

BEGIN

EXPORTS SP-block;

IMPORTS Date, Dbtag FROM NCBI-General

Seq-id FROM NCBI-SeqLoc;

SP-block ::= SEQUENCE { -- SWISSPROT specific descriptions

class ENUMERATED {

not-set (0) ,

standard (1) , -- conforms to all SWISSPROT checks

prelim (2) , -- only seq and biblio checked

other (255) } ,

extra-acc SET OF VisibleString OPTIONAL , -- old SWISSPROT ids

imeth BOOLEAN DEFAULT FALSE , -- seq known to start with Met

plasnm SET OF VisibleString OPTIONAL, -- plasmid names carrying gene

seqref SET OF Seq-id OPTIONAL, -- xref to other sequences

dbref SET OF Dbtag OPTIONAL , -- xref to non-sequence dbases

keywords SET OF VisibleString OPTIONAL , -- keywords

created Date OPTIONAL , -- creation date

sequpd Date OPTIONAL , -- sequence update

annotupd Date OPTIONAL } -- annotation update

END

--*********************************************************************

-- PIR specific data

-- This block of specifications was developed by Jim Ostell of

-- NCBI

--*********************************************************************

PIR-General DEFINITIONS ::=

BEGIN

EXPORTS PIR-block;

IMPORTS Seq-id FROM NCBI-SeqLoc;

PIR-block ::= SEQUENCE { -- PIR specific descriptions

had-punct BOOLEAN OPTIONAL , -- had punctuation in sequence ?

host VisibleString OPTIONAL ,

source VisibleString OPTIONAL , -- source line

summary VisibleString OPTIONAL ,

genetic VisibleString OPTIONAL ,

includes VisibleString OPTIONAL ,

placement VisibleString OPTIONAL ,

superfamily VisibleString OPTIONAL ,

keywords SEQUENCE OF VisibleString OPTIONAL ,

cross-reference VisibleString OPTIONAL ,

date VisibleString OPTIONAL ,

seq-raw VisibleString OPTIONAL , -- seq with punctuation

seqref SET OF Seq-id OPTIONAL } -- xref to other sequences

END

--*********************************************************************

-- GenBank specific data

-- This block of specifications was developed by Jim Ostell of

-- NCBI

--*********************************************************************

GenBank-General DEFINITIONS ::=

BEGIN

EXPORTS GB-block;

IMPORTS Date FROM NCBI-General;

GB-block ::= SEQUENCE { -- GenBank specific descriptions

extra-accessions SEQUENCE OF VisibleString OPTIONAL ,

source VisibleString OPTIONAL , -- source line

keywords SEQUENCE OF VisibleString OPTIONAL ,

origin VisibleString OPTIONAL,

date VisibleString OPTIONAL , -- old form Entry Date

entry-date Date OPTIONAL , -- replaces date

div VisibleString OPTIONAL , -- GenBank division

taxonomy VisibleString OPTIONAL } -- continuation line of organism

END

--**********************************************************************

-- PRF specific definition

-- PRF is a protein sequence database crated and maintained by

-- Protein Research Foundation, Minoo-city, Osaka, Japan.

-- Written by A.Ogiwara, Inst.Chem.Res. (Dr.Kanehisa's Lab),

-- Kyoto Univ., Japan

--**********************************************************************

PRF-General DEFINITIONS ::=

BEGIN

EXPORTS PRF-block;

PRF-block ::= SEQUENCE {

extra-src PRF-ExtraSrc OPTIONAL,

keywords SEQUENCE OF VisibleString OPTIONAL

}

PRF-ExtraSrc ::= SEQUENCE {

host VisibleString OPTIONAL,

part VisibleString OPTIONAL,

state VisibleString OPTIONAL,

strain VisibleString OPTIONAL,

taxon VisibleString OPTIONAL

}

END

--*********************************************************************

-- PDB specific data

-- This block of specifications was developed by Jim Ostell and

-- Steve Bryant of NCBI

--*********************************************************************

PDB-General DEFINITIONS ::=

BEGIN

EXPORTS PDB-block;

IMPORTS Date FROM NCBI-General;

PDB-block ::= SEQUENCE { -- PDB specific descriptions

deposition Date , -- deposition date month,year

class VisibleString ,

compound SEQUENCE OF VisibleString ,

source SEQUENCE OF VisibleString ,

exp-method VisibleString OPTIONAL , -- present if NOT X-ray diffraction

replace PDB-replace OPTIONAL } -- replacement history

PDB-replace ::= SEQUENCE {

date Date ,

ids SEQUENCE OF VisibleString } -- entry ids replace by this one

END

ASN.1 Specification: seqcode.asn

--$Revision: 2.0 $

-- *********************************************************************

-- These are code and conversion tables for NCBI sequence codes

-- ASN.1 for the sequences themselves are define in seq.asn

-- Seq-map-table and Seq-code-table REQUIRE that codes start with 0

-- and increase continuously. So IUPAC codes, which are upper case

-- letters will always have 65 0 cells before the codes begin. This

-- allows all codes to do indexed lookups for things

-- Valid names for code tables are:

-- IUPACna

-- IUPACaa

-- IUPACeaa

-- IUPACaa3 3 letter amino acid codes : parallels IUPACeaa

-- display only, not a data exchange type

-- NCBI2na

-- NCBI4na

-- NCBI8na

-- NCBI8aa

-- NCBIstdaa

-- probability types map to IUPAC types for display as characters

NCBI-SeqCode DEFINITIONS ::=

BEGIN

EXPORTS Seq-code-table, Seq-map-table, Seq-code-set;

Seq-code-type ::= ENUMERATED { -- sequence representations

iupacna (1) , -- IUPAC 1 letter nuc acid code

iupacaa (2) , -- IUPAC 1 letter amino acid code

ncbi2na (3) , -- 2 bit nucleic acid code

ncbi4na (4) , -- 4 bit nucleic acid code

ncbi8na (5) , -- 8 bit extended nucleic acid code

ncbipna (6) , -- nucleic acid probabilities

ncbi8aa (7) , -- 8 bit extended amino acid codes

ncbieaa (8) , -- extended ASCII 1 letter aa codes

ncbipaa (9) , -- amino acid probabilities

iupacaa3 (10) , -- 3 letter code only for display

ncbistdaa (11) } -- consecutive codes for std aas, 0-25

Seq-map-table ::= SEQUENCE { -- for tables of sequence mappings

from Seq-code-type , -- code to map from

to Seq-code-type , -- code to map to

num INTEGER , -- number of rows in table

start-at INTEGER DEFAULT 0 , -- index offset of first element

table SEQUENCE OF INTEGER } -- table of values, in from-to order

Seq-code-table ::= SEQUENCE { -- for names of coded values

code Seq-code-type , -- name of code

num INTEGER , -- number of rows in table

one-letter BOOLEAN , -- symbol is ALWAYS 1 letter?

start-at INTEGER DEFAULT 0 , -- index offset of first element

table SEQUENCE OF

SEQUENCE {

symbol VisibleString , -- the printed symbol or letter

name VisibleString } , -- an explanatory name or string

comps SEQUENCE OF INTEGER OPTIONAL } -- pointers to complement nuc acid

Seq-code-set ::= SEQUENCE { -- for distribution

codes SET OF Seq-code-table OPTIONAL ,

maps SET OF Seq-map-table OPTIONAL }

END

C Structures and Functions: objseq.h

/* objseq.h

* ===========================================================================

* PUBLIC DOMAIN NOTICE

* National Center for Biotechnology Information

* This software/database is a "United States Government Work" under the

* terms of the United States Copyright Act. It was written as part of

* the author's official duties as a United States Government employee and

* thus cannot be copyrighted. This software/database is freely available

* to the public for use. The National Library of Medicine and the U.S.

* Government have not placed any restriction on its use or reproduction.

* Although all reasonable efforts have been taken to ensure the accuracy

* and reliability of the software and data, the NLM and the U.S.

* Government do not and cannot warrant the performance or results that

* may be obtained by using this software or data. The NLM and the U.S.

* Government disclaim all warranties, express or implied, including

* warranties of performance, merchantability or fitness for any particular

* purpose.

* Please cite the author in any work or product based on this material.

* ===========================================================================

* File Name: objseq.h

* Author: James Ostell

* Version Creation Date: 4/1/91

* $Revision: 2.0 $

* File Description: Object manager interface for module NCBI-Seq

* Modifications:

* --------------------------------------------------------------------------

* Date Name Description of modification

* ------- ---------- -----------------------------------------------------

* ==========================================================================

#ifndef _NCBI_Seq_

#define _NCBI_Seq_

#ifndef _ASNTOOL_

#include <asn.h>

#endif

#ifndef _NCBI_General_

#include <objgen.h>

#endif

#ifndef _NCBI_Seqloc_

#include <objloc.h>

#endif

#ifndef _NCBI_Pub_

#include <objpub.h>

#endif

#ifndef _NCBI_Seqalign_

#include <objalign.h>

#endif

#ifndef _NCBI_Pubdesc_

#include <objpubd.h> /* separated out to avoid typedef order problems */

#endif

#ifndef _NCBI_Seqfeat_

#include <objfeat.h> /* include organism for now */

#endif

#ifndef _NCBI_Seqres_

#include <objres.h>

#endif

#ifndef _NCBI_Access_

#include <objacces.h>

#endif

#ifndef _NCBI_SeqBlock_

#include <objblock.h>

#endif

#ifndef _NCBI_SeqCode_

#include <objcode.h>

#endif

#ifdef __cplusplus

extern "C" {

#endif

/*****************************************************************************

* loader

*****************************************************************************/

extern Boolean SeqAsnLoad PROTO((void));

/*****************************************************************************

* internal structures for NCBI-Seq objects

*****************************************************************************/

/*****************************************************************************

* SeqAnnot - Sequence annotations

*****************************************************************************/

typedef struct seqannot {

ObjectIdPtr id;

DbtagPtr db;

CharPtr name,

desc;

Uint1 type; /* 1=ftable, 2=align, 3=graph */

Pointer data;

struct seqannot PNTR next;

} SeqAnnot, PNTR SeqAnnotPtr;

SeqAnnotPtr SeqAnnotNew PROTO((void));

Boolean SeqAnnotAsnWrite PROTO((SeqAnnotPtr sap, AsnIoPtr aip, AsnTypePtr atp));

SeqAnnotPtr SeqAnnotAsnRead PROTO((AsnIoPtr aip, AsnTypePtr atp));

SeqAnnotPtr SeqAnnotFree PROTO((SeqAnnotPtr sap));

/*****************************************************************************

* Sets of SeqAnnots

*****************************************************************************/

Boolean SeqAnnotSetAsnWrite PROTO((SeqAnnotPtr sap, AsnIoPtr aip, AsnTypePtr set, AsnTypePtr element));

SeqAnnotPtr SeqAnnotSetAsnRead PROTO((AsnIoPtr aip, AsnTypePtr set, AsnTypePtr element));

/*****************************************************************************

* SeqHist

*****************************************************************************/

typedef struct seqhist {

SeqAlignPtr assembly;

DatePtr replace_date;

SeqIdPtr replace_ids;

DatePtr replaced_by_date;

SeqIdPtr replaced_by_ids;

Boolean deleted;

DatePtr deleted_date;

} SeqHist, PNTR SeqHistPtr;

SeqHistPtr SeqHistNew PROTO((void));

Boolean SeqHistAsnWrite PROTO((SeqHistPtr shp, AsnIoPtr aip, AsnTypePtr atp));

SeqHistPtr SeqHistAsnRead PROTO((AsnIoPtr aip, AsnTypePtr atp));

SeqHistPtr SeqHistFree PROTO((SeqHistPtr shp));

/*****************************************************************************

* Bioseq.

* Inst is incorporated within Bioseq for efficiency

* seq_data_type

* 0 = not set

* 1 = IUPACna

* 2 = IUPACaa

* 3 = NCBI2na

* 4 = NCBI4na

* 5 = NCBI8na

* 6 = NCBIpna

* 7 = NCBI8aa

* 8 = NCBIeaa

* 9 = NCBIpaa

* 11 = NCBIstdaa

* seq_ext_type

* 0 = none

* 1 = seg-ext

* 2 = ref-ext

* 3 = map-ext

*****************************************************************************/

#define Seq_code_iupacna 1

#define Seq_code_iupacaa 2

#define Seq_code_ncbi2na 3

#define Seq_code_ncbi4na 4

#define Seq_code_ncbi8na 5

#define Seq_code_ncbipna 6

#define Seq_code_ncbi8aa 7

#define Seq_code_ncbieaa 8

#define Seq_code_ncbipaa 9

#define Seq_code_iupacaa3 10

#define Seq_code_ncbistdaa 11

#define Seq_repr_virtual 1

#define Seq_repr_raw 2

#define Seq_repr_seg 3

#define Seq_repr_const 4

#define Seq_repr_ref 5

#define Seq_repr_consen 6

#define Seq_repr_map 7

#define Seq_repr_other 255

#define Seq_mol_dna 1

#define Seq_mol_rna 2

#define Seq_mol_aa 3

#define Seq_mol_na 4

#define Seq_mol_other 255

#define ISA_na(x) ((x==1)||(x==2)||(x==4))

#define ISA_aa(x) (x == 3)

typedef struct bioseq {

SeqIdPtr id; /* Seq-ids */

ValNodePtr descr; /* Seq-descr */

Uint1 repr,

mol;

Int4 length; /* -1 if not set */

IntFuzzPtr fuzz;

Uint1 topology,

strand,

seq_data_type, /* as in Seq_code_type above */

seq_ext_type;

ByteStorePtr seq_data;

Pointer seq_ext;

SeqAnnotPtr annot;

SeqHistPtr hist;

} Bioseq, PNTR BioseqPtr;

BioseqPtr BioseqNew PROTO((void));

Boolean BioseqAsnWrite PROTO((BioseqPtr bsp, AsnIoPtr aip, AsnTypePtr atp));

BioseqPtr BioseqAsnRead PROTO((AsnIoPtr aip, AsnTypePtr atp));

BioseqPtr BioseqFree PROTO((BioseqPtr bsp));

Boolean BioseqInstAsnWrite PROTO((BioseqPtr bsp, AsnIoPtr aip, AsnTypePtr orig));

Boolean BioseqInstAsnRead PROTO((BioseqPtr bsp, AsnIoPtr aip, AsnTypePtr orig));

BioseqPtr PNTR BioseqInMem PROTO((Int2Ptr numptr));

/*****************************************************************************

* Initialize bioseq and seqcode tables and default numbering

*****************************************************************************/

Boolean BioseqLoad PROTO((void));

/*****************************************************************************

* BioseqAsnRead Options

*****************************************************************************/

typedef struct op_objseq {

SeqIdPtr sip; /* seq id to find */

Boolean found_it; /* set to TRUE when BioseqAsnRead matches sip */

Boolean load_by_id; /* if TRUE, load only if sip matches */

} Op_objseq, PNTR Op_objseqPtr;

/* types for AsnIoOption OP_NCBIOBJSEQ */

#define BIOSEQ_CHECK_ID 1 /* match Op_objseq.sip */

/*****************************************************************************

* SeqDescr uses an ValNode with choice =

1 = * mol-type GIBB-mol , -- type of molecule

2 = ** modif SET OF GIBB-mod , -- modifiers

3 = * method GIBB-method , -- sequencing method

4 = name VisibleString , -- a name for this sequence

5 = title VisibleString , -- a title for this sequence

6 = org Org-ref , -- if all from one organism

7 = comment VisibleString , -- a more extensive comment

8 = num Numbering , -- a numbering system

9 = maploc Dbtag , -- map location of this sequence

10 = pir PIR-block , -- PIR specific info

11 = genbank GB-block , -- GenBank specific info

12 = pub Pubdesc -- a reference to the publication

13 = region VisibleString -- name for this region of sequence

14 = user UserObject -- user structured data object

15 = sp SP-block -- SWISSPROT specific info

16 = neighbors Entrez-link -- links to sequence neighbors

17 = embl EMBL-block -- EMBL specific info

18 = create-date Date -- date entry created

19 = update-date Date -- date of last update

20 = prf PRF-block -- PRF specific information

21 = pdb PDB-block -- PDB specific information

22 = het Heterogen -- cofactor, etc associated but not bound

types with * use data.intvalue. Other use data.ptrvalue

** uses a chain of ValNodes which use data.intvalue for enumerated type

*****************************************************************************/

#define Seq_descr_mol_type 1

#define Seq_descr_modif 2

#define Seq_descr_method 3

#define Seq_descr_name 4

#define Seq_descr_title 5

#define Seq_descr_org 6

#define Seq_descr_comment 7

#define Seq_descr_num 8

#define Seq_descr_maploc 9

#define Seq_descr_pir 10

#define Seq_descr_genbank 11

#define Seq_descr_pub 12

#define Seq_descr_region 13

#define Seq_descr_user 14

#define Seq_descr_sp 15

#define Seq_descr_neighbors 16

#define Seq_descr_embl 17

#define Seq_descr_create_date 18

#define Seq_descr_update_date 19

#define Seq_descr_prf 20

#define Seq_descr_pdb 21

#define Seq_descr_het 22

Boolean SeqDescrAsnWrite PROTO((ValNodePtr anp, AsnIoPtr aip, AsnTypePtr atp));

ValNodePtr SeqDescrAsnRead PROTO((AsnIoPtr aip, AsnTypePtr atp));

ValNodePtr SeqDescrFree PROTO((ValNodePtr anp));

/*****************************************************************************

* Pubdesc and Numbering types defined in objpubd.h

*****************************************************************************/

#ifdef __cplusplus

}

#endif

C Structures and Functions: objpubd.h

/* objpubd.h