Sequence Features


Introduction
Seq-feat: Structure of a Feature
SeqFeatData: Type Specific Feature Data
Seq-feat Implementation in C
CdRegion: Coding Region
Genetic Codes
Rsite-ref: Reference To A Restriction Enzyme
RNA-ref: Reference To An RNA
Gene-ref: Reference To A Gene
Prot-ref: Reference To A Protein
Txinit: Transcription Initiation
Current Genetic Code Table: gc.prt
ASN.1 Specification: seqfeat.asn
C Structures and Functions: objfeat.h


 Introduction

A sequence feature (Seq-feat) is a block of structured data (SeqFeatData) explicitly attached to a region of a Bioseq through one or two Seq-locs (see Sequence Locations and Identifiers). The Seq-feat itself can carry information common to all features, as well as serving as the junction between the SeqFeatData and Seq-loc(s). Since a Seq-feat references a Bioseq through an explicit Seq-loc, a Seq-feat is an entity which can stand alone, or be moved between contexts without loss of information. Thus, information ABOUT Bioseqs can be created, exchanged, and compared independently from the Bioseq itself. This is an important attribute of the NCBI data model.

A feature table is a set of Seq-feat gathered together within a Seq-annot (see Biological Sequences). The Seq-annot allows the features to be attributed to a source and be associated with a title or comment. Seq-feats are normally exchanged "packaged" into a feature table.

Seq-feat: Structure of a Feature

A Seq-feat is a data structure common to all features. The fields it contains can be evaluated by software the same way for all features, ignoring the "data" element which is what makes each feature class unique.

id: Features Can Have Identifiers

At this time unique identifiers for features are even less available or controlled than sequence identifiers. However, as molecular biology informatics becomes more sophisticated, it will become not only useful, but essential to be able to cite features as precisely as NCBI is beginning to be able to cite sequences. The Seq-feat.id slot is where these identifiers will go. The Feat-id object for features, meant to be equivalent of the Seq-id object for Bioseqs, is not very fully developed yet. It can accommodate feature ids from the NCBI Backbone database, local ids, and the generic Dbtag type. Look for better characterized global ids to appear here in future as the requirement for structured data exchange becomes increasingly accepted.

data: Structured Data Makes Feature Types Unique

Each type of feature can have a data structure which is specifically designed to accommodate all the requirements of that type with no concern about the requirements of other feature types. Thus a coding region data structure can have fielded elements for reading frame and genetic code, while a tRNA data structure would have information about the amino acid transferred.

This design completely modularizes the components required specifically by each feature type. If a new field is required by a particular feature type, it does not affect any of the others. A new feature type, even a very complex one, can be added without affecting any of the others.

Software can be written in a very modular fashion, reflecting the data design. Functions common to all features (such as determining all features in a sequence region) simply ignore the "data" field and are robust against changes or additions to this component. Functions which process particular types have a well defined data interface unique to each type.

Perhaps a less obvious consequence is code and data reuse. Data objects used in other contexts can be used as features simply by making them a CHOICE in SeqFeatData. For example, the publication feature reuses the Pubdesc type used for Bioseq descriptors. This type includes all the standard bibliographic types (see Bibliographic References) used by MEDLINE or other bibliographic databases. Software which displays, queries, or retrieves publications will work without change on the "data" component of a publication feature because it is EXACTLY THE SAME object. This has profound positive consequences for both data and code development and maintenance.

This modularization also makes it natural to discuss each allowed feature type separately as is done in the SeqFeatData section below.

partial: This Feature is Incomplete

If Seq-feat.partial is TRUE, the feature is incomplete in some (unspecified) way. The details of incompleteness may be specified in more detail in the Seq-feat.location field. This flag allows quick exclusion of incomplete features when doing a database wide survey. It also allows the feature to be flagged when the details of incompleteness may not be known.

Seq-feat.partial should ALWAYS be TRUE if the feature is incomplete, even if Seq-feat.location indicates the incompleteness as well.

except: There is Something Biologically Exceptional

The Seq-feat.except flag is similar to the Seq-feat.partial flag in that it allows a simple warning that there is something unusual about this feature, without attempting to structure a detailed explanation. Again, this allows software scanning features in the database to ignore atypical cases easily. If Seq-feat.except is TRUE, Seq-feat.comment should contain a string explaining the exceptional situation.

Seq-feat.except does not necessarily indicate there is something wrong with the feature, but more that the biological exceeds the current representational capacity of the feature definition and that this may lead to an incorrect interpretation. For example, a coding region feature on genomic DNA where post-transcriptional editing of the RNA occurs would be a biological exception. If one translates the region using the frame and genetic code given in the feature one does not get the protein it points to, but the data supplied in the feature is, in fact, correct. It just does not take into account the RNA editing process.

Ideally, one should try to avoid or minimize exceptions by the way annotation is done. An approach to minimizing the RNA editing problem is described in the "product" section below. If one is forced to use exception consistently, it is a signal that a new or revised feature type is needed.

comment: A Comment About This Feature

No length limit is set on the comment, but practically speaking brief is better.

product: Does This Feature Produce Another Bioseq?

A Seq-feat is unusual in that it can point to two different sequence locations. The "product" location enables two Bioseqs to be linked together in a source/product relationship explicitly. This is very valuable for features which describe a transformation from one Bioseq to another, such as coding region (nucleic acid to protein) or the various RNA types (genomic nucleic acid to RNA product).

This explicit linkage is extremely valuable for connecting diverse types. Linkage of nucleic acid to protein through coding region makes data traversal from gene to product or back simple and explicit, but clearly of profound biological significance. Less obvious, but nonetheless useful is the connection between a tRNA gene and the modified sequence of the tRNA itself, or of a transcribed coding region and an edited mRNA.

Note that such a feature is as valuable in association with its product Bioseq alone as it is with its source Bioseq alone, and could be distributed with either or both.

location: Source Location of This Feature

The Seq-feat.location is the traditional location associated with a feature. While it is possible to use any Seq-loc type in Seq-feat.location, it is recommended to use types which resolve to a single unique sequence. The use of a type like Seq-loc-equiv to represent alternative splicing of exons (similar to the GenBank/EMBL/DDBJ feature table "one-of") is strongly discouraged. Consider the example of such an alternatively spliced coding region. What protein sequence is coded for by such usage? This problem is accentuated by the availability of the "product" slot. Which protein sequence is the product of this coding region? While such a short hand notation may seem attractive at first glance, it is clearly much more useful to represent each splicing alternative, and it's associated protein product, times of expression, etc. separately.

qual: GenBank Style Qualifiers

The GenBank/EMBL/DDBJ feature table uses "qualifiers", a combination of a string key and a string value. Many of these qualifiers do not map to the ASN.1 specification, so this provides a means of carrying them in the Seq-feat for features derived from those sources.

title: A User Defined Name

This field is provided for naming features for display. It would be used by end-user software to allow the user to add locally meaningful names to features. This is not an id, as this is provided by the "id" slot.

ext: A User Defined Structured Extension

The "ext" field allows the extension of a standard feature type with a structured User-object (see General Use Objects) defined by a user. For example, a particular scientist may have additional detailed information about coding regions which do not fit into the standard CdRegion data type. Rather than create a completely new feature type, the CdRegion type can be extended by filling in as much of the standard CdRegion fields as possible, then putting the additional information in the User-object. Software which only expects a standard coding region will operate on the extended feature without a problem, while software that can make use of the additional data in the User-object can operate on exactly the same the feature.

cit: Citations For This Feature

This slot is a set of Pubs which are citations about the feature itself, not about the Bioseq as a whole. It can be of any type, although the most common is type "pub", a set of any kind of Pubs. The individual Pubs within the set may be Pub-equivs (see Bibliographic References) to hold equivalent forms for the same publication, so some thought should be given to the process of accessing all the possible levels of information in this seemingly simple field.

exp-ev: Experimental Evidence

If it is known for certain that there is or is not experimental evidence supporting a particular feature, Seq-feat.exp-ev can be "experimental" or "not-experimental" respectively. If the type of evidence supporting the feature is not known, exp-ev should not be given at all.

This field is only a simple flag. It gives no indication of what kind of evidence may be available. A structured field of this type will differ from feature type to feature type, and thus is inappropriate to the generic Seq-feat. Information regarding the quality of the feature can be found in the CdRegion feature and even more detail on methods in the Tx-init feature. Other feature types may gain experimental evidence fields appropriate to their types as it becomes clear what a reasonable classification of that evidence might be.

xref: Linking To Other Features

SeqFeatXrefs are copies of the Seq-feat.data field and (optionally) the Seq-feat.id field from other related features. This is a copy operation and is meant to keep some degree of connectivity or completeness with a Seq-feat that is moved out of context. For example, in a collection of data including a nucleic acid sequence and its translated protein product, there would be a Gene feature on the nucleic acid, a Prot-ref feature on the protein, and a CdRegion feature linking all three together. However, if the CdRegion feature is taken by itself, the name of the translated protein and the name of the gene are not immediately available. The Seq-feat.xref provides a simple way to copy the relevant information. Note that there is a danger to any such copy operation in that the original source of the copied data may be modified without updating the copy. Software should be careful about this, and the best course is to take the original data if it is available to the software, using any copies in xref only as a last resort. If the "id" is included in the xref, this makes it easier for software to keep the copy up to date. But it depends on widespread use of feature ids.

SeqFeatData: Type Specific Feature Data

The "data" slot of a Seq-feat is filled with SeqFeatData, which is just a CHOICE of a variety of specific data structures. They are listed under their CHOICE type below, but for most types a detailed discussion will be found under the type name itself later in this chapter, or in another chapter. That is because most types are data objects in their own right, and may find uses in many other contexts than features.

gene: Location Of A Gene

A gene is a feature of its own, rather than a modifier of other features as in the GenBank/EMBL/DDBJ feature tables. A gene is a heritable region of nucleic acid sequence which confers a measurable phenotype. That phenotype may be achieved by many components of the gene including but not limited to coding regions, promoters, enhancers, terminators, and so on. The gene feature is meant to approximately cover the region of nucleic acid considered by workers in the field to be the gene. This admittedly fuzzy concept has an appealing simplicity and fits in well with higher level views of genes such as genetic maps.

The gene feature is implemented with a Gene-ref object, or a "reference to" a gene. The Gene-ref object is discussed below.

org: Source Organism Of The Bioseq

Normally when a whole Bioseq or set of Bioseqs is from the same organism, the Org-ref (reference to Organism) will be found at the descriptor level of the Bioseq or Bioseq-set (see Biological Sequences). However, in some cases the whole Bioseq may not be from the same organism. This may occur naturally (e.g. a provirus integrated into a host chromosome) or artificially (e.g. recombinant DNA techniques).

The org feature is implemented with an Org-ref object, or a "reference to" an organism. The Org‑ref is discussed below.

cdregion: Coding Region

A cdregion is a region of nucleic acid which codes for a protein. It can be thought of as "instructions to translate" a nucleic acid, not simply as a series of exons or a reflection of an mRNA or primary transcript. Other features represent those things. Unfortunately, most existing sequences in the database are only annotated for coding region, so transcription and splicing information must be inferred (often inaccurately) from it. We encourage the annotation of transcription features in addition to the coding region. Note that since the cdregion is "instructions to translate", one can represent translational stuttering by having overlapping intervals in the Seq-feat.location. Again, beware of assuming a cdregion definitely reflects transcription.

A cdregion feature is implemented with a Cdregion object, discussed below.

prot: Describing A Protein

A protein feature describes and/or names a protein or region of a protein. It uses a Prot-ref object, or "reference to" a protein, described in detail below.

A single amino acid Bioseq can have many protein features on it. It may have one over its full length describing a pro-peptide, then a shorter one describing the mature peptide. An extreme case might be a viral polyprotein which would have one protein feature for the whole polyprotein, then additional protein features for each of the component mature proteins. One should always take into account the "location" slot of a protein feature.

rna: Describing An RNA

An RNA feature can describe both coding intermediates and structural RNAs using an RNA-ref, or "reference to" an RNA. The RNA-ref is described in more detail below. The Seq-feat.location for an RNA can be attached to either the genomic sequence coding for the RNA, or to the sequence of the RNA itself, when available. The determination of whether the Bioseq the RNA feature is attached to is genomic or an RNA type is made by examining the Bioseq.descr.mol-type, not by making assumptions based on the feature. When both the genomic Bioseq and the RNA Bioseq are both available, one could attach the RNA Seq-feat.location to the genomic sequence and the Seq-feat.product to the RNA to connect them and capture explicitly the process by which the RNA is created.

pub: Publication About A Bioseq Region

When a publication describes a whole Bioseq, it would normally be at the "descr" slot of the Bioseq. However, if it applies to a sub region of the Bioseq, it is convenient to make it a feature. The pub feature uses a Pubdesc (see Biological Sequences for a detailed description) to describe a publication and how it relates to the Bioseq. To indicate a citation about a specific feature (as opposed to about the sequence region in general), use the Seq-feat.cit slot of that feature.

seq: Tracking Original Sequence Sources

The "seq" feature is a simple way to associate a region of sequence with a region of another. For example, if one wished to annotate a region of a recombinant sequence as being from "pBR322 10-50" one would simply use a Seq-loc (see Sequence Locations and Identifiers) for the interval 10-50 on Seq-id pBR322. Software tools could use such information to provide the pBR322 numbering system over that interval.

This feature is really meant to accommodate older or approximate data about the source of a sequence region and is no more than annotation. More specific and computationally useful ways of doing this are (1) create the recombinant sequence as a segmented sequence directly (see Biological Sequences), (2) use the Seq-hist field of a Bioseq to record its history, (3) create alignments (see Sequence Alignments) which are also valid Seq-annots, to indicate more complex relationships of one Bioseq to others.

imp: Importing Features From Other Data Models

The SeqFeatData types explicitly define only certain well understood or widely used feature types. There may be other features contained in databases converted to this specification which are not represented by this ASN.1 specification. At least for GenBank, EMBL, DDBJ, PIR, and SWISS-PROT, these can be mapped to an Imp-feat structure so the features are not lost, although they are still unique to the source database. All these features have the basic form of a string key, a location (carried as the original string), and a descriptor (another string). In the GenBank/EMBL/DDBJ case, any additional qualifiers can be carried on the Seq-feat.qual slot.

GenBank/EMBL/DDBJ use a "location" called "replace" which is actually an editing operation on the sequence which incorporates literal strings. Since the locations defined in this specification are locations on sequences, and not editing operations, features with replace operators are all converted to Imp-feat so that the original location string can be preserved. This same strategy is taken in the face of incorrectly constructed locations encountered in parsing outside databases into ASN.1.

region: A Named Region

The region feature provides a simple way to name a region of a Bioseq (e.g. "globin locus", "LTR", "subrepeat region", etc).

comment: A Comment On A Region Of Sequence

The comment feature allows a comment to be made about any specified region of sequence. Since comment is already a field in Seq-feat, there is no need for an additional type specific data item in this case, so it is just NULL.

bond: A Bond Between Residues

This feature annotates a bond between two residues. A Seq-loc of type "bond" is expected in Seq-feat.location. Certain types of bonds are given in the ENUMERATED type. If the bond type is "other" the Seq-feat.comment slot should be used to explain the type of the bond. Allowed bond types are:

        disulfide (1) ,

        thiolester (2) ,

        xlink (3) ,

        thioether (4) ,

        other (255) } ,

site: A Defined Site

The site feature annotates a know site from the following specified list. If the site is "other" then Seq-feat.comment should be used to explain the site.

       active (1) ,

       binding (2) ,

       cleavage (3) ,

       inhibit (4) ,

       modified (5),

       glycosylation (6) ,

       myristoylation (7) ,

       mutagenized (8) ,

       metal-binding (9) ,

       phosphorylation (10) ,

       acetylation (11) ,

       amidation (12) ,

       methylation (13) ,

       hydroxylation (14) ,

       sulfatation (15) ,

       oxidative-deamination (16) ,

       pyrrolidone-carboxylic-acid (17) ,

       gamma-carboxyglutamic-acid (18) ,

       blocked (19) ,

       lipid-binding (20) ,

       np-binding (21) ,

       dna-binding (22) ,

       other (255) } ,

rsite: A Restriction Enzyme Cut Site

A restriction map is basically a feature table with rsite features. Software which generates such a feature table could then use any sequence annotation viewer to display its results. Restriction maps generated by physical methods (before sequence is available), can use this feature to create a map type Bioseq representing the ordered restriction map. For efficiency one would probably create one Seq-feat for each restriction enzyme used and used the Packed-pnt Seq-loc in the location slot. See Rsite-ref, below.

user: A User Defined Feature

An end-user can create a feature completely of their own design by using a User-object (see General Use Objects) for SeqFeatData. This provides a means for controlled addition and testing of new feature types, which may or may not become widely accepted or to "graduate" to a defined SeqFeatData type. It is also a means for software to add structured information to Bioseqs for it's own use and which may never be intended to become a widely used standard. All the generic feature operations, including display, deletion, determining which features are carried on a sub region of sequence, etc, can be applied to an user feature with no knowledge of the particular User-object structure or meaning. Yet software which recognizes that User-object can take advantage of it.

If an existing feature type is available but lacks certain additional fields necessary for a special task or view of information, then it should be extended with the Seq-feat.ext slot, rather than building a complete user feature de novo.

txinit: Transcription Initiation

This feature is used to designate the region of transcription initiation, about which considerable knowledge is available. See Txinit, below.

num: Applying Custom Numbering To A Region

A Numbering object can be used as a Bioseq descriptor to associate various numbering systems with an entire Bioseq. When used as a feature, the numbering system applies only to the region in Seq-feat.location. This make multiple, discontinuous numbering systems available on the same Bioseq. See Biological Sequences for a description of Numbering, and also Seq-feat.seq, above, for an alternative way of applying a sequence name and it's numbering system to a sequence region.

psec-str: Protein Secondary Structure

Secondary structure can be annotated on a protein sequence using this type. It can be predicted by algorithm (in which case Seq-feat.exp-ev should be "not-experimental") or by analysis of the known protein structure (Seq-feat.exp-ev = "experimental"). Only three types of secondary structure are currently supported. A "helix" is any helix, a "sheet" is beta sheet, and "turn" is a beta or gamma turn. Given the controversial nature of secondary structure classification (not be mention prediction), we opted to keep it simple until it was clear that more detail was really necessary or understood.

non-std-residue: Unusual Residues

When an unusual residue does not have a direct sequence code, the "best" standard substitute can be used in the sequence and the residue can be labeled with its real name. No attempt is made to enforce a standard nomenclature for this string.

het: Heterogen

In the PDB structural database, non-biopolymer atoms associated with a Bioseq are referred to as "heterogens". When a heterogen appears as a feature, it is assumed to be bonded to the sequence positions in Seq-feat.location. If there is no specific bonding information, the heterogen will appear as a descriptor of the Bioseq. The Seq-loc for the Seq-feat.location will probably be a point or points, not a bond. A Seq-loc of type bond is between sequence residues.

Seq-feat Implementation in C

The C implementation of a Seq-feat is mostly straightforward. However, some explanation of the "id" and "data" slots will be helpful. Both are implemented as a Choice, which is like a ValNode but without a next pointer. Both Choice structures are included as part of a SeqFeat structure. In the tables below the values of Choice.choice and the type in Choice.data.ptrvalue or Choice.data.intvalue are shown.

SeqFeat.id

ASN.1 name

Value in Choice.choice

Type in Choice.data

(not present)

0

not needed

gibb

1

integer

giim

2

GiimPtr

local

3

ObjectIdPtr

general

4

DbtagPtr

SeqFeat.data

ASN.1 name

Value in Choice.choice

Type in Choice.data

(not present)

0

not needed

gene

1

GeneRefPtr

org

2

OrgRefPtr

cdregion

3

CdRegionPtr

prot

4

ProtRefPtr

rna

5

RnaRefPtr

pub

6

PubdescPtr

seq

7

SeqLocPtr

imp

8

ImpFeatPtr

region

9

CharPtr

comment

10

(not used)

bond

11

integer

site

12

integer

rsite

13

RsiteRefPtr

user

14

UserObjectPtr

txinit

15

TxinitPtr

num

16

NumberingPtr

psec-str

17

integer

non-std-residue

18

CharPtr

het

19

CharPtr

In addition to the usual SeqFeatNew(), SeqFeatAsnRead(), SeqFeatAsnWrite(), and SeqFeatFree() functions, there is a SeqFeatToXref() function which creates an xref and copies the "id" and "data" slots to it. There are also SeqFeatSetAsnRead() and SeqFeatSetAsnWrite() functions for sets of features. Finally, there is are special SeqFeatDataAsnRead(), SeqFeatDataAsnWrite(), and SeqFeatDataFree() functions which operate on the "data" component of a SeqFeat structure since there is no separate C structure for SeqFeatData.

Of course, within the software tools for producing GenBank, report, or other formats from ASN.1 are functions to format and display features as well. There are some functions to manipulate the SeqFeatData objects, such as the translation of a CdRegion, and a host of functions to use and compare the Seq-locs of "product" and "location" or easily access and use the sequence regions they point to. These functions are discussed in the Sequence Utilities chapter. Additional functions, described in Exploring The Data, allow one to easily locate features of interest by type, in arbitrarily complex objects.

CdRegion: Coding Region

A CdRegion, in association with a Seq-feat, is considered "instructions to translate" to protein. The Seq-locs used by the Seq-feat do not necessarily reflect the exon structure of the primary transcript (although they often do). A Seq-feat of type CdRegion can point both to the source nucleic acid and to the protein sequence it produces. Most of the information about the source nucleic acid (such as the gene) or the destination protein (such as it's name) are associated directly with those Bioseqs. The CdRegion only serves as a link between them, and as a method for explicitly encoding the information needed to derive one from the other.

orf: Open Reading Frame

CdRegion.orf is TRUE if the coding region is only known to be an open reading frame. This is a signal that nothing is known about the protein product, or even if it is produced. In this case the translated protein sequence will be attached, but there will be no other information associated with it. This flag allows such very speculative coding regions to be easily ignored when scanning the database for genuine protein coding regions.

The orf flag is not set when any reasonable argument can be made that the CdRegion is really expressed, such as detection of mRNA or strong sequence similarity to known proteins.

Translation Information

CdRegion has several explicit fields to define how to translate the coding region. Reading frame is explicitly given or defaults to frame one.

The genetic code is assumed to be the universal code unless given explicitly. The code itself is given, rather than requiring software to determine the code at run-time by analyzing the phylogenetic position of the Bioseq. Genetic code is described below.

Occasionally the genetic code is not followed at specific positions in the sequence. Examples are the use of alternate initiation codons only in the first position, the effects of suppresser tRNAs, or the addition of selenocysteine. The Code-break object specifies the three bases of the codon in the Bioseq which is treated differently and the amino acid which is generated at that position. During translation the genetic code is followed except at positions indicated by Code-breaks, where the instructions in the Code-break are followed instead.

Problems With Translations

In a surprising number of cases an author publishes both a nucleic acid sequence and the protein sequence produced by its coding region, but the translation of the coding region does not yield the published protein sequence. On the basis of the publication it is not possible to know for certain which sequence is correct. In the NCBI Backbone database both sequences are preserved as published by the author, but the conflict flag is set to TRUE in the CdRegion. If available, the number of gaps and mismatches in the alignment of the translated sequence to the published protein sequence are also given so a judgment can be made about the severity of the problem.

Genetic Codes

A Genetic-code is a SET which may include one or more of a name, an integer id, or 64 cell arrays of amino acid codes in different alphabets. Thus, in a CdRegion, one can either refer to a genetic code by name or id, provide the genetic code itself, or both. Tables of genetic codes are provided in the NCBI software release with most possibilities filled in.

The Genetic-code.name is a descriptive name for the genetic code, mainly for display to humans. The integer id refers to the ids in the gc.val (binary ASN.1) or gc.prt (text ASN.1) file of genetic codes maintained by NCBI, distributed with the software tools and Entrez releases, and published in the GenBank/EMBL/DDBJ feature table document. Genetic-code.id is the best way to explicitly refer to a genetic code.

The genetic codes themselves are arrays of 64 amino acid codes. The index to the position in the array of the amino acid is derived from the codon by the following method:

index = (base1 * 16) + (base2 * 4) + base3

where T=0, C=1, A=2, G=3

Note that this encoding of the bases is not the same as any of the standard nucleic acid encoding described in Biological Sequence. This set of values was chosen specifically for genetic codes because it results in the convenient groupings of amino acid by codon preferred for display of genetic code tables.

The genetic code arrays have names which indicate the amino acid alphabet used (e.g. ncbieaa). The same encoding technique is used to specify start codons. Alphabet names are prefixed with "s" (e.g. sncbieaa) to indicate start codon arrays. Each cell of a start codon array contains either the gap code ("-" for ncbieaa) or an amino acid code if it is valid to use the codon as a start codon. Currently all starts are set to code for methionine, since it has never been convincingly demonstrated that a protein can start with any other amino acid. However, if other amino acids are shown to be used as starts, this structure can easily accommodate that information.

The contents of gc.prt, the current supported genetic codes, is given at the end of this chapter.

C Implementation Of Genetic Codes

GeneticCode is implemented as a ValNodePtr with choice = 254. The ValNodePtr‑>data.ptrvalue is the head of a linked list of ValNodes, each of which contains on of the possible forms of a particular GeneticCode as follows:

GeneticCode Elements

ASN.1 name

Value in ValNode.choice

Type in ValNode.data

name

1

CharPtr

id

2

integer

ncbieaa

3

CharPtr

ncbi8aa

4

ByteStorePtr

ncbistdaa

5

ByteStorePtr

sncbieaa

6

CharPtr

sncbi8aa

7

ByteStorePtr

sncbistdaa

8

ByteStorePtr

GeneticCodeNew() returns a pointer to the ValNode with choice = 254, the element which points to the head of the chain. This is the datum which is returned from GeneticCodeAsnRead() and is passed to GeneticCodeAsnWrite() and GeneticCodeFree(). There are also GeneticCodeTableAsnRead() and ..Write() functions. The table functions expect a list of ValNode with ->choice = 254 linked by their ->next pointers, each with a linked list of ValNodes representing the elements of a genetic code starting from its ValNodePtr->data.ptrvalue.

A special function, GeneticCodeTableLoad() reads gc.val into memory. For this function to work the gc.val file must be in the directory with other DATA items such as sequence alphabet file, seqcode.val.

GeneticCodeFind(id, name) returns a GeneticCodePtr to the appropriate code assuming GeneticCodeTableLoad() has previously succeeded. If "name" is NULL, id is matched. If the code cannot be found, NULL is returned.

Rsite-ref: Reference To A Restriction Enzyme

This simple data structure just references a restriction enzyme. It is a choice of a simple string (which may or may not be from a controlled vocabulary) or a Dbtag, in order to cite an enzyme from a specific database such as RSITE. The Dbtag is preferred, if available.

Note that this reference is not an Rsite-entry which might contain a host of information about the restriction enzyme, but is only a reference to the enzyme.

RNA-ref: Reference To An RNA

An RNA-ref allows naming and a minimal description of various RNAs. The "type" is a controlled vocabulary for dividing RNAs into broad, well accepted classes. The "pseudo" field is used for RNA pseudogenes.

The "ext" field allows the addition of structure information appropriate to a specific RNA class as appropriate. The "name" extension allows naming the "other" type or adding a modifier, such as "28S" to rRNA. For tRNA there is a structured extension which as fields for the amino acid transferred, drawn from the standard amino acid alphabets, and a value for one or more codons that this tRNA recognizes. The values of the codons are calculated as a number from 0 to 63 using the same formula as for calculating the index to Genetic Codes, above.

As nomenclature and attributes for classes of RNAs becomes better understood and accepted, the RNA-ref.ext will gain additional extensions.

Gene-ref: Reference To A Gene

A Gene-ref is not intended to carry all the information one might want to know about a gene, but to provide a small set of information and reference some larger body of information, such as an entry in a genetic database.

The "locus" field is for the gene symbol, preferably an official one (e.g. "Adh"). The "allele" field is for an allele symbol (e.g. "S"). The "desc" field is for a descriptive name for the gene (e.g. "Alcohol dehydrogenase, SLOW allele"). One should fill in as many of these fields as possible.

The "maploc" field accepts a string with a map location using whatever conventions are appropriate to the organism. This field is hardly definitive and if up to date mapping information is desired a true mapping database should always be consulted.

If "pseudo" is TRUE, this is a pseudogene.

The "db" field allows the Gene-ref to be attached to controlled identifiers from established gene databases. This allows a direct key to a database where gene information will be kept up to date without requiring that the rest of the information in the Gene-ref necessarily be up to date as well. This type of foreign key is essential to keeping loosely connected data up to date and NCBI is encouraging gene databases to make such controlled keys publicly available.

The "syn" field holds synonyms for the gene. It does not attempt to discriminate symbols, alleles, or descriptions.

In addition to the usual C functions, there is a specific GeneRefDup() function to duplicate this object quickly.

Prot-ref: Reference To A Protein

A Prot-ref is meant to reference a protein very analogous to the way a Gene-ref references a gene. The "name" field is a SET OF strings to allow synonyms. The first name is presumed to be the preferred name by software tools. Since there is no controlled vocabulary for protein names this is the best that can be done at this time. "ADH" and "alcohol dehydrogenase" are both protein names.

The "desc" field is for a description of the protein. This field is often not necessary if the name field is filled in, but may be informative in some cases and essential in cases where the protein has not yet been named (e.g. ORF21 putative protein).

The "ec" field contains a SET of EC numbers. These strings are expected to be only numbers separated by periods (no leading "EC"). Sometimes the last few positions will be occupied by dashes or not filled in at all if the protein has not been fully characterized. Examples of EC numbers are ( 1.14.13.8 or 1.14.14.- or 1.14.14.3 or 1.14.--.-- or 1.14 ).

The "activity" field allows the various known activities of the protein to be specified. This can be very helpful, especially when the name is not informative.

The "db" field is to accommodate keys from protein databases. While protein nomenclature is not well controlled, there are subfields such as immunology which have controlled names. There are also databases which characterize proteins in other ways than sequence, such as 2-d spot databases which could provide such a key.

In addition to the usual C functions, there is also a ProtRefDup() for quickly duplicating this object.

Txinit: Transcription Initiation

This is an example of a SeqFeatData block designed and built by a domain expert, an approach the NCBI strongly encourages and supports. The Txinit structure was developed by Philip Bucher and David Ghosh. It carries most of the information about transcription initiation represented in the Eukaryotic Promoter Database (EPD). The Txinit structure carries a host of detailed experimental information, far beyond the simple "promoter" features in GenBank/EMBL/DDBJ. EPD is released as a database in its own right and as Txinit Seq-feats. NCBI will be incorporating the EPD in its feature table form to provide expert annotation of the sequence databases in the manner described in the Data Model chapter.

The Txinit object is well described by its comments in the ASN.1 definition. The best source of more in depth discussion of these fields is in the EPD documentation, and so it will not be reproduced here.

Current Genetic Code Table: gc.prt

--**************************************************************************

--  This is the NCBI genetic code table

--  Base 1-3 of each codon have been added as comments to facilitate

--    readability at the suggestion of Peter Rice, EMBL

--*************************************************************************

 

Genetic-code-table ::= {

{

  name "Standard" ,

  name "SGC0" ,

  id 1 ,

  ncbieaa  "FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG",

  sncbieaa "-----------------------------------M----------------------------"

  -- Base1  TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG

  -- Base2  TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG

  -- Base3  TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG

} ,

{

  name "Vertebrate Mitochondrial" ,

  name "SGC1" ,

  id 2 ,

  ncbieaa  "FFLLSSSSYY**CCWWLLLLPPPPHHQQRRRRIIMMTTTTNNKKSS**VVVVAAAADDEEGGGG",

  sncbieaa "--------------------------------MMMM---------------M------------"

  -- Base1  TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG

  -- Base2  TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG

  -- Base3  TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG

} ,

{

  name "Yeast Mitochondrial" ,

  name "SGC2" ,

  id 3 ,

  ncbieaa  "FFLLSSSSYY**CCWWTTTTPPPPHHQQRRRRIIMMTTTTNNKKSSRRVVVVAAAADDEEGGGG",

  sncbieaa "-----------------------------------M----------------------------"

  -- Base1  TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG

  -- Base2  TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG

  -- Base3  TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG

} ,

{

  name "Mold Mitochondrial and Mycoplasma" ,

  name "SGC3" ,

  id 4 ,

  ncbieaa  "FFLLSSSSYY**CCWWLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG",

  sncbieaa "-----------------------------------M----------------------------"

  -- Base1  TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG

  -- Base2  TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG

  -- Base3  TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG

} ,

{

  name "Invertebrate Mitochondrial" ,

  name "SGC4" ,

  id 5 ,

  ncbieaa  "FFLLSSSSYY**CCWWLLLLPPPPHHQQRRRRIIMMTTTTNNKKSSSSVVVVAAAADDEEGGGG",

  sncbieaa "---M----------------------------M-MM----------------------------"

  -- Base1  TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG

  -- Base2  TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG

  -- Base3  TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG

} ,

{

  name "Ciliate Macronuclear and Daycladacean" ,

  name "SGC5" ,

  id 6 ,

  ncbieaa  "FFLLSSSSYYQQCC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG",

  sncbieaa "-----------------------------------M----------------------------"

  -- Base1  TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG

  -- Base2  TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG

  -- Base3  TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG

} ,

{

  name "Protozoan Mitochondrial (and Kinetoplast)" ,

  name "SGC6" ,

  id 7 ,

  ncbieaa  "FFLLSSSSYY**CCWWLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG",

  sncbieaa "--MM---------------M------------MMMM---------------M------------"

  -- Base1  TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG

  -- Base2  TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG

  -- Base3  TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG

} ,                                                                               

{

  name "Plant Mitochondrial/Chloroplast (posttranscriptional variant)" ,

  name "SGC7" ,

  id 8 ,

  ncbieaa  "FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRWIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG",

  sncbieaa "--M-----------------------------MMMM---------------M------------"

  -- Base1  TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG

  -- Base2  TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG

  -- Base3  TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG

} ,

{

  name "Echinoderm Mitochondrial" ,

  name "SGC8" ,

  id 9 ,

  ncbieaa  "FFLLSSSSYY**CCWWLLLLPPPPHHQQRRRRIIIMTTTTNNNKSSSSVVVVAAAADDEEGGGG",

  sncbieaa "-----------------------------------M----------------------------"

  -- Base1  TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG

  -- Base2  TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG

  -- Base3  TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG

} ,

{

  name "Euplotid Macronuclear" ,

  name "SGC9" ,

  id 10 ,

  ncbieaa  "FFLLSSSSYY*QCCCWLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG",

  sncbieaa "-----------------------------------M----------------------------"

  -- Base1  TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG

  -- Base2  TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG

  -- Base3  TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG

} ,

{

  name "Eubacterial" ,

  id 11 ,

  ncbieaa  "FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG",

  sncbieaa "---M---------------M------------M--M---------------M------------"

  -- Base1  TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG

  -- Base2  TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG

  -- Base3  TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG

}

}

 

ASN.1 Specification: seqfeat.asn

--$Revision: 2.0 $

--**********************************************************************

--

--  NCBI Sequence Feature elements

--  by James Ostell, 1990

--

--**********************************************************************

 

NCBI-Seqfeat DEFINITIONS ::=

BEGIN

 

EXPORTS Seq-feat, Feat-id;

 

IMPORTS Gene-ref FROM NCBI-Gene

        Prot-ref FROM NCBI-Protein

        Org-ref FROM NCBI-Organism

        RNA-ref FROM NCBI-RNA

        Seq-loc, Giimport-id FROM NCBI-Seqloc

        Pubdesc, Numbering, Heterogen FROM NCBI-Sequence

        Rsite-ref FROM NCBI-Rsite

        Txinit FROM NCBI-TxInit

        Pub-set FROM NCBI-Pub

        Object-id, Dbtag, User-object FROM NCBI-General;

 

--*** Feature identifiers ********************************

--*

 

Feat-id ::= CHOICE {

    gibb INTEGER ,            -- geninfo backbone

    giim Giimport-id ,        -- geninfo import

    local Object-id ,         -- for local software use

    general Dbtag }           -- for use by various databases

 

--*** Seq-feat *******************************************

--*  sequence feature generalization

 

Seq-feat ::= SEQUENCE {

    id Feat-id OPTIONAL ,

    data SeqFeatData ,           -- the specific data

    partial BOOLEAN OPTIONAL ,    -- incomplete in some way?

    except BOOLEAN OPTIONAL ,     -- something funny about this?

    comment VisibleString OPTIONAL ,

    product Seq-loc OPTIONAL ,    -- product of process

    location Seq-loc ,            -- feature made from

    qual SEQUENCE OF Gb-qual OPTIONAL ,  -- qualifiers

    title VisibleString OPTIONAL ,   -- for user defined label

    ext User-object OPTIONAL ,    -- user defined structure extension

    cit Pub-set OPTIONAL ,        -- citations for this feature

    exp-ev ENUMERATED {           -- evidence for existence of feature

        experimental (1) ,        -- any reasonable experimental check

        not-experimental (2) } OPTIONAL , -- similarity, pattern, etc

   xref SET OF SeqFeatXref OPTIONAL }   -- cite other relevant features

 

SeqFeatData ::= CHOICE {

    gene Gene-ref ,

    org Org-ref ,

    cdregion Cdregion ,

    prot Prot-ref ,

    rna RNA-ref ,

    pub Pubdesc ,              -- publication applies to this seq

    seq Seq-loc ,              -- to annotate origin from another seq

    imp Imp-feat ,

    region VisibleString,      -- named region (globin locus)

    comment NULL ,             -- just a comment

    bond ENUMERATED {

        disulfide (1) ,

        thiolester (2) ,

        xlink (3) ,

        thioether (4) ,

        other (255) } ,

   site ENUMERATED {

       active (1) ,

       binding (2) ,

       cleavage (3) ,

       inhibit (4) ,

       modified (5),

       glycosylation (6) ,

       myristoylation (7) ,

       mutagenized (8) ,

       metal-binding (9) ,

       phosphorylation (10) ,

       acetylation (11) ,

       amidation (12) ,

       methylation (13) ,

       hydroxylation (14) ,

       sulfatation (15) ,

       oxidative-deamination (16) ,

       pyrrolidone-carboxylic-acid (17) ,

       gamma-carboxyglutamic-acid (18) ,

       blocked (19) ,

       lipid-binding (20) ,

       np-binding (21) ,

       dna-binding (22) ,

       other (255) } ,

    rsite Rsite-ref ,       -- restriction site  (for maps really)

    user User-object ,      -- user defined structure

    txinit Txinit ,         -- transcription initiation

   num Numbering ,         -- a numbering system

   psec-str ENUMERATED {   -- protein secondary structure

       helix (1) ,         -- any helix

       sheet (2) ,         -- beta sheet

       turn  (3) } ,       -- beta or gamma turn

   non-std-residue VisibleString ,  -- non-standard residue here in seq

   het Heterogen }         -- cofactor, prosthetic grp, etc, bound to seq

 

SeqFeatXref ::= SEQUENCE {

    id Feat-id OPTIONAL ,         -- the feature copied

    data SeqFeatData }           -- the specific data

  

--*** CdRegion ***********************************************

--*

--*  Instructions to translate from a nucleic acid to a peptide

--*    conflict means it's supposed to translate but doesn't

--*

 

 

Cdregion ::= SEQUENCE {

    orf BOOLEAN OPTIONAL ,             -- just an ORF ?

    frame ENUMERATED {

        not-set (0) ,                  -- not set, default to one

        one (1) ,

        two (2) ,

        three (3) } DEFAULT one ,      -- reading frame

    conflict BOOLEAN OPTIONAL ,        -- conflict

    gaps INTEGER OPTIONAL ,            -- number of gaps on conflict/except

    mismatch INTEGER OPTIONAL ,        -- number of mismatches on above

    code Genetic-code OPTIONAL ,       -- genetic code used

    code-break SEQUENCE OF Code-break OPTIONAL ,   -- individual exceptions

    stops INTEGER OPTIONAL }           -- number of stop codons on above

 

                    -- each code is 64 cells long, in the order where

                    -- T=0,C=1,A=2,G=3, TTT=0, TTC=1, TCA=4, etc

                    -- NOTE: this order does NOT corresspond to a Seq-data

                    -- encoding.  It is "natural" to codon usage instead.

                    -- the value in each cell is the AA coded for

                    -- start= AA coded only if first in peptide

                    --   in start array, if codon is not a legitimate start

                    --   codon, that cell will have the "gap" symbol for

                    --   that alphabet.  Otherwise it will have the AA

                    --   encoded when that codon is used at the start.

 

Genetic-code ::= SET OF CHOICE {

    name VisibleString ,               -- name of a code

    id INTEGER ,                       -- id in dbase

    ncbieaa VisibleString ,            -- indexed to IUPAC extended

    ncbi8aa OCTET STRING ,             -- indexed to NCBI8aa

   ncbistdaa OCTET STRING ,           -- indexed to NCBIstdaa

    sncbieaa VisibleString ,            -- start, indexed to IUPAC extended

    sncbi8aa OCTET STRING ,             -- start, indexed to NCBI8aa

   sncbistdaa OCTET STRING }           -- start, indexed to NCBIstdaa

 

Code-break ::= SEQUENCE {              -- specific codon exceptions

    loc Seq-loc ,                      -- location of exception

    aa CHOICE {                        -- the amino acid

        ncbieaa INTEGER ,              -- ASCII value of NCBIeaa code

        ncbi8aa INTEGER ,              -- NCBI8aa code

       ncbistdaa INTEGER } }           -- NCBIstdaa code

 

Genetic-code-table ::= SET OF Genetic-code     -- table of genetic codes

 

--*** Import ***********************************************

--*

--*  Features imported from other databases

--*

 

Imp-feat ::= SEQUENCE {

    key VisibleString ,

    loc VisibleString OPTIONAL ,         -- original location string

    descr VisibleString OPTIONAL }       -- text description

 

Gb-qual ::= SEQUENCE {

    qual VisibleString ,

    val VisibleString }

 

END

 

--**********************************************************************

--

--  NCBI Restriction Sites

--  by James Ostell, 1990

--  version 0.8

--

--**********************************************************************

 

NCBI-Rsite DEFINITIONS ::=

BEGIN

 

EXPORTS Rsite-ref;

 

IMPORTS Dbtag FROM NCBI-General;

 

Rsite-ref ::= CHOICE {

    str VisibleString ,     -- may be unparsable

    db  Dbtag }             -- pointer to a restriction site database

 

END

 

--**********************************************************************

--

--  NCBI RNAs

--  by James Ostell, 1990

--  version 0.8

--

--**********************************************************************

 

NCBI-RNA DEFINITIONS ::=

BEGIN

 

EXPORTS RNA-ref, Trna-ext;

 

--*** rnas ***********************************************

--*

--*  various rnas

--*

                         -- minimal RNA sequence

RNA-ref ::= SEQUENCE {

    type ENUMERATED {            -- type of RNA feature

        unknown (0) ,

        premsg (1) ,

        mRNA (2) ,

        tRNA (3) ,

        rRNA (4) ,

        snRNA (5) ,

        scRNA (6) ,

        other (255) } ,

    pseudo BOOLEAN OPTIONAL , 

    ext CHOICE {

        name VisibleString ,        -- for naming "other" type

        tRNA Trna-ext } OPTIONAL }  -- for tRNAs

 

Trna-ext ::= SEQUENCE {                -- tRNA feature extensions

    aa CHOICE {                         -- aa this carries

        iupacaa INTEGER ,

        ncbieaa INTEGER ,

        ncbi8aa INTEGER ,

       ncbistdaa INTEGER } OPTIONAL ,

    codon SET OF INTEGER OPTIONAL }     -- codon(s) as in Genetic-code

                                        -- NOT anti-codons

END

 

--**********************************************************************

--

--  NCBI Genes

--  by James Ostell, 1990

--  version 0.8

--

--**********************************************************************

 

NCBI-Gene DEFINITIONS ::=

BEGIN

 

EXPORTS Gene-ref;

 

IMPORTS Dbtag FROM NCBI-General;

 

--*** Gene ***********************************************

--*

--*  reference to a gene

--*

 

Gene-ref ::= SEQUENCE {

    locus VisibleString OPTIONAL ,     -- Official gene symbol

    allele VisibleString OPTIONAL ,    -- Official allele designation

    desc VisibleString OPTIONAL ,      -- descriptive name

    maploc VisibleString OPTIONAL ,    -- descriptive map location

    pseudo BOOLEAN DEFAULT FALSE ,          -- pseudogene

    db SET OF Dbtag OPTIONAL ,      -- ids in other dbases

   syn SET OF VisibleString OPTIONAL }      -- synonyms for locus

 

END

 

--**********************************************************************

--

--  NCBI Organism

--  by James Ostell, 1990

--  version 0.8

--

--**********************************************************************

 

NCBI-Organism DEFINITIONS ::=

BEGIN

 

EXPORTS Org-ref;

 

IMPORTS Dbtag FROM NCBI-General;

 

--*** Org-ref ***********************************************

--*

--*  Reference to an organism

--*

 

Org-ref ::= SEQUENCE {

    taxname VisibleString OPTIONAL ,   -- scientific name

    common VisibleString OPTIONAL ,    -- common name

    mod SET OF VisibleString OPTIONAL , -- modifier for tissue/strain/line

    db SET OF Dbtag OPTIONAL ,         -- ids in other dbases

    syn SET OF VisibleString OPTIONAL }  -- synonyms for taxname or common

 

END

 

--**********************************************************************

--

--  NCBI Protein

--  by James Ostell, 1990

--  version 0.8

--

--**********************************************************************

 

NCBI-Protein DEFINITIONS ::=

BEGIN

 

EXPORTS Prot-ref;

 

IMPORTS Dbtag FROM NCBI-General;

 

--*** Prot-ref ***********************************************

--*

--*  Reference to a protein name

--*

 

Prot-ref ::= SEQUENCE {

    name SET OF VisibleString OPTIONAL ,      -- protein name

    desc VisibleString OPTIONAL ,      -- description (instead of name)

    ec SET OF VisibleString OPTIONAL , -- E.C. number(s)

    activity SET OF VisibleString OPTIONAL ,  -- activities

    db SET OF Dbtag OPTIONAL }         -- ids in other dbases

 

 

 

END

--********************************************************************

--

--  Transcription Initiation Site Feature Data Block

--  James Ostell, 1991

--  Philip Bucher, David Ghosh

--  version 1.1

--

-- 

--

--********************************************************************

 

NCBI-TxInit DEFINITIONS ::=

BEGIN

 

EXPORTS Txinit;

 

IMPORTS Gene-ref, Prot-ref, Org-ref FROM NCBI-SeqFeat;

 

Txinit ::= SEQUENCE {

    name VisibleString ,    -- descriptive name of initiation site

    syn SEQUENCE OF VisibleString OPTIONAL ,   -- synonyms

    gene SEQUENCE OF Gene-ref OPTIONAL ,  -- gene(s) transcribed

    protein SEQUENCE OF Prot-ref OPTIONAL ,   -- protein(s) produced

    rna SEQUENCE OF VisibleString OPTIONAL ,  -- rna(s) produced

    expression VisibleString OPTIONAL ,  -- tissue/time of expression

    txsystem ENUMERATED {       -- transcription apparatus used at this site

        unknown (0) ,

        pol1 (1) ,      -- eukaryotic Pol I

        pol2 (2) ,      -- eukaryotic Pol II

        pol3 (3) ,      -- eukaryotic Pol III

        bacterial (4) ,

        viral (5) ,

        rna (6) ,       -- RNA replicase

        organelle (7) ,

        other (255) } ,

    txdescr VisibleString OPTIONAL ,   -- modifiers on txsystem

    txorg Org-ref OPTIONAL ,  -- organism supplying transcription apparatus

    mapping-precise BOOLEAN DEFAULT FALSE ,  -- mapping precise or approx

    location-accurate BOOLEAN DEFAULT FALSE , -- does Seq-loc reflect mapping

    inittype ENUMERATED {

        unknown (0) ,

        single (1) ,

        multiple (2) ,

        region (3) } OPTIONAL ,

    evidence SET OF Tx-evidence OPTIONAL }

 

Tx-evidence ::= SEQUENCE {

    exp-code ENUMERATED {

        unknown (0) ,   

        rna-seq (1) ,   -- direct RNA sequencing

        rna-size (2) ,  -- RNA length measurement

        np-map (3) ,    -- nuclease protection mapping with homologous sequence ladder

        np-size (4) ,   -- nuclease protected fragment length measurement

        pe-seq (5) ,    -- dideoxy RNA sequencing

        cDNA-seq (6) ,  -- full-length cDNA sequencing

        pe-map (7) ,    -- primer extension mapping with homologous sequence ladder   

        pe-size (8) ,   -- primer extension product length measurement

        pseudo-seq (9) , -- full-length processed pseudogene sequencing

       rev-pe-map (10) ,   -- see NOTE (1) below

        other (255) } ,

    expression-system ENUMERATED {

        unknown (0) ,

        physiological (1) ,

        in-vitro (2) ,

        oocyte (3) ,

        transfection (4) ,

        transgenic (5) ,

        other (255) } DEFAULT physiological ,

    low-prec-data BOOLEAN DEFAULT FALSE ,

    from-homolog BOOLEAN DEFAULT FALSE }     -- experiment actually done on

                                             --  close homolog

 

   -- NOTE (1) length measurement of a reverse direction primer-extension

   --        product (blocked  by  RNA  5'end)  by  comparison with

   --        homologous sequence ladder (J. Mol. Biol. 199, 587)

 

   

END

C Structures and Functions: objfeat.h

/*  objfeat.h

* ===========================================================================

*

*                            PUBLIC DOMAIN NOTICE                         

*               National Center for Biotechnology Information

*                                                                          

*  This software/database is a "United States Government Work" under the  

*  terms of the United States Copyright Act.  It was written as part of   

*  the author's official duties as a United States Government employee and

*  thus cannot be copyrighted.  This software/database is freely available

*  to the public for use. The National Library of Medicine and the U.S.   

*  Government have not placed any restriction on its use or reproduction. 

*                                                                          

*  Although all reasonable efforts have been taken to ensure the accuracy 

*  and reliability of the software and data, the NLM and the U.S.         

*  Government do not and cannot warrant the performance or results that   

*  may be obtained by using this software or data. The NLM and the U.S.   

*  Government disclaim all warranties, express or implied, including      

*  warranties of performance, merchantability or fitness for any particular

*  purpose.                                                                

*                                                                         

*  Please cite the author in any work or product based on this material.  

*

* ===========================================================================

*

* File Name:  objfeat.h

*

* Author:  James Ostell

*  

* Version Creation Date: 4/1/91

*

* $Revision: 2.0 $

*

* File Description:  Object manager interface for module NCBI-SeqFeat

*

* Modifications: 

* --------------------------------------------------------------------------

* Date    Name        Description of modification

* -------  ----------  -----------------------------------------------------

*

*

* ==========================================================================

*/

 

#ifndef _NCBI_Seqfeat_

#define _NCBI_Seqfeat_

 

#ifndef _ASNTOOL_

#include <asn.h>

#endif

#ifndef _NCBI_General_

#include <objgen.h>

#endif

#ifndef _NCBI_Seqloc_

#include <objloc.h>

#endif

#ifndef _NCBI_Pub_

#include <objpub.h>

#endif

#ifndef _NCBI_Pubdesc_

#include <objpubd.h>

#endif

 

#ifdef __cplusplus

extern "C" {

#endif

 

/*****************************************************************************

*

*   loader

*

*****************************************************************************/

extern Boolean SeqFeatAsnLoad PROTO((void));

 

/*****************************************************************************

*

*   GBQual

*

*****************************************************************************/

typedef struct gbqual {

    CharPtr qual,

        val;

    struct gbqual PNTR next;

} GBQual, PNTR GBQualPtr;

 

GBQualPtr GBQualNew PROTO((void));

Boolean GBQualAsnWrite PROTO((GBQualPtr gbp, AsnIoPtr aip, AsnTypePtr atp));

GBQualPtr GBQualAsnRead PROTO((AsnIoPtr aip, AsnTypePtr atp));

GBQualPtr GBQualFree PROTO((GBQualPtr gbp));

 

/*****************************************************************************

*

*   SeqFeatXref

*      cross references between features

*

*****************************************************************************/

typedef struct seqfeatxref {

    Choice id;     

    Choice data;

    struct seqfeatxref PNTR next;

} SeqFeatXref, PNTR SeqFeatXrefPtr;

 

SeqFeatXrefPtr SeqFeatXrefNew PROTO((void));

Boolean SeqFeatXrefAsnWrite PROTO((SeqFeatXrefPtr sfxp, AsnIoPtr aip, AsnTypePtr atp));

SeqFeatXrefPtr SeqFeatXrefAsnRead PROTO((AsnIoPtr aip, AsnTypePtr atp));

SeqFeatXrefPtr SeqFeatXrefFree PROTO((SeqFeatXrefPtr sfxp));

                       /* free frees whole chain of SeqFeatXref */

/*****************************************************************************

*

*   SeqFeat

*     Feat-id is built into idtype/id

*       1=gibb (in id.intvalue)

*       2=gimm (id.ptrvalue)

*       3=local (id.ptrvalue to Object-id)

*       4=general (id.ptrvalue to Dbtag)

*     SeqFeatData is built into datatype/data

*       datatype gives type of SeqFeatData:

*   0 = not set

    1 = gene, data.value.ptrvalue = Gene-ref ,

    2 = org , data.value.ptrvalue = Org-ref ,

    3 = cdregion, data.value.ptrvalue = Cdregion ,

    4 = prot , data.value.ptrvalue = Prot-ref ,

    5 = rna, data.value.ptrvalue = RNA-ref ,

    6 = pub, data.value.ptrvalue = Pubdesc ,  -- publication applies to this seq

    7 = seq, data.value.ptrvalue = Seq-loc ,  -- for tracking source of a seq.

    8 = imp, data.value.ptrvalue = Imp-feat ,

    9 = region, data.value.ptrvalue= VisibleString,      -- for a name

    10 = comment, data.value.ptrvalue= NULL ,             -- just a comment

    11 = bond, data.value.intvalue = ENUMERATED {

        disulfide (1) ,

        thiolester (2) ,

        xlink (3) ,

        other (255) } ,

    12 = site, data.value.intvalue = ENUMERATED {

        active (1) ,

        binding (2) ,

        cleavage (3) ,

        inhibit (4) ,

        modified (5),

        other (255) } ,

    13 = rsite, data.value.ptrvalue = Rsite-ref

    14 = user, data.value.ptrvalue = UserObjectPtr

    15 = txinit, data.value.ptrvalue = TxinitPtr

   16 = num, data.value.ptrvalue = NumberingPtr   -- a numbering system

   17 = psec-str data.value.intvalue = ENUMERATED {   -- protein secondary structure

       helix (1) ,         -- any helix

       sheet (2) ,         -- beta sheet

       turn  (3) } ,       -- beta or gamma turn

   18 = non-std-residue data.value.ptrvalue = VisibleString ,  -- non-standard residue here in seq

   19 = het data.value.ptrvalue=CharPtr Heterogen   -- cofactor, prosthetic grp, etc, bound to seq

*  

*

*****************************************************************************/

typedef struct seqfeat {

    Choice id;     

    Choice data;

    Boolean partial ,

        except;

    CharPtr comment;

    ValNodePtr product ,

        location;

    GBQualPtr qual;

    CharPtr title;

    UserObjectPtr ext;

    ValNodePtr cit;       /* citations (Pub-set)  */

   Uint1 exp_ev;

   SeqFeatXrefPtr xref;

    struct seqfeat PNTR next;

} SeqFeat, PNTR SeqFeatPtr;

 

SeqFeatPtr SeqFeatNew PROTO((void));

Boolean SeqFeatAsnWrite PROTO((SeqFeatPtr anp, AsnIoPtr aip, AsnTypePtr atp));

SeqFeatPtr SeqFeatAsnRead PROTO((AsnIoPtr aip, AsnTypePtr atp));

SeqFeatPtr SeqFeatFree PROTO((SeqFeatPtr anp));

 

     /* get a SeqFeatXref from a feature.  Currently only Prot-ref and */

     /* Gene-ref are supported */

 

SeqFeatXrefPtr SeqFeatToXref PROTO((SeqFeatPtr sfp));

 

/*****************************************************************************

*

*   SeqFeatId - used as parts of other things, so is not allocated itself

*

*****************************************************************************/

void SeqFeatIdFree PROTO((ChoicePtr cp));  /* does NOT free cp itself */

Boolean SeqFeatIdAsnWrite PROTO((ChoicePtr cp, AsnIoPtr aip, AsnTypePtr orig));

Boolean SeqFeatIdAsnRead PROTO((AsnIoPtr aip, AsnTypePtr orig, ChoicePtr cp));

       /** NOTE: SeqFeatIdAsnRead() does NOT allocate cp ***/

Boolean SeqFeatIdDup PROTO((ChoicePtr dest, ChoicePtr src));

 

/*****************************************************************************

*

*   SeqFeatData - used as parts of other things, so is not allocated itself

*

*****************************************************************************/

void SeqFeatDataFree PROTO((ChoicePtr cp));  /* does NOT free cp itself */

Boolean SeqFeatDataAsnWrite PROTO((ChoicePtr cp, AsnIoPtr aip, AsnTypePtr orig));

Boolean SeqFeatDataAsnRead PROTO((AsnIoPtr aip, AsnTypePtr orig, ChoicePtr cp));

       /** NOTE: SeqFeatDataAsnRead() does NOT allocate cp ***/

 

/*****************************************************************************

*

*   SeqFeatSet - sets of seqfeats

*

*****************************************************************************/

Boolean SeqFeatSetAsnWrite PROTO((SeqFeatPtr anp, AsnIoPtr aip, AsnTypePtr set, AsnTypePtr element));

SeqFeatPtr SeqFeatSetAsnRead PROTO((AsnIoPtr aip, AsnTypePtr set, AsnTypePtr element));

 

/*****************************************************************************

*

*   CodeBreak

*

*****************************************************************************/

typedef struct cb {

    SeqLocPtr loc;          /* the Seq-loc */

    Choice aa;              /* 1=ncbieaa, 2=ncbi8aa, 3=ncbistdaa */

    struct cb PNTR next;

} CodeBreak, PNTR CodeBreakPtr;

 

CodeBreakPtr CodeBreakNew PROTO((void));

Boolean CodeBreakAsnWrite PROTO((CodeBreakPtr cbp, AsnIoPtr aip, AsnTypePtr atp));

CodeBreakPtr CodeBreakAsnRead PROTO((AsnIoPtr aip, AsnTypePtr atp));

CodeBreakPtr CodeBreakFree PROTO((CodeBreakPtr cbp));

 

/*****************************************************************************

*

*   CdRegion

*

*****************************************************************************/

typedef struct cdregion {

    Boolean orf;

    Uint1 frame;

    Boolean conflict;

    Uint1 gaps,                         /* 255 = any number > 254 */

        mismatch,

        stops;

    ValNodePtr genetic_code;                 /* NULL = not set */

    CodeBreakPtr code_break;

} CdRegion, PNTR CdRegionPtr;

 

CdRegionPtr CdRegionNew PROTO((void));

Boolean CdRegionAsnWrite PROTO((CdRegionPtr cdp, AsnIoPtr aip, AsnTypePtr atp));

CdRegionPtr CdRegionAsnRead PROTO((AsnIoPtr aip, AsnTypePtr atp));

CdRegionPtr CdRegionFree PROTO((CdRegionPtr cdp));

 

/*****************************************************************************

*

*   GeneticCode

*

*      ncbieaa, ncbi8aa, ncbistdaa

*      are arrays 64 cells long, where each cell gives the aa produced

*      by triplets coded by T=0, C=1, A=2, G=3

*      TTT = cell[0]

*      TTC = cell[1]

*      TTA = cell[2]

*      TTG = cell[3]

*      TCT = cell[4]

*      ((base1 * 16) + (base2 * 4) + (base3)) = cell in table

*

*      sncbieaa, sncbi8aa, sncbistdaa

*      are arrays same as above, except the AA's they code for are only for

*      the first AA of a peptide.  This accomdates alternate start codes.

*       If a codon is not a valid start, the cell contains the "gap" symbol

*       instead of an AA.

*

*      in both cases, IUPAC cannot be used because it has no symbol for

*       stop.

*     

*

*   GeneticCode is a ValNodePtr so variable numbers of elements are

*      easily accomodated.  A ValNodePtr with choice = 254 is the head

*       of the list.  It's elements are a chain of ValNodes beginning with

*       the data.ptrvalue of the GeneticCode (head).  GeneticCodeNew()

*       returns the head.

*  

*   Types in ValNodePtr->choice are:

*      0 = not set

*      1 = name (CharPtr in ptrvalue)

*      2 = id (in intvalue)

*      3 = ncbieaa (CharPtr in ptrvalue)

*      4 = ncbi8aa (ByteStorePtr in ptrvalue)

*      5 = ncbistdaa (ByteStorePtr in ptrvalue)

*      6 = sncbieaa (CharPtr in ptrvalue)

*      7 = sncbi8aa (ByteStorePtr in ptrvalue)

*      8 = sncbistdaa (ByteStorePtr in ptrvalue)

*      255 = read unrecognized type, but passed ASN.1

*  

*****************************************************************************/

typedef ValNodePtr GeneticCodePtr;

 

GeneticCodePtr GeneticCodeNew PROTO((void));

Boolean GeneticCodeAsnWrite PROTO((GeneticCodePtr gcp, AsnIoPtr aip, AsnTypePtr atp));

GeneticCodePtr GeneticCodeAsnRead PROTO((AsnIoPtr aip, AsnTypePtr atp));

GeneticCodePtr GeneticCodeFree PROTO((GeneticCodePtr gcp));

 

Boolean GeneticCodeTableAsnWrite PROTO((GeneticCodePtr gcp, AsnIoPtr aip, AsnTypePtr atp));

GeneticCodePtr GeneticCodeTableAsnRead PROTO((AsnIoPtr aip, AsnTypePtr atp));

 

GeneticCodePtr GeneticCodeFind PROTO((Int4 id, CharPtr name));

GeneticCodePtr GeneticCodeTableLoad PROTO((void));

 

/*****************************************************************************

*

*   ImpFeat

*

*****************************************************************************/

typedef struct impfeat {

    CharPtr key,

        loc,

        descr;

} ImpFeat, PNTR ImpFeatPtr;

 

ImpFeatPtr ImpFeatNew PROTO((void));

Boolean ImpFeatAsnWrite PROTO((ImpFeatPtr ifp, AsnIoPtr aip, AsnTypePtr atp));

ImpFeatPtr ImpFeatAsnRead PROTO((AsnIoPtr aip, AsnTypePtr atp));

ImpFeatPtr ImpFeatFree PROTO((ImpFeatPtr ifp));

 

/*****************************************************************************

*

*   RnaRef

*    Choice used for extensions

*      0 = no extension

*      1 = name, ext.value.ptrvalue = CharPtr

*      2 = trna, ext.value.ptrvalue = tRNA

*

*****************************************************************************/

typedef struct rnaref {

    Uint1 type;

    Boolean pseudo;

    Choice ext;

} RnaRef, PNTR RnaRefPtr;

 

RnaRefPtr RnaRefNew PROTO((void));

Boolean RnaRefAsnWrite PROTO((RnaRefPtr rrp, AsnIoPtr aip, AsnTypePtr atp));

RnaRefPtr RnaRefAsnRead PROTO((AsnIoPtr aip, AsnTypePtr atp));

RnaRefPtr RnaRefFree PROTO((RnaRefPtr rrp));

 

/*****************************************************************************

*

*   tRNA

*

*****************************************************************************/

typedef struct trna {

    Uint1 aatype,  /* 0=not set, 1=iupacaa, 2=ncbieaa, 3=ncbi8aa 4=ncbistdaa */

        aa;        /* the aa transferred in above code */

    Uint1 codon[6];    /* codons recognized, coded as for Genetic-code */

} tRNA, PNTR tRNAPtr;   /*  0-63 = codon,  255=no data in cell */

 

/*****************************************************************************

*

*   GeneRef

*

*****************************************************************************/

typedef struct generef {

    CharPtr locus,

        allele,

        desc,

        maploc;

    Boolean pseudo;

    ValNodePtr db;          /* ids in other databases */

    ValNodePtr syn;         /* synonyms for locus */

} GeneRef, PNTR GeneRefPtr;

 

GeneRefPtr GeneRefNew PROTO((void));

Boolean GeneRefAsnWrite PROTO((GeneRefPtr grp, AsnIoPtr aip, AsnTypePtr atp));

GeneRefPtr GeneRefAsnRead PROTO((AsnIoPtr aip, AsnTypePtr atp));

GeneRefPtr GeneRefFree PROTO((GeneRefPtr grp));

GeneRefPtr GeneRefDup PROTO((GeneRefPtr grp));

 

/*****************************************************************************

*

*   OrgRef

*

*****************************************************************************/

typedef struct orgref {

    CharPtr taxname,

        common;

    ValNodePtr mod;

    ValNodePtr db;          /* ids in other databases */

    ValNodePtr syn;         /* synonyms for taxname and/or common */

} OrgRef, PNTR OrgRefPtr;

 

OrgRefPtr OrgRefNew PROTO((void));

Boolean OrgRefAsnWrite PROTO((OrgRefPtr orp, AsnIoPtr aip, AsnTypePtr atp));

OrgRefPtr OrgRefAsnRead PROTO((AsnIoPtr aip, AsnTypePtr atp));

OrgRefPtr OrgRefFree PROTO((OrgRefPtr orp));

 

/*****************************************************************************

*

*   ProtRef

*

*****************************************************************************/

typedef struct protref {

    ValNodePtr name;

    CharPtr desc;

    ValNodePtr ec,

        activity;

    ValNodePtr db;          /* ids in other databases */

} ProtRef, PNTR ProtRefPtr;

 

ProtRefPtr ProtRefNew PROTO((void));

Boolean ProtRefAsnWrite PROTO((ProtRefPtr orp, AsnIoPtr aip, AsnTypePtr atp));

ProtRefPtr ProtRefAsnRead PROTO((AsnIoPtr aip, AsnTypePtr atp));

ProtRefPtr ProtRefFree PROTO((ProtRefPtr orp));

ProtRefPtr ProtRefDup PROTO((ProtRefPtr orp));

 

/*****************************************************************************

*

*   RsiteRef

*       uses an ValNode

*       choice = 1 = str

*                2 = db

*

*****************************************************************************/

typedef ValNodePtr RsiteRefPtr;

 

Boolean RsiteRefAsnWrite PROTO((RsiteRefPtr orp, AsnIoPtr aip, AsnTypePtr atp));

RsiteRefPtr RsiteRefAsnRead PROTO((AsnIoPtr aip, AsnTypePtr atp));

RsiteRefPtr RsiteRefFree PROTO((RsiteRefPtr orp));

 

/*****************************************************************************

*

*   Txinit

*       Transcription initiation site

*

*****************************************************************************/

typedef struct txevidence {

    Uint1 exp_code ,

        exp_sys ;

    Boolean low_prec_data ,

        from_homolog;

    struct txevidence PNTR next;

} TxEvidence, PNTR TxEvidencePtr;

 

typedef struct txinit {

    CharPtr name;

    ValNodePtr syn ,

        gene ,

        protein ,

        rna ;

    CharPtr expression;

    Uint1 txsystem;

    CharPtr txdescr;

    OrgRefPtr txorg;

    Boolean mapping_precise,

        location_accurate;

    Uint1 inittype;              /* 255 if not set */

    TxEvidencePtr evidence;

} Txinit, PNTR TxinitPtr;

 

TxinitPtr TxinitNew PROTO((void));

Boolean TxinitAsnWrite PROTO((TxinitPtr txp, AsnIoPtr aip, AsnTypePtr atp));

TxinitPtr TxinitAsnRead PROTO((AsnIoPtr aip, AsnTypePtr atp));

TxinitPtr TxinitFree PROTO((TxinitPtr txp));

 

 

 

#ifdef __cplusplus

}

#endif

 

#endif