General Use Objects

Introduction
Large Text Blocks: StringStore
The Date
Identifying Things: Object-id
Identifying Things: Dbtag
Identifying People: Person-id
Expressing Uncertainty with Fuzzy Integers: Int-fuzz
Creating Your Own Objects: User-object
ASN.1 Specification: general.asn
C Structures and Functions: objgen.h

Introduction

This section presents the data objects defined in general.asn and objgen.[ch]. They are a miscellaneous collection of generally useful types.

Large Text Blocks: StringStore

A StringStore is defined as a VisibleString for ASN.1 encoding. This type is used to hold very long strings. It is simply a hint to the AsnLib functions to store the incoming data in a ByteStore (see CoreLib chapter) rather than an array to avoid overrunning allocation limits of some computers. OCTET STRINGs (a sequence of opaque bytes) are always kept in ByteStore structures since the length of the object must be stored as well (no terminating '\0' is possible). ByteStores have the advantage of segmenting the long strings, which for nucleic acid data can get very long. The ByteStore will allow us to add data buffering to disk for these large objects as it becomes necessary even on large computers.

The Date

ASN.1 has primitive types for recording dates but which require the time in seconds as well. For scientific and bibliographic data, it is common that only the date, or even just a part of the date (e.g. month and year) are available. Rather than use artificial zero values for the more precise ASN.1 form, we have created a specialized Date type. Date is a CHOICE of a simple, unparsed string or a structured Date-std. The string form is a fall-back for when the input data is so poorly structured that it is impossible to reliably parse the date fields from it. It should only be used as a last resort to accommodate old data, as it is impossible to compute or index on.

When possible, the "std" form of the Date should be used. In this case year is an integer (e.g. 1992), month is an integer from 1-12 (where January is 1), and day is an integer from 1-31. A string called "season" can be used, particularly for bibliographic citations (e.g. the "spring" issue). When a range of months is given for an issue (e.g. "June‑July") it cannot be represented directly. However, one would like to be able to index on integer months but still not lose the range. This is accomplished by putting 6 in the "month" slot and "‑July" in the "season" slot. The DatePrint() function will put them back together for display, but the issue can still be indexed by month. Year is the only required field in a Date-std.

The "C" structure used for Date can accommodate both the representation of the CHOICE itself (which kind of Date is this?) and the data from either CHOICE. It has a four byte array and a CharPtr. The byte[0] indicates what kind of Date it is. If a "str" type, then the CharPtr points to the string and the other three bytes in the array have no meaning. If a "std" type, then the byte[1] is the year (minus 1900 to save space - the object loaders will add the 1900 back when encoding into ASN.1), byte[2] is the month (or 0 if not given), and byte[3] is the day (or 0 if not given). If the CharPtr is NULL, then the season is not given.

The object loaders contain a number of handy functions for working with Dates in addition to the usual New(), Free(), AsnRead() and AsnWrite() functions. DateWrite() will fill a Date.std with the function arguments. DateRead() will fill pointer arguments with the values from a Date. DateCurr() will create and return a Date.std filled with the current date by accessing the computer system. DateDup() will create a copy of a Date. DatePrint() will format a Date into a display format into a buffer supplied by the caller. This buffer should normally be at least 30 bytes long. The format is e.g. "Jun 30, 1992".

DateMatch(a, b, all) will return 0 if Date a is the same as Date b, 1 if b is after a, -1 if b is before a. It will return a 2 or -2 (for sorting) if they are different Date types (str and std) that could not be compared. If all is equal to TRUE, then all fields that are set in one Date must be set and must match in the other Date. If all is equal to FALSE, then only the fields set in both are matched. Note that this function can only measure if one date is before another chronologically if both are Date-std types. The string Date types can only be compared lexically (like strcmp()).

Identifying Things: Object-id

An Object-id is a simple structure used to identify a data object. It is just a CHOICE of an INTEGER or a VisibleString. It must always be used within some defining context (e.g. see Dbtag below) in order to have some global meaning. It allows flexibility in a host system's preference for identifying things by integers or strings.

The ObjectId "C" structure has a 4 byte integer slot and a CharPtr slot. If the CharPtr is NULL, then the integer value is the identifier and the type is "int". If the CharPtr is not NULL, then the Object-id is type "str" and the CharPtr is considered to point at the identifier.

There is an ObjectIdDup() function to make a copy of an ObjectId and an ObjectIdMatch() function which returns TRUE if two ObjectIds are identical, FALSE if they are not.

Identifying Things: Dbtag

A Dbtag is an Object-id within the context of a database. The database is just defined by a VisibleString. The strings identifying the database are not centrally controlled, so it is possible that a conflict could occur. If there is a proliferation of Dbtags, then a registry might be considered at NCBI. Dbtags provide a simple, general way for small database providers to supply their own internal identifiers in a way which will, usually, be globally unique as well, yet requires no official sanction. So, for example, identifiers for features on sequences are not widely available at the present time. However, the Eukaryotic Promotor Database (EPD) can be provided as a set of features on sequences. The internal key to each EPD entry can be propagated as the Feature-id by using a Dbtag where "EPD" is the "db" field and an integer is used in the Object-id, which is the same integer identifying the entry in the normal EPD release.

As for ObjectIds, there are DbtagMatch() and DbtagDup() functions in the object loaders.

Identifying People: Person-id

Person-id provides an extremely flexible way to identify people. There are four CHOICES from very explicit to completely unstructured. When one is building a database, one should select the most structured form possible. However, when one is processing data from other sources, one can only pick the most structured form possible, given the input data.

The first Person-id CHOICE is a Dbtag. It would allow people to be identified by some formal registry. For example, in the USA, it might be possible to identify people by Social Security Number. Theoretically, one could then maintain a link to a person in database, even if they changed their name. Dbtag would allow other registries, such as professional societies, to be used as well. Frankly, this may be wishful thinking and possibly even socially inadvisable, though from a database standpoint, it would be very useful to have some stable identifier for people.

A Name-std Choice is the next most explicit form. It allows a structured, fielded name, making indexing by last name, but disambiguation (of say, "Jones") by first name possible. This is the best choice when the data is available and its use should be encouraged by those building new databases wherever reasonable.

The last two choices are string types. MEDLINE stores names in strings in a structured way (e.g. Jones JM). This means one can usually, but not always, parse out last names and can generally build indexes on the assumption that the last name is first. Thus, it is worth distinguishing this case from the pure string form, the last CHOICE. In a pure string, there are no guarantees of any kind made about the structure of the name. It could be last name first, first name first, comma after last name, periods between initials, etc. The string form should be the CHOICE of last resort.

In the "C" structure, the first element indicates the type of the Person-id. The generic Pointer then must be cast to the correct type given that knowledge. So, for a Person-id.dbtag the Pointer is a DbtagPtr. For Person-id.name it is a NameStdPtr. For Person-id.ml or Person-id.str it is a CharPtr.

Expressing Uncertainty with Fuzzy Integers: Int-fuzz

Lengths of biological sequences and locations on them are expressed with integers. However, sometimes it is desirable to be able to indicate some uncertainty about that length or location. Unfortunately, most software cannot make good use of such uncertainties, though in most cases this is fine. In order to provide both a simple, single integer view, as well as a more complex fuzzy view when appropriate, we have adopted the following strategy. In the NCBI specifications, all lengths and locations are always given by simple integers. If information about fuzziness is appropriate, then an Int-fuzz is ADDED to the data. In this case, the simple integer can be considered a "best guess" of the length or location. Thus simple software can ignore fuzziness, while it is not lost to more sophisticated uses.

Fuzziness can take a variety of forms. It can be plus or minus some fixed value. It can be somewhere in a range of values. It can be plus or minus a percentage of the best guess value. It may also be certain boundary conditions (greater than the value, less than the value) or refer to the bond BETWEEN residues of the biological sequence (bond to the right of this residue, bond to the left of that residue).

Creating Your Own Objects: User-object

One of the strengths of ASN.1 is that it requires a formal specification of data down to very detailed levels. This enforces clear definitions of data which greatly facilitates exchange of information in useful ways between different databases, software tools, and scientific enterprises. The problem with this approach is that it makes it very difficult for end users to add their own objects to the specification or enhance objects already in the specification. Certainly custom modules can be added to accommodate specific groups needs, but the data from such custom modules cannot be exchanged or passed through tools which adhere only to the common specification.

We have defined an object called a User-object, which can represent any class of simple, structured, or tabular data in a completely structured way, but which can be defined in any way that meets a user's needs. The User-object itself has a "class" tag which is a string used like the "db" string in Dbtag, to set the context in which this User-object is meaningful. The "class" strings are not centrally controlled, so again it is possible to have a conflict, but unlikely unless activity in this area becomes very great. Within a "class" one can define an object "type" by either a string or an integer. Thus any particular endeavor can define a wide variety of different types for their own use. The combination of "class" and "type" identifies the object to databases and software that may understand and make use this particular User-object's structure and properties. Yet, the generic definition means software that does not understand the purpose or use of any User-object can still parse it, pass it though, or even print it out for a user to peruse.

The attributes of the User-object are contained in one or more User-fields. Each User-field has a field label, which is either a string or an integer. It may contain any kind of data, strings, real numbers, integers, arrays of anything, or even sub-fields or complete sub-objects. When arrays and repeating fields are supplied, the optional "num" attribute of the User-field is used to tell software how many elements to prepare to receive. Virtually any structured data type from the simplest to the most complex can be built up from these elements.

The User-object is provided in a number of places in the public ASN.1 specifications to allow users to added their own structured features to Feature-tables or their own custom extensions to existing features. This allows new ideas to be tried out publicly, and allows software tools to be written to accommodate them, without requiring consensus among scientists or constant revisions to specifications. Those new ideas which time and experience indicate have become important concepts in molecular biology can be "graduated" to real ASN.1 specifications in the public scheme. A large body of structured data would presumably already exist in User-objects of this type, and these could all be back fitted into the new specified type, allowing data to "catch up" to the present specification. Those User-objects which do not turn out to be generally useful or important remain as harmless historical artifacts. User-objects could also be used for custom software to attach data only required for use by a particular tool to an existing standard object without harming it for use by standard tools.

ASN.1 Specification: general.asn

--$Revision: 2.0 $

--**********************************************************************

-- NCBI General Data elements

-- by James Ostell, 1990

--**********************************************************************

NCBI-General DEFINITIONS ::=

BEGIN

EXPORTS Date, Person-id, Object-id, Dbtag, Int-fuzz, User-object;

-- StringStore is really a VisibleString. It is used to define very

-- long strings which may need to be stored by the receiving program

-- in special structures, such as a ByteStore, but it's just a hint.

-- AsnTool stores StringStores in ByteStore structures.

-- OCTET STRINGs are also stored in ByteStores by AsnTool

-- typedef struct bsunit { /* for building multiline strings */

-- Nlm_Handle str; /* the string piece */

-- Nlm_Int2 len_avail,

-- len;

-- struct bsunit PNTR next; } /* the next one */

-- Nlm_BSUnit, PNTR Nlm_BSUnitPtr;

-- typedef struct bytestore {

-- Nlm_Int4 seekptr, /* current position */

-- totlen, /* total stored data length in bytes */

-- chain_offset; /* offset in ByteStore of first byte in curchain */

-- Nlm_BSUnitPtr chain, /* chain of elements */

-- curchain; /* the BSUnit containing seekptr */

-- } Nlm_ByteStore, PNTR Nlm_ByteStorePtr;

-- AsnTool incorporates this as a primitive type, so the definition

-- is here just for completness

-- StringStore ::= [APPLICATION 1] IMPLICIT OCTET STRING

-- Date is used to replace the (overly complex) UTCTtime, GeneralizedTime

-- of ASN.1

-- It stores only a date

Date ::= CHOICE {

str VisibleString , -- for those unparsed dates

std Date-std } -- use this if you can

Date-std ::= SEQUENCE { -- NOTE: this is NOT a unix tm struct

year INTEGER , -- full year (including 1900)

month INTEGER OPTIONAL , -- month (1-12)

day INTEGER OPTIONAL , -- day of month (1-31)

season VisibleString OPTIONAL } -- for "spring", "may-june", etc

-- Dbtag is generalized for tagging

-- eg. { "Social Security", str "023-79-8841" }

-- or { "member", id 8882224 }

Dbtag ::= SEQUENCE {

db VisibleString , -- name of database or system

tag Object-id } -- appropriate tag

-- Object-id can tag or name anything

Object-id ::= CHOICE {

id INTEGER ,

str VisibleString }

-- Person-id is to define a std element for people

Person-id ::= CHOICE {

dbtag Dbtag , -- any defined database tag

name Name-std , -- structured name

ml VisibleString , -- MEDLINE name (semi-structured)

-- eg. "Jones RM"

str VisibleString } -- unstructured name

Name-std ::= SEQUENCE { -- Structured names

last VisibleString ,

first VisibleString OPTIONAL ,

middle VisibleString OPTIONAL ,

full VisibleString OPTIONAL , -- full name eg. "J. John Poop, Esq"

initials VisibleString OPTIONAL, -- first + middle initials

suffix VisibleString OPTIONAL , -- Jr, Sr, III

title VisibleString OPTIONAL } -- Dr., Sister, etc

--**** Int-fuzz **********************************************

--*

--* uncertainties in integer values

Int-fuzz ::= CHOICE {

p-m INTEGER , -- plus or minus fixed amount

range SEQUENCE { -- max to min

max INTEGER ,

min INTEGER } ,

pct INTEGER , -- % plus or minus (x10) 0-1000

lim ENUMERATED { -- some limit value

unk (0) , -- unknown

gt (1) , -- greater than

lt (2) , -- less than

tr (3) , -- space to right of position

tl (4) , -- space to left of position

other (255) } } -- something else

--**** User-object **********************************************

--*

--* a general object for a user defined structured data item

--* used by Seq-feat and Seq-descr

User-object ::= SEQUENCE {

class VisibleString OPTIONAL , -- endeavor which designed this object

type Object-id , -- type of object within class

data SEQUENCE OF User-field } -- the object itself

User-field ::= SEQUENCE {

label Object-id , -- field label

num INTEGER OPTIONAL , -- required for strs, ints, reals, oss

data CHOICE { -- field contents

str VisibleString ,

int INTEGER ,

real REAL ,

bool BOOLEAN ,

os OCTET STRING ,

object User-object , -- for using other definitions

strs SEQUENCE OF VisibleString ,

ints SEQUENCE OF INTEGER ,

reals SEQUENCE OF REAL ,

oss SEQUENCE OF OCTET STRING ,

fields SEQUENCE OF User-field ,

objects SEQUENCE OF User-object } }

END

C Structures and Functions: objgen.h

/* objgen.h

* ===========================================================================

* PUBLIC DOMAIN NOTICE

* National Center for Biotechnology Information

* This software/database is a "United States Government Work" under the

* terms of the United States Copyright Act. It was written as part of

* the author's official duties as a United States Government employee and

* thus cannot be copyrighted. This software/database is freely available

* to the public for use. The National Library of Medicine and the U.S.

* Government have not placed any restriction on its use or reproduction.

* Although all reasonable efforts have been taken to ensure the accuracy

* and reliability of the software and data, the NLM and the U.S.

* Government do not and cannot warrant the performance or results that

* may be obtained by using this software or data. The NLM and the U.S.

* Government disclaim all warranties, express or implied, including

* warranties of performance, merchantability or fitness for any particular

* purpose.

* Please cite the author in any work or product based on this material.

* ===========================================================================

* File Name: objgen.h

* Author: James Ostell

* Version Creation Date: 1/1/91

* $Revision: 2.1 $

* File Description: Object manager interface for module NCBI-General

* Modifications:

* --------------------------------------------------------------------------

* Date Name Description of modification

* ------- ---------- -----------------------------------------------------

* ==========================================================================

#ifndef _NCBI_General_

#define _NCBI_General_

#ifndef _ASNTOOL_

#include <asn.h>

#endif

#ifdef __cplusplus

extern "C" {

#endif

/*****************************************************************************

* loader

*****************************************************************************/

extern Boolean GeneralAsnLoad PROTO((void));

/*****************************************************************************

* internal structures for NCBI-General objects

*****************************************************************************/

/*****************************************************************************

* Date, Date-std share the same structure

* any data[2] or data[3] values = 0 means not set or not present

* data [0] - CHOICE of date ,0=str, 1=std

* [1] - year (- 1900)

* [2] - month (1-12) optional

* [3] - day (1-31) optional

*****************************************************************************/

typedef struct date {

Uint1 data[4]; /* see box above */

CharPtr str; /* str or season or NULL */

} NCBI_Date, PNTR NCBI_DatePtr;

#define DatePtr NCBI_DatePtr

NCBI_DatePtr DateNew PROTO((void));

NCBI_DatePtr DateFree PROTO((NCBI_DatePtr dp));

Boolean DateWrite PROTO((NCBI_DatePtr dp, Int2 year, Int2 month, Int2 day, CharPtr season));

Boolean DateRead PROTO((NCBI_DatePtr dp, Int2Ptr year, Int2Ptr month, Int2Ptr day, CharPtr season));

Boolean DatePrint PROTO((NCBI_DatePtr dp, CharPtr buf));

NCBI_DatePtr DateCurr PROTO((void));

NCBI_DatePtr DateDup PROTO((NCBI_DatePtr dp));

Boolean DateAsnWrite PROTO((NCBI_DatePtr dp, AsnIoPtr aip, AsnTypePtr atp));

NCBI_DatePtr DateAsnRead PROTO((AsnIoPtr aip, AsnTypePtr atp));

Int2 DateMatch PROTO((DatePtr a, DatePtr b, Boolean all));

/*****************************************************************************

* Object-id stuff

*****************************************************************************/

typedef struct objid {

Int4 id;

CharPtr str;

} ObjectId, PNTR ObjectIdPtr;

extern ObjectIdPtr ObjectIdNew PROTO((void));

extern ObjectIdPtr ObjectIdFree PROTO(( ObjectIdPtr oid));

extern ObjectIdPtr ObjectIdAsnRead PROTO((AsnIoPtr aip, AsnTypePtr atp));

extern Boolean ObjectIdAsnWrite PROTO((ObjectIdPtr oid, AsnIoPtr aip, AsnTypePtr atp));

extern Boolean ObjectIdMatch PROTO((ObjectIdPtr a, ObjectIdPtr b));

extern ObjectIdPtr ObjectIdDup PROTO((ObjectIdPtr oldid));

/*****************************************************************************

* DBtag stuff

*****************************************************************************/

typedef struct dbtag {

CharPtr db;

ObjectIdPtr tag;

} Dbtag, PNTR DbtagPtr;

extern DbtagPtr DbtagNew PROTO((void));

extern DbtagPtr DbtagFree PROTO(( DbtagPtr dbt));

extern DbtagPtr DbtagAsnRead PROTO((AsnIoPtr aip, AsnTypePtr atp));

extern Boolean DbtagAsnWrite PROTO((DbtagPtr dbt, AsnIoPtr aip, AsnTypePtr atp));

extern Boolean DbtagMatch PROTO((DbtagPtr a, DbtagPtr b));

extern DbtagPtr DbtagDup PROTO((DbtagPtr oldtag));

/*****************************************************************************

* Name-std

* names[0] = last

* [1] = first

* [2] = middle

* [3] = full

* [4] = initials

* [5] = suffix

* [6] = title

*****************************************************************************/

typedef struct namestd {

CharPtr names[7];

} NameStd, PNTR NameStdPtr;

extern NameStdPtr NameStdNew PROTO((void));

extern NameStdPtr NameStdFree PROTO(( NameStdPtr nsp));

extern NameStdPtr NameStdAsnRead PROTO((AsnIoPtr aip, AsnTypePtr atp));

extern Boolean NameStdAsnWrite PROTO((NameStdPtr nsp, AsnIoPtr aip, AsnTypePtr atp));

/*****************************************************************************

* Person-id

* choice = 0 = not set

* 1 = dbtag

* 2 = name

* 3 = ml

* 4 = str

*****************************************************************************/

typedef struct personid {

Uint1 choice; /* which CHOICE, see above */

Pointer data; /* points to appropriate data structure */

} PersonId, PNTR PersonIdPtr;

extern PersonIdPtr PersonIdNew PROTO((void));

extern PersonIdPtr PersonIdFree PROTO(( PersonIdPtr pid));

extern PersonIdPtr PersonIdAsnRead PROTO((AsnIoPtr aip, AsnTypePtr atp));

extern Boolean PersonIdAsnWrite PROTO((PersonIdPtr pid, AsnIoPtr aip, AsnTypePtr atp));

/*****************************************************************************

* Int-fuzz

*****************************************************************************/

typedef struct intfuzz {

Uint1 choice; /* 1=p-m, 2=range, 3=pct, 4=lim */

Int4 a, b; /* a=p-m,max,pct,orlim, b=min */

} IntFuzz, PNTR IntFuzzPtr;

extern IntFuzzPtr IntFuzzNew PROTO((void));

extern IntFuzzPtr IntFuzzFree PROTO(( IntFuzzPtr ifp));

extern IntFuzzPtr IntFuzzAsnRead PROTO((AsnIoPtr aip, AsnTypePtr atp));

extern Boolean IntFuzzAsnWrite PROTO((IntFuzzPtr ifp, AsnIoPtr aip, AsnTypePtr atp));

/*****************************************************************************

* User-field

* data is an DataVal where:

* choice asn1 data. =

1 = str VisibleString , ptrvalue = CharPtr

2 = int INTEGER , intvalue

3 = real REAL , realvalue

4 = bool BOOLEAN , boolvalue

5 = os OCTET STRING , ptrvalue = ByteStorePtr

6 = object User-object , ptrvalue = UserObjectPtr

7 = strs SEQUENCE OF VisibleString , ptrvalue = CharPtr PNTR

8 = ints SEQUENCE OF INTEGER , ptrvalue = Int4Ptr

9 = reals SEQUENCE OF REAL , ptrvalue = FloatHiPtr

10 = oss SEQUENCE OF OCTET STRING , ptrvalue = ByteStorePtr PNTR

11 = fields SEQUENCE OF User-field , ptrvalue = UserFieldPtr

12 = objects SEQUENCE OF User-object } } ptrvalue = UserObjectPtr

* User-object

*****************************************************************************/

typedef struct userfield {

ObjectIdPtr label;

Int4 num;

Uint1 choice;

DataVal data;

struct userfield PNTR next;

} UserField, PNTR UserFieldPtr;

extern UserFieldPtr UserFieldNew PROTO((void));

extern UserFieldPtr UserFieldFree PROTO(( UserFieldPtr ufp));

extern UserFieldPtr UserFieldAsnRead PROTO((AsnIoPtr aip, AsnTypePtr atp));

extern Boolean UserFieldAsnWrite PROTO((UserFieldPtr ufp, AsnIoPtr aip, AsnTypePtr atp));

typedef struct userobj {

CharPtr _class;

ObjectIdPtr type;

UserFieldPtr data;

struct userobj PNTR next; /* for SEQUENCE OF User-object */

} UserObject, PNTR UserObjectPtr;

extern UserObjectPtr UserObjectNew PROTO((void));

extern UserObjectPtr UserObjectFree PROTO(( UserObjectPtr uop));

extern UserObjectPtr UserObjectAsnRead PROTO((AsnIoPtr aip, AsnTypePtr atp));

extern Boolean UserObjectAsnWrite PROTO((UserObjectPtr uop, AsnIoPtr aip, AsnTypePtr atp));

#ifdef __cplusplus

}

#endif