Overview


Introduction
Components Of The Software Development ToolKit
A Few Samples
Using This Document
Contacting NCBI


 Introduction

Molecular biology is generating a host of data which are dramatically altering and deepening our understanding of the processes which underlie all living things. This new knowledge is already affecting medicine, agriculture, biotechnology, and basic science in fundamental and sweeping ways. However, the data on which our growing understanding is based is being accumulated and analyzed in thousands of laboratories all over the world, from large genome centers to small university laboratories, from large pharmaceutical companies to small biotech startups. It is being managed and analyzed on machines from small personal computers to supercomputers, on systems from a few disk files to large commercial database systems. These essential new data require specialized tools for analysis and management, so software tools are being developed in all these different environments at once.  Since molecular biology is an infant science, the data itself is not yet fully understood, so its fundamental properties and relationships are constantly being revised as well. Finally, the raw volume of molecular biology data is growing at an astonishing rate.

In recognition of the essential and growing role of bioinformatics in the United States, the National Center for Biotechnology Information (NCBI) was created by act of Congress in November 1988. This law mandates that NCBI shall:

 

1) Create automated systems for knowledge about molecular biology, biochemistry, and genetics.

2) Perform research into advanced methods of analyzing and interpreting molecular biology data.

3) Enable biotechnology researchers and medical care personnel to use the systems and methods developed.

4) Coordinate efforts to gather biotechnology information worldwide.

To approach these goals, NCBI has been organized into three interoperating branches. The Basic Research Branch (BRB) is a group of scientists who perform research into algorithms and methods for analyzing molecular biology data and publish results in peer reviewed journals, and keeps the other branches abreast of the latest developments from a scientific perspective. The Information Resources Branch (IRB) maintains the infrastructure at NCBI, administers the distribution of data and services provided by NCBI to the community, supports a visiting scientist program to enable researchers to spend time working at NCBI, and interacts with other agencies and bodies. The Information Engineering Branch (IEB) designs and builds databases and software tools for molecular biology information which attempt to incorporate the new approaches and meet the needs of the BRB, while producing data and software tools which are released to the community on a production basis by the IRB.

This document describes the data model and software tools developed by the IEB to achieve their mission. The IEB approaches its task with an understanding of the situation outlined in the first paragraph, that molecular biology data comes from and is used in an extremely heterogeneous, distributed, and changing environment, from both computing and biological points of view. The data processed and integrated by IEB will come from many different sources which may use different models of the data, which can be expected to change over time. The data will be stored and managed on many different computer systems using many different database management systems. The data itself is expected to be valuable for longer than the life cycle of any particular computer system or program. This means that the data must be described in a controlled and formal way, so that all participants can clearly understand what data components are available in common at any time, but without dependence on any particular software tool or language, database management system, or hardware architecture.

Software developed by IEB must be capable of running on all major hardware platforms used in the scientific community and must be designed to be ported to new systems as the computer industry progresses. It must be capable of providing systems for data retrieval by end-user scientists while also providing software hooks for other programs written by bioinformatics specialists in commercial, academic, or government settings, and by academic researchers.

To achieve the goal of a formal, controlled, yet flexible data specification, IEB has adopted the use of Abstract Syntax Notation 1 (ASN.1), and International Standards Organization standard (ISO 8824, 8825) for describing and encoding data in a machine readable way which is independent of hardware or software architecture and language. IEB has created a formal specification in ASN.1 for biotechnology and bibliographic information. This specification is based on a data model which unifies sequence related data from bands on a gel to genetic maps to sequenced nucleic acid and protein molecules. It provides connections from such data to other specialized datasets such as stock center lists, taxonomies, or structures. The specification is done as a series of connected modules. This means selected modules can be reused by other biotechnology databases and new ones added to meet specialized needs. The ASN.1 specification and encoding provide an essential common ground, changing the many to many mapping between the various information sources and applications to a many to one mapping, both for data models and for software interfaces.

To achieve the goals of software portability and of providing different levels of access from database producer to programmer to end-user, IEB has developed a layered software toolkit. The toolkit is used internally at NCBI to process and analyze data from a variety of sources to build and maintain the unified databases and also serves as the components for the end-user applications NCBI distributes. This means it is subjected to the continuous demands for quality and performance imposed by a large, production operation in the course of our daily work. The source code for the toolkit is made available without restriction for use by anyone wishing to take advantage of the work done by NCBI. The software runs on a wide variety of common platforms and is layered to allow programmers use both very low level or very high level tools to access and manipulate data.

Components Of The Software Development ToolKit

ASN.1

A brief introduction is provided to the ASN.1 language itself in the beginning of the AsnLib chapter. Those familiar with Backus-Naur form should have no trouble reading it immediately, while a short explanation may be required for others. It is a simple, logical way to specify data and is used for many purposes in the computer industry to describe and exchange data. A number of books, articles, and software tools from the computer industry at large are readily available for those who wish a more in-depth knowledge of ASN.1. This is an important aspect of choosing the ASN.1 language to describe biological data. ASN.1 is a formal data description language, developed, tested, and used within the computer industry, not an ad hoc file format developed by biologists. Would you program in an ad hoc programming language developed by biologists? Then why describe data that way?

Data Model For Biological Sequences

The selection of a data description language does not define what it is used for any more than the selection of English defines what a book is about. The IEB has defined a model for biotechnology information (which happens to be specified in ASN.1) which is centered around the concept of a biological sequence as a simple, linear coordinate system. Genetic and physical maps, sequenced pieces of nucleic acids and proteins, and complex assemblies of such components can all be considered specializations of the basic sequence concept of an identified coordinate system. Relationships between sequences (e.g. sequence alignments, sequence assemblies, relationships of genetic to physical maps) can all be considered mappings from one sequence coordinate system to another. Information about sequences can be considered mappings of specialized data objects (e.g. publications, genes, coding regions) to any sequence coordinate system. Such specialized data objects may themselves contain keys to other databases containing more specialized information not necessarily captured by the common data model, but unique to a particular organism, discipline, or database.

CoreLib: Writing Portable Software

The CoreLib is a small set of "C' language functions, macros, and guidelines that permit the writing of programs which compile and execute without change on over fourteen different hardware/operating system/compiler combinations. If one wishes to distribute one's code to as many molecular biologists as possible with as little work as possible, learning to write CoreLib style code is a tremendous advantage. If one wishes to write on one platform, but interface with NCBI software, one should still understand the CoreLib approach (read the introduction in the CoreLib chapter), but it does not require that one write CoreLib code oneself.

AsnLib: Reading and Writing ASN.1

AsnLib is a function library written with CoreLib, which provides functions for reading and validating ASN.1 specifications and generating parse trees to encoded and decode data conforming to the specification. The parse trees can be generating dynamically at run-time from any input specification, or parse trees for particular specifications can be produced as "C" language header files to be incorporated into applications. Given a parse tree generated either way, AsnLib provides low level functions for encoding and decoding data in either the text or binary forms of ASN.1, one element at a time. Converters to other languages (ASN.1 to Prolog or ASN.1 to LISP have been done), filters (get all journal titles from an ASN.1 encoded stream of bibliographic citations), or indexing programs (index a file of ASN.1 encoded bibliographic citations on author name) can be written with tools at this level.

Object Loaders: Combining AsnLib and the Data Model

Every ASN.1 specification module in the NCBI data model has a corressponding "object loader" module. This is a "C" language ".c" and ".h" file which typedef a "C" structure for every entity defined in ASN.1 (called an "object" here). For each object there is a function to create it, read it from an ASN.1 stream, write it to an ASN.1 stream, and free it. These take the form of [AsnName]New(), [AsnName]AsnRead(), [AsnName]AsnWrite(), and [AsnName]Free(). If an "object" is considered data associated with methods, these routines define the structure of the data (as mapped from ASN.1) and define routines to load such objects in and out of memory from ASN.1.

In some cases additional functions such as compare, duplicate, find, or print are defined here as well. The Data Access layer returns pointers to these structures and the Utilities layer provides more routines to compare, explore, manipulate, and display these structures. Using the object loader layer incorporates a great deal of NCBI code into your application, but most programmers find this the easiest level to access NCBI data for complex objects such as whole sequence entries.

In the following document detailed discussion of an ASN.1 module and its corresponding object loader are combined together in a single chapter. The chapters are organized by grouping closely related objects together. The discussion in each chapter focuses on particular issues surrounding the implementation of that data type but may not mention every function. The complete ASN.1 specification and object loader ".h" files follow at the end of each such chapter for the comprehensive and definitive specification.

Utilities

A growing number of utility functions have been written that manipulate or analyze the structures defined in the object loaders. For example, one function compares two (arbitrarily complex) locations on sequences and determines if they overlap or if one is contained in the other. Another opens a "port" on any (arbitrarily complex) sequence or part(s) of a sequence(s) and treat it as a single sequence, in any selected sequence alphabet, with operations provided such as "seek to location", "get next residue", "read x residues into a buffer", and so on. A whole family of functions allow the exploration of any arbitrarily complex structure in memory with a call to a user supplied function when encountering any structure based on it's ASN.1 name (e.g. find all coding region features, or find all publications, or find all author names in publications). Finally there are functions that will output a sequence entry in GenBank format, FASTA format, or a report format.

Data Access

A family of functions supplies high level access to sequence and bibliographic data on the Entrez:Sequences CDROM provided by NCBI. These functions allow the evaluation of Boolean operations on a list of terms, resulting the sequence ids (or MEDLINE ids) that satisfy the query. Other functions take sequence or MEDLINE id and retrieve the record from the CDROM, or retrieve its "neighbors", entries which are similar to it.

These same functions have been implemented as Internet network access functions to the NCBI data servers, and will become publicly available in 1993. Software which accesses data on the Entrez: Sequences CDROM using the access functions can be changed to access the network servers by just linking to a different library.

The access functions mean that a programmer can incorporate any or all of the functionality shown by the Entrez application into a program of their own design. This means customized analysis and retrieval systems can be written which nonetheless take advantage of the public data retrieval systems.

Vibrant: A Portable Windowing System

Vibrant is a portable windowing system written with CoreLib which allows windowing applications to be written which are source code identifical on Macintosh, MicroSoft Windows, UNIX X11 Motif and VMS X11 Motif. Vibrant is not meant to provide every possible tool supported by the host system or other commercial products, but rather to vastly simplify writing basic scientific applications which are compatible with the modern windowing environments widely used by scientists now in a portable way.

NCBI fondly hopes that eventually a standard windowing API or appropriate tools will emerge from the computer industry. We will only support Vibrant until that time. While we make it available to the public to use as desired, Vibrant is primarily aimed at serving internal NCBI needs.

A Few Samples

This document contains a large mass of detailed information and new ideas. Just as learning a new language, it is a substantial commitment to learn and understand it all. But knowing it all may not be necessary to get started. This is a quick sample of what is available to give you a flavor of what this is.

This is the ASN.1 definitions used for an article citation (from a book, journal, or proceedings.. only journal is shown). The "::=" means "is defined as" and SEQUENCE means "the following items come in order", not a biological sequence. You can probably just read the rest.

Cit-art ::= SEQUENCE {                  -- article in journal or book

    title Title OPTIONAL ,              -- title of paper (ANSI requires)

    authors Auth-list OPTIONAL ,        -- authors (ANSI requires)

    from CHOICE {                       -- journal or book

        journal Cit-jour ,

        book Cit-book ,

        proc Cit-proc } }

 

Cit-jour ::= SEQUENCE {             -- Journal citation

    title Title ,                   -- title of journal

    imp Imprint }

 

Auth-list ::= SEQUENCE {

        names CHOICE {

            std SEQUENCE OF Author ,        -- full citations

            ml SEQUENCE OF VisibleString ,  -- MEDLINE, semi-structured

            str SEQUENCE OF VisibleString } , -- free for all

        affil Affil OPTIONAL }        -- author affiliation

 

Title ::= SET OF CHOICE {

    name VisibleString ,    -- Title, Anal,Coll,Mono    AJB

    tsub VisibleString ,    -- Title, Subordinate       A B

    trans VisibleString ,   -- Title, Translated        AJB

    jta VisibleString ,     -- Title, Abbreviated        J

    iso-jta VisibleString , -- specifically ISO jta      J

    ml-jta VisibleString ,  -- specifically MEDLINE jta  J

    coden VisibleString ,   -- a coden                   J

    issn VisibleString ,    -- ISSN                      J

    abr VisibleString ,     -- Title, Abbreviated         B

    isbn VisibleString }    -- ISBN                       B

 

Imprint ::= SEQUENCE {                  -- Imprint group

    date Date ,                         -- date of publication

    volume VisibleString OPTIONAL ,

    issue VisibleString OPTIONAL ,

    pages VisibleString OPTIONAL ,

    section VisibleString OPTIONAL ,

    pub Affil OPTIONAL,                     -- publisher, required for book

    cprt Date OPTIONAL,                     -- copyright date, "    "   "

    part-sup VisibleString OPTIONAL ,       -- used in MEDLINE

    language VisibleString DEFAULT "ENG" ,  -- put here for simplicity

   prepub ENUMERATED {                     -- for prepublication citaions

       submitted (1) ,                     -- submitted, not accepted

       in-press (2) ,                    -- accepted, not published

       other (255)  } OPTIONAL }

 

That is a very complete and detailed specification but here is a sample of a journal citation in text form ASN.1. You can easily see how it conforms to the specification and how one would locate the journal title for example.

Cit-art ::= {

  title {

    name "Developmental regulation of a constitutively expressed mouse mRNA

 encoding a 72-kDa heat shock-like protein." } ,

  authors {

    names

      ml {

        "Giebel LB" ,

        "Dworniczak BP" ,

        "Bautz EK" } ,

    affil

      str "Zentrum fur Molekulare Biologie, Universitat Heidelberg (ZMBH),

 Federal Republic of Germany." } ,

  from

    journal {

      title {

        ml-jta "Dev Biol" } ,

      imp {

        date

          std {

            year 1988 ,

            month 1 } ,

        volume "125" ,

        issue "1" ,

        pages "200-7" } } }

Here is the object loader "C" structure and its attendant functions for a Cit-art. There is even a matching function for this object. Details of using the "fromptr" to access the CitJour, CitBook, or CitProc for the article are given in the Bibliographic References chapter. This is just to give the flavor.

/*****************************************************************************

*

*   Cit-art

*

*****************************************************************************/

typedef struct citart {

   ValNodePtr title;       /* choice[1]=name,[2]=tsub,[3]=trans */

   AuthListPtr authors;

   Uint1 from;             /* [1]=journal,[2]=book,[3]=proc */

   Pointer fromptr;

} CitArt, PNTR CitArtPtr;

 

extern CitArtPtr CitArtNew PROTO((void));

extern CitArtPtr CitArtFree PROTO((CitArtPtr cap));

extern CitArtPtr CitArtAsnRead PROTO((AsnIoPtr aip, AsnTypePtr atp));

extern Boolean CitArtAsnWrite PROTO((CitArtPtr cap, AsnIoPtr aip, AsnTypePtr atp));

Int2 CitArtMatch PROTO((CitArtPtr a, CitArtPtr b));

 

Here is a data access function which retrieves a MEDLINE record (a MedlineEntry) from the Entrez: Sequences CDROM, given a MEDLINE unique identifier (uid). A MedlineEntry contains an article citation (i.e. it reuses the Cit-art object from the bibliographic module then adds the additional index terms and information needed to make a MEDLINE record).

MedlineEntryPtr GetMedline (Int4 uid)

{

   MedlineEntryPtr mep = NULL;

 

   if (! EntrezInit())                   /* intitialize Entrez CDROM */

       return NULL;                      /* failed to initialize */

   mep = EntrezMedlineEntryGet(uid);     /* get the Medline entry */

   EntrezFini();                         /* close CDROM */

   return mep;

}

Here is a code fragment that will exhaustively explore the MedlineEntry structure in memory and call the user supplied callback function when it finds Imprint in the Cit-jour of the Cit-art (i.e. the article was published in a journal which was printed at a particular time). The string "Cit-art.from.jour.imp" defines a path to the journal imprint following the ASN.1 specification given above.

ExploreExample( MedlineEntryPtr mep)

{

   AsnIoPtr aip;

 

   aip = AsnIoNullOpen();                       /* attach a callback below */

   AsnExpOptNew(aip, "Cit-art.from.jour.imp", NULL, GetImprint);

   MedlineEntryAsnWrite(mep, aip, NULL); /* traverse structure */

   AsnIoClose(aip);

   return;

}

 

/*** this is called whenever a journal imprint in an article is found **/

void GetImprint(AsnExpOptStructPtr aeosp)

{

   ImprintPtr ip;

 

   /*

   ** Make sure we are at the beginning of an Imprint

   */

 

   if (aeosp->dvp->intvalue != START_STRUCT) return;

 

   ip = (ImprintPtr)aeosp->the_struct;  /* we have the Imprint */

       /*.... do whatever you want with it */

   return;

}

Finally, here we print out the MedlineEntry in EndNote format and free the memory it used.

MedlineToFile (MedlineEntryPtr mep)

{

   FILE *fp;

 

   fp = FileOpen("test.out", "w");

   MedlineEntryToDocFile (mep, fp);

   FileClose(fp);

   MedlineEntryFree(mep);

   return;

}

Using This Document

This document has a detailed table of contents which can direct you to the topic of interest. For an initial acquaintance with the system, read the Data Model chapter and the introductions to the other chapters. Then, depending on your style and interests, either:

1) Download the software toolkit, build it, and make the demo programs. Print the ".c" files for the demos and look them over. Print the ".asn" files from the \asn directory and look them over. Print the ".h" files from the \object directory. Print "sequtil.h" and "seqport.h" from \api. Print "accentr.h" from \cdromlib. Go back and read the rest of the documentation.

2) Read the documentation by scanning the sections after the introductions in each chapter. Then return in detail to what interests you.

Contacting NCBI

You can download the software tools (all versions) by anonymous ftp to ftp.ncbi.nlm.nih.gov.

cd toolbox\ncbi_tools

bin

get ncbi.tar.Z                       (compressed UNIX tar file)

or

get ncbiZ.exe                       (self extracting DOS archive)

or

get ncbi.sea.hqx                 (self extracting Mac archive)

You can get on an email list to be notified of new releases of software by sending your name, address, institution, and email address to bits-request@ncbi.nlm.nih.gov

You can email to toolbox@ncbi.nlm.nih.gov

You can FAX 301-480-9241, attn. toolbox

You can mail to:

toolbox

NCBI

Bldg 38A, NIH

8600 Rockville Pike

Bethesda, MD 20850

All comments are welcome. If you are part of a larger project or group who wish to make use of the NCBI tools or to establish data exchange with NCBI, please let us know and we will do whatever we can to ensure your success.