NCBI Logo NCBI News NCBI News banner
National Center for Biotechnology Information US Department of Health and Human Services National Center for Biotechnology Information National Library of Medicine National Institutes of Health
fall 2003 issue of NCBI News

click to go to index of past issues

In this issue


Entrez Programming Utilities (E-Utils)

PubChem

GenePlot

New NLM Catalog in Entrez

New Genome Builds

New Microbial Genomes in GenBank

Whole Genome Shotgun Project

Web BLAST

Trace Archive Grows

New Organisms in UniGene

RefSeq Version 8

Submissions Corner

Predicted Records

GenBank Release 144

BLAST 2.2.10

Publications

Masthead

 





Entrez Programming Utilities

In dealing with specialized datasets, researchers are often restricted to one of two unattractive choices: either to download an ftp archive containing far more than the data of interest, followed by a round of local parsing, or to access the data interactively, even though the volume of data may render this method cumbersome. To help with the latter method, NCBI provides a suite of programs called the Entrez Programming Utilities (E-Utilities) that allow automated access to the Entrez databases.

What Are the E-Utilities?

The E-Utilities are a set of seven server-side programs that provide a stable interface to the search, retrieval, and linking functions of the Entrez system, using a fixed URL syntax. The output provided by the E-Utilities is in XML format, with the notable exception of the EFetch utility, which returns data in a variety of formats. The E-Utilities are designed to be called from within a computer program that can process their output. Calling an E-Utility from any of the common programming languages— including Perl, Python, and Java—is a simple matter of posting a URL.

The E-Utilities Implement Entrez Functions

Each of the E-Utilities performs a basic task within the Entrez system, and six of the E-Utilities have a direct equivalent in interactive Entrez (Box 1 on page 3). For instance, typing a text query into the NCBI home page and clicking “Go” causes Entrez to search for matches across all Entrez databases and list the number of matching records for each. This "Global Query" function is implemented by EGQuery. If a single database is queried, Entrez first maps the query to a set of integers, or unique identifiers (UIDs), for matching records in the selected database. Entrez UIDs are sometimes referred to as GI numbers for nucleotide and protein, PMIDs for PubMed, and MMDB-IDs for Structures. Entrez queries and the subsequent list of matching UIDs are implemented using ESearch. On the web, Entrez searches are automatically followed by displays of brief record listings, called Document Summaries (DocSums), for matching records. This functionality is implemented by ESummary. Access to full records in an Entrez database on the Web is provided by clicking on the accession of a displayed DocSum. These functions are implemented by EFetch. Acces-sing records linked to a given record on the Web is as simple as clicking on a link in the Links menu to the right of a DocSum. This linking function is provided by ELink. On the web, Batch Entrez is used to upload a list of UIDs; this function is provided by EPost. EPost places UIDs on the Entrez History server that stores the results of previous searches during an Entrez session, as can be done on the Web using the Preview/Index or History tabs. The only E-Utility that does not have a direct Web parallel is EInfo, which provides the vital statistics of Entrez databases such as the date of the last update, a list of links to other databases, and a list of indexed fields.

The E-Utilities Search for Data


Suppose that a researcher wants to find all human RefSeq protein records that have links to Online Mendelian Inheritance in Man (OMIM), and thereby have an associated phenotype. This can be done by posting the ESearch URL shown in Example 1 of ESearch in Box 1:
This URL produces XML output, a portion of which is shown below:

<Count>14988</Count>
<RetMax>20</RetMax>
<RetStart>0</RetStart>
<QueryKey>47</QueryKey>
<WebEnv>0hh9nVItHLfyYJGaMMIh_T0ptRqIsaiaikdx5k_yhaM0S72qC5x-AY</WebEnv>


Included is the number of records (14,988) matching the query along with the two parameters that define the location of the data set on the History server: the Query Key, with a value of 47, and the Web Environ-ment (WebEnv), with a value of “0hh9nVItHLfyYJGaMMI . . . .” The latter is a string associated with the internet cookie for the Entrez session.

The E-Utilities Retrieve Data

Retrieving the actual records identified in the above search is performed either using ESummary to retrieve DocSums or using EFetch to retrieve formatted records, such as FASTA sequence. Other available sequence formats include GenBank, GenPept, and INSDSeq XML, which can be selected using the &rettype parameter. One consideration to bear in mind is that EFetch is limited to 500 records per URL. Therefore, to retrieve FASTA sequence for all 14,988 records, a loop within the calling program will be required to post 30 URLs, the second of which, to retrieve records 500-999, is shown in the EFetch section of Box 1.

The remaining 28 URLs would differ only in the value of the &retstart parameter, which would increment by 500 in each successive call within the loop.

The E-Utilities Limit and Link Datasets

To find the annotated genes associated with a select group of this set of RefSeq proteins, namely those that have interleukin 22 in their title, another ESearch URL can be used, as listed under Example 2 of the ESearch section of Box 1, where "%2347" is the URL encoding for "#47" and refers to our previous query key. The five resulting GIs can be extracted from the XML and used as input to ELink, shown in the ELink section of Box 1.

Since each GI was assigned in a separate &id parameter, the XML output will contain separate lists of linked GeneIDs for each protein GI. A simple analysis of the results reveals that the second, third, and fourth protein GIs are linked to the same GeneID, revealing the three transcriptional variants of the interleukin 22 binding protein, a name in turn retrieved by a single ESummary call with that common GeneID.

By using additional combinations of E-Utilities calls, a wide array of data pipelines can be constructed easily and used to process large numbers of data records.

For more information, see the following:

E-utility online documentation:

NCBI PowerScripting, a new NCBI course on programming with the E-utils:

Building Customized Data Pipelines Using the Entrez Programming Utilities (eutils):

—ES

Helpful E-Utility URLs and E-Utility Samples - The base E-Utility URL:

eutils.ncbi.nlm.nih.gov/entrez/eutils

Click here to open Box 1 which contains helpful E-Utility URLs and E-Utility samples

to next article


NCBI News | Summer 2003 NCBI News: Spring 2004