Supported programming languages

Datasets supports multiple languages via its API

NCBI Datasets API v2alpha can be accessed by any programming language that supports HTTP requests. Nearly all languages have libraries providing this support.

In the past, NCBI Datasets provided client libraries for programmatic access. These client libraries (for example Python and R) only support NCBI Datasets API v1. While these client libraries don’t currently support v2alpha, you may use the NCBI Datasets v2alpha API via its REST interface. The primary reason we decided not to provide pre-built API libraries for v2alpha is that most users have preferred to retrieve the data they need using our datasets command-line tool (CLI). Those who prefer or need more fine-grained access to the API however, can still use the REST API or use an OpenAPI Generator to generate client libraries for the language of their choice.

Generating OpenAPI client libraries for Datasets

Given the NCBI Datasets OpenAPI 3.0 spec, you can use an OpenAPI Generator to build an NCBI Datasets API v2alpha client library for the language of your choice. Your library, once generated, will have both documentation and code for each of our NCBI Datasets v2alpha REST API functions. You may then install and use it following the guidelines for the chosen language. There are a variety of ways to generate the OpenAPI interface and, for a more thorough explanation, you should look at the OpenAPI generator documentation. Below are instructions following a couple different approaches for building the Python version of the Datasets OpenAPI library using either npm or the OpenAPI java libraries. In the generating statements below, we provide some additional parameters including the package name and the project name. A description of these options and other generator parameters can be found here.

Build Python NCBI Datasets API v2alpha library using npm

#!/usr/bin/env bash
OUTPUT_DIR="python_lib"
# get a copy of the Datasets OpenAPI v2alpha Specification
wget https://0-www-ncbi-nlm-nih-gov.brum.beds.ac.uk/datasets/docs/v2/openapi3/openapi3.docs.yaml
# Intialize a local npm package
npm init --yes
# Install the OpenAPI specification generator. To get help on using the openapi generator for various languages run:
# npx @openapitools/openapi-generator-cli config-help -g <python|go|javascript|typescript|R|...>
npm install @openapitools/openapi-generator-cli
# Create the datasets OpenAPI library for python in the directory ${OUTPUT_DIR}.
# For more info see: https://openapi-generator.tech/docs/usage/#generate
npx @openapitools/openapi-generator-cli generate \
    -i openapi3.docs.yaml \
    -g python \
    -o ${OUTPUT_DIR} \
    --package-name "ncbi.datasets.openapi" \
    --additional-properties=pythonAttrNoneIfUnset=true,projectName="ncbi-datasets-pylib"

Build Python NCBI Datasets API v2alpha library using OpenAPI java libraries

#!/usr/bin/env bash
OUTPUT_DIR="python_lib"
# get a copy of the Datasets OpenAPI v2alpha Specification
wget https://0-www-ncbi-nlm-nih-gov.brum.beds.ac.uk/datasets/docs/v2/openapi3/openapi3.docs.yaml
# Get the OpenAPI library generator (a Java jar file)
wget https://repo1.maven.org/maven2/org/openapitools/openapi-generator-cli/7.2.0/openapi-generator-cli-7.2.0.jar -O openapi-generator-cli.jar
# Create the datasets OpenAPI library for python in the directory ${OUTPUT_DIR}.
# For more info see: https://openapi-generator.tech/docs/usage/#generate
java -jar openapi-generator-cli.jar generate \
    -g python \
    -i openapi3.docs.yaml \
    --package-name "ncbi.datasets.openapi" \
    --additional-properties=pythonAttrNoneIfUnset=true,projectName="ncbi-datasets-pylib"

Example Python program using the generated OpenAPI library: gene_get_info.py

import sys
import io
import os
from typing import List
from zipfile import ZipFile

from ncbi.datasets.openapi import ApiClient as DatasetsApiClient
from ncbi.datasets.openapi import ApiException as DatasetsApiException
from ncbi.datasets.openapi import GeneApi as DatasetsGeneApi

zipfile_name = "gene_ds.zip"

# download the data package using the DatasetsGeneApi, and then print out protein sequences for A2M and GNAS
with DatasetsApiClient() as api_client:
    gene_ids: List[int] = [2, 2778]
    gene_api = DatasetsGeneApi(api_client)
    try:
        gene_dataset_download = gene_api.download_gene_package_without_preload_content(
            gene_ids,
            include_annotation_type=["FASTA_GENE", "FASTA_PROTEIN"],
        )

        with open(zipfile_name, "wb") as f:
            f.write(gene_dataset_download.data)
    except DatasetsApiException as e:
        sys.exit(f"Exception when calling GeneApi: {e}\n")


try:
    dataset_zip = ZipFile(zipfile_name)
    zinfo = dataset_zip.getinfo(os.path.join("ncbi_dataset/data", "protein.faa"))
    with io.TextIOWrapper(dataset_zip.open(zinfo), encoding="utf8") as fh:
        print(fh.read())
except KeyError as e:
    logger.error("File %s not found in zipfile: %s", file_name, e)

Run the above Python program

# Create a virtual env and install the newly built library
virtualenv venv && source venv/bin/activate
pip install python_lib
# Run program to retrieve a zip file with gene information, and print the protein.faa file
python gene_get_info.py

Generated April 19, 2024