Get representative protein sequences from Ortholog sets

Each ortholog gene may have many transcripts. Here we describe how you can filter the resulting metadata to only the longest transcript.

Get representative protein sequences from Ortholog sets

Each ortholog gene may have many transcripts. Here we describe how you can filter the resulting metadata to only the longest transcript.

Get one representative protein sequence from a set of orthologs

# [START ortholog_longest_transcript]
from typing import Iterator, List

from ncbi.datasets.openapi import ApiClient as DatasetsApiClient
from ncbi.datasets.openapi import ApiException as DatasetsApiException
from ncbi.datasets import GeneApi as DatasetsGeneApi

input_symbol = "ACE2"
input_taxon = "human"

def find_longest_transcript(gene):
    if not gene.transcripts:
        return None
    num_trs = len(gene.transcripts)
    if num_trs == 0:
        return None
    if num_trs == 1:
        return gene.transcripts[0]

    def get_tr_length(tr):
        return tr.length

    tr_by_len = sorted(gene.transcripts, key=get_tr_length, reverse=True)
    maxlen = get_tr_length(tr_by_len[0])
    longest = []
    for tr in tr_by_len:
        trlen = get_tr_length(tr)
        if trlen < maxlen:
    if len(longest) == 1:
        return longest[0]
    # select ealiest accession.version (oldest sequence)
    return sorted(longest, key=lambda tr: tr.accession_version)[0]

def _genes_for_ortholog(ortholog_set) -> Iterator:
    for gene_match in ortholog_set.genes.genes:
        if gene_match.messages:
            for m in gene_match.messages:
                if m.error:
                    print(f"error: [{m.error.reason}] {m.error.message}")
                if m.warning:
                    print(f"warning: [{m.warning.reason}] {m.warning.message}")
        if gene_match.warnings:
            for m in gene_match.warnings:
                print(f"warning: [{m.reason}] {m.message}")
        if gene_match.gene:
            yield gene_match.gene

def retrieve_orthologs_by_gene_id(gene_api: DatasetsGeneApi, gene_id: str):
    ortholog_set = gene_api.gene_orthologs_by_id(int(gene_id))
    for gene in _genes_for_ortholog(ortholog_set):
        longest_tr = find_longest_transcript(gene)
                "gene_id": gene.gene_id,
                "symbol": gene.symbol,
                "tr-accession": longest_tr.accession_version,
                "tr-length": longest_tr.length,
                "prot-accession": longest_tr.protein.accession_version,
                "prot-length": longest_tr.protein.length,

def gene_ids_for_symbol_and_taxon(gene_api: DatasetsGeneApi, symbols: List[str], taxon: str) -> List[int]:
    """Return a list of gene-ids for a set of symbols and taxon"""
    gene_ids_for_symbols: List[int] = []
        gene_reply = gene_api.gene_metadata_by_tax_and_symbol(
        for gene in gene_reply.genes:
    except DatasetsApiException as e:
        print(f"Exception when calling GeneApi: {e}\n")
    return gene_ids_for_symbols

def retrieve_orthologs_for_symbol(symbol: str, taxon: str):
    with DatasetsApiClient() as api_client:
        gene_api = DatasetsGeneApi(api_client)
            gene_ids = gene_ids_for_symbol_and_taxon(gene_api, [symbol], taxon)
            if len(gene_ids) != 1:
                raise Exception(f"Unexpected number of gene-ids: {gene_ids}")

            retrieve_orthologs_by_gene_id(gene_api, gene_ids[0])
        except DatasetsApiException as e:
            print(f"Exception when calling GeneApi: {e}\n")

# [END ortholog_longest_transcript]

if __name__ == "__main__":
    retrieve_orthologs_for_symbol(symbol=input_symbol, taxon=input_taxon)
  1. Download a gene ortholog data package. In this example, we are using the --taxon-filter flag to restrict the data to calculated orthologs from ferret (TaxId: 9669), mouse (TaxId: 10090) and human (TaxID: 9606).
datasets download ortholog gene-id 672 \
--taxon-filter 9669,10090,9606 \

unzip -d brca1_example
  1. Using dataformat, let’s create a tsv file with the following fields: taxonomic ID, taxonomic name, transcript accession and length, protein accession and length and isoform name. We’ll be working from the data directory:
cd brca1_example/ncbi_dataset/data

dataformat tsv gene \
--inputfile data_report.jsonl \
--fields tax-id,tax-name,transcript-accession,transcript-length,transcript-protein-accession,transcript-protein-length,transcript-protein-isoform > transcript_protein.tsv

cat transcript_protein.tsv


Taxonomic ID    Taxonomic Name  Transcript Accession    Transcript Transcript Length    Transcript Protein Accession    Transcript Protein Length       Transcript Protein Isoform
9669    Mustela putorius furo   XM_004772611.3  6768    XP_004772668.1  1869    isoform X5
9669    Mustela putorius furo   XM_004772612.3  6929    XP_004772669.1  1869    isoform X5
9669    Mustela putorius furo   XM_045080745.1  6937    XP_044936680.1  1869    isoform X5
9669    Mustela putorius furo   XM_013045927.2  6902    XP_012901381.1  1822    isoform X6
9669    Mustela putorius furo   XM_013045926.2  6950    XP_012901380.1  1877    isoform X4
9669    Mustela putorius furo   XM_004772609.3  6953    XP_004772666.1  1878    isoform X2
9669    Mustela putorius furo   XM_004772610.3  6953    XP_004772667.1  1878    isoform X3
9669    Mustela putorius furo   XM_004772608.3  6956    XP_004772665.1  1879    isoform X1
9669    Mustela putorius furo   XM_013045929.2  3642    XP_012901383.1  775     isoform X9
9669    Mustela putorius furo   XM_004772616.3  3642    XP_004772673.1  775     isoform X8
9669    Mustela putorius furo   XM_013045928.2  3645    XP_012901382.1  776     isoform X7
10090   Mus musculus    XR_004936704.1  6499
10090   Mus musculus    XM_006532068.4  3363    XP_006532131.1  707     isoform X4
10090   Mus musculus    XM_006532064.5  6543    XP_006532127.1  1767    isoform X2
10090   Mus musculus    NM_009764.3     6648    NP_033894.3     1812
10090   Mus musculus    XM_036156253.1  6542    XP_036012146.1  1765    isoform X3
10090   Mus musculus    XM_030245495.2  6780    XP_030101355.1  1812    isoform X1
9606    Homo sapiens    NR_027676.2     7152
9606    Homo sapiens    NM_007297.4     7028    NP_009228.2     1816    isoform 3
9606    Homo sapiens    NM_007299.4     3696    NP_009230.2     699     isoform 5
9606    Homo sapiens    NM_007294.4     7088    NP_009225.1     1863    isoform 1
9606    Homo sapiens    NM_007300.4     7151    NP_009231.2     1884    isoform 2
9606    Homo sapiens    NM_007298.3     3699    NP_009229.2     759     isoform 4
  1. Now let’s create a list of TaxIds using dataformat
dataformat tsv gene \
--inputfile data_report.jsonl \
--fields tax-id \
--elide-header > taxids.txt
  1. Now let’s create two files: longest_transcript.txt and longest_protein.txt with the accessions of the longest sequences for each type of data.
cat taxids.txt | while read IDS; do
	LONGEST=$(grep -w "^$IDS" transcript_protein.tsv | sed 's/\t/|/g' | sort -Vr -k6,6 -t'|' | head -n1);
	TRANSCRIPT=$(echo $LONGEST | cut -f3 -d'|');
	PROTEIN=$(echo $LONGEST | cut -f5 -d'|');
	printf "$TRANSCRIPT\n" >> transcript.list;
	printf "$PROTEIN\n" >> protein.list;
  1. Extract the sequences from the original files rna.fna and protein.faa using the lists we created and the program seqtk
  • transcript: seqtk subseq rna.fna transcript.list > rna_longest.fna
  • protein: seqtk subseq protein.faa protein.list > protein_longest.faa
Generated December 6, 2022