Help on the annotated mRNA page

January 27, 2004

 

 The annotated mRNA page

back to top

 

     Viewing the mRNA page, tabs and menus

     How to access the alternative variants annotations

     Choice of the sequence to annotate: genome versus cDNA consensus

     cDNA clones supporting this mRNA

·        Annotation of the mRNA and protein variant

-       Summary annotation of the mRNA variant

-       Primers

-       Predicted cellular localization and motifs (Psort)

-       Protein family classification (Pfam)

-       Protein homologies (BlastP)

-       Lineage and Closest homologues (TaxBlast)

-       mRNA structure

·         Graphical representation of the AceView-derived mRNA and protein

 

Viewing the mRNA page, tabs and menus

back to top

 

To see this page and the rest of AceView correctly, please, enable JavaScript and StyleSheets on your browser: Tools -> Internet Options

 

The Annotated mRNA(s) page is accessible by clicking on the gray tab at the top of the page; it then becomes blue.

 

 

You now see, on the left, a text describing annotation of the specific mRNA, starting with a menu and mouse over submenu. The menu and submenu are transcript dependent: only paragraphs with content in the particular transcript appear in the menus. Titles in the menu and submenu are clickable (blue), and that takes you to the corresponding chapter or paragraph in the text, or to a linked document.

In turn, each paragraph in the text has an associated help accessible by clicking on the conspicuous question mark ? Details on what is described, and how the information is gathered, generated or evaluated in AceView are given. Please complain if things remain unclear.

 

Example of menu in the mRNA page.

 

mRNA summary

Gene summary

Protein annotation

mRNA structure

cDNA clones

Sequences

Diagram

 

On the right is a diagram depicting the spliced variant, decorated with BlastP homologies, Pfam and Psort motifs, Stops and Met(AUG) in the three frames, and up to 60 supporting mRNAs and ESTs. Reconstructed consensus sequences are highlighted: the .AM AceView reference sequence in pale yellow and the NCBI RefSeq in light turquoise. For each sequence aligned in the mRNA variant, differences from the genome sequence are color-coded, polyA or unaligned sequences are drawn and cDNA clone anomalies are labeled.

For the worm, we indicate the stage and/or tissue and the presence and type of trans-spliced leader.

 

How to access the alternative variants annotations

back to top

 

If there are alternative variants, you may view them in either of two ways: by using the toggle in the top left of the graphic mRNA window, which lists all the annotated variants or by clicking on the letter characteristic of the variant in the gene summary in the text window on the left.

 

 

Choice of the sequence to annotate: underlying genome versus cDNA consensus

back to top

 

We have chosen to annotate the mRNA sequence derived from the sequence of the genome (by concatenating the exon sequences) rather than the consensus of the cDNAs that we call .AM (itself depicted alongside the transcript as a cDNA highlighted in yellow), because we compared the average quality of the sequence of the genome and that of individual mRNAs from GenBank and found that the average frequency of “errors” is 8 to 10 times greater in mRNAs than it is in finished genomic sequences (on a sample of 400 differences, in the worm). Hence we believe that the genome-derived sequence is overall of better quality than the cDNA consensus. We however give, in the transcript summary, the explicit count of differences between the two sequences, and provide the cDNA consensus sequence (.AM) in the “Fasta sequences” list, in each transcript and in the Gene page in the “Transcripts” table.

Warning: there are a number of cases where a single base deletion or insertion in the genome has led to a frameshift breaking the open reading frame. These cases are easy to see graphically by association of a line of blue errors in the cDNAs and a shift in the black/green open reading frames; users should then be aware that there is a problem and they better reannotate the transcript themselves, preferring the cDNA consensus sequence .AM that we provide in the Fasta sequences page.

  

  cDNA clones supporting this mRNA

back to top

 

  Selected cDNA clones supporting this mRNA back to top?

The detailed table about these clones is here.

Complete CDS clones: A21353, NM_002286, X51985.

The table of all clones supporting the mRNA has the standard format for tables of clones in AceView. If you need more explanations, please write to us.

The second paragraph points to the accessions of the most interesting clones, and clicking brings the NCBI mRNA or EST accession record.

  

 Summary annotation of the mRNA

back to top

 

      Example: This complete CDS mRNA is 1,035 bp long. It is supported by 4 cDNA clones. We annotate here the sequence derived from the genome, although the best path through the available clones differs from it in 4 positions. The pre-messenger has 5 exons. It covers 9.66 kb on the NCBI build 33, April 2003 genome. The protein (229 amino acids, 26.8 kDa, pI 9.0) contains no Pfam motif. It contains 2 coiled-coil stretches and an endoplasmic reticulum membrane domain [PSORT II]. It is predicted to localize in the nucleus [PSORT II]. TaxBlast (threshold 10^-3) tracks ancestors back to Eutheria.

 

An overview is provided here; it typically contains a summary of all the analyses detailed in the following paragraphs. It is computer generated; therefore it sounds like a cold frog. The information is always presented in the same order:

1.    Is the transcript complete or truncated, and if so, on which side?

2.    What is its observed length? And this connects to the sequence, with lower case for the UTRs, upper case for the coding region; exons are indicated in alternate colors.

3.    How many clones support it?

4.    Does the sequence derived from the cDNA consensus guided by the genome differ from the genome underneath, and if so, how many point differences are there? Look at the yellow highlighted clone-like object in the diagram to see if the sequence differences, indicated by blue marks for insertions or deletions and red marks for single point mutations, are in the coding (wide dark pink area) or in the non coding UTRs.

5.    How many exons are there, and what is the size of the genomic piece under the transcript? And this connects to the sequence, with lower case for the UTRs, upper case for the coding region; exons are indicated in alternate colors, and introns in black lower case.

6.    Then about the derived protein: how many amino acids does it contain? What are the molecular weight and pI (provided by ExPASy)? Note the influence here of the choice of the initiator Met and remember that the proteins we annotate are deduced from the mRNA sequence, and not observed directly. Translation usually starts at an ATG (Met), but at least three other codons: GTG, TTG, or CTG are candidates to be used as initiators in most species (see the codon usage table maintained by the Taxonomy group at NCBI). Confronted to the choice of annotating a protein possibly too long or possibly too short, we decided to use as Start any of the possible Met codons, we simply pick whichever codon gives us the longest predicted protein. Using this simple rule with no further constraint, we find that 3.3% of the proteins are annotated starting from one of these “rare” sites, not from ATG.

7.    Does the protein contain reliable type A Pfam motifs?

8.    Does it contain motifs searched by PSORT II (see details below)?

9.    What is its predicted cellular localization?

10. Finally, are there hits by BlastP at 10-3, and if so, how far back in evolution can we trace the protein?

 

Primers

back to top

 

Primers and temperature conditions to amplify the CDS are calculated by Osp (Hillier L, Green P. PCR Methods Appl. 1991 Nov;1(2):124-8). Usually, an annealing temperature 2 or 3 degrees Celsius below the lowest T indicated gives good results.

 

Predicted cellular localization and motifs (Psort)

back to top

 

·        PSORT II by Kenta Nakai gives the a priori probability for a protein to be found in the various subcellular compartments. But the localization cannot be reliably assessed if the protein is incomplete, because it may be missing dominant signals that would influence its localization, for example a signal peptide. We limit the main localization annotation to complete proteins, and in the mRNA summary or to generate a title for a variant, we impose that the Psort probability for the most likely localization be above 50% for all compartments, except for the nucleus, where we demand above 60%. However in the text, we report the a priori probabilities as they come out of the program, even for partial proteins.

 

The PSORT II principles are explained in this very useful document and cited papers therein. PSORT II is based on physical properties of the protein and on the recognition of addressing and other motifs; inferences on the subcellular localization are derived by training the program on a set of 1,531 yeast proteins whose localization is known (thanks to YPD/Proteome).

Validation: This problem is difficult, although extremely important. In C. elegans, with our thresholds and conditions, between 75 and 85% of the proteins whose localization is known are predicted correctly by PSORT II. Large proteins, in particular from the cytoskeleton or near the cell periphery, are often predicted to be nuclear.

 

A PSORT predicted localization example:

PSORT II analysis (K. Nakai http://psort.nibb.ac.jp), trained on yeast data, predicts that the subcellular location of this protein is most likely in the Golgi (33%) or in the endoplasmic reticulum (33%). Less likely possibilities are in the plasma membrane (22%) or secreted (11%).

 

·        PSORT II also provides the coordinates and the sequences of the motifs recognized, as calculated by a variety of programs, some written or modified by Nakai’s team, some made available by others. These usually short motifs include signal peptides, transmembrane domains, coiled-coils, nuclear localization domains, endoplasmic reticulum retention signals, and N-myristoylation and prenylation domains. The sequences are displayed in the table, and they are shown as wide red bars in the graphic.

It may be useful to access all proteins sharing a given motif, and we provide this function by clicking on the name of the motif in the table. The result table has the usual structure for a gene list and allows iterative refinements by query. The list is ordered by map and position on the chromosomes; it provides the gene title, often with phenotypic or functional indications, and an idea of the expression level, through the total number of clones coming from that gene (according to AceView).

 

A PSORT domain table example:

From aa

Domain

Sequence

1 to 17

N_terminal_Signal_domain

MSLSFLLLLFFSHLILS

234 to 240

Possible nuclear localization

PEKKKPP

236 to 239

Possible nuclear localization

KKKP

264 to 267

ER_membrane_domain

KFRF

 

 

Protein family classification (Pfam)

back to top

 

·        This paragraph reports attempts to classify the protein as a member of a previously recognized and described protein family. We use Pfam and HMMER, as detailed below. If the protein convincingly belongs to a family, clicking on the family name brings the more complete description at the Sanger Institute Pfam site, often with the 3D structure of a member. The AceView genes that belong to the same family are enumerated and listed a click away.

·        Details: Proteins are classified into protein families, known as the Pfam A protein family motifs, defined by a very active international collaboration (Bateman A, Birney E, Cerruti L, Durbin R, Etwiller L, Eddy SR, Griffiths-Jones S, Howe KL, Marshall M, Sonnhammer EL. Nucleic Acids Res. 2002 Jan 1;30(1):276-80). They are searched using the HMMER program provided by Sean Eddy. We keep only the highly significant hits according to the Pfam cut-offs. We download and use systematically the latest release of the Pfam motifs, about 2 weeks before release date. Version 2.8g of the program is indeed slow and computer greedy, but it performs in our tests significantly better than the much faster CD search at NCBI. More recently, we tested HMMER version 3 and it is considerably faster. The Pfam motifs used for the worm and Arabidopsis are from version 8.0, downloaded May 2003; for human December 2003 (build 34), they are from version 10.0 downloaded September 24th, 2003.

·        The Pfam motifs are displayed as orange boxes; clicking on the box, or on the number given in the text, produces the list of all other genes in that species that belong to the same family, in the standard format for genes tables, which allows further refinement of the query.

·        As a bonus of Pfam, and thanks to the work of the EBI Apweiler team (Mulder NJ et al., Nucleic Acids Res. 2003 Jan 1;31(1):315-8), we gain the InterPro descriptions. People interested in family descriptions should also follow the provided link to Pfam, because some of the Pfam descriptions are still better than the corresponding InterPro descriptions, although all descriptions will ultimately be merged into InterPro. GO classifications are also made available by the InterPro group. A word of caution: this annotation is not always perfect and reliable, because the function of a family is often inferred from one or a small number of its members, and the underlying hypothesis of structural to functional conservation is sometimes incorrect. Yet this GO annotation is better than nothing, and the GO terms are displayed in the gene summary and the gene summary table on the “Gene on Genome” page, with an indication of whether they came from Pfam (and were inferred by electronic annotation) or from manual curation of an article. In the table, each GO term points to the set of genes with that annotation.

 

Protein homologies (NCBI BlastP)

back to top

 

·        The protein is decorated with NCBI BlastP homologies against the nr database. We only consider hits that would arise by chance at less than 1 per thousand (E < 10-3). We report the total number of hits in the text but limit the display and the “Table of BlastP hits” to the best 30 hits. The results are described in the text under “Protein homologies” and are shown as blue overlapping boxes in the graphical display.

·        Clicking on the blue box on the graph or in the text under BlastP brings in the BlastP table of results, calculated a few days before the release (so it may be up to 3-4 months old).

 

   Lineage and closest homologues (abbreviated as Taxonomy in the Table of Contents)

back to top

 

BlastP at NCBI comes with a very useful associated “Taxonomy report” also known as TaxBlast. From this document, we extract the number of hits in each of a number of selected branches of the tree, the entire tree is represented, but some branches are purposely merged (for example as “other amniota”). The accession of the closest homolog in the most studied species is given, preferring the RefSeq (NM or XM) to GenBank, SwissProt or PIR whenever there is a choice, because we then gain for free a connection to the nicely annotated NCBI LocusLink/Gene database.

The data are represented as a (hopefully self-explanatory) tree.

As always, some caution in the interpretation is recommended, especially when the number of hits in a large branch is small, as in the example below.

 

Example of a possibly ancient gene (nematode spd-5 spindle defective gene).

Based on Taxblast, homologues to this gene are found in the following organism(s):
Archaea 2 hits.
Bacteria 3 hits.
--Other Bacteria 3 hits.
Eukaryota 38 hits.
--Mycetozoa 1 hit.
----Dictyostelium discoideum 1 hit, best hit: AAO52027.1.
--Fungi Metazoa group 35 hits.
----Pseudocoelomata 7 hits.
------Caenorhabditis elegans 7 hits, best hit: NP_491539.1.
----Deuterostomia 28 hits.
------Amniota 28 hits.
--------Mus musculus 6 hits, best hit: NP_032507.1.
--------Rattus norvegicus 8 hits, best hit: XP_230851.1.
--------Homo sapiens 11 hits, best hit: P24043.
--------Other Amniota 3 hits.
--Other Eukaryota 2 hits.

Note that it was not found in E. coli or the most studied bacteria, and only 2 archaea are positive at 10-3. If I was interested in this gene, I would make sure that the few bacterial hits are not contaminations.

 

   mRNA structure

back to top

 

This paragraph gives information on the 5’UTR, the splicing pattern, and the 3’UTR including a report on polyadenylation signal. Access to the various sequences is granted a click away. Tell us if something needs documentation.

Example: mRNA Structure ?
The sequence of the 5kb upstream of the transcript is here.
The 5'UTR contains about 156 bp. The CDS is complete since there is an in frame stop in the 5'UTR 27 bp before the Met.

Splicing Comparison to the genome sequence shows that the 24/24 introns follows the consensus [gt-ag] rule. Coordinates of the introns in the template genomic DNA, where 1 denotes the first genomic base matching the RNA, are:

[ type ]

start

end

length

[ gt-ag ]

409

118864

118456 bp

  The 3'UTR contains about 1613 bp followed by the polyA. The standard AATAAA polyadenylation signal is seen about 30 bp before the polyA. This 3'UTR 1614 bp is among the 5% longest we have seen, it may serve a regulatory function. It contains 23% A, 30% T, 23% G, 21% C.

 

 

 Graphical representation of the AceView-derived mRNA and protein

back to top

 

Each mRNA and protein variant is depicted in its page, decorated with BlastP homologies, Pfam and Psort motifs, Stops and Met(AUG) in the three frames, and up to 60 supporting mRNAs and ESTs. Special sequences are highlighted in this diagram: the .AM AceView reference sequence in pale yellow and the NCBI RefSeq in light turquoise. For each sequence aligned in the mRNA variant, differences from the genome sequence are color-coded, polyA or unaligned sequences are indicated with specific icons and cDNA clone anomalies are labeled.

For the worm, we indicate the stage and/or tissue and the presence and type of trans-spliced leader.

 

1

2

3

4

5 & 6

7

8

9

10

11

Scale

in bp

Pfam

Psort

BlastP

1st 2nd

frame

mRNA

&

Protein

3rd

frame

NM

AM

cDNA sequences

from databases

decorated with errors

 

1: Scale is in basepairs, base 1 is the first base of the transcript.

When there is a gap in the sequence and a clone with a 5’ and 3’ read bordering the gap, we estimate the size of the gap from the average insert clones in the libraries.

 

2: Pfam homologies to known protein families are shown in their exact position according to the HMMER program. Click on the orange box to get the list of all genes in this species that contain significant matches to this Pfam family.

Click on the link in the title above for more information on the procedure, and carefully read the text associated to the diagram of your transcript, under Protein family classification for more results, including the InterPro description of the protein family and a link to the Pfam description.

 

3: Psort2 motifs are usually short addressing motifs: they are depicted by wide but usually short red bars. Clicking on some, such as transmembrane domains, coiled coils, myristoylation or prenylation domains, ER retention signals, zinc fingers for example brings the list of all genes in the database with a similar motif detected by the same program (indeed a collection of programs assembled and tuned by Kenta Nakai, from Tokyo University). Nuclear localization signals are just basic short stretches; they are very frequent in proteins and do not connect to a list.

Click on the link in the title above for more information on the Psort2 procedure, and carefully read the text associated to the diagram of your transcript, under Predicted cellular localization and motifs (Psort) for more results, in particular the important predicted localization paragraph not summarized in the diagram, but which is actually the main aim and originality of the Psort2 program.

 

4: BlastP /TaxBlast results are shown compressed in a single line, so that the complete extent of all BlastP homologies in the nr protein database, with cut off 10-3, is visible. Clicking on the blue box brings a table of BlastP hits, with up to the 20 best hits and for each all the characteristics provided by Blast: coordinates, scores and links to the GenBank records. This table was calculated a few days before the release (so it may be up to 3 months old).

Click on either of the two links in the title above for more information on the BlastP or TaxBlast analyses, and carefully read the texts associated to the diagram of your transcript, under Protein homologies and most of all under Lineage and closest homologues. The latter diagram provides you with links to the closest homologues in other species.

 

Each mRNA/Protein variant is graphically represented in tyrian pink as a spliced mRNA, position of the introns is indicated by a triangle along the mRNA, a kind of scar of where the intron was. A well defined intron (by AceView criteria) has at least one cDNA clone exactly matching the exons over 8 bp on both sides of the intron, and the type of intron feet is color coded accordingly: pink if well defined* and typical [gt-ag], [gc-ag], or [at-ac]; blue if atypical or not well defined (note this is less precise than on the gene on genome view, where introns that are not well defined are represented by a straight line rather than a broken blue one). An alternative intron is filled, and a common intron is open. Gaps are not well represented in this view yet.

The protein is represented by the wide pink area in the diagram, while the narrow areas correspond to the UTRs.

 

The three reading frames are schematized, with Stops as black lines and Met(AUG) as green lines: open reading frames (ORF) lie between two Stops, ORFs above 80 aminoacids are outlined as black rectangles, and coding sequences (CDS) are classically considered to go from the first Met to the Stop in the longest open reading frame.

Choice of the initiator Met: When working with mRNA and genome sequences, we never manipulate real protein sequences: all of the proteins we annotate are predicted from the mRNA sequence. To not bias the analysis, we have decided to use the codon usage table maintained by the Taxonomy group at NCBI. There we read that three codons are candidates to be used as initiators in most species, and although ATG is probably used more often than TTG or CTG, we take whichever of these three codons gives us the longest predicted protein. Only actual protein sequencing can definitely say which was the actual initiator, and apart from detecting NH2-terminal signals such as signal peptides, it does not hurt much to annotate a protein that may be too long at the NH2 terminus.

 

Aligned supporting clones

On the right of the diagram are represented the alignments of the supporting mRNAs and ESTs, a down arrow for 5’ reads and an up arrow for 3’ reads, with color-coded indications of differences from the genome sequence: red for a single basepair difference, transition or transversion; blue for either a single-base insertion or deletion, green for an uncalled base (n). Too many errors may merge as brown. Anomalies are also labeled by red dots (minor) or big squares (major) at the bottom of the clones.

Warning: there are a number of cases where a single base deletion or insertion in the genome has led to a frameshift breaking the open reading frame. These cases are easy to see graphically by association of a line of blue errors in the cDNAs and a shift in the black/green open reading frames; users should then be aware that there is a problem and they better reannotate the transcript themselves, preferring the cDNA consensus sequence .AM that we provide in the Fasta sequences page.

 

~~~~~ In addition in the worm, we indicate, along the aligned cDNAs, extra data we have about stage or tissue and trans-spliced leader. The stage is shown below the clone: e, embryo; 1, L1 larva; 2, L2 larva; 4, L4 larva; m, mixed stages enriched in larvae and adults; similarly for the tissue: o, ovary; sp, sperm enriched; cd, cadmium induced. The trans-spliced leader is indicated on top of the clone: 1, SL1; 2, SL2, …, 12, SL12. A suffix ‘ or m is added to low frequency variants of the standard SL1 to SL12 that, because the corresponding genes cannot be found in the genome sequence, we consider to be mutant SLs rather than new SL types (e.g., SL1’).

 

 

 


 


Freedom of Information Act | Disclaimer