Help

Help on the ways to query AceView genes, transcripts, proteins

Help! How to query AceView?

You only need to read the sections of this help document that describe the features that you want to understand. Each section of this text is intended to be understandable independently.

- Selecting a database - choose which database to use for searches.

- Search – use this box to search for genes by almost anything: names, identifiers, or content.

- Query Pfam – find genes belonging to families, searching by functional or structural attributes in the Pfam annotations.

- BLAST – find an AceView mRNA or protein matching your own sequence

- Finally, you may be interested to learn about the structured query we are starting to design - where the user will select a large number of properties from pull-down menus with “and”, “or”, and “not”, allowing a selection of genes directly on quantifiable data calculated by AceView as well as on the existence of defined properties. While developing this new query, we are happy to help people hunting for a gene by sending them upon request html tables of the genes in an area of the chromosome.

Selecting a database

Where it says "Query", select AceView. In the next Select box, choose a species database. Alternatively, click on the species icon.

Querying other databases at NCBI

Where it says "Query", change the selection from AceView to the name of one of the other databases at NCBI, for example, LocusLink. Enter your query text in the "Search" input box and click "Go".

“Search” through AceView by typing in the box

Enter a query in the Search input box and click "Go" or hit return.

For example, type NGFB, FGF or FGF5, AF364604, ada* (expecting the ADAMs and ADAR at least), sugar metabolism, or search the human database for “breast cancer" or for "mitotic spindle"; or try this last query in worm and you will get a list of genes related directly or indirectly to your query.

The case (uppercase or lowercase) is not sensitive, so you can use one or the other indifferently. A “?” is a place holder for a single character and a * for any number of characters. For example “choleste?ol” will match only cholesterol, and “ubi*itin” will match ubiquitin.

The “Search” is general: type a gene or protein name, a LocusLink name or an alias symbol or a LocusID, a clone name, an mRNA or EST sequence accession (but no GI, protein accession, Unigene number, or Ensembl identifier), or else try a meaningful expression, a gene ontology term, a phenotype, a function, or a property; the answer is always a gene or a list of genes or a clone/sequence pointing to a transcript/gene, and the query system will ultimately search over the entire database and look at all annotations until it finds something that it interprets as what you meant. For this reason, the main principles guiding you when you type a query should be:

1. If you are interested in a specific gene, type the most specific identifier you can think of. It can be the official gene symbol (even if it is a single letter, as are many Drosophila genes), or an alias, or the name of the gene, the LocusLink accession number, an mRNA or EST accession number (no version needed, but it does not hurt), or a cDNA clone name as it appears in GenBank. If you look for a gene class, you can use the * and the “?”. If you look for a Caenorhabditis elegans gene, you can also type the predicted CDS name.

Please do not use a protein accession (in particular, an NP_), or a GI; use the associated mRNA or EST accession instead. And be aware that we search only primary data: we do not search for Unigene clusters or Ensembl identifiers or the old AceView names from before build 28.

Caution: From build 29 to 34, we were using our new gene naming system, yet we did not attempt to track variant suffixes .a .b …: a given mRNA name could correspond to different sequences in different builds. We are sorry about the confusion that may have caused. Since January 2004, we added a date to all variants names so the variants now have a unique traceable sequence.

2. If you are interested in all genes related to a word or a group of words, just type your words normally (separated by spaces), but keep in mind that we look for only exact matches to the set of characters you typed, and we add * by default around your words (i.e., we automatically allow extensions of what you typed in both directions). Please be especially careful with spelling mistakes, because we do not attempt to fix them. For best results, type only the roots of the meaningful words (for example, “nerv” rather than “nerve” or “nervous”) that undoubtedly should be found in your gene. Drop unnecessary suffixes or prefixes and plurals and pick the most frequently used synonyms. If you are unsure of the spelling, type only what you are sure of: prefer “lecit choleste” to “lecitin cholestelol” and “phosphor” (hardly ambiguous in fact) to “phosphorylation”. When in doubt, prefer a space or a question mark to an uncertain character. Prefer singular to plural (e.g., use claudin rather than claudins), avoid using dashes or dots in names unless necessary, and do not type words of only 1 or 2 characters, unless they are a gene name.

How does it work:

It may help to understand the principle of how the query system works. The search tries to implement the notoriously impossible “what you get is what you wish” paradigm. By observing the queries we receive and the data in the various fields of the database that we want accessible, we ended up ordering and classifying the data to be searched and have drawn a set of rules based on common sense. The system will ultimately look over the entire database, but it will do so in an orderly fashion to minimize answer time for the simplest and most common queries.

- To simplify, we first expect you are looking for a single gene and try to recognize exact gene symbols, LocusID, GenBank identifiers, and cDNA clones: if we do, we stop the search and you get the “Gene on Genome” view of your gene or the cDNA clone pointing to its gene.

- If this fails, we proceed by searching all other fields and documents in the database, allowing word completion and extension both ways. The words do not need to be in the same order or in the same case, but they need to be in a single document (for example, a single abstract of publication). In case of success, since all objects in the database point to genes, we collect and display the results as a list of genes. The list is ordered by map and position on the chromosomes; it provides the gene title, often with phenotypic or functional indications, and an idea of the expression level, through the total number of clones coming from that gene (according to AceView).

- An inconvenience is that the list may contain genes somewhat distantly related to the query. To compensate, we have added the possibility of recursive searches. In the list of genes page, you can refine the query by applying a new query on just that set of genes: just type in the query box in that page, and we will once more (and recursively so thereafter) search a match of your text in the name, title, annotation, and abstracts of papers of the genes in the list. In the example above of “breast cancer”, you may want, for example, to type “brca”.

One possibly irritating feature with the query answer is that it may be difficult to trace back the term among the many documents attached to a gene; often the terms are found in the abstract of one of the papers, and the user has to open and read them all to find out why the gene belongs to the list. Yet the system is not bugged, and computers have many qualities, but they lack imagination and creativity: we can guarantee that the words typed were found somewhere in this gene’s information.

An even more frustrating case is when you get either the “No locus found” page or “Too many hits”. Read these pages carefully. You may find a way out, reformulate your query, or decide to use the query Pfam or Blast instead. If too annoyed, please write to us, we are interested in query problems.

Search for function and protein family annotations through “Query Pfam”

If you are looking for genes belonging to families by searching by name or by functional or structural attributes, click the "Pfam Search" link. This takes you to the Pfam Search page, which has its own search form.

The “Query Pfam” searches for terms in the title or descriptions of motifs from the Pfam protein family. The system uses HMMER precomputed Pfam hits and their InterPro protein and gene ontology descriptors (learn more about this). The result is a list of protein family motifs with links to genes containing those motifs.

To query Pfam:

Enter a word or group of words in the search box, select a species, and click "Search".

For example, search "PF00988", "PH", "dehydrogenase", "sugar metabolism", or "mitotic spindle".

The database will search for Pfam motifs that contain the words you entered in any of the searched fields. It searches

- Pfam protein family name

- Pfam Definition

- Pfam Description

- Pfam Comment

- Titles of papers relating to the motif

- Authors of papers relating to the motif

- Gene Ontology terms identified as relevant to the protein family signature

The results will contain only Pfam motifs that contain all of the words you entered, but it is not required that the words be in the same field. For example, if you enter "sugar metabolism", it would match a motif that mentions "sugar" in the Pfam Description and "metabolism" in the Gene Ontology terms for that motif.

You do not need to enter complete words. For example, the word "metabol" would match both "metabolic" and "metabolism".

Pfam Search results

Results are displayed as a table showing the Pfam motif name, the Pfam accession number, the Pfam definition, the number of products found that match that motif, and the number of genes producing proteins that match that motif. The number of genes is often smaller than the number of products because alternative splicing or alternative promotors generate alternative isoforms.

Pfam motifs that are not present in the organism selected are not displayed.

If it is not apparent why a particular motif matched your query, it is probably because there was a match in the Pfam Comment or the Gene Ontology terms.

In the table of results, the Accession column links to a comprehensive description of the Pfam motif at the Sanger Institute. The Gene column links to a list of genes containing the Pfam motif, which in turn connects to the detailed AceView gene descriptions.

Query results are only stored for an hour or so after you make your query. If you leave the results on your screen for a long time, the system will not be able to download or filter them when you come back.

Refining Pfam results

Once you have a table of results, you can make additional queries that only search in the results that you already have. There are several text input boxes on the page. Enter your criteria in as many of the boxes as you find suitable and press "Refine Results".

In the input box labeled "Query", you can make another query that works just like the original query.

You can change the number of lines to show per page, or the first line number on the page.

In the table, there is a check box next to each line number. Check that box to eliminate the line from your result set.

At the top of each column, there is a text box. In that box, you can type a pattern to match in that column. All rows that do not match will be removed from your result set. If you check "include", the rows that match will be retained in your results, and rows that do not match will be removed. If you check "exclude", the rows that match will be removed from your results.

For the Product and Gene columns, you can check one of >, =, or < to select only lines that have a specific number of products or genes.

Downloading Pfam results

Below the table, there are some links that allow you to download your results in a format that is intended to be easily processed by other computers. (The download will always get all the results, even if your display only shows a subset.) More information is available from the Pfam download help page.

Future plans for the Pfam Search

If interest grows, we plan to extend this page to find results from multiple species with a single query.

We plan to make a protein display, linked from the Products column.

Search for AceView mRNAs or proteins resembling your sequence: Blast against AceView

Click the "Blast Search" link. This takes you to the Blast Search page, which has its own search form.

The BLAST search uses BLAST to search for a peptide (using tblastn) or nucleotide (using blastn) sequence in our databases. You can search a single species or any set of the species known to AceView (currently 3); if you use the default, all species will be searched, and you will be able to look at the annotation of the homologues in the other species: maybe a function was discovered in that other species, or you will find interesting abstracts...

The system automatically determines whether the sequence you have entered is DNA or peptide. DNA sequences can contain the letters TCAGN. Peptide sequences can contain the letters ABCDEFGHIKLMNPQRSTUVWXYZ*. You can use uppercase or lowercase.

Matches with E-values higher (i.e., less good) than specified in the query will not be displayed. As usual, longer query sequences produce better matches than short sequences.

The BLAST search uses blastall (from the NCBI Toolkit) to perform a blastn or tblastn search on a database that is constructed from the sequences known to AceView. A citation of the BLAST paper is available here. Extensive information about how BLAST works and what you can do with it is available at https://0-www-ncbi-nlm-nih-gov.brum.beds.ac.uk/BLAST.

Future plans for a new “Query by feature” in AceView

One of our plans is to design a new Query by feature, which would allow the user to select the genes by acting on multiple fields at a time, using a combination of toggles and simple windows where you would type numbers. You could select simultaneously on map position, level and pattern of expression, presence of alternative variants and the various alternative features we calculate in the database, presence of a given protein motif, predicted cellular compartment, taxonomy (are there close genes in bacteria, viruses, invertebrates...), presence of a known phenotype, presence of an antisense gene, and also length of UTRs, of CDS, molecular weight, pI, number of exons, of introns, types of introns (e.g., at-ac) and all of these queries that are structured, in that they apply to a single field in the database at a time. It should be easy to then provide a list with exactly the genes people need to consider, for example those in a given region of a chromosome that have a membrane domain and are overexpressed in tumors; or those expressed in brain and antisense to other genes. A second query, either of the standard AceView type (looking for any words in the gene’s records) or an iterative query by feature should allow researchers to sort sets of genes with very interesting properties indeed, and to hopefully help our knowledge of the genes to progress faster.

In the meantime, if you are hunting for a gene responsible for a phenotype in a region of a chromosome, and wonder if we can help…

The answer is YES! The new query by features will allow you to look at lists of genes ordered by position in a given area, but in the meantime, it is easy for us to send you an html table of the genes lying in between two markers, or in a given region that you would define by coordinates. The list will have interesting data and each gene will be clickable and connect to our main AceView server. Just send us a mail.