Help on the ways to query AceView
genes, transcripts, proteins
Help! How to query
AceView? |
You only need to read the
sections of this help document that describe the features that you want to
understand. Each section of this text is intended to be understandable
independently.
-
Selecting a database -
choose which database to use for searches.
-
Search – use this box to
search for genes by almost anything: names, identifiers, or content.
-
Query Pfam – find genes
belonging to families, searching by functional or structural attributes in the
Pfam annotations.
-
BLAST – find an AceView mRNA or
protein matching your own sequence
-
Finally, you may be interested to
learn about the structured query we are starting to
design - where the user will select a large number of properties from
pull-down menus with “and”, “or”, and “not”, allowing a selection of genes
directly on quantifiable data calculated by AceView as well as on the existence
of defined properties. While developing this new query, we are happy to help
people hunting for a gene by sending them upon request html tables of the genes
in an area of the chromosome.
Selecting a database |
Where it says
"Query", select AceView. In the next Select box, choose a species
database. Alternatively, click on the species icon.
Querying other
databases at NCBI
Where it says
"Query", change the selection from AceView to the name of one of the
other databases at NCBI, for example, LocusLink. Enter your query text in the
"Search" input box and click "Go".
“Search”
through AceView by typing in the box |
Enter
a query in the Search input box and click "Go" or hit return.
For example, type NGFB, FGF or FGF5, AF364604, ada* (expecting the ADAMs and ADAR at least),
sugar metabolism, or search the human database
for “breast cancer" or for "mitotic spindle"; or try this last query in worm and you will get a list of genes related directly or
indirectly to your query.
The
case (uppercase or lowercase) is not sensitive, so you can use one or the other
indifferently. A “?” is a place
holder for a single character and a * for any number of characters. For example
“choleste?ol” will match only cholesterol, and “ubi*itin” will match ubiquitin.
The “Search” is general:
type a gene or protein name, a LocusLink name or an alias symbol or a LocusID,
a clone name, an mRNA or EST sequence accession (but no GI, protein accession, Unigene number, or Ensembl
identifier), or else try a meaningful expression, a gene ontology term, a
phenotype, a function, or a property; the answer is
always a gene or a list of genes or a clone/sequence pointing to a
transcript/gene, and the query system will ultimately search over the
entire database and look at all annotations until it finds something that it
interprets as what you meant. For this reason, the main principles guiding you
when you type a query should be:
1. If you
are interested in a specific gene,
type the most specific identifier you can think of. It can be the official gene
symbol (even if it is a single letter, as are many Drosophila genes), or an
alias, or the name of the gene, the LocusLink accession number, an mRNA or EST
accession number (no version needed, but it does not hurt), or a cDNA clone
name as it appears in GenBank. If you look for a gene class, you can use the *
and the “?”. If you look for a Caenorhabditis
elegans gene, you can also type the predicted CDS name.
Please do not use a protein accession (in particular, an NP_),
or a GI; use the associated mRNA or
EST accession instead. And be aware that we search only primary data: we do not
search for Unigene clusters or Ensembl identifiers or the old AceView names
from before build 28.
Caution:
From build 29 to 34, we were using our new gene naming system,
yet we did not attempt to track variant suffixes .a .b …: a given mRNA name
could correspond to different sequences in different builds. We are sorry about
the confusion that may have caused. Since January 2004, we added a date
to all variants names so the variants now have a unique traceable sequence.
2. If you
are interested in all genes related
to a word or a group of words, just type your words normally (separated by
spaces), but keep in mind that we look for only exact matches to the set of
characters you typed, and we add * by default around your words (i.e., we
automatically allow extensions of what you typed in both directions). Please be
especially careful with spelling mistakes,
because we do not attempt to fix them. For best results, type only the roots of the meaningful words (for
example, “nerv” rather than “nerve” or “nervous”) that undoubtedly should be
found in your gene. Drop unnecessary suffixes or prefixes and plurals and pick
the most frequently used synonyms. If you are unsure of the spelling, type only
what you are sure of: prefer “lecit choleste” to “lecitin cholestelol” and
“phosphor” (hardly ambiguous in fact) to “phosphorylation”. When in doubt,
prefer a space or a question mark to an uncertain character. Prefer singular to plural (e.g., use claudin
rather than claudins), avoid using dashes or dots in names unless necessary,
and do not type words of only 1 or 2 characters, unless they are a gene name.
How does it work:
It may help to understand
the principle of how the query system works. The search tries to implement the
notoriously impossible “what you get is what you wish” paradigm. By observing
the queries we receive and the data in the various fields of the database that
we want accessible, we ended up ordering and classifying the data to be
searched and have drawn a set of rules based on common sense. The system will ultimately look over the entire
database, but it will do so in an orderly fashion to minimize answer time for
the simplest and most common queries.
-
To simplify, we first expect you are
looking for a single gene and try to recognize exact gene symbols, LocusID,
GenBank identifiers, and cDNA clones: if we do, we stop the search and you get
the “Gene on Genome” view of your gene or the cDNA clone pointing to its gene.
-
If this fails, we proceed by searching
all other fields and documents in the database, allowing word completion and
extension both ways. The words do not need to be in the same order or in the
same case, but they need to be in a single document (for example, a single
abstract of publication). In case of success, since all objects in the database
point to genes, we collect and display the results as a
list of genes. The list is ordered by map and position on the chromosomes;
it provides the gene title, often with phenotypic or functional indications,
and an idea of the expression level, through the total number of clones coming
from that gene (according to AceView).
-
An inconvenience is that the list may
contain genes somewhat distantly related to the query. To compensate, we have
added the possibility of recursive searches. In the list of genes page, you can
refine the query by applying a new query on just that set of genes: just type
in the query box in that page, and we will once more (and recursively so
thereafter) search a match of your text in the name, title, annotation, and
abstracts of papers of the genes in the list. In the example above of “breast cancer”,
you may want, for example, to type “brca”.
One possibly irritating feature with the
query answer is that it may be difficult to trace
back the term among the many documents attached to a gene; often the
terms are found in the abstract of one of the papers, and the user has to open
and read them all to find out why the gene belongs to the list. Yet the system
is not bugged, and computers have many qualities, but they lack imagination and
creativity: we can guarantee that the words typed were found somewhere in this
gene’s information.
An even more frustrating case is when you
get either the “No locus found” page or “Too many hits”. Read these pages carefully. You
may find a way out, reformulate your query, or decide to use the query Pfam or
Blast instead. If too annoyed, please write to us, we
are interested in query problems.
Search for function and protein family annotations
through “Query Pfam” |
If
you are looking for genes belonging to families by searching by name or by
functional or structural attributes, click the "Pfam Search" link.
This takes you to the Pfam Search page, which has its own search form.
The “Query Pfam” searches
for terms in the title or descriptions of motifs from the Pfam protein family. The
system uses HMMER precomputed Pfam hits and their InterPro protein and gene ontology
descriptors (learn more about this). The
result is a list of protein family motifs with links to genes containing those
motifs.
To query Pfam:
Enter a word or group of
words in the search box, select a species, and click "Search".
For example, search
"PF00988", "PH", "dehydrogenase", "sugar
metabolism", or "mitotic spindle".
The database will search
for Pfam motifs that contain the words you entered in any of the searched
fields. It searches
-
Pfam protein family name
-
Pfam Definition
-
Pfam Description
-
Pfam Comment
-
Titles of papers relating to the motif
-
Authors of papers relating to the
motif
-
Gene Ontology terms identified as
relevant to the protein family signature
The results will contain
only Pfam motifs that contain all of the words you entered, but it is not
required that the words be in the same field. For example, if you enter
"sugar metabolism", it would match a motif that mentions
"sugar" in the Pfam Description and "metabolism" in the
Gene Ontology terms for that motif.
You do not need to enter complete
words. For example, the word "metabol" would match both
"metabolic" and "metabolism".
Pfam Search results
Results are displayed as a
table showing the Pfam motif name, the Pfam accession number, the Pfam
definition, the number of products found that match that motif, and the number
of genes producing proteins that match that motif. The number of genes is often
smaller than the number of products because alternative splicing or alternative
promotors generate alternative isoforms.
Pfam motifs that are not
present in the organism selected are not displayed.
If it is not apparent why
a particular motif matched your query, it is probably because there was a match
in the Pfam Comment or the Gene Ontology terms.
In the table of results,
the Accession column links to a comprehensive description of the Pfam motif at
the Sanger Institute. The Gene column links to a list of genes containing the
Pfam motif, which in turn connects to the detailed AceView gene descriptions.
Query results are only
stored for an hour or so after you make your query. If you leave the results on
your screen for a long time, the system will not be able to download or filter
them when you come back.
Refining Pfam results
Once you have a table of
results, you can make additional queries that only search in the results that
you already have. There are several text input boxes on the page. Enter your
criteria in as many of the boxes as you find suitable and press "Refine
Results".
In the input box labeled
"Query", you can make another query that works just like the original
query.
You can change the number
of lines to show per page, or the first line number on the page.
In the table, there is a
check box next to each line number. Check that box
to eliminate the line from your result set.
At the top of each column,
there is a text box. In that box, you can type a
pattern to match in that column. All rows that do not match will be
removed from your result set. If you check "include",
the rows that match will be retained in your results, and rows that do not
match will be removed. If you check "exclude",
the rows that match will be removed from your results.
For the Product and Gene
columns, you can check one of >, =, or < to
select only lines that have a specific number of products or genes.
Downloading Pfam results
Below the table, there are
some links that allow you to download your results in a format that is intended
to be easily processed by other computers. (The download will always get all the
results, even if your display only shows a subset.) More information is
available from the Pfam
download help page.
Future plans for the Pfam
Search
If interest grows, we plan
to extend this page to find results from multiple species with a single query.
We plan to make a protein
display, linked from the Products column.
Search for AceView mRNAs or proteins
resembling your sequence: Blast against AceView |
Click the "Blast
Search" link. This takes you to the Blast Search page, which has its own
search form.
The BLAST search uses
BLAST to search for a peptide (using tblastn) or nucleotide (using blastn)
sequence in our databases. You can search a single species or any set of the species
known to AceView (currently 3); if you use the default, all species will be
searched, and you will be able to look at the annotation of the homologues in
the other species: maybe a function was discovered in that other species, or
you will find interesting abstracts...
The system automatically
determines whether the sequence you have entered is DNA or peptide. DNA
sequences can contain the letters TCAGN. Peptide sequences can contain the
letters ABCDEFGHIKLMNPQRSTUVWXYZ*. You can use uppercase or lowercase.
Matches with E-values
higher (i.e., less good) than specified in the query will not be displayed. As
usual, longer query sequences produce better matches than short sequences.
The
BLAST search uses blastall (from the NCBI Toolkit) to perform a blastn or
tblastn search on a database that is constructed from the sequences known to
AceView. A citation of the BLAST paper is available here.
Extensive information about how BLAST works and what you can do with it is
available at https://0-www-ncbi-nlm-nih-gov.brum.beds.ac.uk/BLAST.
Future
plans for a new
“Query by feature” in AceView |
One of our plans is to design a new Query by feature, which would
allow the user to select the genes by acting on multiple fields at a time,
using a combination of toggles and simple windows where you would type numbers.
You could select simultaneously on map position, level and pattern of
expression, presence of alternative variants and the various alternative
features we calculate in the database, presence of a given protein motif,
predicted cellular compartment, taxonomy (are there close genes in bacteria,
viruses, invertebrates...), presence of a known phenotype, presence of an antisense
gene, and also length of UTRs, of CDS, molecular weight, pI, number of exons,
of introns, types of introns (e.g., at-ac) and all of these queries that are
structured, in that they apply to a single field in the database at a time. It
should be easy to then provide a list with exactly the genes people need to
consider, for example those in a given region of a chromosome that have a
membrane domain and are overexpressed in tumors; or those expressed in brain
and antisense to other genes. A second query, either of the standard AceView
type (looking for any words in the gene’s records) or an iterative query by
feature should allow researchers to sort sets of genes with very interesting
properties indeed, and to hopefully help our knowledge of the genes to progress
faster.
In
the meantime, if you are hunting for a gene responsible for a phenotype in a region
of a chromosome, and wonder if we can help…
The answer is YES! The new query by features will allow you to look at lists of genes ordered by position in a given area, but in the meantime, it is easy for us to send you an html table of the genes lying in between two markers, or in a given region that you would define by coordinates. The list will have interesting data and each gene will be clickable and connect to our main AceView server. Just send us a mail.