LogOddsLogo
LogOddsLogo is a web based application designed to make the generation of sequence logos from biological sequence alignments as easy and painless as possible. Building on the WebLogo 3 source code, LogOddsLogo uses per-observation multiple-alignment log-odds scores as measures of information content at each position of a sequence logo. It provides the methods for logo generation that have a proper statistical basis and are optimal for recognising functionally relevant alignment columns.
A sequence logo is a graphical representation of an amino acid or nucleic acid multiple sequence alignment. Each logo consists of stacks of symbols, one stack for each position in the sequence. The overall height of the stack indicates the sequence conservation at that position, while the height of symbols within the stack indicates the relative frequency of each amino or nucleic acid at that position. The width of the stack is proportional to the fraction of valid symbols in that position. (Positions with many gaps have thin stacks.) In general, a sequence logo provides a richer and more precise description of, for example,a binding site, than would a consensus sequence.
Yu YK, Capra JA, Stojmirovic A, Landsman D, Altschul, SF. Log-odds sequence logos. Submitted for publication, (2014).
Altschul SF, Wootton JC, Zaslavsky E, Yu YK. The construction and use of log-odds substitution scores for multiple sequence alignment. PLoS Comput Biol. 6:e1000852, (2010).
Altschul SF, Gertz EM, Agarwala R, Schäffer AA, Yu YK. PSI-BLAST pseudocounts and the minimum description length principle. Nucleic Acids Res. 37:815-824, (2009).
Crooks GE, Hon G, Chandonia JM, Brenner SE. WebLogo: A sequence logo generator. Genome Research 14:1188-1190, (2004).
Sunyaev, S. R. et al. PSIC: Profile extraction from sequence alignments with position-specific counts of independent observations. Protein Eng. 12:387-394, (1999).
Robinson AB, Robinson LR. Distribution of glutamine and asparagine residues and their near neighbors in peptides and proteins. Proc. Natl Acad. Sci. USA 88:8888-8884, (1991).
Schneider TD, Stephens RM. Sequence logos: A new way to display consensus sequences. Nucleic Acids Res. 18:6097-6100, (1990).
Unlike its predecessor WebLogo, LogOddsLogo offers two web interfaces for creating sequence logos, one for protein sequences and another for nucleic acid (DNA or RNA) sequences. Each interface only offers the options relevant to the type of sequences it supports. Initally both interfaces offer the entries for only the most essential parameters. The advanced parameters can be accessed by clicking on the More parameters link.
The background composition of the genome from which the sequences have been drawn. The default option is to use equiprobable background. However, you may also explicitly set the expected CG content for nucleic acid sequences, insist on equiprobable background distributions, or use any of the prescribed compositions in the dropdown box.
For proteins, the background composition, consisting of a set of emission probabilities of amino acids, is used to calculate the scores for the Bayesian Integral Log-odds (BILD) method, the normalized maximum likelihood (NML) method and the Schneider-corrected (SC) method. Currently, we use the Robinson-Robinson frequencies as the background amino acid frequencies for the latter two methods mentioned above. However, in the command line version of the program, the user may specify a different set of background probabilities. BILD score, on the other hand, always uses the implicit background probabilities of the selected Dirichlet mixture.
2 Watson-Crick hydrogen bonds | TAU | dark orange |
3 Watson-Crick hydrogen bonds | GC | blue |
G | G | orange |
TU | TU | red |
C | C | blue |
A | A | green |
Hydrophilic | RKDENQ | blue |
Neutral | SGHTAP | green |
Hydrophobic | YVMCLFIW | black |
Polar | GSTYC | green |
Neutral | QN | purple |
Basic | KRH | blue |
Acidic | DE | red |
Hydrophobic | AVLIPWFM | black |
Positive | KRH | blue |
Negative | DE | red |
logoddslogo
, provides many more options and greater control over the final logo appearance.
LogOddsLogo is written in Python. It is necessary to have Python 3.7 and the extension package numpy installed before LogOddsLogo will run. LogOddsLogo also requires a recent version of ghostscript to create PNG and PDF output, and pdf2svg to generate SVG output.
The LogOddsLogo source code can be downloaded from
ftp://ftp.ncbi.nlm.nih.gov/pub/qmbp/logoddslogo/.
This code is distributed under various open source licenses. Please consult the LICENSE.txt
file in the source distribution for details.
After unpacking the LogOddsLogo tarfile, it should be possible to immediately create logos using the command line client (provided that python, numpy and ghostscript have already been installed).
./logoddslogo --format PNG < cap.fa > cap.png
Please consult the file build_examples.sh
for more examples.
To run LogOddsLogo as a stand alone web service, run the logo server command :
./logoddslogo --serve
It should now be possible to access LogOddsLogo at http://localhost:8080/.
The command line client and LogOddsLogo libraries can be permanently installed using the supplied setup.py
script.
sudo python setup.py install
Run python setup.py help
for more installation options. For example, to specifically install the logoddslogo script to /usr/local/bin
sudo python setup.py install_scripts --install-dir /usr/local/bin
Note that LogOddsLogo source code fully contains the code for WebLogo, version 3.3. All the changes made at the NCBI are placed in separate directories and hence LogOddsLogo and WebLogo-3.3 can coexist on the same Python installation. The WebLogo executable is still present in LogOddsLogo distribution, but it is not installed by the setup script.
To use LogOddsLogo as a web application, first install the its dependencies and libraries as above, then
place (or link) the logoddslogolib/htdocs
directory
somewhere within the document root of your webserver. The webserver
must be able to execute the CGI scripts proteins.cgi
and nucleicacid.cgi
. For Apache, you may have to add an ExecCGI
option and add a cgi handler in the httpd.conf
configuration file.
Something like this:
<Directory "/home/httpd/htdocs/logoddslogo/"> Options FollowSymLinks MultiViews ExecCGI AllowOverride None Order allow,deny Allow from all </Directory> ... # To use CGI scripts outside of ScriptAliased directories: # (You will also need to add "ExecCGI" to the "Options" directive.) # AddHandler cgi-script .cgiIt may also be necessary to set the
PATH
and PYTHONPATH
environment variables.
SetEnv PYTHONPATH /path/to/logoddslogo/librariesThe cgi script also has to be able to find the '
gs
' ghostscript executable.
The maximum bytes of uploaded sequence data can be controlled with the WEBLOGO_MAX_FILE_SIZE
environment variable.
SetEnv WEBLOGO_MAX_FILE_SIZE 1000000
logoddslogo
, The LogOddsLogo Command Line Interface (CLI)build_examples.sh
script for inspiration.
Usage: logoddslogo [options] < sequence_data.fa > sequence_logo.eps Create sequence logos from biological sequence alignments. Options: --version show program's version number and exit -h --help show this help message and exit Input/Output Options: -f --fin FILENAME Sequence input file (default: stdin) -D --datatype FORMAT Type of multiple sequence alignment or position weight matrix file: (clustal, fasta, plain, msf, genbank, nbrf, nexus, phylip, stockholm, intelligenetics, table, array, transfac) -o --fout FILENAME Output file (default: stdout) -F --format FORMAT Format of output: eps (default), png, png_print, pdf, jpeg, svg, logodata Logo Data Options: -M --score-method SCORE_METHOD The method for scoring: 'BILD' or 'NML' or 'SC' or 'SU': BILD - BILD score NML - Normalized Maximum Likelihood score SC - Schneider score - Corrected SU - Schneider score - Uncorrected -d --dmnumber DMNUMBER Dirichlet mixture parameter for BILD score. For nucleotide sequences, it should be a floating point value corresponding to Dirichlet concentration alpha (DEFAULT 1.0). For protein sequences, it should be an integer (0 to 8) indicating a particular stored Dirichlet mixture: 0 - recode3-20 (20 components) 1 - recode4-20 (20 components) 2 - recode5-20 (20 components) 3 - Fournier-20 (20 components) 4 - dist-20 (20 components) 5 - dist-ncbi-52 (52 components - DEFAULT) 6 - dist-ncbi-72 (72 components) 7 - dist-ncbi-110 (110 components) 8 - dist-ncbi-134 (134 components) This parameter is ignored for any other scoring method. -A --sequence-type TYPE The type of sequence data: 'protein', 'rna' or 'dna'. -a --alphabet ALPHABET The set of symbols to count, e.g. 'AGTC'. All characters not in the alphabet are ignored. If neither the alphabet nor sequence-type are specified then logoddslogo will examine the input data and make an educated guess. See also --sequence-type, --ignore-lower-case -U --units NUMBER A unit of entropy ('bits' (default), 'nats', 'digits'), or a unit of free energy ('kT', 'kJ/mol', 'kcal/mol'), or 'probability' for probabilities --composition COMP. The expected composition of the sequences: 'equiprobable (default)', a CG percentage, a species name (e.g. 'E. coli','H. sapiens'), or an explicit distribution (e.g. "{'A':10, 'C':40, 'G':40, 'T':10}"). For proteins, NML and SC use the Robinson-Robinson frequencies as the background, although an explicit specification (same format as above) is also allowed. BILD always uses the implicit background frequencies of the selected Dirichlet mixture. --weight NUMBER The weight of prior data. Default depends on alphabet length --no-weighcounts For proteins only, do not use other columns to estimate the number of independent observations of an amino acid within the column considered. Default is True --ovline Draw an overline on the positive score region. Default is False -i --first-index INDEX Index of first position in sequence data (default: 1) -l --lower INDEX Lower bound of sequence to display -u --upper INDEX Upper bound of sequence to display Transformations: Optional transformations of the sequence data. --ignore-lower-case Disregard lower case letters and only count upper case letters in sequences. --reverse reverse sequences --complement complement DNA sequences Logo Format Options: These options control the format and display of the logo. -s --size LOGOSIZE Specify a standard logo size (small, medium (default), large) -n --stacks-per-line COUNT Maximum number of logo stacks per logo line. (default: 40) -t --title TEXT Logo title text. --label TEXT A figure label, e.g. '2a' -X --show-xaxis YES/NO Display sequence numbers along x-axis? (default: True) -x --xlabel TEXT X-axis label --annotate TEXT A comma separated list of custom stack annotations, e.g. '1,3,4,5,6,7'. Annotation list must be same length as sequences. -S --yaxis UNIT Height of yaxis in units. (Default: Maximum value with uninformative prior.) -Y --show-yaxis YES/NO Display entropy scale along y-axis? (default: True) -y --ylabel TEXT Y-axis label (default depends on plot type and units) -E --show-ends YES/NO Label the ends of the sequence? (default: False) -P --fineprint TEXT The fine print (default: logoddslogo version) --ticmarks NUMBER Distance between ticmarks (default: 1.0) --errorbars YES/NO Display error bars? (default: False) --reverse-stacks YES/NO Draw stacks with largest letters on top? (default: False) Color Options: Colors can be specified using CSS2 syntax. e.g. 'red', '#FF0000', etc. -c --color-scheme SCHEME Specify a standard color scheme (auto, base pairing, charge, chemistry, classic, hydrophobicity, monochrome) -C --color COLOR SYMBOLS DESCRIPTION Specify symbol colors, e.g. --color black AG 'Purine' --color red TC 'Pyrimidine' --default-color COLOR Symbol color if not otherwise specified. Advanced Format Options: These options provide fine control over the display of the logo. -W --stack-width POINTS Width of a logo stack (default: 10.8) --aspect-ratio POINTS Ratio of stack height to width (default: 5) --box YES/NO Draw boxes around symbols? (default: no) --resolution DPI Bitmap resolution in dots per inch (DPI). (Default: 96 DPI, except png_print, 600 DPI) Low resolution bitmaps (DPI<300) are antialiased. --scale-width YES/NO Scale the visible stack width by the fraction of symbols in the column? (I.e. columns with many gaps of unknowns are narrow.) (Default: yes) --debug YES/NO Output additional diagnostic information. (Default: False) LogOddsLogo Server: Run a standalone webserver on a local port. --serve Start a standalone LogOddsLogo server for creating sequence logos. --port PORT Listen to this local port. (Default: 8080)
logoddslogolib
, and weblogo
, which is copied without change from
WebLogo 3. The package logoddslogolib
contains all NCBI changes from WebLogo 3. Please consult the API documentation for WebLogo and LogOddsLogo.
The API documentations were generated using pdoc3.
The WebLogo 3 sever can be found here.
The legacy WebLogo 2 sever can be found here.
LogOddsLogo was written by Yi-Kuo Yu, Stephen Altschul and Aleksandar Stojmirovic, with input from David Landsman and John Capra. It was based on the WebLogo-3.3 source code and the research of Yi-Kuo Yu and Stephen Altschul on information content of multiple sequence alignments.
WebLogo was created by Gavin E. Crooks, Liana Lareau, Gary Hon, John-Marc Chandonia and Steven E. Brenner. Many others have provided suggestions, bug fixes and moral support.
WebLogo was originally based upon the programs alpro and makelogo, both of which are part of Tom Schneider's delila package. Many thanks are due to him for making this software freely available and for encouraging its use.
Please direct questions and feedback about LogOddsLogo to Yi-Kuo Yu.