LogOddsLogo - User's Manual

Introduction
Creating Sequences Logos using the Web interface
Downloading and Installing LogOddsLogo
Command Line Interface (CLI)
Application Programmer Interface (API)
Miscellanea

Introduction

LogOddsLogo is a web based application designed to make the generation of sequence logos from biological sequence alignments as easy and painless as possible. Building on the WebLogo 3 source code, LogOddsLogo uses per-observation multiple-alignment log-odds scores as measures of information content at each position of a sequence logo. It provides the methods for logo generation that have a proper statistical basis and are optimal for recognising functionally relevant alignment columns.

A sequence logo is a graphical representation of an amino acid or nucleic acid multiple sequence alignment. Each logo consists of stacks of symbols, one stack for each position in the sequence. The overall height of the stack indicates the sequence conservation at that position, while the height of symbols within the stack indicates the relative frequency of each amino or nucleic acid at that position. The width of the stack is proportional to the fraction of valid symbols in that position. (Positions with many gaps have thin stacks.) In general, a sequence logo provides a richer and more precise description of, for example,a binding site, than would a consensus sequence.

References

Yu YK, Capra JA, Stojmirovic A, Landsman D, Altschul, SF. Log-odds sequence logos. Submitted for publication, (2014).

Altschul SF, Wootton JC, Zaslavsky E, Yu YK. The construction and use of log-odds substitution scores for multiple sequence alignment. PLoS Comput Biol. 6:e1000852, (2010).

Altschul SF, Gertz EM, Agarwala R, Schäffer AA, Yu YK. PSI-BLAST pseudocounts and the minimum description length principle. Nucleic Acids Res. 37:815-824, (2009).

Crooks GE, Hon G, Chandonia JM, Brenner SE. WebLogo: A sequence logo generator. Genome Research 14:1188-1190, (2004).

Sunyaev, S. R. et al. PSIC: Profile extraction from sequence alignments with position-specific counts of independent observations. Protein Eng. 12:387-394, (1999).

Robinson AB, Robinson LR. Distribution of glutamine and asparagine residues and their near neighbors in peptides and proteins. Proc. Natl Acad. Sci. USA 88:8888-8884, (1991).

Schneider TD, Stephens RM. Sequence logos: A new way to display consensus sequences. Nucleic Acids Res. 18:6097-6100, (1990).

Creating Sequences Logos using the Web interface

Unlike its predecessor WebLogo, LogOddsLogo offers two web interfaces for creating sequence logos, one for protein sequences and another for nucleic acid (DNA or RNA) sequences. Each interface only offers the options relevant to the type of sequences it supports. Initally both interfaces offer the entries for only the most essential parameters. The advanced parameters can be accessed by clicking on the More parameters link.

Basic Parameters

Sequence Data

Enter your multiple sequence alignment or position weight matrix file, or select a file to upload. Supported file formats include CLUSTALW, FASTA, plain flatfile, MSF, NBRF, PIR, NEXUS and PHYLIP for multiple sequence alignments, and transfac for position weight matrices. All sequences must be the same length, else LogOddsLogo will return an error and report the first sequence that differed in length from previous sequences.

Output format

PNG : (600 DPI) Print resolution bitmap
PNG : (low res, 96 DPI) Screen resolution bitmap
JPEG :Screen resolution bitmap
EPS : Encapsulated postscript
PDF : Portable Document Format
SVG : Scalable Vector Graphics

Generally speaking, vector formats (EPS and PDF) are better for printing, while bitmaps (JPEG and PNG) are more suitable for displaying on the screen or embedding into a web page.

Logo size

The physical dimensions of the generated logo. Specifically, controls the size of individual symbols stacks.

small : 5.4 points wide (same as 9pt Courier), aspect ratio 5:1
medium : Double the width and height of small.
large : Triple the width and height of small.

The choices have been limited to promote inter-logo consistency. Small logos can fit 80 stacks across a printed page, or 40 across a half page column. The command line interface provides greater control, if so desired.

Sequence type (nucleic acids only)

The type of biological molecule.

Advanced Parameters

Scoring method

BILD (Bayesian Integral Log-odds) score
Normalized Maximum Likelihood score
Schneider and Stephens entropy difference score - Corrected
Schneider and Stephens entropy difference score - Uncorrected

Dirichlet mixture (protein BILD score only)

A stored Dirichlet mixture to be used as a prior for BILD scores (the first five are taken from UCSC Dirichlet mixtures pages while the last four are from the ncbi in-house development).

recode3-20 (20 components)
recode4-20 (20 components)
recode5-20 (20 components)
Fournier-20 (20 components)
dist-20 (20 components)
dist-ncbi-52 (52 components)
dist-ncbi-72 (72 components)
dist-ncbi-110 (110 components)
dist-ncbi-134 (134 components).

Dirichlet concentration parameter (nucleic acid BILD score only)

The nucleic acid logos take as a Dirichlet mixture prior a single Dirichlet distribution with determined by the concentration parameter alpha.

Weight counts (proteins only)

Weight amino acid counts for each column by estimating aggregate numbers of independent observations, using the method of Sunyaev et al (1999) as modified by Altschul et al (2009).

Draw an overline on the positive score region

Determine the boundaries for the logo using a one-dimensional Smith-Waterman algorithm and draw an overline above the positive score region.

Stacks per line

If the length of the sequences is greater than this maximum number of stacks per line, then the logo will be split across multiple lines.

Ignore lower case

Disregard lower case letters and only count upper case letters in sequences?

Units

The units used for the y-axis.

probability: Show residue probabilities, rather than information content.
bits: Information content in bits
nats: Natural units, 1 bit = ln 2 (0.69) nats
kT : Thermal energy units in natural units (Numerically the same as nats)
kJ/mol : Thermal energy (assuming T = 300 K)
kcal/mol : Thermal energy (assuming T = 300 K)

First position number

The numerical label of the first position in the sequence data in the input file. The label must be an integer. Residue labels for the logo will be relative to this number. (See also: Logo range.)

Logo range

By default, all sequence data from the input file is displayed in the Sequence Logo. With this option, you can instead show a subrange of the sequence data. The numbering of Start and End Positions is relative to the First Position Number. Thus, if the First Position Number is "2", Start is "5" and End is "10", then the 4th through 9th (inclusive) sequence positions of the input file will be displayed, and they will be numbered "5", "6", "7", "8", "9" and "10".

Composition

The background composition of the genome from which the sequences have been drawn. The default option is to use equiprobable background. However, you may also explicitly set the expected CG content for nucleic acid sequences, insist on equiprobable background distributions, or use any of the prescribed compositions in the dropdown box.

For proteins, the background composition, consisting of a set of emission probabilities of amino acids, is used to calculate the scores for the Bayesian Integral Log-odds (BILD) method, the normalized maximum likelihood (NML) method and the Schneider-corrected (SC) method. Currently, we use the Robinson-Robinson frequencies as the background amino acid frequencies for the latter two methods mentioned above. However, in the command line version of the program, the user may specify a different set of background probabilities. BILD score, on the other hand, always uses the implicit background probabilities of the selected Dirichlet mixture.

Scale stack width

Scale the visible stack width by the fraction of symbols in the column? (i.e. columns with many gaps or unknown residues are narrow.)

Title

Give your logo a title.

Figure label

An optional figure label, added to the top left (e.g. '(a)').

X-axis

Add a label to the x-axis, or hide axis altogether.

Y-axis

The vertical axis indicates the information content of a sequence position. Use this option to toggle the y-axis and override the default axis label.

Y-axis scale

The height of the y-axis in designated units. The automatic option will pick reasonable defaults based on the sequence type and axis unit.

Y-axis tic spacing

The distance between major tic marks on the y-axis.

Sequence end labels

Choose this option to label the 5' & 3' ends of nucleic acid or the N & C termini of amino acid sequences.

Version fineprint

Toggle display of the LogOddsLogo version information in the lower right corner.

Color Scheme

monochrome : All symbols black
Base Pairing (NA default) :

2 Watson-Crick hydrogen bonds TAU dark orange

3 Watson-Crick hydrogen bonds GC blue
Classic (NA) : WebLogo (version 1 and 2) and makelogo default color scheme for nucleic acids: G, orange; T & U, red; C, blue; and A, green.

G G orange

TU TU red

C C blue

A A green
Hydrophobicity (AA default) :

Hydrophilic RKDENQ blue

Neutral SGHTAP green

Hydrophobic YVMCLFIW black
Chemistry (AA) : Color amino acids according to chemical properties. WebLogo (version 1 and 2) and makelogo default color. (Note that the WebLogo 2 documentation erroneously lists Q and N under green.)

Polar GSTYC green

Neutral QN purple

Basic KRH blue

Acidic DE red

Hydrophobic AVLIPWFM black
Charge (AA) :

Positive KRH blue

Negative DE red
Custom : A custom color scheme can be specified in the input field below. Specify colors on the left and associated symbols on the right. Colors are entered using CSS2 (Cascading Style Sheet) syntax. (E.g. 'red', '#F00', '#FF0000', 'rgb(255, 0, 0)', 'rgb(100%, 0%, 0%)' or 'hsl(0, 100%, 50%)' for the color red.)

More Options

The weblogo command line client, logoddslogo, provides many more options and greater control over the final logo appearance.

Installing LogOddsLogo

Dependencies

LogOddsLogo is written in Python. It is necessary to have Python 3.7 and the extension package numpy installed before LogOddsLogo will run. LogOddsLogo also requires a recent version of ghostscript to create PNG and PDF output, and pdf2svg to generate SVG output.

Download and Installation

The LogOddsLogo source code can be downloaded from ftp://ftp.ncbi.nlm.nih.gov/pub/qmbp/logoddslogo/. This code is distributed under various open source licenses. Please consult the LICENSE.txt file in the source distribution for details.

After unpacking the LogOddsLogo tarfile, it should be possible to immediately create logos using the command line client (provided that python, numpy and ghostscript have already been installed).

./logoddslogo --format PNG < cap.fa > cap.png

Please consult the file build_examples.sh for more examples.

To run LogOddsLogo as a stand alone web service, run the logo server command :

./logoddslogo --serve

It should now be possible to access LogOddsLogo at http://localhost:8080/.

The command line client and LogOddsLogo libraries can be permanently installed using the supplied setup.py script.

sudo python setup.py install

Run python setup.py help for more installation options. For example, to specifically install the logoddslogo script to /usr/local/bin

sudo python setup.py install_scripts --install-dir /usr/local/bin

Note that LogOddsLogo source code fully contains the code for WebLogo, version 3.3. All the changes made at the NCBI are placed in separate directories and hence LogOddsLogo and WebLogo-3.3 can coexist on the same Python installation. The WebLogo executable is still present in LogOddsLogo distribution, but it is not installed by the setup script.

Web App

To use LogOddsLogo as a web application, first install the its dependencies and libraries as above, then place (or link) the logoddslogolib/htdocs directory somewhere within the document root of your webserver. The webserver must be able to execute the CGI scripts proteins.cgi and nucleicacid.cgi. For Apache, you may have to add an ExecCGI option and add a cgi handler in the httpd.conf configuration file. Something like this:

<Directory "/home/httpd/htdocs/logoddslogo/">
    Options FollowSymLinks MultiViews ExecCGI
    AllowOverride None
    Order allow,deny
    Allow from all
</Directory>
...
# To use CGI scripts outside of ScriptAliased directories:
# (You will also need to add "ExecCGI" to the "Options" directive.)
#
AddHandler cgi-script .cgi

It may also be necessary to set the PATH and PYTHONPATH environment variables.

SetEnv PYTHONPATH /path/to/logoddslogo/libraries

The cgi script also has to be able to find the 'gs' ghostscript executable. The maximum bytes of uploaded sequence data can be controlled with the WEBLOGO_MAX_FILE_SIZE environment variable.

SetEnv WEBLOGO_MAX_FILE_SIZE 1000000

`logoddslogo`, The LogOddsLogo Command Line Interface (CLI)

The command line client has many options not available through the web interface. Please consult the bundled build_examples.sh script for inspiration.

Usage: logoddslogo [options]  < sequence_data.fa > sequence_logo.eps

Create sequence logos from biological sequence alignments.

Options:
     --version                  show program's version number and exit
  -h --help                     show this help message and exit

  Input/Output Options:
    -f --fin FILENAME           Sequence input file (default: stdin)
    -D --datatype FORMAT        Type of multiple sequence alignment or
                                position weight matrix file: (clustal, fasta,
                                plain, msf, genbank, nbrf, nexus, phylip,
                                stockholm, intelligenetics, table, array,
                                transfac)
    -o --fout FILENAME          Output file (default: stdout)
    -F --format FORMAT          Format of output: eps (default), png,
                                png_print, pdf, jpeg, svg, logodata

  Logo Data Options:
    -M --score-method SCORE_METHOD
                                The method for scoring: 'BILD' or 'NML' or 'SC'
                                or 'SU':
                                BILD - BILD score
                                NML  - Normalized Maximum Likelihood score
                                SC   - Schneider score - Corrected
                                SU   - Schneider score - Uncorrected
    -d --dmnumber DMNUMBER      Dirichlet mixture parameter for BILD score.
                                For nucleotide sequences, it should be a
                                floating point value corresponding to
                                Dirichlet concentration alpha (DEFAULT 1.0).
                                For protein sequences, it should be an integer
                                (0 to 8) indicating a particular stored
                                Dirichlet mixture:
                                0 - recode3-20 (20 components)
                                1 - recode4-20 (20 components)
                                2 - recode5-20 (20 components)
                                3 - Fournier-20 (20 components)
                                4 - dist-20 (20 components)
                                5 - dist-ncbi-52 (52 components - DEFAULT)
                                6 - dist-ncbi-72 (72 components)
                                7 - dist-ncbi-110 (110 components)
                                8 - dist-ncbi-134 (134 components)
                                This parameter is ignored for any other
                                scoring method.
    -A --sequence-type TYPE     The type of sequence data: 'protein', 'rna' or
                                'dna'.
    -a --alphabet ALPHABET      The set of symbols to count, e.g. 'AGTC'. All
                                characters not in the alphabet are ignored. If
                                neither the alphabet nor sequence-type are
                                specified then logoddslogo will examine the
                                input data and make an educated guess. See
                                also --sequence-type, --ignore-lower-case
    -U --units NUMBER           A unit of entropy ('bits' (default), 'nats',
                                'digits'), or a unit of free energy ('kT',
                                'kJ/mol', 'kcal/mol'), or 'probability' for
                                probabilities
       --composition COMP.      The expected composition of the sequences:
                                'equiprobable (default)', a CG percentage,
                                a species name (e.g. 'E. coli','H. sapiens'),
                                or an explicit distribution
                                (e.g. "{'A':10, 'C':40, 'G':40, 'T':10}").
                                For proteins, NML and SC use the
                                Robinson-Robinson frequencies as the background,
                                although an explicit specification (same format as above)
                                is also allowed. BILD always uses the implicit background
                                frequencies of the selected Dirichlet mixture.
       --weight NUMBER          The weight of prior data.  Default depends on
                                alphabet length
       --no-weighcounts         For proteins only, do not use other columns to
                                estimate the number of independent
                                observations of an amino acid within the
                                column considered.  Default is True
       --ovline                 Draw an overline on the positive score region.
                                Default is False
    -i --first-index INDEX      Index of first position in sequence data
                                (default: 1)
    -l --lower INDEX            Lower bound of sequence to display
    -u --upper INDEX            Upper bound of sequence to display

  Transformations:
    Optional transformations of the sequence data.

       --ignore-lower-case      Disregard lower case letters and only count
                                upper case letters in sequences.
       --reverse                reverse sequences
       --complement             complement DNA sequences

  Logo Format Options:
    These options control the format and display of the logo.

    -s --size LOGOSIZE          Specify a standard logo size (small, medium
                                (default), large)
    -n --stacks-per-line COUNT  Maximum number of logo stacks per logo line.
                                (default: 40)
    -t --title TEXT             Logo title text.
       --label TEXT             A figure label, e.g. '2a'
    -X --show-xaxis YES/NO      Display sequence numbers along x-axis?
                                (default: True)
    -x --xlabel TEXT            X-axis label
       --annotate TEXT          A comma separated list of custom stack
                                annotations, e.g. '1,3,4,5,6,7'.  Annotation
                                list must be same length as sequences.
    -S --yaxis UNIT             Height of yaxis in units. (Default: Maximum
                                value with uninformative prior.)
    -Y --show-yaxis YES/NO      Display entropy scale along y-axis? (default:
                                True)
    -y --ylabel TEXT            Y-axis label (default depends on plot type and
                                units)
    -E --show-ends YES/NO       Label the ends of the sequence? (default:
                                False)
    -P --fineprint TEXT         The fine print (default: logoddslogo version)
       --ticmarks NUMBER        Distance between ticmarks (default: 1.0)
       --errorbars YES/NO       Display error bars? (default: False)
       --reverse-stacks YES/NO  Draw stacks with largest letters on top?
                                (default: False)

  Color Options:
    Colors can be specified using CSS2 syntax. e.g. 'red', '#FF0000', etc.

    -c --color-scheme SCHEME    Specify a standard color scheme (auto, base
                                pairing, charge, chemistry, classic,
                                hydrophobicity, monochrome)
    -C --color COLOR SYMBOLS DESCRIPTION
                                Specify symbol colors, e.g. --color black AG
                                'Purine' --color red TC 'Pyrimidine'
       --default-color COLOR    Symbol color if not otherwise specified.

  Advanced Format Options:
    These options provide fine control over the display of the logo.

    -W --stack-width POINTS     Width of a logo stack (default: 10.8)
       --aspect-ratio POINTS    Ratio of stack height to width (default: 5)
       --box YES/NO             Draw boxes around symbols? (default: no)
       --resolution DPI         Bitmap resolution in dots per inch (DPI).
                                (Default: 96 DPI, except png_print, 600 DPI)
                                Low resolution bitmaps (DPI<300) are
                                antialiased.
       --scale-width YES/NO     Scale the visible stack width by the fraction
                                of symbols in the column?  (I.e. columns with
                                many gaps of unknowns are narrow.)  (Default:
                                yes)
       --debug YES/NO           Output additional diagnostic information.
                                (Default: False)

  LogOddsLogo Server:
    Run a standalone webserver on a local port.

       --serve                  Start a standalone LogOddsLogo server for
                                creating sequence logos.
       --port PORT              Listen to this local port. (Default: 8080)

LogOddsLogo Application Programmer Interface (API)

The LogOddsLogo python libraries provide even greater flexibility than the command line client. The code is split between two packages, logoddslogolib, and weblogo, which is copied without change from WebLogo 3. The package logoddslogolib contains all NCBI changes from WebLogo 3. Please consult the API documentation for WebLogo and LogOddsLogo. The API documentations were generated using pdoc3.

Miscellanea

Release Notes and Known Bugs

The LogOddsLogo release notes detail changes to LogOddsLogo and known issues with particular versions.

WebLogo

The WebLogo 3 sever can be found here.

The legacy WebLogo 2 sever can be found here.

Acknowledgments

LogOddsLogo was written by Yi-Kuo Yu, Stephen Altschul and Aleksandar Stojmirovic, with input from David Landsman and John Capra. It was based on the WebLogo-3.3 source code and the research of Yi-Kuo Yu and Stephen Altschul on information content of multiple sequence alignments.

WebLogo was created by Gavin E. Crooks, Liana Lareau, Gary Hon, John-Marc Chandonia and Steven E. Brenner. Many others have provided suggestions, bug fixes and moral support.

WebLogo was originally based upon the programs alpro and makelogo, both of which are part of Tom Schneider's delila package. Many thanks are due to him for making this software freely available and for encouraging its use.

Feedback

Please direct questions and feedback about LogOddsLogo to Yi-Kuo Yu.

2 Watson-Crick hydrogen bonds	TAU	dark orange
3 Watson-Crick hydrogen bonds	GC	blue

Hydrophilic	RKDENQ	blue
Neutral	SGHTAP	green
Hydrophobic	YVMCLFIW	black

Polar	GSTYC	green
Neutral	QN	purple
Basic	KRH	blue
Acidic	DE	red
Hydrophobic	AVLIPWFM	black

QMBP

LogOddsLogo

NCBI

Web service

Download

Documentation