Distributed Cell Biology Simulations with the E-Cell System

Sugimoto M.

Publication Details

Analytical techniques in computational cell biology such as kinetic parameter estimation, Metabolic Control Analysis (MCA) and bifurcation analysis require large numbers of repetitive simulation runs with different input parameters. The requirements for significant computational resources imposed by those analytical methods have led to an increasing interest in the use of parallel and distributed computing technologies.

We developed a Python-scripting environment that can execute the above mathematical analyses. Also, where possible, it automatically and transparently parallelizes them on either (1) stand-alone PCs, (2) shared-memory multiprocessor (SMP) servers, (3) cluster systems, or (4) a computational grid infrastructure. We named this environment E-Cell Session Manager (ESM). It involves user-friendly flat application program interfaces (APIs) for scripting and a pure object-oriented programming environment for sophisticated implementation of a user's analysis.

In this chapter, fundamental concepts related to the design and the ESM architecture are introduced. We also describe an estimation of the parameters with some script examples executed on ESM.

Introduction

Computer simulations are often used to understand complex biological mechanisms, reproducing dynamic behavior in cells, organs and individuals. Simulation models are important for simultaneously understanding the complex processing of biological phenomena and for revealing their mechanisms in vivo. To establish an in silico model to capture biological behavior, qualitative structural information concerning cellular elements including gene networks, metabolic pathways and cascades of signal transductions, along with parameters of reaction rates characterizing the dynamics of the model must be provided precisely and in sufficient detail. Quantitative parameters available from literature or public databases deteriorate the credibility of such constructed models because they often show noise and are measured under different conditions. Recently, a number of high-throughput measurement devices to perform time-course quantitative studies have been developed; these have been aimed at accumulating sufficient and accurate data that can be used for cell simulations.1 Thus, development of sophisticated parameter estimation methods to determine parameters unavailable from observable data and to build quantitative models are required.

Estimation of parameters for large-scale models requires high-performance computing facilities because a number of simulation runs must be repeated using different parameters to produce models that represent specific time-courses. Generic parameter estimation approaches based on global optimizations such as genetic algorithm iterate independent simulations, which can be executed on coarse-grained parallel environments, e.g., cluster machines and grid infrastructures. A number of cell simulators implementing parameter estimation functions with parallel computing have been developed. Systems Biology Workbench (SBW) is an extensible and general framework that includes a biological simulation engine and parameter optimization modules.2 Grid Cellware is an integrated simulation environment implementing the adaptive Swarm algorithm for parameter estimation.3 OBIYagns is a parameter estimation system based on an epigone genetic algorithm called distance independent diversity control (DIDC) and has a Web-based graphical interface.4 These systems exploit clusters or grid infrastructures to distribute simulation runs to reduce the total calculation time.

After constructing the structure and parameters needed in a simulation model, they need to be evaluated by comparing them with known biological data. At this stage, the validity of the model is investigated; this includes the ability to reproduce inter/intra cellular behaviors or its quantitative properties including sensitivity or stability of parameters and analyses using Metabolic Control Analysis (MCA), bifurcation analysis. These analyses can be parallelized at a coarse-grained level because they also repeat independent simulations with different parameters. Typical in silico experiments can also be parallelized in the same way such as over/under-activate a/some intercellular substrate(s) to virtually simulate gene knockout or overexpression and the cultivation of cells with different intracellular conditions such as pH or temperature to maximize/minimize concentrations of cellular products. Since many simulation applications in computational cell biology require repetitive runs of simulation sessions with different models and boundary parameters, distributed computation schemes are highly suitable for such applications.

Here, we discuss a scheme for job-level parallelism, or distributed computing. There is already some middleware software available for the assignment of jobs to distributed environments, e.g., Portable Batch System (PBS, http://www.pbs.org/), Load Sharing Facility (LSF, http://www. platform.com/), Sun Grid Engine (SGE, http://wwws.sun.com/software/gridware/) at the cluster level and Globus toolkit5 at the grid level. While these low-level infrastructures are extremely powerful, they are not compatible with each other, nor are they readily accessible to an average computational biologist. On the other hand, higher-level parallelization systems with a Web-based user interface such as OBIYagns may help computer neophytes. Though these systems provide editable workflow functions such as myGrid6 and ProGenGrid7, they lack programming flexibility to implement a user's analysis algorithm for various research purposes.

In this chapter, we describe the architecture and the design of a distributed computing module of E-Cell3, called E-Cell Session Manager (ESM).8 ESM was developed to produce higher-level APIs to provide users with a scripting environment and to transparently distribute multiple E-Cell sessions on stand-alone PC, SMP, cluster and grid environments. We also introduce parameter estimation scripts built on ESM as an example.

Design of ESM

Architecture of ESM

Figure 1 shows the architecture of ESM. It is composed of three layers: 1) a class library for cell simulation (libecs) and its C++ API (libemc), 2) a Python language wrapper of libemc, PyEcs and pyecell which is the interface connecting the bottom and top layers and 3) a library of various front-end and utility modules written in Python. The pyecell library defines an object class called Session representing a single run of the simulator. ESM provides APIs for Python scripting and instantiates many Session objects.

Figure 1. ESM architecture.

Figure 1

ESM architecture. The bottom layer includes a class library for cell simulation (libecs) and its C++ API (libemc). The top layer represents python front-end utilities such as ESM, GUI and analytical tools. The middle layer (libemc, PyEcs and pyecell) (more...)

The Class diagram of ESM is depicted in Figure 2. The Session Manager class provides the user with a flat API to create and run simulation sessions. The Session Manager class holds a System Proxy object as its attribute. System Proxy conceals the difference of distributed environments and communicates to the computer operating system or to the low-level middleware of the computing environment on which ESM is running (Fig. 3). Session Proxy executes a task in PC and SMP environments or processes a job on cluster and grid environments and holds the status of the process or job (waiting, running, recoverable error, unrecoverable error or finished). Unrecoverable data are unrepeatable errors including job submission failures such as end of file (EOF) error due to instantaneous breakdown of the network.

Figure 2. Class diagram of ESM.

Figure 2

Class diagram of ESM. The Session Manager class provides flat APIs for user scripting. System Proxy is a proxy of a computing environment. Each object of Session Proxy corresponds to a process or a job. Reproduced from Ref. [8] with kind permission of (more...)

Figure 3. Distributed and stand-alone environment infrastructures communicating with ESM.

Figure 3

Distributed and stand-alone environment infrastructures communicating with ESM. PC, WS and cluster represent a stand-alone PC, a workstation and a cluster machine, respectively. The user's script received by ESM and related files, such as ESS or EML, (more...)

To accommodate a distributed environment, subclasses of the Session Proxy and System Proxy objects are exemplified as follows. When a user uses a cluster machine on which the Sun Grid Engine (SGE) parallel batch middleware is installed, the Session Manager class generates instances of SGE Session Proxy and instances of SGE System Proxy that are subclasses of Session Proxy and System Proxy, respectively. On an SMP or a PC computer, they spawn processes in the local computer and use system calls to manage tasks. With other environments, these subclasses contact with the lower-level middleware that manages the computing environment to control jobs and so obtain job status.

Scripting ESM

This section introduces how to use ESM and an ESM script to run multiple E-Cell tasks with different parameters. To run ESM, three types of files are required: 1) a model file (EML or E-Cell model description language file), 2) a session script file (ESS or E-Cell session script file which corresponds with a run of E-Cell simulation) and 3) an ESM script file (E-Cell session manager script file). Examples of command lines to spawn ESM are shown in Figure 4. Examples of an ESM script file and an ESS file used in the script are shown in Figures 5 and 6, respectively. Details of an example ESM script are below.

Figure 4. Command line examples for running ESM.

Figure 4

Command line examples for running ESM. The ‘—environment=’ and ‘—concurrency=’ command-line arguments specify the computing environment and the concurrency of the distributed jobs, respectively. The last (more...)

Figure 5. A sample ESM script.

Figure 5

A sample ESM script. This script runs the session script ‘runsession.py’ 100 times by changing the parameter ‘VALUE_OF_S' from 0 to 100. In the resister step (2), an ESS file ‘runsession.py’ is registered with the (more...)

Figure 6. A sample E-Cell session script (ESS).

Figure 6

A sample E-Cell session script (ESS). This script runs a simulation model for 200 seconds and outputs the value of the variable ‘Variable:/:S’ in the model after the simulation. The initial value of the variable is changed to the value (more...)

Setting system parameters

The computing environment and concurrency needs to be set when running ESM. The computing parameters specify what types of facilities are to be used. The concurrency parameter specifies the maximum number of CPUs that ESM uses simultaneously. When no concurrency parameter is specified, a default value is used, e.g., 1 CPU is used on a stand-alone PC or numbers of all available queues on a cluster machine are used. These parameters are given as command-line arguments to the ‘ecell3-session-manager’ command, which runs an ESM indicated by the user. Alternatively, the user can specify the computing environment, named setEnvironment (environment) and setConcurrency (concurrency) in an ESM script. Other execution environmental conditions should be specified here, on top of the ESM scripts. For example, ESM generates intermediate files under a working directory, specified by setTmpRootDir(directory), during its calculation. These files are removed when the procedure reaches the end of the ESM scripts. Saving the setTmpDirRemoval (deleteflag) method with false arguments avoids deletion of these files and is useful for debugging ESM or ESS scripts.

Registering jobs

The registerEcellSession method in an ESM script registers an E-Cell job. It accepts three arguments: 1) the E-Cell session script (ESS) to be executed, 2) the optional parameters given to the job and 3) the input files to the script (at least a model file) that must be available to the ESS upon execution. In the example in Figure 5, 100 copies of the session script ‘runsession.py’ have been registered with the model file ‘model.eml’. An optional parameter to the script, ‘VALUE_OF_S’ is also given to each session in the range 0, 1, …, 100. When a job is registered, a Session Proxy is instantiated and the registerEcellSession method returns a unique ID.

Running the application

When the run method is called, registered jobs start to execute or are submitted to the lower-level middleware. During this step, System Proxy transfers the ESS file and all other related files to the execution environment. System Proxy communicates with the computer operating system or lower-level middleware at regular intervals to track the process and job status and to update the status of Session Proxy itself. Until all jobs and processes are either ‘finished’ normally or are stopped in ‘unrecoverable error’ states, the run method is repeated. Job execution in this method is parallelized, if possible.

Examining the results

After running ESS, scripts such as that shown in Figure 5 print the results of the executed ESS to the screen; getStdout (aJobID) returns the standard output of the job specified by a job ID.

Parameter Estimation on ESM

This section introduces parameter estimation for an example application scripting program built on ESM architecture. This program incorporates a genetic algorithm—an evolutionary algorithm—to identify a global minimum of given fitness functions, avoiding local minima. A brief overview of a genetic algorithm is as follows. First, individuals in arbitrary numbers are generated with searching parameter sets whose values are randomly distributed within the search space. Second, each individual is independently evaluated with the user-defined fitness function. A square-error function between given and simulation-predicted trajectories is often used as the fitness function. Third, individuals are proliferated or wiped-out from the group of individuals according to their fitness values. Fourth, individuals are crossed over. Fifth, each individual is mutated. These procedures are repeated unless individuals with a sufficient fitness value are found.

A detailed implementation of the genetic algorithm on ESM is described here. The first step of initialization includes parsing a file specifying the parameter estimation process (e.g., Fig. 7) and setting up working conditions, e.g., preparing a temporary working directory with setTmpRootDir (directory) and setting concurrency with setConcurrency (concurrency). Moreover, individual instances need to be tested with random parameters. The genetic algorithm itself is also initialized according to the specified values in the [GA] section of the setting file. In addition, the other parameters are parsed here for the following procedures.

Figure 7. An example of part of a setup file for a step in a process of parameter estimation built on ESS.

Figure 7

An example of part of a setup file for a step in a process of parameter estimation built on ESS. The format of this file simply follows ‘items = value’. The value of ‘RANDOM SEED’ represents the initializing value for a (more...)

In the second step of the evaluation, E-Cell sessions with different searching parameters are registered using registerEcellSession (essfile, argument, extrafiles) methods; ‘essfile’ and ‘extrafiles’ represent an ESS file which includes search parameters and a list of training-time course data files to be used by the fitness function, respectively. Next, the call run method is simply applied to execute the registered ESS files (Fig. 8). Procedures described in an ESS run a simulation with given parameters and evaluate fitness values defined in the ESS. When all spawned sessions finish in success, all calculated fitness values are converged. Furthermore, the third step of selection, the fourth step of crossover and the fifth step of mutation follow and then go back to the second step of the evaluation.

Figure 8. An example ESS (E-Cell session script file) for parameter estimation.

Figure 8

An example ESS (E-Cell session script file) for parameter estimation. Step 1) a model file is loaded; Step 2) KmS and KcF are set to values given by ESM to the model; Step 3) logger stubs are prepared for the model's time-courses; Step 4) simulation is (more...)

On a stand-alone PC, the parameter estimation works as a simple genetic algorithm that executes all ESS scripts in sequence. In grid or cluster environments, it behaves as a master-slave parallel GA, where the master process works on a master node and the calculation of fitness values (procedures written in ESS) is evaluated concurrently in a slave process distributed in parallel computational resources.

Discussion

We have evaluated ESM by implementing simple iterations of E-Cell sessions and a genetic algorithm on ESM. The procedure we described works transparently in both stand-alone and distributed computational environments. An ESM script is helpful for users who might not be familiar with programming parallel environments. It enables these users to implement analysis of algorithms and to easily parallelize them. All scripts can be written in the Python language utilizing ESM's user-friendly API methods. Indeed, the architecture of ESM is so generic that it can execute ordinary scripts, such as Python, Perl and Shell by the registerJobSession method rather than registerEcellSession.

In homogeneous parallel computing environments such as shared-memory machines or PC clusters, it is relatively easy to schedule jobs to minimize the total amount of processing time. On the other hand, heterogeneous environments such as PC grids require more sophisticated scheduling because the topology of remote computation nodes and the network speed between them is generally unpredictably. Such environments deteriorate the parallelization performance of programs requiring synchronous timing of parallelized sessions such as the master-slave genetic algorithm. As an example to resolve this problem, Island-type genetic algorithms may reduce these adverse affects by reducing synchronous transactions among remote calculation nodes and are suitable for heterogeneous and coarse parallel environments. The implementation of such algorithms accommodating a heterogeneous environment is something we have set down for future work.

Medium scaled infrastructures including coarse-distributed computational resources where multiple PC-clusters are connected by grid technology are common network architectures. The current run method in an ESM script simply distributes all registered jobs, which means that job scheduling depends on lower layer middleware. Although submission in the grid version of ESS with cluster options is one solution to efficiently utilize such middleware, users must have detailed knowledge of the features of distributed environments to utilize them. In the future, we will investigate implementation of alternative methods with a sophisticated scheduling scheme.

We have also developed various kinds of other analytical scripts that run on ESM: a sensitivity analysis toolkit based on MCA and a bifurcation analysis toolkit that is used to estimate the stability of nonlinear models. We still need to design a scheduling scheme suitable for the algorithms commonly used in biological systems.

Conclusions

We developed a distributed computing module for the E-Cell System that we named the E-Cell Session Manager (ESM). This software is a higher-level job distribution middleware providing a Python-scripting environment to transparently run simulation runs on any type of stand-alone, cluster or grid computing environments. An EMS script, a script language of ESM and parameter estimation built on ESM is included in the E-Cell Simulation Environment Version 3 package. This package can be downloaded from http://www.e-cell.org/.

References

1.
Voit EO, Marino S, Lall R. Challenges for the identification of biological systems from in vivo time series data. In Silico Biology. 2005;5(2):83–92. [PubMed: 15972008]
2.
Sauro HM, Hucka M, Finney A. et al. Next generation simulation tools: the Systems Biology Workbench and BioSPICE integration. OMICS. 2003;7(4):355–372. [PubMed: 14683609]
3.
Dhar PK, Meng TC, Somani S. et al. Grid cellware: the first grid-enabled tool for modelling and simulating cellular processes. Bioinformatics. 2005;21(7):1284–1287. [PubMed: 15546936]
4.
Kimura S, Kawasaki T, Hatakeyama M. et al. OBIYagns: a grid-based biochemical simulator with a parameter estimator. Bioinformatics. 2004;20(10):1646–1648. [PubMed: 14962919]
5.
Foster I, Kesselman C. Globus: A metacomputing infrastructure toolkit. Int. J. Supercomput. Appl. 1997;11(3):115–128.
6.
Stevens RD, Robinson AJ, Goble CA. myGrid: personalised bioinformatics on the information grid. Bioinformatics. 2003;19:302–304. [PubMed: 12855473]
7.
Aloisio G, Cafaro M, Fiore S. et al. ProGenGrid: a grid-enabled platform for bioinformatics. Stud Health Technol Inform. 2005;112:113–126. [PubMed: 15923721]
8.
Sugimoto M, Takahashi K, Kitayama T. Distributed cell biology simulations with E-Cell System Lecture Notes in Computer Science, Berlin, Springer-Verlag. 2005. pp. 20–31.