Implementation of PDSIMUL for Mesquite

From NESCent Informatics Wiki
Jump to: navigation, search

Foreword

In evolutionary biology, comparative phylogenetic data analysis has often required execution of thousands of simulation to be run and visualized under various parameters and bounds to study particular hypothesis (about any evolutionary history of a group) or a stochastic model of a process. This has been well served by Mesquite and PDSIMUL (under PDAP package) but for their individual short comings. The Mesquite has relatively limited facilities for simulation of continuous-valued evolutionary character traits while PDSIMUL is relatively old and is meant for execution on MS-DOS; as well as the implementation of PDSIMUL is limited to 2 variables (though the positive side being that PDSIMUL implements a number of common types of continuous values evolutionary processes).

Thus the proposal is to en-cash the best features of both by implementing the PDSIMUL modeling functionality as a Mesquite package, consisting of a set of Mesquite modules, written in the Java language. At the same time extending the simulations to more than two dimensions; modeled with a user-specified correlation, further extending to CV matrix.

Summary

The PDSIMUL:Mesquite System (as we will call it through out the proposal) intends to simulate the evolution of two or more than two (possibly correlated) continuous characters along a known phylogenetic tree. It will take input in any of the formats used by PDTREE, and would output simulations in GUI format hiring Mesquite's graphical simulation plotting modules or as an ASCII file named *.PSM. The output of PDSIMUL:Mesquite will include the phylogeny itself in the old .INP format of Martins and Garland (1991), followed by a listing of the parameters chosen for the simulations, and finally the simulated data (as in use by contemporary PDSIMUL module of PDAP package).

PDSIMUL:Mesquite will be configured to operate as a 'callable' module by Mesquite's core trunk module. The basic aim of PDSIMUL:Mesquite module will be integration and interoperability with Mesquite core trunk module. Having said that, PDSIMUL:Mesquite when integrated and used with Mesquite System will be found as "hirable" module (under that name of "PDSIMUL:Mesquite trunk core") along with the other available modules in Mesquite System (located in various subdirectories), when the Mesquite trunk module starts up.

Thus, when the simulation(s) will be required to run, the "PDSIMUL:Mesquite trunk module" will be "hired" by the trunk module of Mesquite as an employee to perform its underlying simulation tasks. Now, the "PDSIMUL:Mesquite trunk module" will provide the user with GUI to select from a list of simulations available. Once the simulation of choice would be selected, the program will take the user to browse through the disk for the file that contains the tree.

Having done that, "PDSIMUL:Mesquite trunk module" will further invoke two modules (or would hire two modules as in Mesquite's terminology), first being the "Tree & Data Coordinator" module for PDSIMUL:Mesquite (note: this is different than the File Coordinator module dedicated to Mesquite system on the whole). The "Tree & Data Coordinator" will coordinate the reading and pre-processing of specified tree. After that the control will be passed to "Parameter Handler" module, which is the other hired module. This module will ask if the user wants to set some parameters to the default values (that is to set any user defined values). Also, this module will query the user and store the global parameters necessary for simulation environment. The latter part just setting some of the simulation parameters to coincide with the phenotypic data of the tips of tree and also taking up the input for bound simulation.

At this point of time, the simulation specific "Pre-simulation" module will be hired by "Parameter Handler" module (with an acknowledgment specified to "PDSIMUL:Mesquite trunk module") and assigned the duty of simulation. Prior to any other activity this module will conduct a confirmation dialogue with the user for setting up all the parameters for the selected simulation at the very initialization step. It may be found that some parameters are mutually exclusive of others; for some such combination's, the program will replace the excluded parameter's values with system embedded defaults, but for others unlike previous version of PDSIMUL, PDSIMUL:Mesquite will 'notice' while you are entering them. For instance, you cannot set the starting point outside of the bounds (as specified) because the choice will be through drop-down or other relevant GUI based menu itself. The "Pre-Simulation" module will further give option to select any Mesquite proprietary based feature which can be operated on simulation results (for graphical view, etc)

Now, the Simulation specific core module or namely: "Simulation Core" module will be hired, which will begin by providing the simulation summary and the set-parameters information; reviewable by the user. If it is what was intended, the user presses the confirmation button and the simulation starts. Depending on the number of simulations requested, the bounding method selected, and the bounds chosen, it could take a very long time for the simulations to be completed. (Note: since the system is being implemented in Java, parallel Java-Threads will be involved to enhance the performance here)

Meanwhile, before the simulation completes (after the simulation data is available); if any graphical statistics where requested while setting up global parameters specific to Mesquite; then Mesquite trunk module will be re-invoked with the task's details; which further with the help of its File Coordinator will process all the plotting and other simulation data processing requests.

Thus, now when the simulation completes, the user will be notified by an information dialogue box; based on user's choice - if (s)he wants any summary statistics for the simulated tip values. The summary statistics and associated histogram of tip values are further added to simulation result. In general as specified by PDSIMUL documentaion and general conventions, it is a good idea to run just a few simulations the first time (e.g., 10 or 100), then check the displayed summary statistics, before asking for all 1,000 or whatever.

Finally, the user gets an option to save the whole simulation track (the exact archive of simulation process - step by step) or only the simulation step or to discard the results under the "Simulation Result Handler" module; after which the user is sent back to the main menu where he can do more runs by entering new parameters, or quits.

It is worth to note that PDSIMUL:Mesquite preserves and inherits all the features of ancestral PDSIMUL module of PDAP package. Like PDSIMUL it can also be used to create phylogenetically correct null distributions for testing the statistical significance of correlations between characters, whether those correlations are computed in the conventional, nonphylogenetic way, by squared-change parsimony or even by phylogenetically independent contrasts.

Other than mentioned modules, a standalone separate Help application will be available to guide through the PDSIMUL:Mesquite System and its functional aspects (callable at any instance from the PDSIMUL:Mesquite System)

Discussion

This section entails about other PDSIMUL:Mesquite features and integration issues

Simulation Models

  • User gets option for this in PDSIMUL:Mesquite core trunk

PDSIMUL:Mesquite preserves all the five main options as in PDSIMUL as to the model of evolutionary change under which continuous characters are simulated:

  1. Gradual Brownian Motion
  2. Speciational Brownian Motion (like gradual but with all branch lengths set equal to one; was termed "punctuational" by Martins and Garland, 1991)
  3. Gradual Ornstein-Uhlenbeck
  4. Speciational Ornstein-Uhlenbeck
  5. Punctuated Equilibrium (change allowed in only one daughter)

Bounded Simulation

  • User gets option for this in Parameter Handler module

The simulations will be bounded for certain cases which are necessary to imitate real world situations (though can be turned off by configuration file); because the characters do not evolve to be infinitely large or small. For example, the latitude can only vary between 0 and 90 degrees from the equator.

Wandering traits can be kept in bounds in many ways and the user will be required to select one of the following: (Details of each has been discussed in PDSIMUL Documentation)

  • Throw out: If at any point during the course of a simulation one of the traits goes out of bounds, the simulation program starts over.
  • Replace: It checks at each step to see if the next added change will land the trait out of bounds. If the change will, then a different change is used.
  • Flip: It checks each step to see if it lands the trait out of bounds or not. However, if the step would take the trait out of bounds, its sign is flipped.
  • Hard Bounce: It makes the traits bounce off the bounds.
  • Soft Bounce: Its idea is that when a trait is near a boundary it is difficult for it to step toward that boundary. So when a trait is 'near' a boundary, then a special function is called to give a multiplier ranging between 0 and 1. This function is:
    sin ([pi/2][distance from edge/(1/2 range)])
  • Truncate Change: If a trait attempts to step past a boundary it is forced to go to and stop on that boundary.
  • Combination: Each time a step that would lead the trait out of bounds appears, the program randomly chooses from all of the previous methods (except for "Throw Out") to resolve the situation.

Implementation of the OU model

  • All these values are taken input in the "simulation <specific>" module (<specific> will be replaced with the name of simulation currently being taken into account)

When the user chooses the OU process instead of Brownian Motion, the user must also specify the (D)ecay constant, (A)daptive peak, (Y)- Velocity of the peak, and (T)- Variance of peak movement.

Graphical User Interface

The PDSIMUL:Mesquite features will be available in a graphical user interface written using Java/AWT. The visual layout will be in concordance with standard Mesquites interfaces; along with added feasibility of usage.

Data Gathering

Data gathering will be by storing the output in specially formatted files (similar to trivial PDSIMUL). Though, storing and transferring data over to any network file system has not been planned but can be incorporate if need arises.

Integration and Interoperability with Mesquite

Mesquite operates via the cooperation of modules, each of which performs some function or duty. There is one core module, the Mesquite trunk module. When this module starts up, it finds all of the available modules. (Mesquite does this by looking into the subdirectories of the mesquite directory.) From each module, it gathers information, including its name and the functions the module promises to perform. (as by Mesquite's documentation)

http://inforex.moved.in/gsoc/pds.png

Now to integrate it with Mesquite; PDSIMUL:Mesquite will only be required to place in a sub-directory of Mesquite and a few parameters in PDSIMUL:Mesquite Configuration file to be tweaked. Thereafter, the PDSIMUL:Mesquite module will be found by Mesquite trunk module and deployed (hired) when required and the further process will take place as discusses in summary.

Tackling different structure of Models (in PDSIMUL and Mesquite)

There are issues that need to be tackled while re-writing PDSIMUL for PDSIMUL:Mesquite pacakge such as models, for eg. Brownian motion model which are structured differently for PDSIMUL and standard Mesquite Modules (having a model and a curator). Thus, when writing PDSIMUL:Mesquite package, we will implement features to handle both the structural aspect and a function to inter-convert the structure as per usage. This may (most probably) will change with the detailed study and review of existing systems.

Development Methodology and Tools

This section describes about development methodology and tools involved

Development Platform and language

The development will be done in Java on Eclipse IDE, with the GUI to be developed in Java AWT

Testing

To ensure bug-free and stable version of PDSIMUL:Mesquite, the development of each module will follow with it's unit testing (using unit test cases). This unit tesing will be performed through JUnit ( the JUnit plug-in is included with the Eclipse development environment).

To ensure better speed and accuracy of simulation results, the Integration testing will include comparative study with output generated by contemporary PDSIMUL module of PDAP package.

Work Plan

The work plan for the design and development for the system has been planned as follows:

  • Prior to GSoC Start (Phase-1): Tackling issues such as restructuring models, other complexities by studying code and formulating alternatives
  • Prior to GSoC Start (Phase-2): Deploying Mesquite and PDSIMUL tools and learning its each feature
  • Week 1: Creating Blueprint for the PDSIMUL:Mesquite and prototyping desirable API
  • Week 2: Developing and unit-testing the PDSIMUL:Mesquite core trunk module
  • Week 3: Developing and unit-testing the Tree & Data Coordinator & Parameter Handler module
  • Week 4: Developing and unit-testing the Pre-simulation & Simulation core module
  • Week 5: Developing and unit-testing the Simulation Result Handler Module and Integration Testing
  • Week 6: Packaging and Documentation
  • Week 7: Further plug-ins incorporation and improvements based on testing and feedbacks
  • Week 8,9: System implementation and Stress test, Beta testing
  • After GSoC: Maintainence and Support

Mentor

Any interested mentor may please enlist his/her name here

Deliverables

The deliverables for the project will include:

  • The packaged PDSIMUL:Mesquite in Full Installer format (Mesquite and PDSIMUL:Mesquite bundled)
  • A Patch (containing only PDSIMUL:Mesquite; to be applied on existing Mesquite installation)
  • All source files with proper comments and configuration files
  • Comprehensive documentation and Help files

To add to the deliverables, I will be available for help and assistance to the developed system in future and handle any bugs

About Me

Introduction

My name is Abhishek Tiwari and am located in Lucknow, Uttar Pradesh (India) and am pursuing bachelors in engineering (Computer Science stream). I have developed several LAMP based web-applications and C++, Java Standalone apps (working as a freelancer on internet) during my college life and have done several other design and development works continuously (adhering to industry norms) for commercial and non-commercial (academic) purposes. I have strong hold and experience in several Open Source and Java Technologies; as well as have a hand on experience of many other popular technologies.
I have good problem solving and algorithmic skills. I have bragged several national and international level awards by winning programming contests at various levels such as of International Programming Contests organized by renowned universities, TopCoder, ACM ICPC, etc (in team and individually).
My experience as an intern in the Platform Development Department at Yodlee Infotech Pvt. Ltd. as well attests to my skills of design and development. During my internship, I developed a module which was deployed as an integral part of Service Level Agreement Computation Model (Java and Oracle 9i) at Yodlee which leveraged me to learn and work under Java (and related technology's) industry standards.

Skill Set

C/C++ (Turbo –DOS, GCC, Borland platforms), J2EE, PHP, XML, AJAX, CGI, SPECTR (Specialized Programming Environment for Competitive Training Robot), MATLAB, MySQL, Oracle, etc

Related Experience and Detailed Portfolio

A list of projects undertake and detailed portfolio can be viewed at:

http://inforex.moved.in/

Answers to Questionaire of NESCent

  • Your interests, what makes you excited
    The opportunity to apply Computer Science and Algorithmic knowledge to Comparative Biology & related field under open-source domain itself is an exiting prospect.
    Speaking about GSoC, it is a great initiative by Google which makes your summer worthwhile. You get to spend your time developing what you love to do with the best people of the open source field. The thread of thought - "what you are developing would be used by others in future" gives a different thrilling feeling altogether :)
    Nevertheless to speak, a T-shirt and stipend from Google for what you would have even liked to do without even getting paid...
    So its a win-win situation all over the place :)
  • Why you are interested in the project, uniquely suited to undertake it, and what you anticipate to gain from it
    The first and foremost striking feature about this project is that it deals with comparative biology (a real world domain). So any contribution from my side will help researches under this domain of biological field gain altogether.
    Moreover, it deals with enhancement and development of a pre-dominant tool: PDSIMUL (which has not been changed since 1993) in JAVA which is a technology of my keen interest thus making this project totally suitable to my taste, as well as my prior experiences myself makes me a good contender to take up and develop the project with world class quality.
    To gain from it; there are several things... I would be learning several new tools related to biological field which I never have been exposed to in past. I can learn about new career prospects as well in the combo field of Biological Sciences and Information Technology. Moreover, an experience to implement my knowledge and skills to a non-CS field as well!
  • A summary of your programming experience and skills
    I have now been coding from over 7 years by now and of which I have a three years of experience working with open-source projects (encompassing PHP, Java, etc). My acquaintance with open source spans its beginning with a peculiar interest of tweaking with existing codes (and the code for open source software were available to tweak as much as you want.. though initial outcomes were serious system crashes ;) ) ... following that; my interest grew in PHP for web-based development work and I learned LAMP & Java for Desktop Applications; following which I took leverage to implement my skill set in several freelancing works as well.
  • Programs or projects you have previously authored or contributed to, in particular those available as open-source, including, if applicable, any past Summer of Code involvement
    No past Summer of Code involvement, though I have been involved with development of many open-source projects. Available to review at my site:
    http://inforex.moved.in/
  • A project plan for the project you are proposing, even if your proposed project is directly based on one of the ideas
    The project plan has been listed above in previous sections