Phylohackathon 1/First Call For Suggestions
These suggestions were submitted in response to the action item from the first teleconference on Aug 2, 2006.
There are many novel, powerful phylogenetic analysis tools (muscle, BAliPhy, HyPhy, newer PAML & MrBayes models) that have marginal or no programmatic "support" in BioPerl (or any other programming toolkit). By "support" I mean specifically:
1) the ability to fully interact and execute the tool via an (OO) programmatic interface
- local execution via command-line/config-file interface
- remote execution via RPC (web service, CORBA, or otherwise)
- both simple- and advanced-usage parameter handling
- robust error handling
2) an ability to obtain analysis results from the tool in a pre-parsed, object oriented manner, via:
- extended output parsers, robust to format changes, novel output "fields" and expert usage; this would likely benefit from the use of nested/swappable grammars for both efficiency and maintenance/evolvability, which is an approach not yet taken by most BioPerl parsers
- extended object models beyond the traditional singular Sequence, Alignment, and Tree objects, by which analysis parameters and results may be accessed (see #3a below)
3) an ability to "pipeline" tool input and output via common object (and associated serialization format) reuse, which in the field of phyloinformatics I propose to occur via:
- a composite sequences/alignment(s)/tree(s) object that tightly "binds" and manages relationships between components and tool parameters/results (e.g. PAML clade-specific site variation models defined along branches of the actual associated Tree object); this is a reference to the ongoing "Bio::CDAT" project
- a standardization on the use of Nexus-formatted text representations as the "raw" interchange format, until such a time when a new standard emerges (be it phyloXML or otherwise).
- potentially a relational schema extension to BioSQL to provide more structured and indexable storage than raw Nexus files would permit extending the Bio::*IO subsystems to target Nexus and/or relational schema(s)
CORBA and other "service-oriented architectures" (PISE, more generic Web Services, etc.) provide the means for multiple programming toolkits to reuse a common service architecture, but come at a significant burden in installation and usage complexity (which may be appropriate for an institutional-level service-provider installation, but not for the average scientist "hacker"). However, the BioPerl "Bio::Tools::Run" hierarchy of command-line tool "wrappers" already provides the foundation for much of the desired functionality mentioned above, including the ability to optionally invoke remote "web services" where available.
Therefore, I propose that a NESCent-sponsored hackathon would initially aim to achieve a more thorough instantiation of the existing BioPerl framework (efforts #1 and #2 above) for phyloinformatic use cases, and use the opportunity to brainstorm ideas about #3. Though this proposal is exclusionary to most other programmatic toolkits (BioJava, BioPython, BioRuby, etc.), it is only because of my unawareness of the phyloinformatic support in those toolkits. However, there is opportunity in this proposal for BioSQL extension work (and thus impact on the other toolkits via the ODBA)
I have three points to make i) regarding technology choices, ii) the users we are targeting, and iii) our development strategy.
First, I will just reiterate the two main ideas that came up in our telecon:
- develop interfaces to the CIPRES services in BioPerl, BioJava, whatever
- fill holes in evolutionary analysis options for BioPerl
Either way, if we want people to sit down and work on code together, this is going to take some serious preparation (individual reading, and group teleconferences to develop a clear understanding).
Second, I would like to stress the benefits of reaching out to a wider community than just those involved in the tree-of-life project. Evolutionary approaches can be— and, I would argue should be— the preferred approach to problems in genome annotation, personalized medicine, rational drug design, and so on. The methods that evolutionary biologists have been refining for 40 years are superior (relative to, e.g., generic pattern-finding methods borrowed from computer science) because it is more accurate to treat biological entities as things that diverged by an evolutionary process than as things that just fell out of the sky. So, why aren't evolutionary methods more widely used? Its not for lack of powerful modeling tools (which we have). Its not because the concepts are too hard to understand (they aren't more complex than for other approaches, e.g., neural nets). I think it is because we have not provided the informatics support that would allow high-throughput, integrative, flexible analysis pipelines.
Third, I would like to see us focus our development in such a way that, at the end of the day, we can carry out some specific kind of analysis, beginning with some pre-existing input data and ending with a result. It does not have to be a completely novel analysis, it could be something that has been done dozens of times before (in fact, this is a good place to start). I have been working on a list of evolutionary informatics use cases. I need to make these more specific, e.g., for the first one about sequence evolution, I should find a recently published paper that includes an analysis of phylogeny and dN/dS for some sequence family (doesn't matter which), and that has supplementary data files with sequences and so on. Then we can set the goal of reproducing this published analysis using an entirely script-driven approach.
I don't want to come across as apathetic about this, but I do not have very strong opinions about the specific focus or a specific use- case for a hackathon. Not being a Perl programmer, I'm probably the least qualified to comment on the priorities/needs of BioPerl, and I've had a really hard time coming up with a catchy idea for the hack- a-thon. I think improving BioPERL/CIPRES interactions would be a great thing (certainly for CIPRES and quite possibly for both communities). However, CIPRES software has not reached a point of robustness or richness such that we could invite a group of programmers in and expect a week long programming session focussed entirely on CIPRES to be a fun, fast-paced experience that I think Todd has in mind.
I do like Aaron's suggestions of improving for using HyPhy and PAML from external scripting languages.
If no one objects I'll be a something of a lurker on these discussions. I'll chime in as thoughts occur to me, and I'd like to participate in the hack-a-thon (of course I understand if you choose not to invite me if the hack-a-thon it goes in a BioPerl direction and you want to keep the group small).
I couldn't see Arlin's use-cases on the www.molevol.org TWiki because I didn't have a log-in, but I do have one general comment. The idea of tacking distinct simple use-case is appealing for several reasons. I think that it is also really important to leave time pseudocoding or brainstorming about other use-cases. It is so easy to make decisions in one context that don't work at all when you want to add another analysis.
I'm also a little concerned that I may have sounded too negative during our teleconference. If one asks "What could CIPRES give to the BioPerl that could not be done by extending BioPerl?" then the answer is "not much right now," (and I think that it is important not to oversell CIPRES).
If the question is "Is CIPRES the best route for incorporating phylogentic inference tools into a pipeline designed for very high throughput bioinformatics analyses -- a context in which an estimated tree is produced in a second or two?," then again the answer is negative. Researchers that just want to incorporate some of the most obvious effects of phylogenetic history into their analyses, will gravitate to cheap-and-dirty tree estimation procedure.
CIPRES' main software goal is building a tool that infers tree for datasets of the size being produced by the tree-of-life groups. We are trying to tackle this by building a library to engage the computational phylogenetics community and makes it easier for them to work collaboratively. Most of our programmer and users care a lot about phylogenies and phylogenetic analyses, so we tend to target algorithms that run for hours or days (hence our inability to provide the actual computations as a general-access web service).
Given that Perl in general is so good at piecing together programs (and that BioPerl is much older, and more mature than CIPRES), it sounds a bit depressing for CIPRES if one couches the questions in terms of what CIPRES can deliver to BioPerl (this was how much of our conference call sounded to me). Ultimately, one of the best thing that CIPRES could do for BioPerl is broaden its user community by allowing programmers in C, Java, Python to use BioPerl implementations (with Rutger's code doing the adapting between BioPerl and CIPRES APIs). I think that this is actually a big contribution. I've seen lots of people (myself included) translate implementations from one language to another rather than deal with inter-language RPC. As a result our field has a lot of fragile and partially redundant implementations rather than a few well-tested and heavily used implementations. I think that it will be pretty exciting to see Menu Items for BioPerl analyses appearing in Mesquite menus.
Mark, to access the twiki material that I referenced, you can register as a guest to view it.
Some specific examples of holes in BioPerl would be:
- missing Bio::Run stuff, e.g., the orphan module for Molphy
- other gaps in phylo & popgen support in the BioPerl priority list.
I am particularly keen on having HyPhy well wrapped with bioperl-run. It would be interesting to know how people think this should be done:
- Have a module for each of the analysis in HyPhy?
Like with Phylip or PAML, where one has a module for each analysis.
- Have a generic module Bio::Tools::Run::Phylo::Hyphy.pm
and then have an analysis parameter inside the factory object that would direct the control file edition to the right place.
I really like the tree visualization capabilities of the ape module in r-project (http://cran.r-project.org/src/contrib/Descriptions/ape.html). It would be nice to be able to get this module closer to bioperl, although I am a complete illiterate on the R-Perl interfaces.