Phylohackathon 1/Targets

From NESCent Informatics wiki
Jump to: navigation, search

Critical Use Cases

Here are those use cases that we consider critical to be addressed at this hackathon event.

Top tier

  1. Family alignment: identify homologues, generate family alignment, evaluate models (3.2, 3.18)
  2. Reconcile trees (3.19) and Determine concordance between two or more phylogenies (3.15)
  3. Phylogenetic footprinting/shadowing (3.10)
  4. Morphological characters: infer tree (3.12) and calculate support values (3.13)
  5. Estimate divergence times (3.17)
  6. NEXUS: lack of support for (and adherence to) standards

Second tier

  • Determine correlation of a phylogeny with other variables
  • Functional inference: identifying "functional" sites by "evolutionary trace" and related methods (via HyPhy)
  • Tree of life: whole-genome phylogeny and horizontal transfer

Toolkit-specific targets of opportunity

BioJava Targets

We already have a good sequence and alignment model.

  1. Nexus parser and writer. [Richard] Level I Level II Level III (future) Level IV (future)
    • NEXUS parser based on ideas from JEBL.
    • Implemented DATA, CHARACTERS, DISTANCES, TREES, TAXA
  2. Tree model. [Tobias] (future - partially implemented now but not enough to be fully useful)
    • General and still easy to use API very hard to design, may not get beyond basic outline implementation for now.
    • Should work on rooted trees, as these appear to be the most important.
  3. Merging functionality from JEBL. [Tobias] - partially implemented but needs more work
    • Creating new Biojava-JEBL adaptor which will work both ways to allow BioJava and JEBL to interoperate without being dependent on each other or
    • Started work on porting JEBL code to BioJava, no one's a fan of making the JEBL jar part of BioJava, keep clean separation.
  4. Tree manipulation tools. - too much for now.
  5. PhyloXML parser+writer. - too much for now.
  6. Wrappers for external phylo tools, including parsers for their formats if necessary. [Jim] - will do as many as possible
    • Starting with softparsmap as this was specifically requested in a use-case.
    • Phylip alignment format parser (read and write) completed.
  7. Fix softparsmap
    • Generalizing the inputs of softparsmap to use with any species and gene trees - accomplished via Python wrapper script.

BioPerl Targets

  1. Reconcile tree problem (3.19):
    MSA + species_tree (e.g., Bio::DB::Taxonomy -> njtree/phyml (or RAP) -> Tree::NHX
    BioPerl already has alignment IO, tree IO
    1. wrapper around njtree [1] [Albert]
    2. wrapper around RAP
    3. NameIssue (e.g., PHYLIP, clustalw) - assign short names, run external analysis, replace long names (Weigang)
  2. Sequence family analysis pipeline (3.2) [Jason]
    blast (m9) -> tribemcl -> clustalw -> phylip -> paml
    1. mcl wrapper (fam_id, gene_name)
  3. Phylogenetic footprinting and shadowing (3.10) [Sendu]
    PhastCons, FootPrinter, or use HyPhy?
    need to investigate the use of this software
    1. Some more links about Phylogenetic footprinting
    2. support input and output formats
  4. Morphological characters: infer tree and support characters (3.12) and calculate support values (3.13) The challenge here is to make matrix objects compatible between BioPerl, Bio::Phylo (and perhaps Bio::NEXUS). BioPerl has an alignment object (i.e. a molecular matrix), but currently no matrix used for categorical data. However, objects implementing AlignI could be modified to allow for non-molecular data - by changing the alphabets used to check for valid characters - and so the Bio::Phylo matrix object's API is being adapted to conform to this interface.
  5. Estimate divergence times (3.17)
    r8s - probably not that difficult to wrap as Bio::Tools::Run::Phylo::R8s::NPRS and Bio::Tools::Run::Phylo::R8s::PL [Albert]
    Multidivtime or T3 (thornian time traveler) - a bit convoluted, but relatively similar to what we already do for Codeml.pm
    Beast - need to generalize on input constraints and output
    Jason (or other) find out in/out formats
    1. generic interface to time estimate
  6. NEXUS serialization format - will eventually integrate parser from Bio::NEXUS
    1. SimpleAlign is now AnnotatableI and FeatureHolderI. Can it also hold tree objects in a 'collection'? Stockholm alignment format allows for the inclusion of tree data in Newick format, so this would be beneficial for both formats.
    2. Would a new general annotation class similar to Heikki's Bio::Seq::MetaI and Bio::Seq::Meta::Array be useful for holding position-based character state data for alignments?
  7. Wrappers for HyPhy (with some help from Sergei), relates to Family and Reconciliation use cases.
    1. Bio::Tools::Run::Phylo::Hyphy::SLAC
    2. Bio::Tools::Run::Phylo::Hyphy::FEL
    3. Bio::Tools::Run::Phylo::Hyphy::REL
    4. Bio::Tools::Run::Phylo::Hyphy::ModelTest
    5. Bio::Tools::Run::Phylo::Hyphy::GABranch
    6. Bio::Tools::Run::Phylo::Hyphy::GARD

Bio::Phylo Targets

Work on Bio::Phylo/Bioperl/Bio::NEXUS compatibility. The following Bio::Phylo objects are now compatible with bioperl:

  • the tree object can be wrapped in an adaptor that implements Bio::Tree::TreeI
  • the node object likewise using Bio::Tree::NodeI
  • the matrix object using Bio::Align::AlignI
  • the character state sequence object using Bio::LocatableSeq

This entailed

  1. writing a generalized adaptor infrastructure (Bio::Phylo::Adaptor),
  2. writing concrete adaptor objects (e.g. Bio::Phylo::Adaptor::Bioperl::Tree);
  3. have these adaptors subclass the bioperl interfaces and implement their methods.

The canonical test so far is to somehow generate a Bio::Phylo object, then pass it to the appropriate Bioperl IO class and write the object to file. This works!

BioRuby Targets

We already have alignment and tree models (but should be brushed up more).

  1. Add parsers/writers for .nex, .nhx, .msf, etc. (we already have Newick, Multi FASTA, ClustalW parsers/writers)
    • NEXUS [Christian]: developed parser for nexus formatted data. Parser returns trees, data, characters, distances, and taxa as objects which allow to access the individual data fields (such as number of characters). Trees can be either returned as newick parsed tree objects or as strings. Sequences can be either returned as sequence objects or as strings. Distances are returned as matrix. Other blocks are returned as generic blocks which can return their content in tokenized form. Blocks can also be returned as nexus formatted strings. Documentation and unit tests have been submitted, too.
    • NHX [Naohisa] (done for parser and writer)
    • msf [Naohisa] (done for parser and writer)
    • Phylip sequence alignment format [Naohisa] (done for parser and writer)
    • Molphy sequence alignment format (.mol) [Naohisa] (done for writer)
    • Phylip distance matrix [Naohisa] (done for parser and writer)
    • Document Bio::PhylogeneticTree(renamed) Bio::Tree [Naohisa] (in progress)
  2. Fill gaps in the following use cases and create workflows.
    • User interface and workflow design, release management [Toshiaki] (in progress)
    • Application wrappers. (planned, small progress)
      • ClustalW and MAFFT wrappers should be refactored. [Naohisa] (done)
    • CIPRES binding?

List of TODOs for top tier use cases

  1. Use case 3.2
    • Writing Bio::Alignment object to file for application (e.g. PHYLIP) (see above)
    • Implement Nei-Gojobori method.
    • Sophisticated command execution helper, if possible.
  2. Use case 3.18
    • Parser for T-Coffee (use ClustalW compatible output format?)[Naohisa] (done, only a few lines added/modified)
    • Cheat how other Bio* project identifies paralogs/gene families ;-)
    • Alignment to PHYLIP (same as 3.2) (see above)
  3. Use case 3.10
    • Cheat how other Bio* project identify and extract orthologous sequences (through genome synteny)
    • Develop a synteny model?
    • Add functionality to calculate total tree length in Newick Bio::Tree class [Naohisa](done)
    • Add functionality to window search in multiple alignment class (We already have.)
  4. Use case 3.12
    • Develop NEXUS model (= develop NEXUS parser, see above)
    • Develop parser for PAUP results (= develop NEXUS parser, see above)
  5. Use case 3.13
    • Develop parser for PAUP/TNT results (= develop NEXUS parser, see above)
  6. Use case 3.17
    • Develop parser for Phylomatic, r8s, multidivtime results
  7. Use case 3.19
    • Review Schreiber format
    • Develop NHX format parser (see above)
    • Review Softparsmap
    • Develop Softparsemap parser

BioPython Targets

  1. Develop a wrapper to allow BioPython to interface with functions in HyPhy, particularly those used in the selection analysis of sequences --- addresses use case (3.2) on facilitating the detection of positive selection.
  2. Develop a wrapper that executes HyPhy batch files, allowing complete access to HyPhy functionality

HyPhy Targets

  1. Develop a HyPhy batch language file to address use case (3.18), mapping inferred mutations to branches in the tree topology.
  2. Work with use case stewards to apply HyPhy to selected problems
  3. Implement McDonald-Kreitman algorithm in HyPhy
    • Generalize M-K such that one could estimate divergence across genomes (thousands of alignments)

BioSQL Targets

  1. Develop a relational model for phylogenetic trees in BioSQL that can be used as an optional add-on. The model should accomplish the following basic features.
    • Stores gene trees as well as species trees.
    • Accommodates binary trees, polytomies, and anastomozing trees.
    • Allows attributes with values to be assigned to edges between nodes, such as branch length, support value(s), or posterior probability.
    • Allows attributes with values to be assigned to nodes in the tree, such as duplication/speciation events, calibrated or estimated time, etc.
    • Allows the results of precomputes for accelerating topological queries to be stored.
  2. Script to load trees into the schema from, e.g., NEXUS files.
  3. Script to run the precompute calculations
    • Nested Set values (left and right values)
    • Transitive closure
    • Path through nodes
  4. Define SQL queries to resolve typical topological queries against a database of trees, resulting from, e.g., a bootstrap analysis of a phylogeny, or conflicting gene and species trees. This relates to use cases Reconcile trees (3.19), Determine concordance between two or more phylogenies (3.15), Infer tree (3.12) and Calculate support values (3.13).
    1. Find the LCA (least common ancestor) of nodes A and B
    2. Find the oldest ancestor node of A such that B is not descended from the ancestor
    3. Find the subtree rooted at LCA(A,B) of nodes A and B (minimal spanning clade)
    4. Find the maximim spanning clade that includes nodes A and B but not C (stem query)
    5. Tree pattern match - all trees that have the same topology between chosen set of taxa (or genes)
      • all trees for which the minimum spanning clade of nodes A and B includes node C (as identified by label)
      • all trees for which the minimum spanning clade of nodes A and B does not include node C (as identified by label)
    6. Subsetting trees:
      • all trees that have at least the given nodes, identified by label
    7. Tree projection: obtain the subtree induced by the chosen set of nodes
  5. Create simple web-based user interface to browse and search the database of trees, including topological queries.

In collaboration with the Bioperl group (Aaron Mackey) we also devised a draft mapping of alignment objects as well as character state matrix data into BioSQL.

ToDos:

  1. Implement the tree projection query
  2. Document Pg version, add MySQL and Oracle versions
  3. Add language bindings to Bioperl-db for Bio::Tree
  4. Flesh out and document mapping of Bio::SimpleAlign objects into BioSQL
  5. Add language bindings to Bioperl-db for SimpleAlign objects
  6. Flesh out and document mapping of NEXUS file contents into BioSQL

NEXUS-related Targets

  1. Identify key applications that use NEXUS as input or output
  2. Identify existing parsers
  3. Develop preliminary spec for levels of conformance (examples are guesswork)
    • level I - minimally useful NEXUS implementation (e.g., BioPerl, clustalw)
    • level II - preferred minimum implementation does TAXA, CHARACTERS, TREES with limitations (e.g., PAUP, MrBayes,...)
    • level III - handles commonly used data types, allows private blocks (? Mesquite, Bio::NEXUS)
    • level IV - complete: handles all properly formed files (incl. private blocks)
  4. Plan to evaluate conformance of NEXUS files
    • Validating parser is possible given a formal BNF grammar provided by Iglesias, et al 2001.
    • but this requires a non-LALR(1) parser
    • unless we limit the scope as in Paloverde's isolated treatment of TREES blocks
  5. Plan to evaluate conformance of applications software
    • User commands allow read & write (thus round-trip) in some cases (PAUP, MacClade, Mesquite)
  6. Plan to evaluate conformance of software libraries (e.g., NCL, Bio::NEXUS)
    • In this case we can write a test suite in the native language of the library

PhyloModeler (formerly CIPRES) Targets

  1. Developing an interface definition language that encapsulates current (and future) substitution models:
    • Currently different programmers are coming up with their own model definition languages, which is anticipated to cause problems for users.
  2. Long term efforts
    • Add wrappers for Cipres and Hyphy after the meeting
    • Creating own instance of an IDL for describing statistical models
    • Create and test an XML format that applications can exchange (e.g. HyPhy, PAUP, MrBayes, BEAST, CIPRES, Mesquite, Garli)
  3. NEXUS-related activities
    • We used NCL to read the NEXUS test files listed at Supporting NEXUS
    • This identified some bugs in NCL and aspects of some of the test files that are non-conformant.
    • The NCL has been updated to write out NEXUS files so that roundtripping can be implemented
    • The NCL SourceForge SVN repository contains the new code, including an app named normalizer that can be used to read in a NEXUS file and output (to stdout) the contents (also in NEXUS format). Note that the tar file distribution available on the SourceForge NCL site has not yet been updated, so look for the normalized app in the SVN repository instead.
    • GARLI was modified to use NCL to improve the interpretation of NEXUS data files