Reconcile Trees Documentation
This Phyloinformatic Hackathon page describes how one might discover and characterize gene families when starting with genomic sequence or clustered EST sequence (Family Alignment Documentation). The critical use case described here may be the next step, where one attempts to reconcile the structure of one or more gene family trees with the structure of the species tree corresponding to the genomes that one has used (and this species tree may have been obtained by some other means of phylogenetic analysis). Identifying the differences between different types of trees, gene family and species for example, may shed light on how the genes may have changed as speciation has occured in the given clade.
Reconciling the Multiple Trees Created by Analyzing Genomes
See also a table of programs for reconciling trees.
Most trees one encounters will be resolved, meaning a parent node will have 2 child nodes. However a significant fraction of the trees at Phytome are unresolved, where there are greater than 2 child nodes for a given parent. softparsmap, unlike most related applications, can handle unresolved trees as input and so was chosen for this task.
- Alignments and phylogenies for ~27,000 gene familes derived from 140 plant species
- Obtain both species trees and corresponding gene family trees (Phytome).
- Compare and reconcile these trees using softparsmap.
Both softparsmap and Phytome are designed in unconventional ways, making this task impossible initially. Specifically:
- softparsmap requires that every gene in a gene tree be described by a GenBank GI number, no other identifiers are accepted.
- Phytome provides gene trees with its own gene and species identifiers and no identifiers from authoritative external databases such as GenBank.
The BioJava group inspected the code of softparsmap and evaluated different work-arounds, finally electing to create a Python wrapper script that generalized the inputs of softparsmap to use with any species and gene tree.
The BioPerl group devised a workflow that could used in the tree reconciliation problem.
Njtree can be used to reconstruct a gene tree for a gene family with known species phylogenies. Njtree can also be used to build species trees or gene trees with unknown species phylogenies.
- Obtain a multiple sequence alignment (MSA) using an alignment application and its corresponding Bioperl wrapper module.
- Example: clustalw and the Bio::Tools::Run::Alignment::Clustalw module.
- Obtain a species tree using Bio::DB::Taxonomy
- Use the tree and the MSA as inputs to Njtree.
They also created a workaround to address a longstanding problem in the clustalw application, its insistence on shortened names. This workaround can take sequences with longer names, run clustalw with truncated version of those names, then extract the alignment and replace the truncated names with the correct, longer names. In this way one can easily use clustalw to create the MSA in the workflow described above.
To enable this workflow the BioPerl group created a wrapper script for Njtree (Bio::Tools::Run::Phylo::Njtree::Best).
An example script:
- !/usr/bin/perl -w
use strict; use Bio::Tools::Run::Phylo::Njtree::Best; use Bio::AlignIO; use Bio::TreeIO;
my $alignio = new Bio::AlignIO(-format => 'fasta',
-file => 'ex1.nucl.mfa');
my $aln = $alignio->next_aln;
my $treeio = Bio::TreeIO->new(-format => 'nhx',
-file => 'ex1.nucl.nhx');
my $tree = $treeio->next_tree;
my $njtree_best = new Bio::Tools::Run::Phylo::Njtree::Best; $njtree_best->alignment($aln); $njtree_best->tree($tree); my $nhx_tree = $njtree_best->run(); </perl>
Bear in mind that Bio::TreeIO can accept a few different file formats (cluster, lintree, newick, nexus, nhx, pag, tabtree).
BioPerl & HyPhy
The following wrapper modules were created that could be used to run HyPhy from within BioPerl and address the tree reconciliation problem:
- Bio::Tools::Run::Phylo::Hyphy::GABranch In progress, waiting for single CPU version of the algorithm
- Bio::Tools::Run::Phylo::Hyphy::GARD In progress, waiting for single CPU version of the algorithm
This section depends on work that is in progress. Please help by contributing.
The BioRuby group contributed to this general use case by creating tools to read and write NEXUS format. The parser returns trees, data, characters, distances, and taxa as objects which allow to access the individual data fields (such as number of characters). Trees can be either returned as Newick-parsed tree objects or as strings. Sequences can be either returned as sequence objects or as strings. Distances are returned as a matrix. Other blocks are returned as generic blocks which can return their content in tokenized form. Blocks can also be returned as NEXUS formatted strings. Documentation and unit tests have been submitted as well.