Sandbox

From NESCent Informatics Wiki
Jump to: navigation, search


Author

Aaron Mackey Arlin Stoltzfus

Copyright

This document is copyright #Author. It can be copied and distributed under the terms of the Perl Artistic License.

Abstract

This HOWTO intends to show how to use the BioPerl CDAT interface for the analysis of character data and trees.

Introduction

A Bio::CDAT object is a composite "container" data object consisting of

  1. one or more Bio::CDAT::TaxonSetI objects (which are containers/iterators of Bio::CDAT::TaxonI objects)
  2. zero or more Bio::CDAT::CharMatrixI objects
  3. zero or more Bio::CDAT::ForestI objects (which is a container/iterator of Bio::CDAT::TreeI objects)
  4. one Bio::CDAT::MediatorI object

A Bio::CDAT object can serve as an container/iterator for items #1 through #3, which means it will have get/set methods such as "get_trees", "set_trees", and "add_tree" as well as a reset-able "next_tree" iterator.

The internal relationships between items #1 through #3 are defined-and-or-regulated by the associated Bio::CDAT::MediatorI object (see below)

All instantiable Bio::CDAT::* classes inherit from Bio::CDAT::Root, while all interface classes inherit from Bio::CDAT::RootI (these classes could be trivially implemented as subclasses of the corresponding BioPerl classes for now, but by declaring them independently we leave room to be BioPerl-independent).

Ideally, items #1 through #3 should be subclassed from some common Bio::CDAT::ComponentI, which defines any (future) shared behavior (for example, that they are Bio::CDAT::AnnotatableI, etc.)

Example Code

Here's some example user code (use case: how many times is the maximum parsimony tree seen in 1000 bootstraps?); note that here I am explicitly constructing a rather verbose (using YAML) generic mediator, rather than using a prebuilt named mediator. The YAML syntax can be extended to provide further relational options (cascade delete, etc.) I'm also verbosely creating a simple factory with this mediator, rather than just giving the mediator to the IO engine directly. In short, this is not a "simple" use case, but is intended to demonstrate some of the features, at least syntactically.

<perl> use Bio::CDAT::IO; use Bio::CDAT::Factory; use Bio::CDAT::Mediator; use Bio::Tools::Run::Phylo::Phylip::ProtPars;

my $mediator = Bio::CDAT::Mediator::Generic->new(<<EOL);

 taxonset:
   cardinality: one
   required: yes
   tree: many
   matrix: many
 tree:
   cardinality: many
   required: no
   taxonset: one
   matrix: one
 matrix:
   cardinality: many
   required: no
   taxonset: one
   tree: one

EOL

my $factory = Bio::CDAT::Factory->new(-mediator => $mediator)

my $cdat = Bio::CDAT::IO->new(-align_file => "TyrKinases.clw",

                             -align_format => "clustalw",
                             -factory => $factory);

my $bootnum = 1000; my $bootstrap = Bio::CDAT::Workflow::Bootstrap->new(-num => $bootnum); $cdat->run($bootstrap); # $cdat now has 1001 matrices.

my $mp = Bio::Tools::Run::Phylo::Phylip::ProtPars->new(); my $infer_tree = Bio::CDAT::Workflow::InferTree->new(-engine => $mp); $cdat->run($infer_tree); # $cdat now has 1001 new trees.

my $count = 0;

my $reference = $cdat->next_tree();

while (my $compare = $cdat->next_tree()) {

 $count++ if has_equivalent_topology($reference, $compare);

}

print "The reference topology was seen $count times out of $bootnum\n";

sub has_equivalent_topology ($$) {

 # left to your imagination ...

} </perl>

Bio::CDAT::IO

Bio::CDAT objects may be built "de novo", but will usually come from file- or stream-based input. A Bio::CDAT::IO system (akin to BioPerl's Bio::SeqIO and Bio::AlignIO systems) will provide flexible input/output strategies (e.g. reuse of Bio::Nexus, "vanilla" BioPerl, relational storage, etc.). Has methods "new" (object constructor), "next_cdat" and (optionally) "write_cdat".

Bio::CDAT::Factory

Bio::CDAT objects may be built using differing mediator strategies; thus, such specifications need to be encapsulated in a factory object that the Bio::CDAT::IO parsing engine can use to facilitate the dynamic production of appropriate Bio::CDAT objects. Bio::CDAT::Factory may be further subclassed to provide specialized behavior (e.g. automated NCBI taxon ID lookup, "mrtrans"-like behavior, etc.).

Bio::CDAT::WorkflowI

A Bio::CDAT::WorkflowI subclass provides a "run_workflow" method that can be (indirectly) invoked by a Bio::CDAT object to perform a series of operations in which the components of the Bio::CDAT object may obtained "in concert" and that remain "policed" by the Bio::CDAT object's Bio::CDAT::MediatorI object. This "run_workflow" method receives parameters that represent the composited TaxonSetI, CharMatrixI, and ForestI objects, through which the "run_workflow" method's code may interact, governed by the policies of the current Mediator.

This also provides a mechanism to encapsulate common "use cases" for general reuse (such as "given an MSA, infer a tree"). It is expected that such reusable workflows be appropriately parameterizable upon instantiation.

A Bio::CDAT::WorkflowI subclass may optionally provide additional methods that operate at workflow instantiation, runtime setup, runtime teardown and workflow destruction. This architecture is inspired by that of CGI::Application.

Note that any operations on the elements of a Bio::CDAT outside of a Bio::CDAT::WorkflowI will *not* be policed by the corresponding Bio::CDAT::MediatorI object; this provides the flexibility of allowing users to muck with the object in multiple steps before (re)assuming relational correctness.

As a bit of syntactic sugar, workflows can be invoked either indirectly (by name) via a Bio::CDAT object: <perl> $result = $cdat->run("Bio::CDAT::Workflow::InferTreeFromMSA"); </perl>

or directly via an instantiated (and possibly alternatively-parameterized) workflow object that is passed a Bio::CDAT object:

<perl> $result = $workflow->run($cdat); </perl>

or indirectly by instantiated workflow object:

<perl> $result = $cdat->run($workflow); </perl>

In any case, the "run_workflow" method will be called with the components of the Bio::CDAT object. Note that $result is probably not a new Bio::CDAT object, but is the extraneous output of the workflow (this may be used to denote success, or to encapsulate information that another workflow might further recognize or process). We make no definition here of the expected return value of an executed workflow save that a failed workflow should throw a (catchable) Bio::CDAT::Error ...

Lastly, a Bio::CDAT::WorkflowI object is also a container/iterator of zero or more Bio::CDAT::Workflow::RequirementI objects (see below).

Bio::CDAT::Workflow::RequirementI

A Bio::CDAT::WorkflowI may (optionally) specify some formal requirements of the Bio::CDAT object's elementary composition for the workflow's proper and successful execution. For example, an "infer tree from MSA" workflow may only be valid in the context of an input Bio::CDAT having a single CharMatrix and no (or empty) ForestI objects. If a Bio::CDAT::WorkflowI subclass defines any requirement objects, they will each be given the opportunity to inspect the Bio::CDAT object before and/or after the workflow completes; any RequirementI failure will result in an thrown Bio::CDAT::ErrorI object. This is a mechanism to further ensure type safety.

Because these are mutable, a user-instantiated workflow object can have flexible/dynamic requirements.

Bio::CDAT::Mediator

These are the rule-sets that govern the cardinality and (for cardinalities greater than 1) relationships between Bio::CDAT-composited component items. For example, a given Bio::CDAT might represent one MSA with many ForestI's derived from different tree inference algorithms run on that same MSA; when a workflow that calculates a new tree is run on this CDAT, the new ForestI that encapsulates the new tree will be added to the existing ForestI's (an alternative "one MSA, one Forest" mediator might instead overwrite any existing ForestI with the new ForestI). Thus, workflows need not know the rules of engagement for a particular dataset, deferring to the mediator to "do the right thing" (or throw an error, as appropriate).

In most cases, the default "one-to-one-to-one" mediator will correctly limit the Bio::CDAT to contain only one TaxonSetI, zero-or-one ForestI and zero-or-one CharMatrixI objects.

References and More Reading

<biblio>

  1. mangalam pmid=12230038

</biblio>