PhyloSoC:Tree and data plotting in the phylobase project
I have recently run into a road block that is the result the whay the data objects in the phylobase project were constructed. I have recently been writing code for reordering the representation of phylogenetic trees, the edge matrix. This matrix is equal to the number of nodes in the tree (tip and internal) however, the labels for the tips and internal nodes are stored in separate data slots. This makes keeping track of which edges correspond to which internal nodes difficult. To accomplish this, the storage mode of node and tip data should be changed. This is a fairly significant change and if I need to do it will likely delay progress on my other goals.
The other issue is the best way to extract the data in a data object. Typically in R unique data elements are held in columns with the individual observations in the rows. With a phylogenetic tree, the obvious means of holding the data is with the in rows, however, if you want to plot a histogram of a particular variable, then you are require to have a columns equal in number to the number of observations, it becomes even more difficult if you'd like to have bivariate data plots. There are two possible solutions, one is to change the data format into a list which can store many values in each cell, or to live with the many column issue. At this point, changing the data format would require a change to the entire phylobase code base and isn't feasible for this project. Instead I will go ahead with the data frame approach.
Last week I:
- read a few reference implementations of plotting and the discussion in Inferring Phylogenies
- modified ported plot.phylo() to utilize phylo4 objects instead of phylo objects
- wrote a function to reorder phylo4 objects into pruningwise order
- wrote a framework for an abstracted plot function
- wrote a rough function for calculating node y coordinates
This week hope to:
- update code written last week with better comments and to use the existing method abstractions (e.g. for getting the root edge)
- write a function for calculating node x coordinates
- fill out the written framework to allow for basic plotting with all new code
- Step one is completed. Steve and I decided that the first goal should be plotting of simple phylograms. Toward that end, I started with the ape plot.phylo() command and deleted all the code related to fan and radial shaped plotting. I also removed all the code related to up/down vs. left/right plotting as these are high level function in grid (not yet implemented).
At this stage, it will be worthwhile stepping back and considering if the plot.phylo() architecture is appropriate and what alternatives might exist.
Project description and timeline
Tree and data plotting in the phylobase project
My interest in the project
I am interested in this particular proposal because I have been frustrated by the lack of good links between R’s excellent data plotting ability and tree plotting abilities. A recent example would be my current project on the evolution of a trait with a circular distribution. The R package circular contains nice functions for plotting circular data as a rose diagrams, and the ape and phylobase packages contain functions for plotting phylogenetic trees. However, without significant low level work, plotting rose diagrams at the tips, or internal nodes isn’t possible. A clear interface to plotting operations allow users to ‘nest’ plot on phylogenies. would remove a major obstacle for those trying to produce phylogenetic tree figures suitable for publication.
The phylobase project currently supports two basic phylogenetic tree data types, one where trees are associated data (phylo4d) and another for the trees alone (phylo4). For each of these basic types there a corresponding “multi” object which is list containing several trees. To date, only the single tree objects have methods for visual display. These methods are wrappers for functions in the package ape which calculate and plot the tree using the R base graphics system. In the case of a phylo4d object data associated with the tree tips is plotted on top of the same graphics device in a style inspired by the package ade4, though re-implemented in phylobase.
Current plotting functions are a limitation to the phylobase package as they require phylobase objects to correspond to, or be coercible to ape package style phylogenies. The also add a dependency to the external package ape for a basic tree operation. Specific limitations stem from the use of the base graphics system, scaling of trees, labels, data and other text is difficult within this system. This is readily observable by plotting the examples in the current plot command help files. If plotted on the default graphics size an error is encountered, in other example improper spacing between legends and other objects is a problem. Lastly as plots are resized, the tree scales but data objects do not.
The way forward is outline in the project call, plotting of phylobase objects should not have external dependancies and should be implemented in the grid graphics system.
The grid graphics system was written by Paul Murrell and provides an alternative graphics interface. Other high level plotting systems in R, such as lattice and ggplot2. One feature of grid is the ability to scale object in relation to the size of the plotting region, or not, thus when when a plotting region is resized, labels and text scale appropriately.
A key paradigm of the grid system is the viewpoint, basically a plotting region that can be nested within another plotting region. This concept is most useful for the case of plotting tree with associated tip or internal node data. A possible approach would be to generate viewpoints at each of the tree tips, which then could be filled with a variety of plot types such as histograms, sideways boxplots, or bivariate scatterplots. Likewise each node could have a associated viewpoint to display data corresponding to that node.
Below is a roadmap for how this project might be implemented. I have strived to make each point a discrete sequential step. They should be done in the order suggested below. As long as each step is completed any point would be an appropriate project stopping point, leaving no further work in order to be usable.
- For phylo4 objects, replace base graphics calls with grid graphics calls. This should be fairly straight forward as grid possesses functions similar to many of the higher level commands, such as lines() used by the current plotting functions. Time is also alloted to ensure a more thorough understanding of the grid system architecture and how it differs from the base graphics. I estimate this step will take 2-3 weeks.
- For the plotting of phylo4d projects add viewpoints for node and tip data. Based on my understanding of the grid system this should be fairly straight forward. The primary challenge will be find the best way to pass on the data plotting commands. I estimate this step will take 1-2 weeks.
- Transition back end calls to ape to phylobase specific code, allowing for more flexible handling of node (internal and tip) data. The code from the ape package is largely applicable to phylobase object, but may require some abstraction. I estimate this step will take 1-3 weeks.
- I expect to complete the above goals by the midterm review in early July.
- Rewrite as necessary C code to optimize plot calculations. Again, the ape package already contains and example of how this might work. The primary challenge here is my beginner knowledge in C. I estimate this step will take 2-3 weeks.
- In addition to writing or updating the R help files for the function described above, I will create tutorials for end users. This would include adding examples of common plot types to the help file, writing sections for the phylobase vignette and updating appropriate wikis such the R wiki and R phylo wiki. I estimate this step will take 1-2 weeks.
Total time 7-13 weeks
- Time permitting write functions to zoom and collapse clades. An example, zoom function exist in ape as does a non-graphical collapse function. I estimate this step will take 1-2 weeks.