PhyloSoC:Enhancing the representation of ecophylogenetic tools in R
My Summer of Code project is to greatly expand the R package picante, a toolbox of comparative methods used to assess phylogenetic community structure. I will code three types of published ecological phylogenetic comparative methods. These methods can be used to assess phylogenetic community structure, the effects of environmental variables on phylogenetic community structure, and phylogenetic structure among species interactions. In other words, the methods can be used to ask questions such as: are closely related species found within the same community; are closely related species found in the same environments; and do closely related species interact with the same species. I will also code data handling routines that quickly randomize large matrices. These routines allow for tests of statistical significance with the methods. The product of my Summer of Code experience will be a comprehensive R package of phylogenetic comparative methods that will facilitate the use of phylogenetics in ecological research.
Phylogenetic comparative methods are increasingly used in ecological research. Incorporating phylogenetics into ecology is appealing because it allows ecological questions to be addressed in an evolutionary context, leading to a deep understanding of natural phenomena. Consequently, new “ecophylogenetic” comparative methods are rapidly being developed (I have adopted this term from Jeaninne Cavender-Bares). However, there are three major barriers that prevent many ecologists from using ecophylogenetic methods. First, many of these new tools are not available as open-source software. Second, available tools are coded in multiple programs, all with unique learning curves. Finally, many of the programs are in a language that not many ecologists know, prohibiting end-users from modifying functions to suit their own purposes. The goal of my project is to enhance the ecophylogenetic R package picante by 1) coding more ecophylogenetic methods into the package; and 2) coding new data handling routines that allow for randomization tests to be quickly performed with the methods. If funded, I will help integrate phylogenetics and ecology by coding a comprehensive package of ecophylogenetic methods in R, a statistical platform and language that many ecologists already know. Biography
My research is on how phylogenetics can lead to better understanding of ecological community assembly, species extinction, and species invasion. For example, I resampled roughly 200 freshwater fish communities in central Mexico that had been sampled at least one time previously in the last 100 years. This data set is large with many communities; many species invasions and extinctions; temporal and spatial autocorrelation; environmental data; trait data; and a large phylogenetic tree. Quantifying the interactions of all these factors required me to develop novel statistical tools and computer programs. My interests are thus at the intersection of computational comparative biology, ecology, and evolution.
Two of the papers I have published in graduate school contain open-source R and MATLAB code (Helmus et al. 2007a; Helmus et al. 2007b). I also have experience as an undergraduate in C++.
A Summer of Code experience will allow me to greatly expand my programming skills. However, the main reason I am applying is because I want to become part of the NESCent phyloinformatics community. After my summer of code is over, I plan to continue to work with the developers of phylobase and picante, and the phyloinformatics group. I feel that working with this group is important since I will likely continue to use and develop ecophylogenetic methods throughout my academic career.
My only other obligation this summer is to make any edits suggested by my committee to my dissertation (I will defend on April 30, but officially graduate at the summer’s end). I will make the Summer of Code project the focus of my summer and will treat it like a full-time job.
I will greatly expand picante by adding three types of ecophylogenetic methods: diversity metrics, species regressions, and bipartite species analyses. Under each section below I have listed: the data each method requires (Data Input), existing code that will guide my R code (Backbone), programs already available for the methods (Current Availability), and my specific products (Deliverable).
1A I will code the PD metric (Faith 1992), the metrics described by Hardy & Senterre (2007), the delta metrics of Clarke & Warwick (1998; 1999), and the metrics developed by my colleagues and I (Helmus et al. 2007a). These metrics represent different phylogenetic components of biodiversity; were developed for different applications and research topics; and/or have different statistical properties. These metrics will increase the metrics already coded in picante (i.e., those of Webb et al. 2002). Coding these metrics will facilitate future research that compares metrics to understand ecological phenomena; and research to understand the similarities and differences of the methods.
-Data Input: a phylo4 object (or other tree object); a community incidence matrix (i.e., an object of species presence/absence or abundance/biomass across communities)
-Backbone: my published and unpublished R code; the cophenetic and vcv.phylo functions; the diversity metric functions in picante (e.g., mnnd)
-Current Availability: Some of these methods are available as computer programs in various formats from the authors.
-Deliverable: a comprehensive R toolkit of the most commonly used phylogenetic biodiversity metrics
1B I will code community regression techniques that partition determinants of phylogenetic community structure using environmental data (Helmus et al. 2007b). Theses techniques are broadly applicable to many data sets; and can be used to understand the causes of observed patterns in phylogenetic structure across communities.
-Data Input: a phylo4 object; an incidence matrix; an environmental matrix
-Backbone: my published MATLAB code; stats package functions
-Current Availability: my published MATLAB code specific to my data set
-Deliverable: a set of R functions to perform these analyses with any data set
1C I will code a set of phylogenetic comparative methods that estimate phylogenetic signal in the interactions of species between two trophic levels or two sets of mutualistic species (Ives & Godfray 2006). These methods will facilitate scientists who want to ask specific questions such as: do closely related herbivores, with similar traits, feed on the same plants; and can phylogeny predict what native plants an invasive pollinator will pollinate.
-Data Input: two phylo4 or 4d objects; a species-species matrix of interaction strengths
-Backbone: published and unpublished MATLAB code; stats package functions
-Current Availability: published and unpublished MATLAB code (not user friendly)
-Deliverable: a set of R functions to perform these analyses with any data set
Randomization tests are used widely in ecophylogenetic research to provide p-values for test statistics and test for nonrandom patterns in community structure. All methods described above are used in conjunction with data randomizations for these purposes. The typical data matrix used in ecophylogenetic research is large, and many randomizations are needed in order to create probability distributions. However, current functions in R do not quickly perform many randomizations for large data matrices. This is a serious hindrance to performing ecophylogenetic data analyses using only R. I propose to write, or augment, a set of functions that will quickly randomize data matrices. I will do this by either 1) changing existing functions; 2) using matrix indexing with the existing sample and apply functions; or 3) writing a series of C/C++ extensions to perform the loops. Specifically, I will create functions to randomize a matrix by maintaining only row totals, only column totals, and both row and column totals. These are the most widely used randomizations in ecophylogenetic research. Augmenting existing functions may be adequate. For example, I changed a line in the picante function randomizeSample from apply(samp,1,sample) to samp[,sample(ncol(samp),replace=FALSE)]. The runtime of the edited function is ~50 times faster than the old.
-Data Input: a data matrix
-Backbone: the sample function; functions in ecodist; functions in picante
-Current Availability (for R only): for-loops; the sample function; randomizeSample
-Deliverable: a set of functions that will quickly randomize data matrices and edits to picante functions to use these quicker randomizations
1. Clarke K. & Warwick R. (1999) The taxonomic distinctness measure of biodiversity: weighting of step lengths between hierarchical levels. Marine Ecology-Progress Series, 184, 21-29
2. Clarke K.R. & Warwick R.M. (1998) A taxonomic distinctness index and its statistical properties. Journal of Applied Ecology, 35, 523-531
3. Faith D.P. (1992) Conservation evaluation and phylogenetic diversity. Biological Conservation, 61, 1-10
4. Hardy O.J. & Senterre B. (2007) Characterizing the phylogenetic structure of communities by an additive partitioning of phylogenetic diversity. Journal of Ecology, 95, 493-506
5. Helmus M.R., Bland T.J., Williams C.K. & Ives A.R. (2007a) Phylogenetic measures of biodiversity. American Naturalist, 169, E68-E83
6. Helmus M.R., Savage K., Diebel M.W., Maxted J.T. & Ives A.R. (2007b) Separating the determinants of phylogenetic community structure. Ecology Letters, 10, 917-925
7. Ives A.R. & Godfray H.C. (2006) Phylogenetic analysis of trophic associations. The American Naturalist, 168, E1-E14
8. Webb C.O., Ackerly D.D., McPeek M.A. & Donoghue M.J. (2002) Phylogenies and community ecology. Annual Review of Ecology Evolution and Systematics, 33, 475–505
see Project Goals for details
Community Bonding - Met Kembel and talked about project - Installed subversion - Learned about picante
May 26-Jun 8: Com1A (2 weeks) -Code diversity metrics
May 26 coded PSV, PSR, PSE, and PSR Area
June 2 -Coded Phylogenetic species clustering -Species Phylogenetic species variability (PSV) -Did not have time to code Phylogenetic Diversity (PD) -Wrote documentation for all the metrics -Could not figure out how to view documentation in R
Jun 9-18: Com1B (1.5) -Coded regressions
June 9 -Coded PD (Thanks Jonathan). -Figured out with Steve's help how to view documentation in R. -Checked if glm binomial is biased and found logistic regression package with Firth correction. -Edited current code, cleaned up documentation
June 16 -Merged the gsoc branch to the main branch and release picante 0.2 -Coded the Ecology Letters paper by first connecting PSV to the two simple null models... they run really slow... -Coded linear/logistic regression function... this took longer than I anticipated
June 23 -Wrote documentation for the regression and community structure functions -Talked to Tony about abundances and binomial data in the bipartite analysis. It can be done -Coded a backbone linear model for the bipartite analyses. -Got example data sets from Tony and Sawat -No hitches, but it took three solid days to write the backbone function and check it for errors with the example data sets.
June 30 -Edited the bipartite code -Coded a way to handle categorical variables, but not certain if it is correct. -Contacted Ben. -Added a bootstrapping routine for parameter CIs, but does not work yet
July 7 -Made the bootstrapping work -Wrote the documentation. -Completed midterm
Jul 14 -Fixed all bugs in my code so far -Steve released picante 0.3! -Looked at the randmat package and tried to get picante to work with the trialswap C code it did not work yet. -Read the Writing R Extensions section about calling C code. -Read info on Cprogramming.com -Downloaded and started using Vim (C text editor)
Jul 21 -Skyped with Steve and went through trialswap code line by line. -With Steve's help, trialswap, richness, frequency and independent nulls are all coded in C. -Edited picante functions to handle the new faster C code.
Jul 28 I am now in the final phase of coding. All of my main goals are complete, thus I just need to streamline, add a few more functions, and clean up. -Looked at using the pblm to predict species interactions and uploaded working version. -Looked at adding a species-area function to compare observed to rarefied curves, I will not add this functionality this summer.
Aug 4 -made the pblm predict function work -wrote the documentation
Aug 11 -make the merge of gsoc to the main branch work.
Aug 18-Sep 1: Final evaluation