Phyloinformatics Summer of Code 2009/Summaries
Student software developers showcase their work
For the third summer in a row, NESCent offered a number of student internships aimed at expanding participation in collaborative open-source software development projects. Interns from around the world were given the opportunity to work remotely on an evoinformatics project of their own choosing, each under the guidance of an experienced mentor.
This summer, NESCent received funding for 9 students from the Google Summer of CodeTM program, and an additional four from the NSF Virtual Data Center project. As their profiles demonstrate, the students put their summers to very good use! To meet the students and learn more about their projects, read below.
The Google Summer of CodeTM students are shown here, and the wrap-up summaries for the 4 students funded by the Virtual Data Center project are posted under that program.
Student: Daniel Ayres
Mentor(s): Aaron Darling, Marc Suchard
Project: GPU acceleration for phylogenetic inference using OpenCL
My project was to expand upon BEAGLE, an open-source computing library that accelerates phylogenetic tree inference by using the powerful Graphics Processing Units (GPUs) that are found within modern high-end desktop and laptop computers. My main goal was for BEAGLE to work across a wide range of GPUs by using OpenCL, a vendor-neutral standard. I also expanded the library so that it could be used with a broader set of evolutionary tools.
Student: Nick Matzke
Mentor(s): Stephen Smith, Brad Chapman, David Kidd
Project: Biogeographical Phylogenetics for BioPython
I developed Bio.Geography, a new module for the bioinformatics programming toolkit Biopython. Bio.Geography expands upon Biopython's traditional capabilities for accessing gene and protein sequences from online databases by allowing automated searching, downloading, and parsing of geographic location records from GBIF, the authoritative aggregator of specimen information from natural history collections worldwide. This will enable analyses of evolutionary biogeography that require the areas inhabited by the species at the tips of the phylogeny, particularly for large-scale analyses where it is necessary to process thousands of specimen occurrence records. The module will also facilitate applications such as species mapping, niche modeling, error-checking of museum records, and monitoring range changes.
Student: Eric Talevich
Mentor(s): Brad Chapman, Christian Zmasek
Project: Biopython support for parsing and writing phyloXML
The phyloXML data exchange format provides a consistent way to store and share information about richly annotated phylogenetic trees, including geographic, taxonomic and sequence-level data. However, researchers can only benefit from this if existing libraries and toolkits support this format. To support phyloXML in the bioinformatics programming toolkit Biopython, I created a pair of new modules, Tree and TreeIO, that offer a common interface for reading and writing phylogenetic trees in several file formats. These modules generalized the underlying tree model to provide a foundation for enhanced phylogenetics support in Biopython in the future.
Student: Diana Jaunzeikare
Mentor(s): Christian Zmasek, Pjotr Prins, Naohisa Goto
Project: Implementing phyloXML support in BioRuby
The phyloXML data exchange format facilitates analysis, exchange, storage and reuse of phylogenetic trees and associated data. My goal was to implement reading and writing capabilities for phyloXML within the bioinformatics programming toolkit BioRuby. My code takes advantages of specialized XML libraries to boost the speed at which extremely large phylogenetic datasets can be processed.
Student: Chase Miller
Mentor(s): Mark Jensen, Rutger Vos
Project: BioPerl integration of the NeXML exchange standard and Bio::Phylo toolkit
NeXML is an XML-based data exchange standard for phylogenetic data that represents the next generation of the popular NEXUS standard. My project consisted of integrating the NeXML standard into the BioPerl programming toolkit. To accomplish this, I developed several modules that provided BioPerl native access to the reference NeXML parser in Perl (Bio::Phylo), while still allowing Bio::Phylo and NeXML to co-evolve independently of the much larger BioPerl project.
Student: Adam Smith
Mentor(s): Pjotr Prins, Chris Fields
Project: Mapping the Bio++ Phylogenetics toolkit to R/BioConductor and BioJAVA using BioLib
Bio++ is a programming toolkit written in C++ for sequence analysis, phylogenetics, molecular evolution and population genetics. My goal was to make the Bio++ functionality accessible from high-level scripting languages such as Perl, Python, and Ruby. I accomplished this with a tool called SWIG (Simplified Wrapper Interface Generator), which generalized the task of programming interfaces to the different scripting languages. The end result is that biologists can spend more time concentrating on solving their biological analysis questions with the most powerful tools available, and less time worrying about interfacing with a toolkit written in an unfamiliar low-level programming language.
Student: Xin (David) Shuai
Mentor(s): Chris Fields, Mark A. Jensen and Pjotr Prins
Project: A BioLib mapping for the libsequence population genetic libraries
BioLib brings together a set of open-source libraries written in C/C++, and makes them available to the higher-level Bio* scripting language toolkits, such as BioPerl and BioPython. Libsequence is a C++ library for population genetic simulation and the evolutionary analysis of molecular data. The goal of my project was to add libseqence into BioLib and build mappings to Perl and Python using a standard framework http://www.swig.org/ (SWIG)].
Student: Kasia Hayden
Mentor(s): Peter Midford, Jim Balhoff
Project: Build a Mesquite package to view Phenex-generated Nexml files
Phenex is a stand-alone tool that assists biological experts in transforming phenotypic characters and character states from phylogenetic data matrices into formal phenotype assertions typed by ontology terms. The goal of my project was to take the annotated character matrices produced by Phenex (exported in NeXML format) and make them viewable in Mesquite, one of the most widely used desktop tools for exploring, visualizing, and analyzing phylogenetic data. The annotations can be viewed both as text, in a footnote box, and in a graphical display. For the latter, I used Graphviz, a general purpose package for drawing graphs in Java.
Student: Dazhi Jiao
Mentor(s): Ryan Scherle, Lucie Chan
Project: PhyloSoC:Enhance the searching functionality of Phylr
Phylr is an implementation of the search functionality described in the emerging PhyloWS “web service” standard for accessing online phylogenetic data resources. My project consisted of enhancing the Phylr service for searching content in a collection of XML files, using a fast text indexing system called Lucene, as well as creating new services for searching relational databases. Specifically, I focused on searching a BioSQL relational database with the PhyloDB extension, because this database is highly flexible in terms of supported metadata and is tightly integrated with the Bio* bioinformatics scripting toolkits. As an initial prototype, I demonstrated how Phylr could search TreeBASE data that has been stored in a BioSQL database.