Phyloinformatics Summer of Code 2011/Summaries
Student software developers showcase their work
For the fifth summer in a row, NESCent offered a number of internships aimed at introducing students to open-source software development. This summer, eight interns from the 2011 Google Summer of Code™ program worked remotely on an evoinformatics project of their own choosing, each under the guidance of an experienced mentor. NESCent’s 2011 Summer of Code students included Jan Engelhardt, Alexandru Lucian Ginsca, Sarah Hird, Peter Hoffman, Daniel Packer, Andrei-Alin Popescu, Apurv Verma, and Laurel Yohe. Their projects ranged from manipulating next-generation sequencing data for population genetic analysis, to enabling a frequently-used alignment viewer to analyze non-coding RNAs, to generating human readable text that could be integrated into Encyclopedia of Life from ontologies containing phenotype data. As their profiles below demonstrate, the students put their summers to very good use!
Student: Jan Engelhardt
Mentor(s): Jim Procter, Peter Troshin
Project: Extending Jalview’s support for handling RNA
Jalview is a frequently used alignment viewer. I implemented new features for the analysis of non-coding RNAs. The RNA structure viewer VARNA can now be used from within Jalview to analyze individual or consensus secondary structures. There is also a new structure consensus annotation row. I furthermore worked on fixing bugs in the existing code.
Student: Sarah Hird
Mentor(s): Jeremy Brown, Brad Chapman
Project: Manipulating NGS data for population genetic analysis
This project was designed to make summary information about large multi-locus sequence-based data sets visible, interpretable and easier to reformat. With advances in sequencing capacity (through novel technologies called next-generation sequencing), biologists now have orders of magnitude more data than ever before. lociNGS displays important information about a multi-locus dataset at both the data set-level and the individual-level. Statistics like coverage, number of variable sites, length and how many individuals are called for each locus are displayed on the screen, and allow researchers to interpret the quality and usability of the individual loci. The program also outputs homologous raw data in a widely used format (FASTA) and will reformat a subset of the data three ways.
Student: Peter Hoffman
Mentor(s): Kevin Savidge, Jeremy Brown
Project: DIM SUM 2: GPU computing for an individual-based simulator of movement and demography
DIM SUM is a population demography and individual migration simulation. The objective of my Google Summer of Code project was to improve the speed of DIM SUM significantly by using the graphics card to allow simulations of scenarios with much larger populations and larger landscapes.
Student: Alexandru Lucian Ginsca
Mentor(s): Mentors: Jim Balhoff, Chris Mungall, Matt Yoder, Cyndy Parr
Project: Export ontology-based phenotype descriptions to the Encyclopedia of Life
The main goal of this project was to generate human readable text that could be integrated into Encyclopedia of Life from ontologies containing phenotype data. One of the tasks was to determine which data could be useful and to find a proper way to display the text in EOL. As an extra feature, the sentences are automatically grouped in paragraphs based on the anatomical system they describe. The end result of the project is a tool that provides phenotype descriptions for a large number of taxa in a form that follows the specifications of the EOL harvesting system so that this text can be integrated in an EOL page with no human supervision.
Student: Daniel Packer
Mentor(s): Jamie Estill, Jim Leebens-Mack, Todd Vision, Bill Piel
Project: Interoperable exchange of gene tree reconciliation maps
Gene and species trees, when mapped together, tell the story of a gene's possible evolutionary path. Software which generates and visualizes these maps, while extraordinarily useful for analysis, are limited by data formats that aren't portable between systems. Our goal was to improve the situation in data interoperability between GTR (gene tree reconciliation) mapping software using open source implementations of modern standards. Based on conversations with experts in phylogenetic data standards, our team defined a way in which XML could be used along with standard open source software libraries to store and transmit GTR maps.
Student: Andrei-Alin Popescu
Mentor(s): Emmanuel Paradis, Katarina Huber
Project: Extending APE to handle incomplete distances
Phylogenetics is a field of bioinformatics concerned with reconstructing evolutionary or phylogenetic trees of a given set of species (taxa) X. There are many methods for doing this, as well as multitude input data describing the species in X from which an evolutionary tree could be constructed. In the current project, we focus on three such inputs: distances, partitions and quartets. Distances on the set X of species are represented by an nXn matrix (|X|=n), where each entry situated at row i column j represents a dissimilarity between species i and j. We have implemented methods to reconstruct phylogenetic trees from both complete and incomplete such distances. A split is a separation of the set X into two groups. Methods for reconstructing phylogenetic trees from sets of splits have been implemented as part of Google Summer of Code. Quartets are sets of four taxa from X, and methods for reconstructing phylogenetic trees from such taxa have been implemented. All of the above implementations have been included in the APE R package, with some implementations written in C and then interfaced with R to enhance performance.
Student: Apurv Verma
Mentor(s): Kathryn Iverson, David Kidd
Project: PhyloGeoRef: A Java Library for displaying phylogenetic trees in kml
kml (Keyhole Markup Language) is the file format used to render geospatial data in Google Earth. The phyloGeoRef library takes a phylogenetic tree and other metadata associated with the operational taxonomic units in the tree as input and creates a kml file that can be viewed in Google Earth. The library incorporates various techniques to improve visualization, including colorization and pop-up information to create a rich user experience. To get started see https://github.com/kdiverson/phyloGeoRef.
Student: Laurel Yohe
Mentor(s): Rutger Vos, William Piel
Project: Automated submission of rich data to TreeBASE
At the present time, TreeBASE serves as an archive for phylogenetic data. The current submission of data to TreeBASE is via NEXUS files. However, this format results in a clunky user interface and does not allow for automated submission of metadata, such as the Genbank number, or additional annotations to be added. My project focused on the task of accepting NeXML files to TreeBASE so that new annotations of the metadata can be displayed in a user-friendly manner. My first task involved implementing the functionality of character sets (e.g. expressing which sequences belong to which genes) to the NeXML file, which is displayed in my figure. I also debugged an important bug regarding the issue of matrices failing to display for several studies. Over the summer, I learned to set up a large code base locally on my computer, gained a profound appreciation of the power of the NeXML file and the importance of making NeXML more useful to the scientific community, and most importantly, learned how to actively communicate with the my mentor and the TreeBASE community about ideas, issues, and innovations regarding the TreeBASE project.