Cyberinfrastructure Summer Traineeships 2009

From NESCent Informatics wiki
Jump to: navigation, search
2009 Cyberinfrastructure Summer Traineeships for Data Repository Interoperability

Please find the projects that were accepted with brief summaries of the goals, who the student and mentor were for each project, and where project page, source code repository, and, if the student kept one, a development blog are located. After completion of the program in August 2009 the students wrote up short summaries of each project.

Cyberinfrastructure Summer Traineeships

We are pleased to announce the availability of summer traineeships for students and postdocs interested in informatics as applied to biodiversity, earth and environmental data. Trainees will gain experience as part of a collaborative open-source software development team. They will be helping to build a Virtual Data Center (VDC) that includes major data repositories in biodiversity, earth and environment science, ecology and evolutionary biology. The VDC summer internship program is inspired by and closely modeled after the Google Summer of CodeTM program. This program is, however, neither part of the Google Summer of Code, nor endorsed by Google or any of its affiliates, and some of the rules differ.

Background

The National Science Foundation is supporting a consortium of data and metadata repositories for biodiversity, earth and environmental science, ecological and evolutionary biology data. The consortium aims to develop the necessary cyberinfrastructure tools and technologies to support the implementation of a Virtual Data Center for these fields. The VDC will be based on a network of existing and new physical data centers ("nodes") that interoperate using open standards and protocols. The network will enable discovery of as well as open, stable, and secure access to data in any of its member nodes.

The consortium is a collaboration among several institutions, including:

Program details

The VDC summer internship program is modeled on the Google Summer of CodeTM program, but differs in several ways, including the rules for eligibility, the application process, and travel requirements.

Broadly speaking, as in the Google program:

  • To participate, you apply with a project proposal and a CV.
  • Your proposal will be scored and ranked by the mentors.
  • If accepted (by being among the top 4 applications), you will be paired with one of our mentors.
  • You need not (and likely will not) be at the same location or institution as your mentor. You are expected to communicate regularly, but online.
  • Projects are about contributing to collaboratively developed open-source software, applying or implementing open standards, and creating open-source code. Primarily, however, they are about learning how to become effective in these things.
  • The length (12 weeks) and timing (May 23 - Aug 17, 2009) during which you will be writing code are aligned with the timetable for the Google program, as are the stipend ($4,500) and payment schedule.
  • Your project proposal may be based on one of the project ideas listed below, or on an idea of your own. We will score your proposal based on your qualifications, what you stand to gain professionally from the internship, evidence for how sustained your interest in cyberinfrastructure for interoperability in our fields could be, and most importantly feasibility and thoroughness of your proposal and project plan (see below).

It differs from the Google program in the following ways:

  • The application period was Mar 30-Apr 22, 2009 (shifted one week from Google's application period, and a week and a half longer).
  • In addition to undergraduate and graduate students, current (as of April 20, 2009) postdocs are also eligible to apply. As we are under US law, nationals of countries that the US has trade restrictions with are ineligible, as for the Google program.
  • You apply directly to the VDC, not to the Google program.
  • You receive payment from the University of New Mexico.
  • Near (shortly before or after) the beginning of the coding period you will attend a meeting of the mentors and several other technical representatives from the collaborating data repositories (called the Technical Working Group - TWG). The meeting will result in strategic decisions on technologies and infrastructure milestones needed to achieve interoperability, and will provide context for and further inform your project.
  • Near the end or after conclusion of the coding period, you will attend a meeting of consortium members (called the Community Engagement Working Group) who focus on engaging non-member repositories as well as the larger community of scientists in the fields served by the member repositories. You will report on your project so that the working group members can determine the best ways to disseminate your results to the relevant community.

All travel expenses will be reimbursed, and no travel beyond those two meetings will be required.

NESCent and the Phyloinformatics Summer of Code

NESCent has been a mentoring organization in the Google Summer of CodeTM since 2007, and has been accepted to continue to participate in the summer of 2009. You can find the GSoC project ideas on the Phyloinformatics Summer of Code 2009 page, and we encourage you to visit that page, too, to get on overview of the kind of projects we support. (Note, however, that the student application period for GSoC ended April 3, and acceptance decisions have been published on April 20.)

News

  • 26 May 2009: The 12-week coding period has started. You can follow the development discussions on the vdc-dev mailing list. --Hlapp
  • 2 May 2009: The four accepted student proposals have been posted online. Congratulations to the successful students! --Hlapp
  • 22 Apr 2009: Student application deadline closed. We received 14 applications. Thank you all for your interest in our program! --Hlapp
  • 10 Apr 2009: Extended student application deadline by one week to April 22, 2009. --Hlapp
  • 07 Apr 2009: Sent student application deadline reminder. --Hlapp
  • 30 Mar 2009: Student application period opens. --Hlapp
  • 29 Mar 2009: Project ideas complete and program announcements broadly disseminated. --Hlapp
  • 23 Mar 2009: Configured and set up Melange as our online application system. --Hlapp
  • 28 Feb 2009: The project ideas page for 2009 (the page you are looking at) is ready for adding project ideas. --Hlapp

Contact

Our organization administrators are Hilmar Lapp (hlapp@nescent.org) and Todd J. Vision (tjv@bio.unc.edu).

You can send any questions that are specifically for the mentors to vdc-twg@ecoinformatics.org. This will reach all mentors, program adminstrators, and members of the Technical Working Group (TWG).

Aside from the TWG list (to which only TWG members are subscribed, even though anyone can post), there are two other lists that may be of interest:

  • The vdc-community mailing list is low traffic, receives primarily announcements, and lets you stay in the loop with activities and developments of the VDC project, of which this program is a part (the subscription page has a link to the list archives, which are public).
  • The vdc-dev mailing list is for VDC technologists, developers, and Summer Traineeship interns to discuss technology, software development, and programming project-related questions (the list archives are public, too).

We will also semi-regularly hang out on IRC at least on weekdays during working hours (EDT) in #vdc on http://irc.ecoinformatics.org. You're welcome to join us at any time, though be prepared that outside of those times we may not be online on IRC. Email will always work. (If you do not have an IRC client installed, you might find the comparison on Wikipedia, the Google directory, or the IRC Reviews helpful. For Macs, X-Chat Aqua works pretty well. If you have never used IRC, try the IRC Primer at IRC Help, which also has links to lots of other material.)

Accepted projects

Generating Accurate Ranking Algorithms via Machine Learning

This is an application for the Vocabulary Term Mapping project idea.

Abstract: This project aims to generate effective ranking algorithms using random combinations of functions from the Natural Language Toolkit (NLTK). A genetic algorithms approach will allow the testing and assessment of large numbers of combinations.

Student: Christine Dumoulin

Mentor(s): Bruce Wilson (primary), Dave Vieglais, Giri Palanisamy

Project Homepage: Accurate Ranking Algorithms via Machine Learning

Project Blog: http://cdvdc09.blogspot.com/

Source Code: https://code.ecoinformatics.org/code/vdc/projects/machlearn/trunk/

Vocabulary Term Mapping

This is an application for the Vocabulary Term Mapping project idea.

Abstract: This project basically involves implementing a tool that could provide considerable assistance when attempting to map semantically similar terms between metadata, and thus semantically similar or equivalent content in data sets.This tool will return a ranked list of matching candidate terms from input of two sets of terms and their descriptions.

Student: Namrata Lele

Mentor(s): Dave Vieglais (primary), Bruce Wilson

Project Homepage: Vocabulary Term Mapping

Source Code: Vocabulary Term Mapping SVN

Refactoring the EarthGrid SOAP API to REST style and implement for Metacat Project

This is an application for the Refactor the EarthGrid SOAP API to REST style and implement for Metacat project idea.

Abstract: EarthGrid (EcoGrid) is a lightweight API which provides SOAP based communication of several types of client softwares with the data server applications. This project involves refactoring current SOAP based Earthgrid API to REST style that has certain benefits over SOAP. Then, in the project scope this REST API will be implemented to Metacat data management system. Moreover, it will be a prototype for client softwares using the Earthgrid API such as Morpho and Kepler.

Student: H.Serhan Akin

Mentor(s): Matt Jones (primary), Mark Servilla

Project Homepage: Refactoring the EarthGrid SOAP API

Source Code: https://code.ecoinformatics.org/code/vdc/projects/RestStyleEarthGrid

Semantic phyloinformatic web services using the EvoInfo stack

This is an application for the Semantic phyloinformatic web services using the EvoInfo stack project idea, which originally was part of the ideas for the Phyloinformatics Summer of Code 2009 program.

Abstract: This project provides a semantically annotated Web service (WS) demonstrating the power of WSs and the Semantic Web in the evolutionary comparative analysis domain. SAWSDL addresses this challenge by allowing relevant comprehensive data and its structure (NeXML) to map to concepts in an ontology (CDAO). The mappings, ontology references, and NeXML declarations in the SAWSDL thus must all cooperate to realize a fully functional WS that may be used seamlessly by an external client program.

Student: John Harney

Mentor(s): Hilmar Lapp (primary), Rutger Vos

Project Homepage: Semantic phyloinformatic web services using the EvoInfo stack

Project Blog: http://harneyvdc.blogspot.com/

Source Code: Semantic EvoInfoWS SVN trunk

Ideas

Note: if there is more than one mentor for a project, the primary mentor is in bold font. Biographical and other information on the mentors is linked to in the Mentors section.

To prospective applicants: The below are only our project ideas, albeit well thought-out ones. You are welcome to propose your own project if none of those below catches your interest, or if your idea is more exciting to you, provided it still falls within our scope (see 'Before you apply' below). Regardless of what you decide to do, make sure you read and follow the guidelines for applicants below.

Write an EML data table validation application

Rationale 
The Ecological Metadata Language (EML) supports the documentation of tabular data through a well structured XML schema, in addition to specifying the access URI to such tables. Unfortunately, there is often a mismatch between the table documentation and the physical table residing at the end of the URI (e.g., the number of columns described in the EML does not match the number of columns present in the physical table). Since EML is the preferred metadata language used by the ecological community, it is of most importance to the VDC project that the corresponding data tables described by EML are accurately documented. To this end, there exists the "Data Manager Library" (a Java library that complements EML) that may be used to develop an application for validating data tables described by the Ecological Metadata Language. It would benefit the VDC project if an EML data table validation application was developed by using the existing Data Manager Library.
Approach 
Data table descriptions in EML can be quite complex, especially when the physical format of the table is taken into consideration. Since the Data Manager Library was written to address these complexities, it would be best to utilize the library for accessing and validating documented data tables. A sound approach would be to use the Data Manager Library as part of the business logic in a web-based Servlet application that would allow one or more EML documents that describes tabular data to be uploaded and validated for accuracy and correctness.
Challenges 
The Data Manager Library has not been fully tested in a production environment and/or with an extensive collection of EML.
Involved toolkits or projects 
The Ecological Metadata Language schema and the Data Manager Library source code may be found in the ecoinformatics.org CVS repository at “http://cvs.ecoinformatics.org/cvs/cvsweb.cgi/eml/”.
Degree of difficulty and needed skills 
Moderate. The most difficult component of this project is understanding the EML schema and how it relates to the operation of the Data Manager Library. Knowledge of the Java programming language and Servlet/JSP programming concepts necessary.
Mentors 
Mark Servilla and Matt Jones

Support for LSIDs for data objects in Dryad

Rationale 
Dryad is a digital repository for data supporting published works in evolutionary biology, and will be a member "node" of the network comprising the VDC. Dryad assigns a persistent, globally unique, and resolvable identifier (GUID) to each data item, using the Handle system (which is natively supported by DSpace, the platform on which Dryad is built). The Handle system, however, is just one widely used standard for GUIDs, and is not uniformly supported across other prospective VDC member nodes. Other pertinent systems are DOIs, which are widely used in the publishing field, and LSIDs, which are increasingly being adopted for ecology and biodiversity data collections (see for example activities at TDWG). A VDC identifier system standard will hence need to consist of a mechanism to federate various GUID schemes, and network member nodes may need to be able to map transparently between several identifier schemes. For example. to map from a handle to a DOI (note that DOIs are themselves based on the Handle system), one could simply re-use the handle suffix and concatenate with the DOI-specific prefix; alternatively, to avoid any interpretation of the structure of the handle, one could use the entire handle as the suffix for constructing corresponding DOIs.
Here we propose to develop the code necessary to map Dryad (i.e., DSpace) handles to LSIDs. LSIDs have the advantage that their assignment is entirely controlled by the data provider and incurs no 3rd party costs (such as from a registry), making them potentially very suitable large collections of data items. In addition, the identifier scheme has an explicit notion of versioning, and the resolution protocol explicitly separates between metadata and data. On the contrary, LSIDs use a resolution protocol separate from the web (just as DOIs and Handles do), requiring the use of a proxy (the LSID resolver) to integrate them into the semantic web.
Approach 
The code to be developed could either be a part of the DSpace code base, or could be developed separately. It would need to respond to LSID resolution queries, translate these into handles, resolve the handles to the data items. be able to extract the metadata for the data items and convert them to RDF (the LSID spec strongly recommends returning metadata as RDF), and extract the digital data object to return as data. The mapping could, for example, be based on using the suffix of the handle as the objectID in the LSID spec. There is an open-source LSID resolver available that could serve as the basis, and there has been a Google Summer of Code 2008 project for DSpace on returning item metadata in RDF, which could form the basis for dynamically composing the RDF.
Depending on the skills of the applicant, the scope of this project can be extended to include support for PURLs that readily make datasets in Dryad Linked Data-compliant.
Challenges 
There are likely issues in the details, such as given a handle constructed from an LSID, how to obtain the metadata and how to obtain the data (bitstream in DSpace speak) without hard-coding assumptions or screen scraping. If the code is to be integrated into the DSpace code base rather than being a stand-alone component, this would need to be coordinated with the DSpace developers.
Involved toolkits or projects 
Degree of difficulty and needed skills 
Moderate to medium, depending on the coding skills (Java would be ideal, but as a stand-alone tool Perl is an option too) and the level of expertise in semantic web technologies, specifically RDF.
Mentors 
Hilmar Lapp, Ryan Scherle

Support for PURLs for data objects in Dryad

Rationale 
Dryad is a digital repository for data supporting published works in evolutionary biology, and will be a member "node" of the network comprising the VDC. Dryad assigns a persistent, globally unique, and resolvable identifier (GUID) to each data item, using the Handle system (see the LSIDs in Dryad project for more information on VDC and identifier systems). However, handles (and therefore also DOIs) as well as the LSID scheme increasingly adopted for ecology and biodiversity data collections or any other identifier schemes that are not based on http URIs need to be converted to using HTTP proxy prefixes (such as http://hdl.handle.net/, http://dx.doi.org/, or http://lsid.tdwg.org/) to integrate into the semantic web because any semantic web client would otherwise need to include special-purpose software that resolves them. The GUID standard of the semantic web is HTTP URIs, which should ideally be resolvable to an RDF document about the object they identify.
This project idea consists of creating a software component that exposes digital objects stored in a DSpace repository to the semantic web through de-referenceable PURLs that are Linked Data-compliant. In the case of Dryad, individual data objects (datasets) are typically part of a set that is associated with a publication; the RDF document returned should be able to reflect those relationships.
Approach 
The code to be developed could be a part of the DSpace code base, or could simply be developed separately. It would need to expose the digital objects stored in DSpace through a scheme for constructing and parsing the PURLs, for example following the template http://purl.<repositoryDomain>/item/<handleSuffix>, which would be http://purl.datadryad.org/item/<handleSuffix> for Dryad. Dereferencing such as PURL would then be translated into the corresponding handle, depending on content negotiation and/or inclusion of the requested format in the request (such as http://purl.datadryad.org/html/item/<handleSuffix> or http://purl.datadryad.org/rdf/item/<handleSuffix>) the handle would be resolved to an HTML page of the data item(s). or the metadata for the data items would be extracted and converted to RDF. The relationship to other data items or the publication item could be described using OAI-ORE.
Challenges 
There are likely issues in the details, such as given a handle constructed from a PURL, how to obtain the metadata and how to obtain the data (bitstream in DSpace speak) without hard-coding assumptions or screen scraping. If the code is to be integrated into the DSpace code base rather than being a stand-alone component, this would need to be coordinated with the DSpace developers.
Involved toolkits or projects 
Degree of difficulty and needed skills 
Moderate to medium, depending on the coding skills (Java would be ideal) and the level of expertise in semantic web technologies, specifically RDF.
Mentors 
Hilmar Lapp, Ryan Scherle

Support for OAI-ORE in the DSpace SWORD implementation

Rationale 
DSpace is the software underlying the Dryad digital data repository, which will be a member "node" of the network comprising the VDC. One of the tasks that will need to become federated between the VDC member nodes is dataset deposition (submission). SWORD (Simple Web-service Offering Repository Deposit), a profile of the Atom Publishing Protocol (APP) is a recent (and the first) standard for repository deposition, and there is an implementation for DSpace.
The Dryad submission interface is implemented as a component external to DSpace, because the publication and dataset model used by Dryad cannot easily be accommodated by the built-in submission system. Ideally, Dryad's submission component would use the SWORD interface to Dryad, acting simultaneously as a proof-of-concept application, but the current SWORD implementation for DSpace lacks support for connecting a collection of datasets with the publication that they are associated with. Instead, items are deposited one at a time.
This project idea would add to the SWORD component for DSpace the ability to accept multiple items, such as a number of data objects and a publication, and then establish the necessary links between them. With this in place, the Dryad submission component can be changed to use the SWORD interface for deposition, serving as a proof-of-concept application for other repositories or the VDC.
Approach 
The relationship between data items and the publication item would probably best be described using OAI-ORE. Therefore, the first step would be to add support for ingesting OAI-ORE packages to the DSpace SWORD implementation such that the relationships would be stored in DSpace. Subsequently the Dryad stand-alone submission system would be modified to actually use the SWORD API.
Challenges 
This will require understanding (or learning to understand) the DSpace API to store the relationships properly, as well as understanding the code base of the SWORD implementation, and finally the Dryad submission system. The OAI-ORE specification is relatively new, and will need to be understood as well to the extent necessary for this work (see David Tarrant et al, Code4Lib Journal (2009) for some background reading).
Involved toolkits or projects 
Degree of difficulty and needed skills 
Moderate to medium, depending on the coding skills (Java) and the level of expertise in or familiarity with OAI standards and OAI-ORE in particular, and experience with the DSpace API. A lot of the necessary pieces of code already exist, but need to be added to and modified.
Mentors 
Hilmar Lapp, Ryan Scherle

Add support for LSID (DOI/Handle/ARK) to Metacat

Rationale
Life Science Identifiers (LSID), Digital Object Identifiers (DOIs), Handles, and Archival Resource Keys (ARK) are used to uniquely identify an object on the web. LSIDs are Uniform Resource Names (URN) with a specific format which includes a naming authority, a namespace, an object ID and a version number. Metacat and Morpho use a proprietary identifier scheme which should be changed to use one or more of these identifier formats, preferably accepting all 4.
Approach
Metacat creates IDs of the form <namespace>.<objectid>.<revision>. This format can easily be translated into an LSID, and is handled currently by Metacat. The various client APIs that Metacat uses need to be upgraded to support use of LSID/DOI/Handle identifers in addition to the current Metacat identifier. For the most part this will require an identifier-neutral mapping layer that can store the identfier and extract the relevant identity and revision information from the identifier.
Challenges
The identifier currently in Metacat is used throughout the code base as it is a fundamental key to the data system. Refactoring to support multiple identifier schemes will require substantial code changes along with a new identifier model.
Involved toolkits or projects
Metacat, Java, J2EE
Degree of difficulty and needed skills
Moderate: The programming tasks are straightforward but fairly extensive and woven throughout the codebase, so there is a large possibility for unforeseen side effects.
Mentors
Matt Jones

Refactor the EarthGrid SOAP API to REST style and implement for Metacat

Rationale
Past cyberinfrastructure efforts such as SEEK developed the EarthGrid programatic interface as a common and lightweight API for common data management operations against data management systems such as Metacat, DiGIR, and SRB. This common SOAP API allows for different software clients like Kepler to communicate with all of these EarthGrid-enabled data servers using a common API. This project entails refactoring the EarthGrid SOAP API to use a REST architecture (more REST info). The benefits of REST over SOAP are extensive for decoupling the implementation of a system from its interface. Once this refactoring has occurred, then this REST interface would be implemented for a single data management server (Metacat) that already uses the EarthGrid API, which should make it a simple matter of mapping existing Metacat operations onto the new API. The VDC is developing a simple REST-style API for managing interactions between data management clients and data management servers, and this EarthGrid REST API will be used as the basis for our first prototypes. Ultimately, we plan to convert client applications such as Morpho and Kepler to use the VDC REST API.
Approach
The Metacat system already supports both a servlet-based API and the EarthGrid SOAP API for communications. This project would involve 1) Converting the current EarthGrid service API as specified in SOAP into a set of REST resource types with their accompanying behaviors that are associated with the HTTP verbs, and then 2) implementing these resources and their state transfers in metacat by invoking existing Metacat operations on the resources.
Challenges
The EarthGrid REST API will correspond at a high level to existing Metacat APIs, so there should reasonably good mappings to existing functions, but there are bound to be issues in the granularity of functions.
Involved toolkits or projects
Java, J2EE, EarthGrid, Metacat, REST, SOAP
Degree of difficulty and needed skills
Moderate: will need to learn multiple code bases and learn the underlying technologies behind SOAP and REST services.
Mentors
Matt Jones

GUID Test and Monitor Service

Rationale
Globally Unique Identifiers (GUIDs) are a critical component of open data sharing systems to avoid collision between data objects. There are several systems for implementing resolvable GUIDs (RGUIDs) which enable a user to retrieve metadata about the object, and/or the object itself over the Internet. Some widely used mechanisms for RGUIDs are plain old URLs (URI, http://www.w3.org/Addressing/), persistent URLs (PURL, http://purl.oclc.org/), Digital Object identifiers (DOI, http://www.doi.org/), and Life Sciences Identifiers (LSID, http://en.wikipedia.org/wiki/LSID). All of these schemes rely upon at least one service endpoint that will take an RGUID and return something useful about it, and so all schemes are subject to failure if there is some problem with access to or function of the service(s). Likewise, systems such as a distributed data archive framework, relying upon these RGIUD services would also be subject to these quality of service issues. As such an important addition to the universe of RGUID implementations would be a mechanism for testing (e.g. Does the RGUID resolve? Is the service accessible? Is the service responding in a timely manner?), and monitoring (alert admins in event of failure).
Approach
  1. Define the testing and monitoring system requirements, architecture, internal data model, and programmatic and user interfaces.
  2. Identify requirements for a statistically appropriate measure of accessibility for a potentially very large number of individual RGUIDs for the various schemes.
  3. Implement or modify existing code to produce RGUID clients for the common schemes, with each client presenting a common API for integration with the testing and monitoring framework.
  4. Implement a simple, extensible framework for executing tests of all protocols at arbitrarily scheduled times.
  5. Implement the monitoring service which is able to generate continuous reporting statistics about the availability of an arbitrary number of RGUID service endpoints
Challenges
The different RGUID systems have conceptually similar operations, though differ in their underlying implementation. Providing a common client API and implementation will present some difficulties. It is expected that greater challenges will be faced in the efficient implementation and operation of the monitoring service as a very large number of
Involved Toolkits or Projects
Mentors
Dave Vieglais (vieglais at ku edu)

Catalog of Metadata

Rationale
There exists many different standards for the structure and content of metadata documents, and this already large number is growing. Furthermore, since many of these metadata documents are not exactly interesting reading, there is a strong tendency for researchers and implementors to go out on their own and create a new standard rather than spend the time and effort necessary to discover, let alone digest these typically cumbersome, dense documents. This of course simply exacerbates the situation, as these fringe projects become successful and their corresponding metadata standard, adding to the already large pile of documents. By providing a simple catalog of existing metadata standards and a mechanism for searching them for keywords and/or phrases, it is hoped that the proliferation of metadata standards (at least for our target community) will be reduced because a) a list of standards will be readily available in the catalog, and b) simple search, browse and keywording/tagging mechanisms will assist interested parties in discovering standards of direct or related interest.
Approach
Several components need to be adapted or implemented for a successful outcome to be reached, though perhaps the most important is the identification of crucial terms to use when indexing the metadata documents (i.e. metadata about the metadata). Something as simple as Dublin Core (http://dublincore.org/) is expected to provide considerable assistance for discovering relevant documents within the system. The major components to be configured / adapted / implemented are:
  1. An information retrieval indexing service such as the Apache Lucene based SOLR system providing fuzzy search capabilities
  2. A set of tools that can parse various formats of metadata documents and generate index entries for the indexing service
  3. A web interface for searching and browsing the catalog, and a management interface for editing references to metadata documents.
Challenges
Apart from the usual issues encountered when combining several libraries and applications developed for different purposes, it is expected that some interesting intellectual challenges will arise as well. One immediate issue is that there is no guarantee a consistent set of terms can be identified for the meta-metadata index, especially when considering the complexity of some scientific metadata standards such as EML (http://cvs.ecoinformatics.org/cvs/cvsweb.cgi/eml/”) and FGDC (http://www.fgdc.gov/metadata/csdgm/). Building an efficient content extraction framework that operates over several different types of content (e.g. HTML, plain text, MS Word Documents) with minimal reconfiguration requires both skill with low level document parsing techniques and higher level knowledge extraction tools, the later of which are likely to require some reconfiguration to operate effectively with scientific rather than general literature.
Involved Toolkits or Projects
Mentor
Dave Vieglais (vieglais at ku edu)

Vocabulary Term Mapping

Rationale
One common and significant hurdle encountered while working with (merging, extracting, indexing) multiple data sets is the determination and mapping of semantically equivalent terms. Metadata descriptions of data sets can provide crucial information for determining data element semantics, though in many cases determining equivalence (or otherwise) of terms described therein is a challenge exacerbated by the sheer number of terms that may be described (e.g. even a simple standard such as the Darwin Core - - has over a hundred different terms described). When mapping terms between different there are N1xN2 comparisons to be made, where N is the number of terms defined in document 1 and 2 respectively. This large number of comparisons can be overwhelming, and it would be very helpful to have a tool that given two sets of terms and their descriptions, was able to return a ranked list of matching candidate terms (i.e. for each term in document 1, find all potentially related terms in document 2 and order them by relevance). A simple tool such as this could provide considerable assistance when attempting to map semantically similar terms between metadata, and thus semantically similar or equivalent content in data sets.
Approach
Assuming one can obtain through some device a list of terms and their human readable descriptions from metadata documents. The goal of this project is to take two such lists, and for each term + description in list 1, find all the related terms in list 2, rank them by relevance, and provide a summary as to the rules that determined the matches and relevance. Algorithms for determining similarity can range from simple word + thesaurus counting tools through more sophisticated algorithms available in the various natural language processing tool kits.
Challenges
Automated information extraction is an ongoing area of research, although it is expected that considerable progress could be made on this project with the use of existing open source tools. One significant challenge may be related to the availability of relevant vocabularies and thesauri for the topics of interest in a programmatically useful format such as SKOS (http://www.w3.org/2004/02/skos/). The relevance calculation algorithm is obviously core to the efficient functioning of this system, and will likely need considerable attention to achieve satisfy results when operating on sophisticated technical descriptions from the various sciences for which metadata documents are available.
Involved Toolkits or Projects
Mentor
Dave Vieglais (vieglais at ku edu)

Mentors

What should prospective applicants know?

Important dates

  • Students apply online, between March 30-April 22, 2009 (in the time zone of your choice). Note: The application period has closed for 2009, and accepted students have been announced.
  • Accepted applicants announced May 3, 2009.
  • The coding period starts May 23 and ends Aug 17, 2009 (at 19:00 UTC)
  • Midterm evaluation takes place between July 6-13, 2009.
  • End-of-term evaluation takes place Aug 17-24, 2009.

Eligibility

To be eligible, you must be a student (whether full or part-time) or a postdoc.

  • We define student as an individual enrolled in or accepted into an accredited institution including (but not necessarily limited to) colleges, universities, masters programs, PhD programs and undergraduate programs. We may require you to supply documentation from your institution (such as transcripts) as proof of enrollment or admission status. There are no requirements for school or field of study in order to participate.
  • You must be eligible to work in the country in which you reside. For students in the U.S. on an F-1 visa, you are welcome to apply as long as you have U.S. work authorization. For F-1 students who have to apply for CPT, the University of New Mexico will furnish you with a letter you can provide to your university to get CPT established once your application has been accepted.
  • The student or postdoc requirement is met if fulfilled on April 20, 2009, even if your enrollment or postdoc ends befoe the end of the program.
  • Residents and nationals of countries that US law prohibits us (and other US-based entities) from engaging in trade with (currently, these are Iran, Syria, Cuba, Sudan, North Korea and Myanmar (Burma)) are ineligible.

Before you apply

  • Pick the idea that appeals most to you in terms of goals, context, and required skills, or you can apply with your own project idea.
  • If you want to apply with your own idea, contact us early on to get feedback on whether your idea is within the scope we support. If you don't, and your idea is outside of our scope, it may simply be declined without further review. Projects supported by this program are expected contribute to interoperability among data centers in the fields of biodiversity, earth and environmental sciences, ecology and evolutionary biology.
  • Ask us questions about the project idea you have in mind.
  • Write a project proposal draft, include a project plan (see below), and bounce those off of us.

Have I mentioned yet that you should be in touch with us before you apply?

When you apply

To apply, please provide a current CV and the following in your application material.

  1. Your interests, what makes you excited.
  2. Why you are interested in the project, uniquely suited to undertake it, and what do you anticipate to gain from it.
  3. A summary of your programming experience and skills
  4. Programs or projects you have previously authored or contributed to, in particular those available as open-source.
  5. A project plan for the project you are proposing, even if your proposed project is directly based on one of the ideas above.
    • A project plan in principle divides up the whole project into a series of manageable milestones and timelines that, when all accomplished, logically lead to the end goal(s) of the project. Put in another way, a project plan explains what you expect you will need to be doing, and what you expect you need to have accomplished, at which time, so that at the end you reach the goals of the project.
    • Do not take this part lightly. A compelling plan takes a significant amount of work. Applications with no or a hastily composed project plan will likely not be competitive.
    • A good plan will require you to thoroughly think about the project itself and how one might want to go about the work.
    • We don't expect you to have all the experience, background, and knowledge to come up with the final, real work plan on your own at the time you apply. We do expect your plan to demonstrate, however, that you have made the effort and thoroughly dissected the goals into tasks and successive accomplishments that make sense.
    • We strongly recommend that you bounce your proposed project and your project plan draft off our mentors by emailing vdc-twg@ecoinformatics.org (see below). You will inevitably discover through this exercise that you are missing a lot of the pieces - we are there to help you fill those in as best as we can.
  6. Your possibly conflicting obligations or plans for the summer during the coding period.
    • Although there are no hard and fast rules about how much you can do in parallel to your Summer of Code project, we do expect the project to be your primary focus of attention over the summer. If you look at your Summer of Code project as a part-time occupation, please don't apply for our organization.
    • That notwithstanding, if you have the time-management skills to manage other work obligations concurrent with your Summer of Code project, feel encouraged to make your case and support it with evidence.
    • Most important of all, be upfront. If it turns out later that you weren't clear about other obligations, at best (i.e., if your accomplishment record at that point is spotless) it destroys our trust. Also, if you are accepted, don't take on additional obligations before discussing those with your mentor.
    • One of the most common reasons for students to struggle or fail is being overstretched. Don't set yourself up for that - at best it would greatly diminish the amount of fun you'll have with your project.
  7. Please also note if you expect that attending the meetings at the beginning and at the end of the coding period may present a problem for you.

You submit your application online (login, create your profile, register as a student, and then create and submit your application).

Get in touch with us

Please send any questions you have and ideas and work plans for projects you would like to propose to vdc-twg@ecoinformatics.org. This will reach all mentors and our adminstrators.

  • We strongly recommend you do this even if you want to work directly on one of our project ideas above. It gives you an opportunity to get feedback on what our expectations might be, and you might want to ask for more specifics.
  • The value of frequent and early communication in contributing to a distributed and collaboratively developed project can hardly be overemphasized. The same is true for becoming part of a community, even if only temporarily. If you can't find the time to engage in communication with your prospective mentors before you have been accepted, how will you argue that you will be a good communicator once accepted?

Reference Facts & Links

Data repositories, technologies, and standards involved

ORNL DAAC 
The Oak Ridge National Laboratory (ORNL) Distributed Active Archive Center (DAAC) is a NASA-funded repository for biogeochemical and field-research data. It is part of the NASA Earth Observing Systems Data and Information System [1] which includes several other DAACs, each focusing on a particular science application area.
EML 
Ecological Metadata Language (EML) is a metadata specification that is in widespread use in the ecological and environmental sciences throughout the world. It is an open-source, grass-roots effort at finding a common metadata specification that can evolve with community needs. Although EML was developed by the ecology community, EML can be used to document a wide variety of scientific data spanning multiple disciplines.
Dryad 
Dryad is a digital repository for data supporting published works in evolutionary biology. It is based on the DSpace software, and is being developed by NESCent in collaboration with the Metadata Research Center (MRC) at the UNC School for Information and Library Science (SILS).
Links: source code, project wiki, metadata application profile
Dublin Core 
GBIF 
LSID 
Metacat 
Metacat is an open spource data repository system for heterogeneous data and metadata. It can store, version, and search multiple metadata standards and arbitrary data schemas, and can be accessed both through web and desktop clients such as Morpho. It was developed in collaboration at the National Center for Ecological Analysis and Synthesis (NCEAS) and the Long Term Ecological Research Network (LTER), and is currently the basis for the multi-institutional Knowledge Network for Biocomplexity (KNB) data sharing network.
TDWG 

Disclaimers

This program or any if its features or terms is not part of the Google Summer of CodeTM program, and is not endorsed by Google or any Google employee. Any links from this page to Google or to the Google Summer of CodeTM program and auxilliary pages are solely due to this program having been inspired by and having similar terms to Google's, and should not be construed to indicate an official relationship or endorsement.

If you have any questions about this program, do not direct them to Google, or to any of the official Google Summer of Code program channels such as mailing lists, Google Groups, or the #gsoc IRC channel. Use only the specific channels mentioned under Contact.

The VDC Summer Internship Program is funded by the U.S. National Science Foundation (NSF). Any opinions, findings, conclusions, or recommendations expressed at this site are those of the authors and do not necessarily reflect the views of the National Science Foundation.