Polyphasic Taxonomy/Microbial Informatics 1999 Research Accomplishments

1. Ribosome Database Project (RDP-II)

Over the past reporting period, RDP-II work concentrated in two areas: internal infrastructure enhancements and user updates. In the previous report, we identified upgraded curator tools as an important requirement to help RDP become up-to-date. We have implemented several GUI based curator editor tools as well as a first-generation system for automatically harvesting rRNA records from GenBank. This system takes as input GenBank's native asn.1 format. The first pass extracts rRNA sequence regions and outputs the data in an intermediate XML formate. Several filter programs further process the records, correcting common GenBank data problems, removing very small partial sequences, and sorting by type. A new loader program loads the resulting XML- formatted records. This filter based approach makes it easy to respond to changes in the data structure supplied by GenBank, a not uncommon occurrence. The produced XML documents conforms to a new custom XML dtd. We intend to us this format as a common intermediate for data from other sources and tools. We chose XML as a common intermediate data format because if its wide acceptance and availability of XML based tools.

We also used XML to develop a general framework for linking tools to the RDP data and are using this framework in-house with several java GUI curation tools. One such tool links the database to the ae2 multiple sequence alignment editor. The user is presented with a phylogenetic-hierarchical view of the available sequences and can choose any subset for editing. The second GUI tool allows the user to graphically edit the RDP's taxonomic hierarchy. Both rearrangements and additions to the hierarchy, as well as changes to selected sequence attributes can be made using a 'point and click' approach. These tools communicate with an intermediate server that supplies both the hierarchical view and sequence data in an XML format. These (and other planned) curator tools require long transactions, potentially lasting hours. To ensure data consistency, the intermediate XML server implements a strict on-demand check-out paradigm for data from the DBMS. We chose the strict (pessimistic) check-out mode of operation because of the difficulty of merging multiple versions of some of our data types. With our relatively small curatorial staff, this has worked well. Eventually, we expect to move some end-user (read only) tools to the same XML intermediate server framework.

In addition to the new curatorial tools, we made a number of modifications to our website. Many of these are not directly visible to the end-user, involving an internal rearrangement of the website structure. Changes directly enhancing the site for the users include a new, easier to use, hierarchical tool for sequence selection, and enhancement to the interfaces of the existing programs. The sequence selection tool produces straight html, an important consideration because of the variety of browsers utilized by our users. However, the tool appears dynamic because custom pages are produced on demand to fulfill opening, closing, and selection requests. This tool can be used either to select sequences for downloading or for inclusion in other analysis functions.

From the user perspective, the most important enhancement was our new 7.1 data release on September 17th, 1999. For this release, we concentrated on becoming up-to-date in sequences for bacterial type material. These sequences provide an important framework linking phylogeny and taxonomy. As such they are the most critical sequences for enhancing RDP's usefulness for end users. In addition to capturing over 1100 additional sequences, we identified other sequences in our alignments as coming from type material, and updated the nomenclature of all bacterial sequences to reflect recent name changes. This new release contains 3324 bacterial and archeal type sequences representing 2460 separate species.

In early 1999, the RDP solicited user comments via a user survey. Most users felt that RDP-II should devote more resources to prokaryotic small subunit sequences, and that RDP-II could improve by releasing data in a more timely manner.

