Last year eXist participated in the Google Summer of Code for the first time and after what we feel was a successful summer, we have decided to apply to participate again this year.
In 2007 we had two students working on an XQJ implementation and Fulltext extensions for XQuery respectively. The first version of the XQJ implementation is almost complete and waiting to be merged with the main code base. The Fulltext extensions for XQuery are waiting our XQuery parser refactorings to be merged.
Suggested projects for 2008 are listed below, however students may also propose their own projects. We suggest discussing it with us before submitting your application to ensure that it is suitable and viable in the Google Summer of Code 2008 frame.
For all questions concerning the Summmer of Code, contact our GSoC administrators eXistAdmin@gmail.com, send an email to the exist-open mailing list or meet us in IRC. For short questions, IRC is the preferred medium.
No projects have been accepted yet.
XQuery programs can get quite complex (scripts with more than 1000 lines are not uncommon), especially if they use a lot of modules. However, debugging the code is currently a tedious, time-consuming job due to the lack of tool support. While some commercial XML editors do already include XQuery debuggers (e.g. Oxygen), eXist lacks an appropriate debugging API to interface with them.
A remote debugging API should be implemented on top of the eXist server. This should at least include the ability to stop XQuery execution at predefined breakpoints, inspect the current query context and switch into single-step execution. A basic command-line or graphical debugging interface should be shipped with eXist. The Oxygen team already expressed their interest to support eXist from their commercial XQuery debugger.
Resources:
eXist aims to be compliant with XQuery 2.0 specifications. It would be interesting that the "sister" recommendation, XSLT 2.0, should be implemented as well, thus allowing XSLT 2.0 processing on (eventually huge) persistent documents. Most of the code is already here since both recommendations are built on-top of XPath 2.0.
However, this is still to be implemented:
Clean separation of XPath 2.0 and XQuery 1.0 code. Exist used to have a dedicated package for XPath in the past: it somehow has to be revived and the XQuery 1.0 specific classes have to be moved to a dedicated package. Functions, including experimental grouping ones (which have to be improved with regard to performance) have to be moved as well.
Write a dedicated XSLT 2.0 frontend to the existing XQuery 1.0 parser that would be used to build the expression tree.
Attention should be drawn to performance concerns. Recent code is definitely more friendly to the programmer with regard to performance. Implementing an XSLT 2.0 processor could help in bringing even more improvements in this area.
Resources:
XSL Transformations (XSLT) Version 2.0. W3C Recommendation.
XQ2XML: XML syntaxes for XQuery. A test suite by David Carlisle that provides an XSLT 2.0 syntax for some of the XQuery test suite tests.
Implement a federated search service over distributed eXist databases. There are various reasons why users may have more than one database instance deployed, for example, to distribute load or to keep sensitive data in its own data store. Another important area of application would be in the context of grid computing.
Unfortunately, there's no simple way to combine results from distributed data stores in a single XQuery. eXist's query engine can only operate on local resources. It can retrieve data from external locations, but only to parse them into a local DOM tree, which is then used for querying. A distributed search facility would allow eXist to directly forward parts of an expression to a remote database instance. The XQuery specification already provides the necessary framework: the collection() and doc() functions both accept arbitrary URIs, so collections as well as resources can be at external locations.
The main challenge will be to properly merge intermediate results from different database instances and track references to remote node sequences throughout the query.
Until now most database parameters are configured in a central XML configuration file. This file is only read once during database startup. While many parameters should indeed not be changed at runtime, there are a few settings which could be modified without requiring a db restart. Examples include the current cache size settings, job scheduling, or index plugins. However, eXist does currently not provide an interface to modify settings at runtime. Some read-only settings are already exposed via Java Management Extensions (JMX) though.
The goal of the project would be to provide a common interface to dynamically configure certain aspects of the database instance at runtime. Ideally access to this interface should be provided via JMX. Additionally, the existing JMX mbeans should be extended to provide more control over jobs running on the db instance. For an administrator, it should be possible to view all running queries or jobs, and modify their access permissions or even kill a process.
Sorting a set of nodes is a frequent operation in many XQuery applications. The "order by" clause in XQuery is very powerful and allows the definition of an arbitrary number of ordering specificiations to be applied on the tuple stream returned by a FLWOR expression.
However, ordering is quite expensive: for each tuple in the return sequence we have to evaluate all ordering expressions once and atomize the result, i.e. transform it into an atomic sequence. Atomization requires access to the actual node stored in the db, thus generating a huge amount of IO. As a result, "order by" expressions should always be applied with care. Query execution times will increase linearily with the size of the return sequence.
To improve this, eXist should at least provide indexed access to the atomized values needed for the ordering. Unfortunately, the existing index structures can not be directly used: the range index maps atomized node values to a sequence of node ids, while order by would need to order node ids by their node value. So either the existing range index has to be extended to support value lookups by node id or a new index structure has to be implemented.
Other XQuery operations could benefit from such an index as well: this includes the aggregate functions min, max and sum, as well as distinct-values.
The following people are available as mentors. Once your project has been accepted, you will be assigned one or two mentors to support you directly, however all mentors will provide support where required.
Adam Retter (GSoC Administrator)
Wolfgang Meier
Leif-Jöran Olsson
Dannes Wessels
Pierrick Brihaye
Andrzej Jan Taramina
Piotr Kaminski