Application Server Configuration

(2Q21)


This section deals with the configuration of the eXist-db Application Server in eXist-db's main configuration file conf.xml.

Main configuration file

The main configuration file for eXist-db is called conf.xml, which is loaded the root directory of the distribution (as specified by the system property exist.home).

The configuration file conf.xml is divided into twelve sections:

  1. <db-connection>: Configures the storage back-end.

  2. <lock-manager>: Configures the Lock Manager.

  3. <repository>: Settings for the package repository.

  4. <binary-manager>: Settings for the Binary Manager.

  5. <indexer>: Controls the indexing process.

  6. <scheduler>: Job scheduler for system or user jobs such as backups.

  7. <parser>: Default settings for parsing structured documents.

  8. <serializer>: Default settings for the serializer (external data representation).

  9. <transformer>: Default settings for the XSLT Transformer.

  10. <validation>: Settings for XML validation.

  11. <xquery>: Enable and configure extension modules that contain XQuery functions.

  12. <xupdate>: Configuration options related to XUpdate processing.

The following sections describe the most commonly modified of the above elements, including how to change the default behavior of eXist-db's handling of whitespace characters.

db-connection element

This element contains basic default storage settings for eXist-db, including memory and system limits. Only one <db-connection> should be specified. An example configuration for the native back-end is shown below:

<db-connection cacheSize="48M" collectionCache="24M" database="native" files="../data" pageSize="4096" nodesBuffer="-1">
  <pool min="1" max="15" sync-period="240000" wait-before-shutdown="60000"/>
  <!-- default-permissions collection="0775" resource="0775" / -->
  <recovery enabled="yes" sync-on-commit="no" group-commit="no" size="100M" journal-dir="../data"/>
  <watchdog query-timeout="-1" output-size-limit="10000"/>
  <default-permissions collection="0775" resource="0775"/>
</db-connection>

db-connection attributes

database

This attribute selects a database system type. Since relational database back-ends are no longer supported by the current release of eXist, only native is available.

files

This attribute specifies the directory where the native back-end will keep its database files, and so it is necessary that this directory exists. If a relative path is specified, it will be based on the root directory as defined in the exist.home system property. If this data directory does not have write permissions (see User Authentication and Access Control), eXist will internally switch to read-only mode such that any attempt to change the database will throw an exception.

cacheSize

This attribute sets the maximum amount of main memory used by all page buffers (i.e. assuming all page buffers are at full capacity). The database uses this parameter to calculate the maximum size of each internal cache. You can increase this value if your system allows for greater memory use.

While indexing documents, eXist will reserve the amount of memory specified in cacheSize - even if not all caches are filled - and will not use it for temporary data.

The cacheSize should not be more than half of the size of the JVM heap size (set by the JVM -Xmx parameter). If the JVM heap is less than 512 megabyte, the cacheSize should even be smaller, e.g. 1/3.

collectionCache

Determines the size of the collection cache, which is a separate caching space. Usually this setting does not need to be changed unless you really have more than a few thousand collections in the db. Increase it carefully, maybe up to 128M.

pageSize

This specifies the number of bytes used for internal data and B-tree pages. This should be equal to or a multiple of the page size used by the filesystem (usually a multiple of 4096).

nodesBuffer

Size of the temporary buffer used by eXist for caching index data while indexing a document. If set to -1, eXist will use the entire free memory to buffer index entries and will flush the cache once the memory is full.

If set to a value > 0, the buffer will be fixed to the given size. The specified number corresponds to the number of nodes the buffer can hold, in thousands. Usually, a good default could be nodesBuffer="1000".

The default setting, nodesBuffer="-1", can be problematic if you frequently need to store large documents in a multi-user environment. In this case, the index operation may consume most of the memory resources, which means that concurrent threads will be slowed down or even come to a halt.

db-connection/pool element

These settings control the internal database connection pool.

min | max

These options specify the minimum and maximum size of the connection pool. This pool restricts the number of parallel (basic) operations that can be executed by the database. Settings should be somewhere between 1 and 20.

Please note that this has nothing to do with the HTTP and XMLRPC server settings - these servers have their own connection pools.

sync-period

This option defines how often the database will flush its internal buffers to disk (in milliseconds). The sync-thread will interrupt normal database operation after the specified time and write all dirty pages to disk. It also writes a checkpoint to the transaction log. In case of a database crash, only transactions which started after the last checkpoint have to be redone or rolled back. The sync-period should thus not be set too long.

wait-before-shutdown

This option specifies the maximum amount of time (in milliseconds) that the database will allow for any running processes to complete upon database shutdown. After that, eXist will try to kill the remaining processes.

If wait-before-shutdown is set to a positive number, eXist will stop the db after the specified timeout, even if there were still running database operations. In this case, no checkpoint will be written to the transaction log. If there were any open transactions, eXist will trigger a recovery run after restart.

If wait-before-shutdown is set to -1, eXist will not shut down before all active database operations returned. This is a safe setting, but it may require a manual intervention to stop the JVM.

db-connection/query-pool element

This element configures the global pool for compiled XQuery expressions. For each XQuery, a maximum number of compiled expressions are kept in the pool, and is removed if it hasn't been used for the defined timeout. The XQuery pool is multi-threaded.

<query-pool> Attributes:

max-stack-size

The maximum number of queries in the query-pool.

size

The number of copies of the same query kept in the query-pool. Value "-1" effectively disables caching. Queries cannot be shared by threads, each thread needs a private copy of a query.

timeout

The amount of time that a query will be cached in the query-pool in milliseconds.

timeout-check-interval

The time between checking for timed out queries. For value "-1" the time out is switched off, resulting cached queries to remain in the cache forever.

db-connection/recovery element

This element configures the journalling and recovery of the database. With recovery enabled, the database is able to recover from an unclean database shutdown due to, for example, power failures, OS reboots, and hanging processes. For this to work correctly, all database operations must be logged to a journal file. The location, size and other parameters for this file can be set using the <recovery> element.

<recovery> Attributes:

enabled

If this attribute is set to yes, automatic recovery is enabled.

size

This attributes sets the maximum allowed size of the journal file. Once the journal reaches this limit, a checkpoint will be triggered and the journal will be cleaned. However, the database waits for running transactions to return before processing this checkpoint. In the event one of these transactions writes a lot of data to the journal file, the file will grow until the transaction has completed. Hence, the size limit is not enforced in all cases.

journal-dir

This attribute sets the directory where journal files are to be written. If no directory is specified, the default path is to the data directory.

sync-on-commit

This attribute determines whether or not to protect the journal during operating system failures. That is, it determines whether the database forces a file-sync on the journal after every commit. If this attribute is set to yes, the journal is protected against operating system failures. However, this will slow performance - especially on Windows systems. If set to no, eXist will rely on the operating system to flush out the journal contents to disk. In the worst case scenario, in which there is a complete system failure, some committed transactions might not have yet been written to the journal, and so will be rolled back.

group-commit

If set to yes, eXist will not sync the journal file immediately after every transaction commit. Instead, it will wait until the current file buffer (32kb) is really full. This can speed up eXist on some systems where a file sync is an expensive operation (mainly windows XP; not necessary on Linux).

However, group-comit="yes" will increase the chance that an already committed operation is rolled back after a database crash.

force-restart

Try to restart the db even if crash recovery failed. This is dangerous because there might be corruptions inside the data files. The transaction log will be cleared, all locks removed and the db re-indexed.

Set this option to yes if you need to make sure that the db is online, even after a fatal crash. Errors encountered during recovery are written to the log files. Scan the log files to see if any problems occurred.

consistency-check

If set to yes, a consistency check will be run on the database if an error was detected during crash recovery. This option requires force-restart to be set to yes, otherwise it has no effect.

The consistency check outputs a report to the directory {files}/sanity and if inconsistencies are found in the db, it writes an emergency backup to the same directory.

db-connection/watchdog element

This is the global configuration for the query watchdog. The watchdog monitors all query processes, and can terminate any long-running queries if they exceed one of the predefined limits. These limits are as follows:

<watchdog> Attributes:

query-timeout

This attribute sets the maximum amount of time (expressed in milliseconds) that the query can take before it is killed. The setting can be overwritten in an XQuery by specifying the option exist:timeout:

declare option exist:timeout "time-in-ms";

Please check the documentation on XQuery options.

output-size-limit

This attribute limits the size of XML fragments constructed using XQuery, and thus sets the maximum amount of main memory a query is allowed to use. This limit is expressed as the maximum number of nodes allowed for an in-memory DOM tree. The purpose of this option is to avoid memory shortages on the server in cases where users are allowed to run queries that produce very large output fragments. The setting can be overwritten in an XQuery by specifying the option exist:output-size-limit:

declare option exist:output-size-limit "size-hint";

db-connection/default-permissions element

Specifies the default permissions for all resources and collections in eXist (see User Authentication and Access Control). When this is not configured, the default mod (similar to the Unix chmod command) is set to 0775 in the resources and collections attributes. A different default value may be set for a database instance. Local overrides are also possible.

lock-manager element

This element contains settings for eXist-db's Lock Manager and Lock Table. The majority of these Lock Manager settings should not be modified unless otherwise suggested by eXist-db Core Development Team.

lock-table/@disabled

Disables the database Lock Table which tracks database locks. The Lock Table is enabled by default and allows reporting on database locking via JMX.

Tracking locks via the Lock Table imposes a small overhead per-Lock. Once users have finished testing their system to ensure correct operation, they may wish to disable this in production to ensure the absolute best performance.

Unless nessecary, it is recommened to leave this enabled.

document/@use-path-locks

Experimental: Causes path locks to be used for documents as well as collection locks.

This has a performance and concurrency impact, but will ensure that you cannot have deadlocks between Collections and Documents.

Unless nessecary, it is recommened to leave this at its default value.

indexer element

This element sets parameters on how XML files are to be indexed by eXist. An example configuration is shown below:

<indexer caseSensitive="yes" index-depth="5" preserve-whitespace-mixed-content="no" suppress-whitespace="none">
  <modules>
    <module id="ngram-index" file="ngram.dbx" n="3" class="org.exist.indexing.ngram.NGramIndex"/>
    <!-- <module id="spatial-index" connectionTimeout="10000" flushAfter="300" class="org.exist.indexing.spatial.GMLHSQLIndex"/> -->
    <module id="lucene-index" buffer="32" class="org.exist.indexing.lucene.LuceneIndex"/>
    <!-- The following index can be used to speed up 'order by' expressions by pre-ordering a node set. -->
    <module id="sort-index" class="org.exist.indexing.sort.SortIndex"/>
    <!-- New range index based on Apache Lucene. Replaces the old range index which is hard-wired into eXist core. -->
    <module id="range-index" class="org.exist.indexing.range.RangeIndex"/>
    <!-- The following module is not really an index (though it sits in the index pipeline). It gathers relevant statistics on the distribution of elements in the database, which can be used by the query optimizer for additional optimizations. -->
    <!-- <module id="index-stats" file="stats.dbx" class="org.exist.storage.statistics.IndexStatistics" /> -->
  </modules>
  <!-- Default index settings. Default settings apply if there's no collection-specific configuration for a collection. -->
  <index>
    <!-- settings go here -->
  </index>
</indexer>

indexer attributes

caseSensitive

Specifies whether string comparisons are to be case-sensitive. This option applies to XPath equality tests (i.e. the = operator), as well as functions such as contains(), starts-with() and ends-with().

This setting does not apply to operators or functions of the full-text index (e.g. &=, |=, near()) nor the n-gram index, which are never case-sensitive

Warning:

Setting caseSensitive="no" violates the XQuery specs! The option should be regarded as a dirty workaround, which will be removed in the future. Please use the n-gram or full-text indexes for case-insensitive queries or - if that is impossible - specify a collation.

suppress-whitespace

Specifies how the <indexer> is to treat whitespace at the start or end of a character sequence. This option only applies to newly stored files, and therefore changing it has no effect on previously stored documents. Possible values for this attribute are:

  1. leading - Suppresses leading whitespace.

  2. trailing - Suppresses trailing whitespace.

  3. both - Suppresses leading and trailing whitespace.

  4. none - Preserves all whitespace.

Note that suppressing whitespace at the start or end of character sequences does effectively change the document!

preserve-whitespace-mixed-content

controls how ignorable whitespace is handled. If set to no, ignorable whitespace, e.g. between the end tag of an element and the start tag of another, will not be stored into the persistent DOM. This leads to a smaller DOM and usually increases the readability of the XML. Ignorable whitespace is not considered as a part of the logical document model, so removing it doesn't change the document.

tokenizer

This attribute invokes the Java class used to tokenize a string into a sequence of single words or tokens, which are stored to the full-text index. Currently only SimpleTokenizer is available.

index-depth

This attribute specifies the depth of the DOM index, or the tree level up to which elements will be added to the index. For example, a value of 2 results in the document root node and all its child elements being indexed; a value of 1 only indexes the root node.

The DOM index maps unique node identifiers to the nodes' storage locations in the DOM file. Generating this index is time- and memory-consuming. It is furthermore primarily needed to access nodes by their unique node identifier, for example, when serializing XML data for query results or XUpdate - which are operations not normally considered time-critical. Moreover, most XPath expressions can do without this index since they use short-cuts to access the node directly.

Normally only top-level elements are added to the DOM index, whereas attributes and text nodes are always excluded. This results in much smaller index sizes and, consequently, a smaller dom.dbx file size. Usually, setting the index-depth to a value of 2 offers a reasonable compromise of index size and performance.

However, if your documents are deeply-structured, you might consider increasing this setting to a level of 3, 4 or 5. For example, if the longest path from the document root to an element node has greater than ten node levels, an index-depth setting of 4 or 5 would probably help to increase overall query performance for some types of queries.

validation

This attribute defines the default setting for the validation of documents by the XML parser. If it is set to no, documents will never be validated against an existing DTD or schema. A value of auto will leave document validation to the SAX parser.

indexer/modules element

This section configures optional indexing modules. Beginning with version 1.2, eXist features a modularized indexing architecture, which allows new indexes to be plugged into the indexing pipeline. The <modules> section lists and configures the indexes that will be available to the database:

<modules>
  <module id="ngram-index" class="org.exist.indexing.ngram.NGramIndex" file="ngram.dbx" n="3"/>
  <!-- <module id="spatial-index" class="org.exist.indexing.spatial.GMLHSQLIndex" connectionTimeout="10000" flushAfter="300" /> -->
</modules>

The only common attributes for each <module> element are class and id. The other attributes, as well as any nested elements, are specific to the index implementation. Detailed information is available in the article on Configuring Database Indexes.

indexer/stopwords element

The file attribute for this element points to a file containing a list of stop-words. Stop-words are not added to the full-text index.

indexer/index element

This configuration element specifies the default index settings. These settings are applied if neither the collection nor any of its ancestors provide a collection configuration.

Configuring indexes via the default settings is not recommended. If you need a global collection configuration, store one for the root collection /db. For more information, see Configuring Indexes.

scheduler element

This section is used to configure asynchronous jobs with eXist's internal scheduler. Three types of jobs are supported:

startup jobs

Startup jobs are executed once during database startup, but before the database becomes available. These jobs are synchronous. The database is blocked to outside requests and no other operations will run at the same time.

system jobs

System jobs require the database to be in a consistent state. The scheduler will run them in an exclusive environment. Once the job is triggered, the database will block all new requests and wait for running operations to complete. It then executes the job. All other database operations will be stopped until the job returns or throws an exception. Any exception will be caught and a warning written to the log.

user jobs

User jobs may be scheduled at any time and may be mutually exclusive or non-exclusive

Below is an example which configures a BackupSystemTask:

<job type="system" name="databackup" class="org.exist.storage.DataBackup" period="120000">
  <parameter name="output-dir" value="backup"/>
  <parameter name="suffix" value=".zip"/>
  <parameter name="prefix" value="backup-"/>
  <parameter name="collection" value="/db"/>
  <parameter name="user" value="admin"/>
  <parameter name="password"/>
  <parameter name="zip-files-max" value="28"/>
</job>

Each job is configured in a <job> element which accepts a number of standard attributes:

job attributes

type

The type of the job to schedule. Must be either startup, system or user.

class

If the job is written in Java this should be the name of the class that extends either

  • org.exist.scheduler.StartupJob

  • org.exist.storage.SystemTask

  • org.exist.scheduler.UserJavaJob

xquery

If the job is written in XQuery (not suitable for system jobs) this should be a path to the XQuery stored in the database, e.g. /db/myCollection/myJob.xql. XQuery job's will be launched under the guest account initially. The running XQuery may switch permissions through calls to xmldb:login().

cron-trigger

To define a firing pattern for the Job using cron style syntax. Not applicable to start-up jobs.

unschedule-on-exception

Either true (default) or false. If true and an exception is encountered the job is unscheduled for further execution until a restart. Otherwise, the exception is ignored.

period

Can be used to define an explicit period for firing the job instead of a cron style syntax. Expressed in milliseconds. Not applicable to start-up jobs.

delay

Can be used for periodic jobs to delay the start of a job. If unspecified jobs will start as soon as the database and scheduler are initialised.

repeat

Can be used for periodic jobs to define how many periods a job should be executed. If unspecified, jobs will repeat indefinitely.

Every job can take additional parameters, which are passed as name/value pairs.

serializer element

The serializer is responsible for serializing XML documents or document fragments back into XML. This configuration element defines default settings for various parameters, which can also be specified programmatically. All settings can be overwritten by XQuery serialization options.

serializer attributes

enable-xinclude

This attribute determines whether <xinclude> tags are to be expanded during serialization. Setting the value to false will leave <xinclude> tags unexpanded.

enable-xsl

Setting this attribute to true tells the serializer to pass its output to an XSL stylesheet when it encounters an XSL processing-instruction at the start of the document.

add-exist-id

This attribute tells the serializer to add additional debug attributes to each element. This information includes the internal identifier of the node and source document. Values:

  1. all - Adds debug information to every node in the output.

  2. element - Adds debug information to top-level elements only.

  3. none (default) - Disables debugging feature.

indent

The serializer defaults to pretty-print the resulting XML source code. Setingt this option to no disables pretty-printing.

match-tagging-elements

The database can highlight matches in the text content of a node by tagging the matching text string with <exist:match>. This only works for XPath expressions using the full-text index. Set the parameter to yes to disable this feature.

transformer element

This section determines which XSLT processor will be used by eXist. By default, eXist relies on Saxon.

validation element

Defines the default validation settings active when parsing XML and links to catalog files. Catalog files are used to locate DTDs, schemas and resolve external entities in general.

Please refer to the corresponding documentation on XML Validation.

xupdate element

Inserting new nodes into a document can lead to fragmentation in the DOM storage file. eXist will thus trigger a de-fragmentation run if the fragmentation exceeds a certain limit. The frequency of such de-fragmentation runs can be configured in the <xupdate> section. The main parameter is called allowed-fragmentation:

<xupdate allowed-fragmentation="20" enable-consistency-checks="no"/>

xupdate attributes

allowed-fragmentation

This attribute defines the maximum number of page splits allowed within a document before a de-fragmentation run is triggered.

enable-consistency-checks

This attribute is for or debugging purposes only. If the parameter is set to yes, a consistency check will be run on modified documents after every XUpdate request. This checks whether the persistent DOM is complete, and all pointers in the structural index point to valid storage addresses that contain valid nodes.

xquery element

<xquery enable-java-binding="no" enable-query-rewriting="no" enforce-index-use="always" disable-deprecated-functions="no" raise-error-on-failed-retrieval="no" backwardCompatible="no">
  <builtin-modules>
    <!-- Default Modules -->
    <module class="org.exist.xquery.functions.util.UtilModule" uri="http://exist-db.org/xquery/util"/>
    <!-- ... more modules ... -->
  </builtin-modules>
</xquery>

The <xquery> section is used to enable/disable certain core features of the XQuery engine. It also lists the XQuery extension modules that will be known to the query engine by default.

xquery attributes

enable-java-binding

Set to yes to enable the java binding. Giving users full access to all Java classes should be considered a security risk and the feature is thus disabled by default.

disable-deprecated-functions

Set to yes to enable XQuery functions marked as deprecated.

enforce-index-use

controls if available range indexes should be used if only some collections in the context set define a matching index. Available settings are:

  • always to always use an index, even if it does not apply to the entire set of collections being queried.

  • strict to only use indexes if they are defined for the entire collection set.

For example, if you have two collections: /db/one and /db/two, and you define a range index on a certain element <node> in /db/one, but not in /db/two, the query engine would not use the index with setting strict if you query both collections. At compile time, eXist doesn't know if <node> exists in both collections and will not use the index if it determines that an index definition does only apply to a part of the collection set being queried. To use the index, you would need to start your XPath expression with a call to collection(), selecting the correct collection with the index defined.

If enforce-index-use is set to always, the query engine only checks if one collection in the collection set has a matching index defined on it. This may lead to an incomplete query result if one forgets certain collections.

In other words, when enforce-index-use is set to "always", it is the query writer's responsibility to make sure indexes are defined properly. But experience has shown it is easier for users to understand that a certain result is incomplete because an index is missing, whereas they have problems to see that a performance issue is caused by inconsistent indexing.

raise-error-on-failed-retrieval

Set to yes if a call to doc(), xmldb:document(), collection() or xmldb:xcollection() should raise an error (FODC0002) when an XML resource can not be retrieved.

Set to no if a call to doc(), xmldb:document(), collection() or xmldb:xcollection() should return an empty sequence when an XML resource can not be retrieved.

enable-query-rewritingo

the query engine can often achieve considerable performance improvements by rewriting an XQuery expression into a more efficient form (see the documentation about indexing). However, these features are relatively new. If you have doubts about the correctness of a query result, you may temporarily set enable-query-rewriting to no and see if the result changes in any way. If it does, you have hit a bug which should be reported.

backwardCompatible

|Set to yes to enable XPath 1.0 backwards compatibility. The setting mainly effects automatic type conversions, which were less strict in XPath 1.0 than in later versions.

xquery/builtin-modules element

This section lists the XQuery extension modules which will be known to the query engine. The modules in this list can be imported into a query without specifying a location. For example:

<module class="org.exist.xquery.modules.file.FileModule" uri="http://exist-db.org/xquery/file"/>

This establishes a static mapping between the module URI for the file module and the Java class which implements it. When using that module, it is sufficient to provide the correct URI in the import. Specifying a location is not needed, like in:

import module namespace file="http://exist-db.org/xquery/file";

Instead of providing a Java class, one can also specify a src URI which must point to the XQuery source code of the module, for instance:

<module uri="http://exist-db.org/xquery/kwic" src="resource:org/exist/xquery/lib/kwic.xql"/>

For the src attribute, eXist understands the same types of URIs as in an ordinary XQuery import statement.