Synchronize directory to a collection

(3Q20)


This article provides information on the synchronization of filesystem data to an exist-db collection.

Introduction

An editing workflow often operates on a filesystem directory that eventually may be published using exist-db. An exist-db-addons library, available in maven central, enables automatic publication of a directory to a collection.

Files in a directory specified by the parameter datadir will be synchronized to a collection specified by the parameter collection recursively. If the target collection does not exist it will be created. Files and collections that are new or newer than the one in the target collection will be written to that collection. Files and collections that are not present in the source directory will be removed from the collection, this can be turned off via a boolean parameter removeNotInSource. Owner and group of collections and documents can be provided in parameters owner and group, otherwise they will be the same as the owner and group of the parent collection of the provided collection parameter. After syncing cache is cleared to prevent problems, this can be turned off via boolean parameter clearCache. NOTE that the sync will partially succeed when during syncing an exception occurs, collections and files added or removed before the exception will remain added/removed. Meant to be used as a start-up task, DataSyncTaskCron is meant to be scheduled as a cronjob.

usage

Below a setup for exist-db for data synchronization.

Include exist-db-addons

For example in a Dockerfile:

ARG EXISTADDONSERSION=2.3
COPY exist-db-addons-${EXISTADDONSERSION}.jar $EXIST_HOME/lib/
ENV CLASSPATH=$EXIST_HOME/lib/exist.uber.jar:$EXIST_HOME/lib/exist-db-addons-${EXISTADDONSERSION}.jar

Or include a dependency in exist-db's pom.xml:

<dependency>
  <groupId>
    org.fryske-akademy
  </groupId>
  <artifactId>
    exist-db-addons
  </artifactId>
  <version>
    2.3
  </version>
</dependency>

configure in conf.xml

Sync at start-up, see scheduler

<job class="org.fryske_akademy.exist.jobs.DataSyncTask" type="system" period="10" repeat="0">
  <parameter name="collection" value="xmldb:exist:///db/apps/teidictjson/data"/>
  <parameter name="datadir" value="/data"/>
</job>

Sync at 2am, see scheduler

<job class="org.fryske_akademy.exist.jobs.DataSyncTaskCron" type="system" cron-trigger="0 0 2 ? * *">
  <parameter name="collection" value="xmldb:exist:///db/apps/teidictjson/data"/>
  <parameter name="datadir" value="/data"/>
</job>