Distributed Versioning for Geospatial Data (Part 2)

Distributed Versioning Implementation

Published in December 2012, this is the second paper in a series of three in a that proposes a new approach to working with geospatial data, recommending a shift to treat geospatial data as programmers do source code. This paper delves into the technology upgrades needed to create a distributed versioning system for geospatial data and will present the work OpenGeo has already carried out.

1. Initial Attempts: Non-Distributed Versioning

In 2006, OpenGeo undertook substantial research and development on an open source versioning implementation. This first versioning system was not distributed, but instead looked to other implementations of centralized versioning for inspiration 1. To achieve this tracking tables were added into PostGIS and extensions were built to the WFS Transactional open standard to handle diffs and rollbacks.

While this approach worked reasonably well, we knew the system was going to get substantially more complicated when it came to adding concepts of “branches” and “merges.” Worse yet, the model would certainly break down when we attempted to create distributed versions. Why? Because we were treating geospatial data in traditional ways, using databases and adding extra code on top of it.

At the time OpenGeo held off from fully investigating the potential for distributed version control, as our first priority was to solve the non-distributed versioning problem. Later, when we began using distributed version tools for software development in our own work we took another look. That’s when we realized that a shift in perspective could make the geospatial data versioning effort smoother and more effective. To read more about this ‘shift’ please see Distributed Versioning for Geospatial, the previous paper in this series.

2. Approaches

Although this paper primarily presents a specific approach to handling distributed versioning of geospatial data, OpenGeo has conducted preliminary investigations into a number of alternative approaches.

The main challenge with management of geospatial data is it’s immensity — the sheer size of the data undercuts many of the assumptions made in the management of code designed for other purposes. There are two general approaches taken cope with this constraint: try to add geospatial capabilities to existing non-geospatial software, or attempt to borrow the best ideas from non-geospatial software and build something new that will be able to handle geospatial.

At OpenGeo we’ve spent time on the latter, adapting the core ideas of Git 2, an open source distributed versioning system, and redesigning these ideas to better fit geospatial data. We now have a working implementation, but it needs to be developed further.

In the future we hope to explore three other potential approaches. One is to try to use a non-geospatial program, like Git, to directly store geospatial data. Secondly, and building on that idea, is a hybrid approach: using a spatial database for the operation’s “head,” providing fast access and spatial indexing, with Git handling the corpus – all the versioning and history. The third approach is to leverage the core technology of CouchDB, a NoSQL peer-based replicating. This approach keeps the backend and front-end of software development separable. If some other open source innovation emerges, we can just adapt it at the right place and likely reuse or adapt the UX research and front-end tools for geospatial collaboration to other distributed versioning backends.

3. An OpenGeo Implementation

OpenGeo has invested heavily to create a library to enable distributed version control (GeoGit). This library is not directly compatible with Git or GitHub, but adapts the core concepts of Git to geospatial data.

The code for the core repository can be found at http://github.com/opengeo/GeoGit. This code backs both a GeoSynchronization Service module (a specification by the OGC to synchronize data) and the versioning constructs of the Web Feature Service 2.0 standard. Eventually we plan to bring both implementations into the standard distributions of GeoServer. Others have been helping out as well, one group in Australia is expanding and leveraging the same code base to work in uDig, a desktop GIS, as well as creating a GeoTools datastore that can connect to the DVS, so other Java-based software can also leverage the code.

The best place to start to gain understanding of the core concepts of Git the article, “Git for Computer Scientists”. This article explains that the core data structure for Git is a Directed Acyclic Graph (DAG) of various objects. Every single object in the graph has a SHA-1 hash for globally unique identification of every file, commit, and tree. These hashes make the version and provenance tracking a core part of the data structure. Merges and diffs become very easy to do, even across repositories, because of every changes SHA-1 hash.

4. Mapping to Geospatial

The primary difference between standard Git and what’s needed for geospatial is the structure of the tree. Typically Git is used to manage source code, which is generally comprised of a small set of files nested in a number of directories, forming a tree with a large number of branches and relatively few leaves. By contrast, with data you get a very flat tree – not many branches but a lot of leaves. Because Git was built to manage source code, not data, trying to shove large datasets into Git can lead to poor performance and struggles to achieve meaningful diffs and merges.

For geospatial versioning, we instead optimize for a typical geospatial representation that doesn’t have much nesting. Instead of files as the base content, we have geospatial “features”, and instead of file directories we have “featureTypes” to split up the tree as needed.

We used the commit data structure in the same way as Git – a “commit” is a type of object that points to a tree for the full state of the project at a given moment in time. The commit represents what is different from that tree, and the tree is the Directed Acyclic Graph. The tree holds all states of the history, and commits point to each place of change

5. Implementation Details

In standard Git, changes are tracked on files of source code. When these files are checked into a Git repository, it creates a canonical representation – its own blob format with metadata about the file – so it is always state aware.

With geospatial distributed version control we create a canonical representation of a feature. The current implementation uses the Hessian binary format with the OGC’s Well Known Binary format to represent the geometries. This canonical representation is separate from the actual data, which stays in the database under version control. All that’s really needed for a datastore to be versioned is for it to produce stable feature IDs. Finding an ideal canonical representation is a path of further research, as the Hessian format is only the second implementation tried thus far.

When you version a datastore, you build up the canonical representation of the features it holds, and start building the DAG of changes. This process is the same as checking a file into a standard Git repository – the system builds up a canonical representation of the file and starts tracking every change.

The next changes made are in the core datastore, but are also held in its’ repository. For that repository we use Berkeley DB Java Edition, which is a solid, robust, and small key value store. The canonical representation doesn’t hold any feature names – just the contents of the feature. The structure (attribute types and names) of the feature lives in another data structure. The canonical representation stores only the attribute data, with the geometry represented as Well-Known Binary, to take advantage of the widely available tools to read and compare it.

The geospatial Git tree holds a featureID instead of a filename, plus a pointer to the featureType. In time, that featureType will be a blob itself, so that we can support the evolution of featureTypes with the same system. For now, the featureType comes from GeoServer’s catalog.

A diff is calculated by simply comparing two commits that represent states at a given time. The calculation travels two trees and finds the difference. Because everything is keyed by SHA-1 hashes, just as in Git, diffs are quick and easy to calculate. The hash code is aggregated, so one tree is hashed for its whole contents.

Although the current implementation of the repository is Berkeley DB Java Edition, many other kvp repositories are possible (for example, S3 on Amazon for cloud repositories). The repository doesn’t have any real awareness of what it stores; it uses the SHA-1 hashes established by versioning for the keys and accesses whatever content was stored there. Both the canonical representation and the repository storage are designed to be flexible to allow for implementations of future research.

The data to be served can be obtained directly from PostGIS provided the request is for the spatial database “head” in its current revision. This process takes full advantage of any indexes on the data, and synchronization between the spatial database and the versioned repository occurs with every WFS transaction. This synchronization is conducted in a specially versioned PostGIS datastore in GeoTools to ensure that all edits are executed properly.

6. Current Challenges

As the project stands, editing the versioned PostGIS datastore with an application that modifies the database directly (for example, a desktop GIS like QGIS or ArcGIS) would result in inconsistency. This type of editing would be akin to directly editing a remote repository in Git instead of pushing the changes through their local repository. With a non-versioned backend, syncing requires a full scan, this approach is is computationally expensive. To improve this process we suggest building in versioning at the database level. This would allow a repository sync and would not require a full scan. Instead it would request the diffs and incorporate those into the repository.

To work well with mixed editing environments where not everything goes through WFS-T, it is necessary to find strategies that ensure the repository gets notified. This notification system could be a PostGIS trigger or plugins for other GIS software that edits the data. In the proprietary world, it could be a trigger or routine in Geodatabase or a plugin for ArcGIS for Desktop.

The other major challenge we face the development is that inescapable feature of geospatial data: huge datasets. We must be sure that we are testing with long revision history to adequately capture the type of work that is done in the geospatial field. Thankfully, we have access to a large dataset with OpenStreetMap, which should enable us to test for both of these characteristics and help us optimize the code.

7. Protocols Implemented

Currently, the OGC GeoSynchronization Service (GSS) and the versioning constructs of Web Feature Service 2.0 are both implemented in the GeoGit library. These protocols focus on single repositories with linear versions. With no possibility of branching or any sort of diff or rollback, they represent only a small subset of what is possible. In time, our goal is to expose a RESTful service that captures the full potential of distributed versioning.

OpenGeo is not currently implementing the original Versioning WFS (WFS-V) protocol that extends WFS-T (which Andrea Aime created in OpenGeo’s first limited versioning attempt). Instead we plan on creating new User Interfaces with GXP (pushing changes to underlying GeoExt and OpenLayers libraries where appropriate) to take advantage of its many capabilities and build upon them. These client libraries will drive a new web versioning API.

WFS-T is supported, using the “handle” construct to bring in commit messages, so standard WFS transactions from non-version-aware clients will still get versioned. And clients using WFS 2.0 may request different versions. Also of interest is ESRI’s REST GeoServices spec, which was recently submitted to OGC. It has no versioning constructs yet, though they could be added, but even without versioning additions the standard GeoServices could be used by clients to edit with transparent versioning. Though as it stands no commit messages about changes will appear.

8. Syncing GeoServers

In this first round of development a GeoSynchronization Service DataStore for GeoServer was implemented. This addition enables a synchronization demonstration of two GeoServers during which an edit is made to the “master” GeoServer and was reflected into a syncing GeoServer. The syncing GeoServer used a geospatial DVS datastore, communicating over the GeoSynchronization Service standard. This process needs significantly more work to get to production.

9. Backends

The current implementation has only been tested against PostGIS but is designed to work against any datastore that can give out stable IDs. Any database backend, including Oracle, ArcSDE, SQL Server, and DB2, should be compatible with the current implementation, though additional testing is needed. The versioning database would support edits and versions, and would push them to the repository. Users could edit directly against the Oracle backend, with the same edits visible in the versioning view, just as SVN changes tracked by a Git repository. If editing entered an inconsistent state, the system could easily correct itself by obtaining the latest diffs instead of requiring a full scan. In time, it should be possible to plug the geospatial version control system on top of an existing Oracle Workspace Manager implementation and import its history, having commits sync automatically.

10. Next Steps

The next step is to develop a versioning front end – a full JavaScript UI that provides an intuitive user experience for working with different versions of geospatial data. Building this front end will drive the creation of a REST API to access the full range of what the geospatial versioning offers. The initial implementation will likely function only with linear revision history without branching or merging; still, it should help establish user interfaces for geospatial diffs and rollbacks.

Once a functioning user interface is created, we’ll need to test the functionality with real world implementations, the user feedback will help smoke test and evolve the code. Likely additions will include mobile interfaces, offline syncing, and more complex workflows, such as approval queues and “cloning” repositories. Our plan is to keep the front end and backend code bases orthogonal to one another, so we can more easily experiment with them independently, working towards a standard API for the communication.

The final paper of this series, Distributed Versioning, Potential Development details a potential path forward for future development.

11. Footnotes