Distributed Versioning for Geospatial Data (Part 1)

Distributed Versioning Implementation

Published in December 2012, This is the first paper in a series of three that propose a new approach to working with spatial data, recommending a shift from treating spatial data simply as data to considering it as programmers do source code. By treating spatial data as an evolving resource that is collaboratively developed and maintained, we can begin to address challenges to spatial data management, including inefficient workflows between multiple people and organizations, disconnected and low-bandwidth situations, and the burden of metadata creation. We propose that organizations can benefit from crowdsourcing spatial data while retaining control over their information repositories and maintaining authoritative data sources.

“OpenGeo is working to shift how we treat geospatial information, our vision includes geospatial information that is more analogous to source code than traditional data. Viewing, and using, this information as material in a collaborative infrastructure will have profound implications, including data that can be built collaboratively and updated in near-real time. This fundamental shift can alleviate major problems that plague users of geospatial information.”
Chris Holmes
Chairman & Founder, OpenGeo

1. Introduction

Working with spatial data is a painstaking task, data management is so demanding that it becomes the end in itself. Many spend the majority of their time gathering, maintaining, finding, cleaning, and managing data, and very little time gaining insight or finding solutions to problems. At OpenGeo we believe that a shift in how we treat spatial data will have profound implications, enabling data to be built collaboratively and updated in near-real time. This requires shifting away from treating spatial data merely as stores of rows and columns that are optimized to be sliced in different ways, copied extensively, and occasionally updated. Instead, it should be treated in much the same way software developers treat source code: as material that can be collaborated on in a way that constantly tracks its origin and evolution, even when copied and edited by disparate users.

2. Source Code and the Development of “Version Control” Systems

Source code, the raw material behind executable software programs is most effective when built collaboratively by teams of programmers. To facilitate this technology has quickly evolved to enable the collaborative nature of software development. The most significant of these tools is the Version Control System (VCS), which keeps track of changes made to source code files. This tracking concept is incredibly powerful because it links code changes to the programmers who make them. The “version control” concept has spread from software to become commonplace in many platforms such as: online wikis, web content management systems, and as “track changes” functions in content-creation programs like Microsoft Word and Excel.

Some attempts at version control for spatial data have been made; Esri’s ArcSDE 1 system has some “versioning” extensions and Oracle Database has Workspace Manager 2, which can be used with spatial data. However, to track changes these programs both add unwieldy extra tables in the database and still require extensive coordination in organizational processes to work effectively.OpenStreetMap (OSM) is built to accommodate versioning, but only offers collaboration around a single, centralized, canonical database.

As the broader software world has continued to innovate, new classes of Distributed Version Control Systems (DVCS) have reached maturity and widespread use. Where earlier version control systems had clients connecting to a central server to coordinate the versioning, as in the global collaborative mapping project, OpenStreetMap, a DVCS takes a peer-to-peer approach. In a DVCS each client’s copy of a repository is a repository in its own right, able to synchronize with any other peer for exchanging patches and further software development. The best known of these DVCS systems is Git, developed by Linus Torvalds to manage Linux’s source code. With excellent hosted social collaboration environments, like GitHub, Git’s influence has been growing.

3. Applying Distributed Versioning to Geospatial

OpenGeo’s goal is to adapt the core concepts of distributed versioning to geospatial data. Our initial work in this area has been through a project called GeoGit (to learn more about GeoGit please read the second paper in this series). Instead of treating geospatial information as merely data – something to be collected and stored in a database – we treat it more like source code, which is the carefully tended but generally unobserved foundation for software.

The comparison is apt in many ways. Most people who use software have no interest in gaining access to the source code and most people who use maps don’t need to engage with the data behind them. But for those who use the geospatial data itself, like those who design or build specialized software, the ability to access and alter the building blocks is essential. With source code a developer can change the software, adding to or changing its functionality and appearance. With raw spatial data, an expert can fix mistakes, conduct analysis and modeling, and merge a publicly available dataset with the data they have collected themselves.

Moreover, enabling true collaboration for geospatial data can have profound implications for general users of maps, just as open-source collaboration has transformed software. There now exists a vast commons of powerful software that anyone can use and improve. In the geospatial community we have seen significant strides toward accurate and applicable information with projects like OpenStreetMap and the citizen-reporting and crisis-mapping platform, Ushahidi. However, moving beyond sourcing information from crowds and towards a data commons collaboratively developed and shared by governments, NGOs, commercial companies, and individuals, will require a substantial paradigm shift to a distributed versioning model.

TA number of issues that have plagued users of geospatial data can be addressed by this shift; multiorganization collaboration, crowdsourced vs. authoritative data, multi-user collaboration, offline connectivity, low and intermittent bandwidth, and metadata.

4. Multi-Organization Collaboration

What if organizations could not only share their respective data but also directly contribute to the same data store? A major failing of the modern geospatial data system is how often the same information is collected by these multiple sources and agencies. The Spatial Data Infrastructure movement attempted to mitigate this duplication by making organizations aware of what others have done but we think technology can further this effort. However, with the current highest level of development in geospatial data management, such coordination can still only be achieved through a central authority manually integrating several sources of information.

By shifting to a distributed versioning model collaboration effort eases. Each organization has its own copy of a given layer of data, and controls the quality assurance (QA) process on any changes. No organization need cede control to an external authority, but instead can review change sets that are tracked in each repository and passed between organizations as discrete units flowing up, down, and through the chain.

As a practical example, consider the TIGER dataset managed by the U.S. Census Bureau. Currently, the Census gathers and aggregates data from a number of sources, but once that data is in TIGER, it is out of the hands of the original provider (such as a city’s road network). The Census continually updates its TIGER database, but if the city wants to integrate those changes into the data it originally submitted, it must first resolve any changes that were made by the Census in their process of normalizing the data. Meanwhile, another organization in the same area may start using the road network for their application (such as an NGO tracking its vehicles). The NGO uses the data from the vehicles to update the road network with changes to the TIGER data that may or may not have already been collected by the city or the Census after the data’s publication.

In the current spatial data environment there is no effective, scalable means for these changesets to flow between organizations. However, if there were a functioning distributed versioning system, each organization could access the others’ repositories and pull in changes through their own QA process. Such a system leads to easier collaboration between various levels of government, NGOs, and commercial data providers, without necessitating any centrally controlled coordination. In the TIGER example, both the city and the NGO could continue to update their copies of the Census data, with the Census able to pull in those changes at will, and the city and NGO able to selectively integrate not only official TIGER updates but also each other’s changes to the dataset.

5. Crowdsourced vs. Authoritative Data

One of the central debates in the GIS world is over the use of crowdsourced versus authoritative data. OpenStreetMap, Google MapMaker, and Ushahidi are all produced using crowdsourced systems (also referred to as “Volunteered Geographic Information” or “User-Generated Content”). Authoritative data is generally managed by “official” (usually government or NGO) sources capable of certifying their information.

The two models exhibit disparate workflows. Crowdsourcing allows anyone to edit or input data, making it more responsive and up to date, but possibly more error-prone (or at least perceived to be so). The authoritative model has one centrally-controlled dataset that can only be altered by that authority, resulting in a slower process but greater accountability.

Within the geospatial community, these two models for creating and maintaining data are pitted against one another as two conflicting strategies. With current data management tools, this characterization is generally accurate, but cutting edge research centers on attempts to integrate the input of the crowd into the authoritative source.

With distributed versioning tools, however, organizations can easily have the best of both – the crowdsourced and authoritative – worlds. Instead of just “sourcing” information from a crowd, they can collaborate with not only other organizations but also any individual or group that wishes to improve the data. The authoritative data provider can continue to publish versions of its data that are fully tested and checked on a slower cycle, so that its users can rely on its set always remaining accurate and certified. However, those who use the data can maintain their own complete copies of the repository and apply their own updates. These updates can periodically be pulled back in to the authoritative repository, where they can undergo the extensive quality checking process required to certify that set.

One example of where this workflow could prove useful in the maintenance of the USGS National Hydrography Dataset (NHD). There are already numerous stewards who are actively attempting to help maintain the NHD, but despite this interest and commitment, many become so frustrated with the difficulty of getting their edits approved by the authoritative source that they abandon the attempt completely.

Distributed versioning could solve this problem; these stewards could maintain their own data repositories for their particular areas of interest. They could also apply the same quality assurance checks as the central authority, which would enable their data to be easily pulled in to the central database. However, the central authority need not be involved in order for partnering stewards to exchange updates with one another. These inter-steward exchanges would also facilitate quicker updating to the central authority, because there would be fewer conflicts during the merging of the updates the various stewards submit.

The workflow could also be inverted, so that the “crowd” repository functions as the central source, with various authoritative sources drawing from it and performing checks on its data. This workflow is akin to that of transportation authority Portland TriMet’s plan for managing its street centerlines; at present, TriMet’s best and least expensive option is OpenStreetMap, but requiring their employees to edit OSM is inefficient, as the tools involved in editing OSM are cumbersome and the information the transit authority needs is not necessarily relevant to most OSM users. Ideally, with a decentralized versioning system, TriMet would instead clone an OSM repository of their area and apply their changes and updates to it. Portland TriMet would make their changes available to OSM, and could easily get OSM’s updates, but there would be no need to coordinate the entire process3 .

6. Multi-User Collaboration

A distributed versioning system for geospatial data can also make an impact on multi-user data editing environments. This use case is currently addressed primarily by tools like ArcSDE and Oracle Workspace manager. A drawback to these tools is that they are built on top of traditional Relational Database Management Systems, which are not fundamentally designed for collaboration. This disconnect manifests primarily in difficulties “merging” changes from multiple users, because the changes are tacked on to the data structure as separate elements.

Centralized versioning shows strain when many users edit multiple versions of a database. Reconciliation of changes become difficult when users are working with divergent iterations of the data. Because of these difficulties, many systems that set out to use geospatial versioning instead end up implementing more primitive multi-user editing techniques4 .

In distributed versioning systems, the history of revision is built in to the core data structures, making merges easier and more accurate. One of the inherent goals of distributed versioning is to enable changes to be easily passed across any repository, so these systems are able to withstand increasingly complex workflows without the merging process becoming prohibitively slow.

In addition to the limits of centralized versioning described above, the current options for geospatial versioning are expensive and limited to desktop-based tools. Rebuilding geospatial versioning tools from the ground up – making them open source and optimized for the web – will significantly lower barriers to collaboration. The potential for variants on data gathering tools will also expand with this change; rather than relying exclusively on trained experts to gather geospatial data, an organization could create custom data-collection applications that run on commodity mobile devices powered by Android and iOS. Such innovations would have the potential to lower the total cost of data collection and maintenance.

7. Disconnected Editing

Handling offline updates is another major issue for geospatial data gathering. Users often go into the field to collect data, where network connections are not reliable. Presently, the best method involves copying the data to a mobile device and attempting to track changes. If the offline user makes edits while the central database is also changing that information, syncing the data when back online becomes extremely challenging.

A true distributed versioning system can smooth this process. Shifting the perspective from treating the mobile device instance as a copy of the data to approaching it as an active repository in itself enables changes to be made in a less tentative manner. With change tracking built in to the data format, changes can be synced when back online, with each change compared individually to the most current version of data in other repositories.

This peer-to-peer approach would also enable exchange of edits between two offline devices that have not synced up with a “master.” It then becomes possible for a larger desktop server in a field office, even one that is completely off the grid, to sync with mobile devices that can be taken on expeditions to gather more information. The desktop server could sync periodically with another repository that is stored in a data center, and the mobile devices could sync with either the field server or the data-center server. While “smarter” mobile devices could have their own repositories, more “primitive” devices could still keep records of changes and exchange information with the smarter repositories when they sync.

8. Intermittent / Low-Bandwidth Situations

The Open Geospatial Consortium (OGC) protocols for accessing data are bandwidth heavy and have no built-in mechanisms for caching and pulling changes. However, the change-tracking capabilities of distributed versioning offer a major advantage in intermittent and low-bandwidth situations. As opposed to downloading the dataset in its entirety it becomes possible to download or upload just the changes. In intermittent-bandwidth situations these updates could be made whenever the user is online and isn’t engaged in higher-priority traffic.

Such an infrastructure also addresses low bandwidth conditions, as a user could conduct the “check out” of the base data in higher bandwidth (either connecting temporarily to a higher bandwidth or receiving the media on a disc or drive), and then only pull the changes. Executing pulls intermittently could also save bandwidth, because not every change needs to be sent. The user only needs the delta (difference) in data from what was last sent or received.

9. Metadata

For users of geospatial data the perceived lack of “good metadata” is a constant cause for concern. Data managers are expected to not only produce high quality geospatial information, but also to describe in detail everything that someone else might want to know about it. Not surprisingly, metadata is often left incomplete, as its creation is a large burden on the data provider. Worse yet, huge swaths of data are never released because some data providers have accepted the maxim that “data with no metadata is worse than no data at all.”

It is worth considering how anomalous this directive is in the digital world. No other common digital platform requires users to fill out long documents of tedious information about the information. Web pages, documents, and photographs all build metadata into the format, and derive additional metadata through use.

With respect to software, source code need not include a detailed document describing who made it, various keywords, and abstracts about the program. No one worries about this lack of metadata, though, because every change made to the source code is tracked by a version control system. Every change essentially has “metadata” built in.

From this base of built-in source code metadata, many useful systems have developed to offer insightful information about the software. Ohloh.net conducts analysis of source code repositories and summarizes the information. Sites like GitHub build a social layer on top of the base tracking, building profiles for developers that enable people to follow the programmers that interest them. Feedback loops allow those same software creators to know instantly whether others are downloading or following what they build. This network creates social pressure to create metadata to explain the software. Many of the systems that track the software also offer a user-friendly infrastructure for getting in touch with the authors of source code, enabling direct dialogue.

All these infrastructures are currently lacking in the realm of geospatial information. There is little incentive to create a labor-intensive metadata document, and no system for receiving feedback about it, and thus it’s difficult to know whether anyone is interested in or benefiting from the metadata.

There is much more to be said about a proper infrastructure for geospatial metadata. However, the core shift of automatically tracking changes at the level of every edit, and building in additional automated tools at the layer level, should enable an infrastructure that lifts the burden from data providers. These built-in tools will track not only edits but also full provenance of the data itself, describing the processes and combinations of original data that went in to its creation. Without needing to invest additional time and resources, it will be possible to track the history of any dataset, revealing who did what when, and using that information to build useful and accessible metadata documents.

10. Conclusion

A shift in perspective and approach for spatial data can lay the groundwork for innovation in a number of crucial areas. By developing distributed versioning systems for geospatial information, we enable multiple users and organizations to maintain their own data repositories, and to easily share their updates with one another. Through distributed repositories and built-in change tracking, we simplify editing and updating in disconnected and low-bandwidth situations. Additionally, these tracked changes become the foundation for metadata that can grow into dataset histories and help to establish geospatially-oriented social communities.

OpenGeo has initiated a first implementation of this approach by creating a GeoServer library,that enables distributed version control. We welcome all input and collaboration. The path to realizing our full vision will be long, but we believe it is worth pursuing, and working together will help us all get there faster.

For more information on Distributed Versioning for GeoSpatial and to find out more about OpenGeo’s implementation plan please see Distributed Versioning, Implementation and Distributed Versioning, Potential Development, the next two papers in this series.