On a Friday evening a few months ago I was having an end-of-week-blow-off-steam conversation with a coworker, both of us thinking out loud about geospatial data and cloud tech.
About 1.5 beers into the conversation, we were spitballing architectures using some of the excitingly sensible standards gaining adoption in geospatial right now. Using STAC and COG for discovery and efficient access of raster data is a great start – and a pattern we’ve already proven for our customers – but what about vector?
COG allows GeoTIFFs (even very large ones) to be efficiently accessed over a network using a combination of two tricks. First, the file structure of a COG is arranged in an efficient and predictable way. This then allows a client to use plain old HTTP (Get Range requests) to read just the parts of image it needs. The result is that a simple web browser can easily work with 4GB images – with no server, container, or lambda involved! But where’s the vector equivalent?
Is this even a problem?
Traditional vector formats are opaque, requiring that a client download the entire file or files (looking at you, shapefile) before parsing. Services like OGC’s Web Feature Service are great, but require infrastructure – a lambda and database at a minimum. I love vector tiles – but they are tricky to parse and contain different content at different levels.
After listening to horror stories about tiny budgets and bandwidth at this year’s FOSS4G in Dar es Salaam, it became clear to me that we can’t expect organizations to bear the cost of providing fast data infrastructure anymore than we should expect consumers to have fast connections and expensive tools. Those who can benefit most from public data are least equipped to consume it, and I believe we can do better.
So how to do the same for vector data? What if we could do the same thing COG does – but apply it to vector? Can we make a format that’s web friendly and requires no infrastructure, and still provide efficient access to large vector datasets?
Turns out, you can.
GeoJSON is the go-to format for web mapping. it’s simpler than GML, easy to parse, human readable, an IETF standard, and supported by most every geo library I’ve ever seen – making it a great starting point.
It, like all JSON formats, is parsed as a document usually containing a single feature collection, which itself contains all of the feature data. We can’t download just part of a GeoJSON file without making it invalid – and even if we could, we don’t know what part of the file we need.
The answer is to take the same approach as with COG: reorganize it, and store self-descriptive information in a header.
The format of the file is very simple and only differs from standard GeoJSON in two ways. First, it starts with a header (again, in JSON format) that takes up the first 10k of the file. Second, the file contains multiple top -level JSON objects.
The header has some simple discovery metadata, and then a list of the collections and their location in the file. The rest of the file is made up of valid GeoJSON collections.
One thing that is intentionally left out is how to subdivide the data – that is entirely up to the data publisher. Sometimes tiling makes sense, or R-trees, or time ranges, or feature types, or whatever. The person publishing the data is free to choose whatever they think makes the most sense.
This demo uses freely available cadastral data containing about 100,000 features in a 168MB GeoJSON file.
This file was converted to Cloud-Optimized GeoJSON by splitting the data’s bounding box into an eight by eight grid and sorting features into those bins before writing the file back out. One interesting side effect was shrinking the file by about 11MB due to stripping out extra whitespace.
Both the original and optimized versions were uploaded into an S3 bucket, made public and accessed with a simple OpenLayers app. The app loads the header, shows collection boundaries, and allows a user to click to load the associated data. To make things more representative of disadvantaged users, Chrome was throttled to 3G speeds.
The results are pretty impressive – loading the header from the optimized file takes 656 milliseconds vs 15 minutes for regular GeoJSON or about 1378 times faster! Picking a tile and loading a fairly dense collection (2.9MB) takes 16.7 seconds, or about 53 times faster than loading the entire dataset.
These performance gains are impressive in the context of a single request but are even more so when we calculate the cost savings for both data provider and consumer.
Take a test drive and see for yourself!
Please check out the first cut of the spec and growing documentation in this github repo. Issues, suggestions, comments and pull requests are all welcome!