Tools for working with open building datasets
- Free software: Apache Software License 2.0
- Documentation: https://opengeos.github.io/open-buildings
- Creator: Chris Holmes
This repo is intended to be a set of useful scripts for working with Open Building Datasets, Initially Google's Open Buildings dataset and Overture's building dataset, specifically to help translate them into Cloud Native Geospatial formats and then use those. The outputs will live on https://beta.source.coop, here for Google and here for Overture so most people can just make use of those directly.
The main operation that most people will be interested in is the 'get-buildings' command, that lets you supply a GeoJSON file to a command-line interface and it'll download all buildings in the area supplied, output in common GIS formats (GeoPackage, FlatGeobuf, Shapefile, GeoJSON and GeoParquet).
The rest of the CLI's and scripts are intended to show the process of transforming the data, and then they've expanded to be a way to benchmark performance.
This is basically my first Python project, and certainly my first open source one. It is only possible due to ChatGPT, as I'm not a python programmer, and not a great programmer in general (coded professionally for about 2 years, then shifted to doing lots of other stuff). So it's likely not great code, but it's been fun to iterate on it and seems like it might be useful to others. And contributions are welcome! I'm working on making the issue tracker accessible, so anyone who wants to try out some open source coding can jump in.
Install with pip:
This should add a CLI that you can then use. If it's working then:
Should print out a help message. You then should be able run the CLI (download 1.json:
You can also stream the json in directly in one line:
The main tool for most people is
get_buildings. It queries complete global
building datasets for the GeoJSON provided, outputting results in common geospatial formats. The
full options and explanation can be found in the
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
Google Building processings¶
In the google portion of the CLI there are two functions:
converttakes as input either a single CSV file or a directory of CSV files, downloaded locally from the Google Buildings dataset. It can write out as GeoParquet, FlatGeobuf, GeoPackage and Shapefile, and can process the data using DuckDB, GeoPandas or OGR.
benchmarkruns the convert command against one or more different formats, and one or more different processes, and reports out how long each took.
A sample output for
benchmark, run on 219_buildings.csv, a 101 mb CSV file is:
1 2 3 4 5 6 7 8 9 10
The full options can be found with
--help after each command, and I'll put them here for reference:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Warning - note that
--no-gpq doesn't actually work right now, see https://github.com/opengeos/open-buildings/issues/4 to track. It is just always set to true, so DuckDB times with Parquet will be inflated (you can change it in the Python code in a global variables). Note also that the
ogr process does not work with
--skip-split-multis, but will just report very minimal times since it skips doing anything, see https://github.com/opengeos/open-buildings/issues/5 to track.
I'm mostly focused on GeoParquet and FlatGeobuf, as good cloud-native geo formats. I included GeoPackage and Shapefile mostly for benchmarking purposes. GeoPackage I think is a good option for Esri and other more legacy software that is slow to adopt new formats. Shapefile is total crap for this use case - it fails on files bigger than 4 gigabytes, and lots of the source S2 Google Building CSV's are bigger, so it's not useful for translating. The truncation of field names is also annoying, since the CSV file didn't try to make short names (nor should it, the limit is silly).
GeoPackage is particularly slow with DuckDB, it's likely got a bit of a bug in it. But it works well with Pandas and OGR.
When I was processing V2 of the Google Building's dataset I did most of the initial work with GeoPandas, which was awesome, and has the best GeoParquet implementation. But the size of the data made its all in memory processing untenable. I ended up using PostGIS a decent but, but near the end of that process I discovered DuckDB, and was blown away by it's speed and ability to manage memory well. So for this tool I was mostly focused on those two.
Note also that currently DuckDB fgb, gpkg and shp output don't include projection information, so if you want to use the output then you'd need to run ogr2ogr on the output. It sounds like that may get fixed pretty soon, so I'm not going to add a step that includes the ogr conversion.
OGR was added later, and as of yet does not yet do the key step of splitting multi-polygons, since it's just using ogr2ogr as a sub-process and I've yet to find a way to do that from the CLI (though knowing GDAL/OGR there probably is one - please let me know). To run the benchmark with it you need to do --skip-split-multis or else the times on it will be 0 (except for Shapefile, since it doesn't differentiate between multipolygons and regular polygons). I hope to add that functionality and get it on par, which may mean using Fiona. But it seems like that may affect performance, since Fiona doesn't use the GDAL/OGR column-oriented API.
There are 3 options that you can set as global variables in the Python code, but are not yet CLI options. These are:
RUN_GPQ_CONVERSION- whether GeoParquet from DuckDB by default runs gpq on the DuckDB Parquet output, which adds a good chunk of processing time. This makes it so the DuckDB processing output is slower than it would be if DuckDB natively wrote GeoParquet metadata, which I believe is on their roadmap. So that will likely emerge as the fastest benchmark time. In the code you can set
RUN_GPQ_CONVERSIONin the python code to false if you want to get a sense of it. In the above benchmark running the Parquet with DuckDB without GPQ conversion at the end resulted in a time of .76 seconds.
PARQUET_COMPRESSION- which compression to use for Parquet encoding. Note that not all processes support all compression options, and also the OGR converter currently ignores this option.
SKIP_DUCK_GPKG- whether to skip the GeoPackage conversion option on DuckDB, since it takes a long time to run.
All contributions are welcome, I love running open source projects. I'm clearly just learning to code Python, so there's no judgement about crappy code. And I'm super happy to learn from others about better code. Feel free to sound in on the issues, make new ones, grab one, or make a PR. There's lots of low hanging fruit of things to add. And if you're just starting out programming don't hesitate to ask even basic things in the discussions.