In this blogpost I present my new Python package: PyGEOS. PyGEOS (documentation) is a library that exposes geospatial operations from GEOS into Python. Arrays of geometries can be operated on with almost zero Python interpreter overhead, leading to performance increases of up to 100 times compared to current shapely or geopandas usage.
I aim to keep the API (and performance) of PyGEOS close to that of PostGIS: the de facto standard library of geospatial analysis on large datasets.
This work has been inspired by the efforts of Joris Van den Bossche and Matthew Rocklin on accelerating geopandas. I recommend reading their earlier blogposts on this subject:
Also, thanks Joris for combining efforts on turning PyGEOS into a mature project.
PyGEOS aims to provide vectorized geospatial operations to the Python ecosystem. “vectorized” means that the operations not only work on single geometry objects: they work on N-dimensional arrays in an elementwise fashion. A short example:
import pygeos import numpy as np points = [ pygeos.Geometry("POINT (1 9)"), pygeos.Geometry("POINT (3 5)"), pygeos.Geometry("POINT (7 6)") ] box = pygeos.box(2, 2, 7, 7) pygeos.contains(box, points)
>>> array([False, True, False])
Here, we computed whether the three
points are contained in
box in one go.
GEOS (Geometry Engine - Open Source)
PyGEOS merely wraps GEOS in Python. The core
pygeos.Geometry object contains a C-pointer to the actual
Working with raw C-pointers has a big advantage: there is no overhead from the Python interpreter when accessing the object. This is why PyGEOS is fast. The flip side is that memory leaks and segfaults are very easy to come by.
pygeos.Geometry uses the Python internals to be able to clean the underlying
GEOSGeometry when it is not used anymore. In general, each Python object keeps a reference count, which is increased when you make another reference (
box2 = box), and decreased when the variable goes out of scope.
Whenever the reference count becomes zero, Python (more specifically: the garbage collector) automatically calls the deallocation method on
pygeos.Geometry, which destroys the underlying
GEOSGeometry. The bottomline is that once this is set up, we do need to worry about memory management anymore.
The functions in PyGEOS are Numpy universal functions or ufuncs (further reading). This makes them array-aware and able to do all the broadcasting gymnastics you may know from NumPy. An example:
import pygeos import numpy as np # create a 5 boxes that have increasing x-coordinates poly_x = np.array([pygeos.box(0 + i, 0, 10 + i, 10) for i in range(5)]) # create a 5 boxes that have increasing y-coordinates poly_y = np.array([pygeos.box(0, 0 + i, 10, 10 + i) for i in range(5)]) # intersect all combinations squares = pygeos.intersection(poly_x[:, np.newaxis], poly_y[np.newaxis, :]) # compute the areas pygeos.area(squares)
array([[100., 90., 80., 70., 60.], [ 90., 81., 72., 63., 54.], [ 80., 72., 64., 56., 48.], [ 70., 63., 56., 49., 42.], [ 60., 54., 48., 42., 36.]])
Reprojections using PyPROJ
PyGEOS provides a way to
apply functions to coordinates directly. This can be used to translate, scale, or reproject you geometries. An example using
pyproj to project a point from lat/lon to webmercator:
import pyproj def transform(coords): x, y = pyproj.transform( pyproj.CRS("EPSG:4326"), # WGS84 lat/lon pyproj.CRS("EPSG:3857"), # webmercator *coords.T, ) return np.array([x, y]).T pygeos.apply(pygeos.Geometry("POINT (5.2 52)"), transform)
<pygeos.Geometry POINT (5.79e+05 6.8e+06)>
PyGEOS adds almost no overhead on the underlying GEOS operations. As a demonstration, I compare performance with looping over shapely objects.
from shapely.geometry import Point, Polygon points = [Point(i, i) for i in range(10000)] poly = Polygon([(10, 10), (10, 100), (100, 100), (100, 10)]) %timeit [poly.contains(point) for point in points] %timeit [poly.distance(point) for point in points]
28.9 ms ± 354 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) 43.9 ms ± 551 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
points2 = pygeos.points(np.arange(10000), np.arange(10000)) poly2 = pygeos.polygons([(10, 10), (10, 100), (100, 100), (100, 10)]) %timeit pygeos.contains(poly2, points2) %timeit pygeos.distance(poly2, points2)
212 µs ± 1.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 10.1 ms ± 86.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
As you can see, the performance increase is significant: a factor 4 for
distance and a whopping
contains! Note that this example poses a very lightweight GEOS operation, in which case the Python overhead is the dominating factor. But still, the results show that PyGEOS’s approach to wrapping Geometry objects works. Please also refer to this blogpost to see that the reported performance is the same as the performance on geopandas’ Cython branch.
To summarize, PyGEOS provides performant geospatial operations using the familiar NumPy array interface.
Stuff we want to do in the future:
- let geopandas make use of PyGEOS
- possible integration with shapely
- multithreading: release the GIL during GEOS operations for better performance
- spatial joins using geospatial indexes
- providing “prepared geometries” to increase performance when multiple predicates are applied on the same geometries
- wrapping more GEOS functions
- and more, see https://github.com/pygeos/pygeos/issues