# Introducing PyGEOS

In this blogpost I present my new Python package: *PyGEOS*. PyGEOS (documentation) is a library that exposes geospatial operations from GEOS into Python. Arrays of geometries can be operated on with almost zero Python interpreter overhead, leading to performance increases of up to 100 times compared to current *shapely* or *geopandas* usage.

I aim to keep the API (and performance) of PyGEOS close to that of PostGIS: the de facto standard library of geospatial analysis on large datasets.

This work has been inspired by the efforts of Joris Van den Bossche and Matthew Rocklin on accelerating *geopandas*. I recommend reading their earlier blogposts on this subject:

- https://jorisvandenbossche.github.io/blog/2017/09/19/geopandas-cython/
- http://matthewrocklin.com/blog/work/2017/09/21/accelerating-geopandas-1

Also, thanks Joris for combining efforts on turning PyGEOS into a mature project.

## PyGEOS

PyGEOS aims to provide vectorized geospatial operations to the Python ecosystem. “vectorized” means that the operations not only work on single geometry objects: they work on N-dimensional arrays in an elementwise fashion. A short example:

```
import pygeos
import numpy as np
points = [
pygeos.Geometry("POINT (1 9)"),
pygeos.Geometry("POINT (3 5)"),
pygeos.Geometry("POINT (7 6)")
]
box = pygeos.box(2, 2, 7, 7)
pygeos.contains(box, points)
```

```
>>> array([False, True, False])
```

Here, we computed whether the three `points`

are contained in `box`

in one go.

## GEOS (Geometry Engine - Open Source)

PyGEOS merely wraps GEOS in Python. The core `pygeos.Geometry`

object contains a C-pointer to the actual `GEOSGeometry`

struct.

```
box._ptr
```

```
94625728372472
```

Working with raw C-pointers has a big advantage: there is no overhead from the Python interpreter when accessing the object. This is why PyGEOS is fast. The flip side is that memory leaks and segfaults are very easy to come by.

### Garbage collecting

The `pygeos.Geometry`

uses the Python internals to be able to clean the underlying `GEOSGeometry`

when it is not used anymore. In general, each Python object keeps a *reference count*, which is increased when you make another reference (`box2 = box`

), and decreased when the variable goes out of scope.

Whenever the reference count becomes zero, Python (more specifically: the *garbage collector*) automatically calls the deallocation method on `pygeos.Geometry`

, which destroys the underlying `GEOSGeometry`

. The bottomline is that once this is set up, we do need to worry about memory management anymore.

This was the first time I actually did something with the Python C-API. I recommend reading the Python docs or having a look at the PyGEOS source if you’re interested in how this works exactly.

## NumPy ufuncs

The functions in PyGEOS are Numpy *universal functions* or *ufuncs* (further reading). This makes them array-aware and able to do all the broadcasting gymnastics you may know from NumPy. An example:

```
import pygeos
import numpy as np
# create a 5 boxes that have increasing x-coordinates
poly_x = np.array([pygeos.box(0 + i, 0, 10 + i, 10) for i in range(5)])
# create a 5 boxes that have increasing y-coordinates
poly_y = np.array([pygeos.box(0, 0 + i, 10, 10 + i) for i in range(5)])
# intersect all combinations
squares = pygeos.intersection(poly_x[:, np.newaxis], poly_y[np.newaxis, :])
# compute the areas
pygeos.area(squares)
```

```
array([[100., 90., 80., 70., 60.],
[ 90., 81., 72., 63., 54.],
[ 80., 72., 64., 56., 48.],
[ 70., 63., 56., 49., 42.],
[ 60., 54., 48., 42., 36.]])
```

## Reprojections using PyPROJ

PyGEOS provides a way to `apply`

functions to coordinates directly. This can be used to translate, scale, or reproject you geometries. An example using `pyproj`

to project a point from lat/lon to webmercator:

```
import pyproj
def transform(coords):
x, y = pyproj.transform(
pyproj.CRS("EPSG:4326"), # WGS84 lat/lon
pyproj.CRS("EPSG:3857"), # webmercator
*coords.T,
)
return np.array([x, y]).T
pygeos.apply(pygeos.Geometry("POINT (5.2 52)"), transform)
```

```
<pygeos.Geometry POINT (5.79e+05 6.8e+06)>
```

## Performance

PyGEOS adds almost no overhead on the underlying GEOS operations. As a demonstration, I compare performance with looping over *shapely* objects.

```
from shapely.geometry import Point, Polygon
points = [Point(i, i) for i in range(10000)]
poly = Polygon([(10, 10), (10, 100), (100, 100), (100, 10)])
%timeit [poly.contains(point) for point in points]
%timeit [poly.distance(point) for point in points]
```

```
28.9 ms ± 354 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
43.9 ms ± 551 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
```

```
points2 = pygeos.points(np.arange(10000), np.arange(10000))
poly2 = pygeos.polygons([(10, 10), (10, 100), (100, 100), (100, 10)])
%timeit pygeos.contains(poly2, points2)
%timeit pygeos.distance(poly2, points2)
```

```
212 µs ± 1.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
10.1 ms ± 86.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```

As you can see, the performance increase is significant: a factor 4 for `distance`

and a whopping `136`

for `contains`

! Note that this example poses a very lightweight GEOS operation, in which case the Python overhead is the dominating factor. But still, the results show that PyGEOS’s approach to wrapping Geometry objects works. Please also refer to this blogpost to see that the reported performance is the same as the performance on geopandas’ Cython branch.

## Outlook

To summarize, PyGEOS provides performant geospatial operations using the familiar NumPy array interface.

Stuff we want to do in the future:

- let geopandas make use of PyGEOS
- possible integration with shapely
- multithreading: release the GIL during GEOS operations for better performance
- spatial joins using geospatial indexes
- providing “prepared geometries” to increase performance when multiple predicates are applied on the same geometries
- wrapping more GEOS functions
- and more, see https://github.com/pygeos/pygeos/issues