Saving a GeoPandas GeoDataFrame (also with Parquet and Feather examples) to a remote Cloud storage bucket and then read from the bucket — demonstrating Storage Client IO
This article requires the use of Google Cloud Storage, a Cloud bucket, a Service Account (with JSON token key), and the correct permissions to write and read. Using the google.cloud.storage
SDK, this article demonstrates how to upload GeoJSON, Parquet, and Feather formats from a GeoPandas GeoDataFrame.
Imports
import io
import os
import json
import tempfile
import google.cloud.storage
import pandas as pd
import geopandas as gpd
from google.oauth2 import service_account
from shapely import wkb
from datetime import timedelta
Setting Up Storage Client
There are a few ways to connect to remote storage, below is using a GCP Service Account and JSON token key to authorize the Client to perform the necessary actions.
Python google-oauth2.Credentials
tok = json.load(open('/path/to/svcacct/token.json'))
creds = service_account.Credentials.from_service_account_info(tok)
store_client = google.cloud.storage.client.Client(credentials=creds)
bucket = google.cloud.storage.bucket.Bucket(client=store_client, name='bucket-name')
Once the Storage Clients are ready, we can proceed to creating some faux-data and uploading/reading it against a remote GCP storage bucket.
Creating Some Faux Data
Using a GeoPandas example, the following code to generates a GeoDataFrame that will be exported in the subsequent sections to different file formats.
# https://geopandas.org/en/stable/gallery/create_geopandas_from_pandas.html
df = pd.DataFrame(
{'City': ['Buenos Aires', 'Brasilia', 'Santiago', 'Bogota', 'Caracas'],
'Country': ['Argentina', 'Brazil', 'Chile', 'Colombia', 'Venezuela'],
'Latitude': [-34.58, -15.78, -33.45, 4.60, 10.48],
'Longitude': [-58.66, -47.91, -70.66, -74.08, -66.86]})
gdf = gpd.GeoDataFrame(
df, geometry=gpd.points_from_xy(df.Longitude, df.Latitude))
GeoJSON
GeoJSON, a file-format that is core to the SpatioTemporal Asset Catalog (STAC), can be streamed into a BytesIO buffer and uploaded as an in-memory/stream file by the Blob Client.
with io.BytesIO() as buf:
gdf.to_file(buf, driver='GeoJSON')
new_blob = google.cloud.storage.blob.Blob(
name='upload-example/dataframe.json',
bucket=bucket
)
buf.seek(0)
new_blob.upload_from_file(buf)
To read the GeoJSON back in, we can stream the Blob directly into a GeoPandas GeoDataFrame using a generated signed URL that allows the data to be read from the remote store under an access expiry deadline.
json_blob = google.cloud.storage.blob.Blob(
name='upload-example/dataframe.json',
bucket=bucket
)
url = json_blob.generate_signed_url(timedelta(seconds=5))
_df = gpd.read_file(url, driver='GeoJSON')
_df
The table content is preserved throughout the GeoJSON save/read process, as seen in below.
Parquet and Feather
Unlike GeoJSON, Parquet and Feather file formats involve a different approach to save and has an extra step during the reading phase. Leveraging part of the approach of rasterio
's MemoryFile()
, we can stage the file we want to save into a tempfile.NamedTemporaryFile
and then upload to the remote storage.
with tempfile.NamedTemporaryFile(prefix='savetest') as tmpf:
gdf.to_parquet(tmpf.name)
new_blob = google.cloud.storage.blob.Blob(
name='upload-example/dataframe.parquet',
bucket=bucket
)
new_blob.upload_from_file(tmpf)
In order to read the Parquet and Feather formats back down, the following cell demonstrates the steps needed to perform that action. Unlike the GeoJSON, we will need to read the Blob with Pandas. This is because the GeoPandas read_file
is unable to read in Parquet or Feather from a remote URL.
_blob = google.cloud.storage.blob.Blob(
name='upload-example/dataframe.parquet',
bucket=bucket
)
url = _blob.generate_signed_url(timedelta(seconds=5))
_df = pd.read_parquet(url)
_df['geometry'] = _df['geometry'].apply(wkb.loads)
_df
If we were to run the above cell without performing the column mapping to wkb.loads
then we would end up seeing a table with the geometries still encoded.
Applying wkb.loads
to the geometry column will convert the encoded binary string to the correct geometry feature, as shown below.
Finally, convert back into a GeoPandas GeoDataFrame:
gpd.GeoDataFrame(_df, geometry='geometry')
Other Cloud Providers
This approach is also similar to the actions performed in Azure (untested in AWS or DigitalOcean). Azure uses their own Python API of storage Clients of course, but, the general approach to save the file and read it in are similar enough that it can translate with minimal changes.