Saving DataFrames to Remote Storage with Python

Benjamin
4 min readMar 18, 2023

--

Saving a GeoPandas GeoDataFrame (also with Parquet and Feather examples) to a remote Cloud storage bucket and then read from the bucket — demonstrating Storage Client IO

Photo by Jason Wong on Unsplash

This article requires the use of Google Cloud Storage, a Cloud bucket, a Service Account (with JSON token key), and the correct permissions to write and read. Using the google.cloud.storage SDK, this article demonstrates how to upload GeoJSON, Parquet, and Feather formats from a GeoPandas GeoDataFrame.

Imports

import io
import os
import json
import tempfile

import google.cloud.storage

import pandas as pd
import geopandas as gpd

from google.oauth2 import service_account

from shapely import wkb
from datetime import timedelta

Setting Up Storage Client

There are a few ways to connect to remote storage, below is using a GCP Service Account and JSON token key to authorize the Client to perform the necessary actions.

Python google-storage Client

Python google-oauth2.Credentials

tok = json.load(open('/path/to/svcacct/token.json'))
creds = service_account.Credentials.from_service_account_info(tok)
store_client = google.cloud.storage.client.Client(credentials=creds)
bucket = google.cloud.storage.bucket.Bucket(client=store_client, name='bucket-name')

Once the Storage Clients are ready, we can proceed to creating some faux-data and uploading/reading it against a remote GCP storage bucket.

Creating Some Faux Data

Using a GeoPandas example, the following code to generates a GeoDataFrame that will be exported in the subsequent sections to different file formats.

# https://geopandas.org/en/stable/gallery/create_geopandas_from_pandas.html
df = pd.DataFrame(
{'City': ['Buenos Aires', 'Brasilia', 'Santiago', 'Bogota', 'Caracas'],
'Country': ['Argentina', 'Brazil', 'Chile', 'Colombia', 'Venezuela'],
'Latitude': [-34.58, -15.78, -33.45, 4.60, 10.48],
'Longitude': [-58.66, -47.91, -70.66, -74.08, -66.86]})

gdf = gpd.GeoDataFrame(
df, geometry=gpd.points_from_xy(df.Longitude, df.Latitude))

GeoJSON

GeoJSON, a file-format that is core to the SpatioTemporal Asset Catalog (STAC), can be streamed into a BytesIO buffer and uploaded as an in-memory/stream file by the Blob Client.

with io.BytesIO() as buf:
gdf.to_file(buf, driver='GeoJSON')
new_blob = google.cloud.storage.blob.Blob(
name='upload-example/dataframe.json',
bucket=bucket
)
buf.seek(0)
new_blob.upload_from_file(buf)

To read the GeoJSON back in, we can stream the Blob directly into a GeoPandas GeoDataFrame using a generated signed URL that allows the data to be read from the remote store under an access expiry deadline.

json_blob = google.cloud.storage.blob.Blob(
name='upload-example/dataframe.json',
bucket=bucket
)

url = json_blob.generate_signed_url(timedelta(seconds=5))
_df = gpd.read_file(url, driver='GeoJSON')
_df

The table content is preserved throughout the GeoJSON save/read process, as seen in below.

Parquet and Feather

Unlike GeoJSON, Parquet and Feather file formats involve a different approach to save and has an extra step during the reading phase. Leveraging part of the approach of rasterio's MemoryFile(), we can stage the file we want to save into a tempfile.NamedTemporaryFile and then upload to the remote storage.

with tempfile.NamedTemporaryFile(prefix='savetest') as tmpf:
gdf.to_parquet(tmpf.name)
new_blob = google.cloud.storage.blob.Blob(
name='upload-example/dataframe.parquet',
bucket=bucket
)
new_blob.upload_from_file(tmpf)

In order to read the Parquet and Feather formats back down, the following cell demonstrates the steps needed to perform that action. Unlike the GeoJSON, we will need to read the Blob with Pandas. This is because the GeoPandas read_file is unable to read in Parquet or Feather from a remote URL.

_blob = google.cloud.storage.blob.Blob(
name='upload-example/dataframe.parquet',
bucket=bucket
)

url = _blob.generate_signed_url(timedelta(seconds=5))
_df = pd.read_parquet(url)
_df['geometry'] = _df['geometry'].apply(wkb.loads)
_df

If we were to run the above cell without performing the column mapping to wkb.loads then we would end up seeing a table with the geometries still encoded.

Applying wkb.loads to the geometry column will convert the encoded binary string to the correct geometry feature, as shown below.

Finally, convert back into a GeoPandas GeoDataFrame:

gpd.GeoDataFrame(_df, geometry='geometry')

Other Cloud Providers

This approach is also similar to the actions performed in Azure (untested in AWS or DigitalOcean). Azure uses their own Python API of storage Clients of course, but, the general approach to save the file and read it in are similar enough that it can translate with minimal changes.

--

--

Benjamin
Benjamin

Written by Benjamin

Open Science | Cloud Platform Developer | EO | Remote Sensing | Software Developer

No responses yet