Cloud-Native integration between Snowflake on AWS and Databricks on GCP

Prasanth Mathesh
Cloud Data and Analytics
4 min readJun 21, 2021

--

Enable data transfers between cloud storage services using vendor-neutral platforms

Introduction

The organizations that maintain multi-cloud often end up sharing data between their cloud storage accounts. There are various reasons and methodologies for the data transfers and here we are going to see how we can set up an integration between snowflake and databricks using GCS as a storage layer without the need of orchestration tool or serverless functions to monitor the data transfers in a quick and seamless way.

Databricks

Databricks was recently launched on the GCP marketplace platform and once a workspace is created (hosted in GCP Europe-west2 ), Databricks itself creates GCS for its internal use with necessary role and permissions.

Snowflake

In a snowflake environment (hosted in the AWS ), create a storage integration for the GCS.

create storage integration gcs_databricks
type = external_stage
storage_provider = gcs
enabled = true
storage_allowed_locations = ('gcs://databricks-7725329731828953/')
;

Create a stage for the GCS integration

create stage aws_gcp_databricks
storage_integration = gcs_databricks
url = 'gcs://databricks-7725329731828953/'
;

If we list the files in the bucket and it will throw a permission error.

Create a new role with necessary permissions or add an account to an existing role that has permissions to storage. The role created for databricks service will have permissions and I have added the snowflake service account as an additional member.

Upload a file to bucket and list it in snowflake.

The jobs/tasks can be scheduled in databricks / snowflake environment to unload data into GCS or to copy from the GCS.

Let's do a transfer from snowflake.

List files in GCS.

The files landing in the S3 data lake can be sent to GCS/datalake with minimal transformations or as is using SQL commands without loading into Datawarehouse. This is one of the basics of lakehouse architecture.

I have kept the movie lens dataset in S3. The snowflake stage is the same we have seen in this article.

The below SQL reads data from S3 stage and composes a JSON Object

with temp as
(
select $1 as userid,$3 rating,count($3) as rating_count
from @hist_parquet/test/ratings.csv (file_format => csv_plain) t
where $3 = '4.0'
group by $1,$3
)
select object_construct (
*
)
from temp
;

The same SQL is used in copy into command.

copy into '@aws_gcp_databricks/rating' from
(
with temp as
(
select $1 as userid,$3 rating,count($3) as rating_count
from @hist_parquet/test/ratings.csv (file_format => csv_plain) t
where $3 = '4.0'
group by $1,$3
)
select object_construct (
*
)
from temp
) file_format = (format_name = 'json_plain')
;

The data has been pushed to GCS upon execution of the copy command.

Instead of polling, create the notifications for google cloud PUB/SUB for the object changes in GCS

gsutil notification create -t recommend-input -f json gs://databricks-7725329731828953/

After enabling the subscriptions and pull can be performed in Google PUB/SUB

Databricks has its own file system DBFS and let's upload a file in /FileStore

The uploaded file can be accessed from snowflake using the GCS stage.

Conclusion

The integration between SAAS/PAAS created in the cloud platforms like AWS GCP AZURE can be done in a seamless manner via API commands without any orchestration/ETL tools. Also, the data can be consumed in a near-realtime manner between these two platforms since they have high bandwidth data pipe with GCS for their respective service accounts. Additionally, the decoupled storage architecture enables storage of the raw/processed data in a single place or with a cheaper storage provider and also avoids the need to have additional data transfer services.

--

--