SnowPark -The Scala developer experience on Snowflake Data Cloud

Published in

Cloud Data and Analytics

4 min readJun 27, 2021

Running scala code UDFs at scale inside the Snowflake data Cloud

Overview

Last year snowflake has announced a lot of features like support for unstructured data, snowtire, snowpark , snowsight etc. In this, one of the anticipated features SnowPark is now available for preview. Until recent time, the Snowflake doesn't have native integration with ML applications and also the support for having a data quality framework at the extraction/ingestion phase. The CI/CD for data pipelines, Data Quality, lineage at the attribute level was quite tricky and had to depend on other vendor services. The SnowPark Scala API feature has opened the gate to address these features.

SnowPark

The following are some of the key features supported by SnowPark as of today.

Execution of Snowflake SQLs via programming.
Lazy analysis on the server reduces the amount of data transferred between your client and the Snowflake database.

UDFs created at the client end can be pushed into the server and run on the server.
The custom ML models can be trained in other languages like Python and those models can be exported as a jar and can be invoked.

The Scala code used in this article is kept in Github. Though SnowPark is at a nascent stage of development, let's see some of its current features.

Create Session

Create a session and select a table.

val session = Session.builder.configs(configMap).create
val df = session.sql( query = "select * from region")
df.show()

Join

Create two tables in Snowflake and join them and save the result in another table.

val reg = session.sql( query = "select * from region")
val nat = session.sql( query = "select * from nation")
val joined_df = reg.join(nat,reg("R_REGIONKEY") ===  nat("N_REGIONKEY"),"inner")
joined_df.show()
joined_df.write.mode(Overwrite).saveAsTable("joined_df")

Joined Result

The below is the snowflake SQL created by the server for the above operation.

The final created table in snowflake.

Read JSON API

The web apps can natively write data into a snowflake and this will change the way data pipelines are built. The NY city supports open API and let's read it as JSON.

Read a JSON and create a data frame

def getJsonData =  {
  val url = "https://data.cityofnewyork.us/resource/2npr-yv2b.json"
  scala.io.Source.fromURL(url).mkString
}

val df = session.createDataFrame(Seq(getJsonData))
df.show()

Once JSON is read and all cleansing/processing is done, the data frame can be merged/inserted into a table or it can be moved into the internal stage of the snowflake. The below is the stage created for storing the json.

Once uploaded, list the files via session API.

session.sql(s"ls @$dataStageName").show()

File Put

The dependency Jars and exported custom jar models can be placed into the internal and can be used for UDF operations

Conclusion

The snowpark Scala API paves way for limitless possibilities such as enabling the organization to bring in their own solutions built for a specific purpose into snowflake data pipelines without much rework or development effort. Additionally, adoption of custom-built java/scala based data quality frameworks, the ML inferences (can save almost 80 per cent costs) and sharing the ML predictions in a real-time manner can be seen in the snowflake data platform.