Snowflake Deep Dive

BigData & Cloud Practice

4 min readNov 26, 2020

article by Brijesh Kumar Singh,Big Data & Cloud Senior Solutions Architect

Snowflake — An Introduction

Snowflake is a columnar MPP data warehousing solution built for the cloud with the same ANSI SQL that typically users are familiar with.
One of the Snowflake’s main advantage is the separation of storage and compute allowing for numerous concurrent users and processes to be scaled out indefinitely and on demand.
Enterprise architecture finds Snowflake unique and forward thinking DW solution designed for cloud that aligns well with the ONE VIEW data architecture.

Features

Security and Data Protection: offers enhanced authentication by providing Multi-Factor Authentication (MFA), federal authentication and Single Sign-on (SSO) and OAuth. All the communication between the client and server is protected y TLS.
Standard and Extended SQL Support: Snowflake data warehouse supports most DDL and DML commands of SQL. It also supports advanced DML, transactions, lateral views, stored procedures, etc.
Connectivity: Snowflake data warehouse supports an extensive set of client connectors and drivers such as Python connector, Spark connector, Node.js driver, .NET driver, etc.
Data Sharing: You can securely share your data with other Snowflake accounts.
Database Replication and Failover: Support for replicating and syncing databases across multiple Snowflake accounts in different regions

Traditional Data Warehouse — Shared Disk

Shared-disk architecture has one storage layer accessible by all cluster nodes
Easier to manage storage: all processing nodes in the database cluster have direct access to all data, and that data is consistent because all modifications to the data are written to the shared disk
Performance affected by disk contention

Traditional Data Warehouse — Shared Nothing

Storage and compute are decentralized
Shared-Nothing architecture has distributed cluster nodes along with disk storage, their own CPU, and Memory
Data can be partitioned and stored across these cluster nodes as each cluster node has its own disk storage
Performance scales with storage and compute increase

Snowflake Architecture — A Hybrid Model

Snowflake Architecture — Components

Cloud Service Layer:

a collection of services that coordinate activities across Snowflake
also runs on compute instances provisioned by Snowflake from the cloud provider

Compute Layer:

Snowflake uses “Virtual Warehouse” for running queries. Virtual Warehouses are MPP compute clusters consisting of multiple nodes with CPU and Memory provisioned on the cloud by Snowflake.
Multiple Virtual Warehouses can be created in Snowflake for various requirements depending upon workloads
Generally, a virtual Warehouse has its own independent compute cluster and doesn’t interact with other virtual warehouses.

Virtual Warehouse Advantage:

can be started or stopped at any time and scaled without impacting running queries
can be set to auto-suspend or auto-resume so that warehouses are suspended after a specific period of inactive time and then when a query is submitted are resumed
can also be set to auto-scale with minimum and maximum cluster size

Storage Layer:

organizes the data into multiple micro partitions that are internally optimized and compressed
uses a columnar format

Snowflake vs Redshift — Comparison

Snowflake vs Redshift — Performance Comparison

Single Concurrency:

Small Queries : Redshift performs 20% faster
Medium Queries : Snowflake Performs 500% faster
Large Queries : Snowflake Performs 800% faster

Multi Concurrency:

Snowflake outperforms redshift as workload/concurrency increases
Redshift concurrency max out at 15 while snowflake does not have any limit (Need to check)

Scalability:

Redshift performance does not scale as concurrency increases while snowflake performance increases linearly

Database Features:

Redshift doesn’t support semi-structured data types like Array, Object, and Variant. But Snowflake does.
Redshift Varchar limits data types to 65535 characters. In Snowflake, Strings are limited to 16MB and the default value is the maximum String size (so there’s no performance overhead)

Snowflake vs Redshift — Performance Comparison Graph

Migration — Redshift to Snowflake

Schema Migration:

Convert Redshift schema/table/views to Snowflake schema. This will include dropping off references such as distribution keys, sort keys etc. and add Snowflake cluster key for large table
Execute schema creation script on Snowflake.

Users/Roles Migration:

Develop script to read redshift privileges and create appropriate GRANT statement for Snowflake
Execute script on Snowflake

Data Migration:

Unload Redshift tables data to S3 bucket using custom script which will loop through all the table. Naming convention should be followed like cluster_name/schema_name/table_name/filename.partitionnumber
Create another script to load data from S3 to Snowflake

User Migration:

Change End point from Redshift to Snowflake URL. User ID and password will remain same but they must add an account_id in the connection string.
Queries need not to be changed if they are ANSI –SQL Compliant. Maintain a deck with known syntax difference between Snowflake and Redshift.
To run any query, user must choose a warehouse and database.

ETL Migration:

Copy commands that loads data to Redshift will need to be modified for some of Snowflake Requirement.