Snowflake Deep Dive
4 min readNov 26, 2020
article by Brijesh Kumar Singh,Big Data & Cloud Senior Solutions Architect
Snowflake — An Introduction
- Snowflake is a columnar MPP data warehousing solution built for the cloud with the same ANSI SQL that typically users are familiar with.
- One of the Snowflake’s main advantage is the separation of storage and compute allowing for numerous concurrent users and processes to be scaled out indefinitely and on demand.
- Enterprise architecture finds Snowflake unique and forward thinking DW solution designed for cloud that aligns well with the ONE VIEW data architecture.
Features
- Security and Data Protection: offers enhanced authentication by providing Multi-Factor Authentication (MFA), federal authentication and Single Sign-on (SSO) and OAuth. All the communication between the client and server is protected y TLS.
- Standard and Extended SQL Support: Snowflake data warehouse supports most DDL and DML commands of SQL. It also supports advanced DML, transactions, lateral views, stored procedures, etc.
- Connectivity: Snowflake data warehouse supports an extensive set of client connectors and drivers such as Python connector, Spark connector, Node.js driver, .NET driver, etc.
- Data Sharing: You can securely share your data with other Snowflake accounts.
- Database Replication and Failover: Support for replicating and syncing databases across multiple Snowflake accounts in different regions
Traditional Data Warehouse — Shared Disk
- Shared-disk architecture has one storage layer accessible by all cluster nodes
- Easier to manage storage: all processing nodes in the database cluster have direct access to all data, and that data is consistent because all modifications to the data are written to the shared disk
- Performance affected by disk contention
Traditional Data Warehouse — Shared Nothing
- Storage and compute are decentralized
- Shared-Nothing architecture has distributed cluster nodes along with disk storage, their own CPU, and Memory
- Data can be partitioned and stored across these cluster nodes as each cluster node has its own disk storage
- Performance scales with storage and compute increase
Snowflake Architecture — A Hybrid Model
Snowflake Architecture — Components
Cloud Service Layer:
- a collection of services that coordinate activities across Snowflake
- also runs on compute instances provisioned by Snowflake from the cloud provider
Compute Layer:
- Snowflake uses “Virtual Warehouse” for running queries. Virtual Warehouses are MPP compute clusters consisting of multiple nodes with CPU and Memory provisioned on the cloud by Snowflake.
- Multiple Virtual Warehouses can be created in Snowflake for various requirements depending upon workloads
- Generally, a virtual Warehouse has its own independent compute cluster and doesn’t interact with other virtual warehouses.
Virtual Warehouse Advantage:
- can be started or stopped at any time and scaled without impacting running queries
- can be set to auto-suspend or auto-resume so that warehouses are suspended after a specific period of inactive time and then when a query is submitted are resumed
- can also be set to auto-scale with minimum and maximum cluster size
Storage Layer:
- organizes the data into multiple micro partitions that are internally optimized and compressed
- uses a columnar format
Snowflake vs Redshift — Comparison
Snowflake vs Redshift — Performance Comparison
Single Concurrency:
- Small Queries : Redshift performs 20% faster
- Medium Queries : Snowflake Performs 500% faster
- Large Queries : Snowflake Performs 800% faster
Multi Concurrency:
- Snowflake outperforms redshift as workload/concurrency increases
- Redshift concurrency max out at 15 while snowflake does not have any limit (Need to check)
Scalability:
- Redshift performance does not scale as concurrency increases while snowflake performance increases linearly
Database Features:
- Redshift doesn’t support semi-structured data types like Array, Object, and Variant. But Snowflake does.
- Redshift
Varchar
limits data types to 65535 characters. In Snowflake, Strings are limited to 16MB and the default value is the maximum String size (so there’s no performance overhead)
Snowflake vs Redshift — Performance Comparison Graph
Migration — Redshift to Snowflake
Schema Migration:
- Convert Redshift schema/table/views to Snowflake schema. This will include dropping off references such as distribution keys, sort keys etc. and add Snowflake cluster key for large table
- Execute schema creation script on Snowflake.
Users/Roles Migration:
- Develop script to read redshift privileges and create appropriate GRANT statement for Snowflake
- Execute script on Snowflake
Data Migration:
- Unload Redshift tables data to S3 bucket using custom script which will loop through all the table. Naming convention should be followed like
cluster_name/schema_name/table_name/filename.partitionnumber
- Create another script to load data from S3 to Snowflake
User Migration:
- Change End point from Redshift to Snowflake URL. User ID and password will remain same but they must add an account_id in the connection string.
- Queries need not to be changed if they are ANSI –SQL Compliant. Maintain a deck with known syntax difference between Snowflake and Redshift.
- To run any query, user must choose a warehouse and database.
ETL Migration:
- Copy commands that loads data to Redshift will need to be modified for some of Snowflake Requirement.