1

Redshift vs. Snowflake: The Definitive Guide

 2 years ago
source link: https://dzone.com/articles/redshift-vs-snowflake-the-definitive-guide
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

What Is Snowflake?

At its core Snowflake is a data platform. It's not specifically based on any cloud service which means it can run any of the major cloud providers like Amazon Web Services (AWS), Microsoft Azure, Google Cloud Platform (GCP). As a SaaS (Software-as-a-Service) solution, it helps organizations consolidate data from different sources into a central repository for analytics purposes to help solve Business Intelligence use cases.

Once data is loaded into Snowflake, data scientists, engineers, and analysts can use business logic to transform and model that data in a way that makes sense for their company. With Snowflake users can easily query data using simple SQL. This information is then used to power reports and dashboards so business stakeholders can make key decisions based on relevant insights.

Founded in 2012 and launching officially in 2014, Snowflake currently holds the record for the largest software IPO in history. At the time of writing this article, Snowflake is valued at approximately $78.49 billion. One of the biggest advantages of Snowflake is the fact that it was developed solely for the cloud and that means it comes with none of the operational overhead or conventional baggage that other technologies bring to the table.

What Is Redshift?

Funny enough, AWS Redshift was strategically named as a deliberate dig at Oracle since all of Oracle's branding is red. Like Snowflake, Redshift is also a cloud-based data warehouse designed to tackle Business Intelligence use cases among other things. However, whereas Snowflake is a SaaS offering, Redshift is a PaaS (Platform-as-a-Service) solution. Redshift was actually one of the very first cloud data warehouses to become available on the market, launching officially in 2013. Similar to Snowflake, Redshift lets users query data using SQL for analysis and reporting purposes.

However, Redshift is a native AWS service that is built to work in unison with other AWS technologies. It’s designed to be a central hub to connect other AWS offerings and even query against a data lake directly in some cases. As a PaaS solution AWS comes with more baggage compared to Snowflake because users have to optimize the platform in order to get the most out of the solution.

Architecture

Snowflake

One of the unique factors about Snowflake is the fact that it has completely decoupled storage and compute processing layers, making it extremely easy to scale up or down as needed. Query processing within Snowflake is tackled using massively parallel processing (MPP) to process queries. On top of this, Snowflake is based entirely around ANSI SQL, so the barrier to entry is extremely low.

Compute nodes or clusters within Snowflake are known as individual warehouses. Snowflake stores a portion of every data set locally within these clusters. Micro partitions are then used to optimize all data and compress it into a columnar storage format so that it can be kept within cloud storage. In addition to this, Snowflake automatically tackles everything from file size, compression, structure, metadata, statistics, and nearly every other data object that is not directly visible to the user.

Every Snowflake warehouse leverages an independent compute cluster. On top of this, warehouses do not share resources with other virtual warehouses. This provides a huge advantage, allowing Snowflake to support near-unlimited concurrency for both queries and users. Since Snowflake is cloud-agnostic, it even runs on any of the three major clouds, AWS, GCP, and Azure.

Redshift

As a traditional data warehouse, Redshift is functionally similar to Snowflake in that it leverages MPP but it is structurally very different. Firstly, Redshift can only run on AWS, it’s not cloud-agnostic like Snowflake, so there is immediately less flexibility. Storage and computing capabilities within Redshift are also tightly coupled and it is not innately serverless like Snowflake. However, it is important to note that Redshift does offer a decoupled feature through its RA3 nodes in addition to a serverless option.

Redshift leverages clusters like Snowflake. However, these clusters can be configured with more granularity because Redshift provides different node types. Nodes within a cluster are partitioned into slices, where each slice represents an allocated portion of an individual node’s disk and memory space. All of this is managed by a leader node that handles the communication, compiles code, and assigns a portion of the data to each compute node. Because of this, cluster size can vary drastically.

Storage within Redshift is duplicated from S3 and data can be compressed and stored in a columnar format similar to Snowflake. Querying data in Redshift is done through PostgresSQL 8, which is an older version of SQL that is much more challenging to use compared to Snowflake.

Scalability

Snowflake

When it comes to scalability, Snowflake has a unique auto-scaling and auto suspend feature to start and stop warehouses during idle and busy periods. Resizing individual nodes in Snowflake is not possible. However, Snowflake does give users the ability to resize clusters in a single click and since each warehouse is on its own compute cluster, workloads can be isolated individually to enable unlimited concurrent queries. Certain queries can be inefficient and difficult to provision for in Snowflake since individual nodes cannot be adjusted and the only way to get larger nodes is to leverage bigger warehouses.

Redshift

Redshift has much fewer features around scalability when compared to Snowflake. Autoscaling is limited and it can take anywhere from several minutes to several hours to adjust clusters since individual nodes have to be added and removed manually. The difficulty of scaling in Redshift is largely dependent upon the size of the clusters and the amount of data. Redshift does offer a feature known as concurrency scaling, but it is not as robust as Snowflake’s autoscaling feature and it adds on additional costs.

Security and Compliance

Snowflake

One of the best facets of Snowflake security is RBAC (role-based access control), which is used to manage users and privileges. Snowflake also uses MFA (multi-factor authentication) for increased account security. Network access is controlled through IP whitelisting/blacklisting which can be defined through the network policies.

Since it operates across clouds, Snowflake can use VPNs (virtual private networks) like Azure Private Link or AWS PrivateLink. In addition to this, Snowflake also automatically encrypts all files stored in the internal stages for loading and unloading data automatically. Within Snowflake all data is automatically encrypted. In addition to this, Snowflake has also earned a number of certifications like SOC 1 Type II and SOC 2 Type II compliance, HIPAA compliance, PCI DSS compliance, FedRAMP compliance, etc.

Redshift

Aside from handling all of the infrastructure and hardware security, Redshift also manages the security around resource provisioning. Access management is handled by AWS Identity and Acess Management (IAM). Only users with the appropriate credentials can create, configure, and delete clusters. Access to any Redshift resource is handled by AWS account privileges. Security groups can be assigned to specific cluster instances to grant inbound access.

Clusters can also be launched within the AWS Virtual Private Cloud (VPC) to further protect access. Cluster encryption can also be enabled when a cluster is launched. Redshift offers secure sockets layer (SSL) connections that encrypt the connection between the SQL client and the cluster. Within Redshift, all data is automatically encrypted at rest and users have the ability to protect data kept in S3 storage with Advanced Encryption Standard (AES-256). Compliance within Redshift is no problem as AWS has even more certificates than Snowflake.

Data Support

Snowflake

Snowflake supports both Semi-Structured (JSON, Avro, Orc, CSV, JSON, Parquet) and Structured Data. Snowflake also launched support for Unstructured data in late 2021, enabling developers to leverage even more development languages (ex: Python, Java, Scala). This gives Snowflake even more flexibility as a data platform since it can be treated as both a data lake and data warehouse.

Redshift

Redshift also supports Structured and Semi-Structured Data. Redshift also supports PartiQL which is a query language designed to process Semi-Structured Data in relational databases more efficiently. Unfortunately, Redshift does not support Unstructured Data. However, Unstructured Data can be stored within S3 “data lake” and queried using Amazon Redshift Spectrum.

Administration

Snowflake

As a native SaaS solution Snowflake lets users easily manage users, roles, permissions, and overall security. Query performance and other general performance tuning happen automatically within Snowflake as data volume increases. There is also hardly ever a need for a full-time administrator within Snowflake because there is built-in optimization and automatic clustering. Nearly every aspect of maintenance is fully automated and handled by Snowflake and since it is a SQL-based tool the barrier to entry is extremely low.

Redshift

Since Redshift is a PaaS solution, there is a substantial amount of manual maintenance required to keep Redshift up and running. For example, user access is managed on a cluster basis and this means that users will need multiple credentials and permission grants to work across different clusters and client connections. On top of this administrators have to perform regular database vacuuming (the process of re-sorting and cleaning up rows and unclaimed space within a database) tasks.

Data Protection

Snowflake

For data protection Snowflake has two core features that it offers known as Time Travel and Fail-safe. In its simplest form, Snowflake’s Time Travel feature keeps a state of data before it is updated. This feature can be applied to databases, tables, and schemas. Time Travel is usually limited to one day, but Enterprise customers can choose any period of time up to 90 days. Fail-safe works in unison with Time Travel. It’s a 7-day period that starts immediately after the Time Travel retention period ends. This feature exists to protect and recover historical data (data that has been deleted or changed).

Redshift

Redshift also provides a data protection feature, but it works slightly differently when compared to Snowflake. Redshift uses what’s known as Snapshots which are point-in-time backups of clusters. There are manual and automated snapshots, both of which are stored internally in Amazon S3. When a user restores from a backup, Redshift creates a new cluster that can be queried immediately.

Snapshots can also be configured to restore a single table instead of an entire cluster. When automated snapshots are enabled, Redshift will periodically take snapshots of specific clusters about every 8 hours or every 5 GB per node of data changes (whichever comes first). Automated snapshots are enabled by default when a cluster is created, but they can also be scheduled to run on a different time basis.

On the other hand, manual snapshots can be taken at any time and the data is retained indefinitely even after a cluster is deleted. The default retention period is one day, but this can be modified in the Redshift console. For example, RA3 nodes can be automated to set a retention period from 1-35 days. Users have no ability to delete snapshots in Redshift because they are automatically deleted at the end of the retention period.

Pricing

Snowflake

The pricing model within Snowflake is based on individual warehouse usage. All virtual warehouses come in several different t-shirt sizes (X-Small, Small, Medium, Large, X-Large, etc.). An X-small warehouse uses one cluster. For each increase in warehouse size, the number of clusters doubles along with the cost.

For an X-Small warehouse, pricing starts at approximately 0.0003 credits per second. Depending on the Snowflake tier, the cost per credit can vary substantially, but the on-demand pricing for Snowflake Standard Edition is $2. The largest Snowflake warehouse (6X-Large) burns 512 credits per hour or .1422 credits per second.

Within Snowflake, customers can pre-purchase credits or pay on demand based on usage. Pre-purchasing credits saves a substantial amount of money since Snowflake offers special deals when customers sign longer contracts and pay upfront.

Although credit costs can vary substantially within Snowflake depending on the business tier, storage costs within Snowflake are relatively straightforward. On-demand users pay a flat rate of $40 per month for each TB of data and upfront customers pay $23 per TB of data stored.

Redshift

Redshift's pricing structure is slightly more complex compared to Snowflake because there are more options specifically around node types. Redshift offers two main types of nodes, DC2 and RA3.

DC2 nodes are tightly coupled with storage. With RA3 nodes, storage and compute are managed independently. The price for a DC2 instance ranges from $0.25 per hour for the smallest size to $4.80 per hour for the largest size. On the other hand, RA3 instances start at $1.086 per hour for the smallest size and $13.04 for the largest size.

Redshift also offers a serverless option for users who don’t want to provision and scale hardware. With this option Redshift automatically scales up or down to meet the requirements of analytic workloads and shuts down during periods of inactivity.

Consumption is calculated per minute based on RPU (Redshift Processing Unit) hours. The price for an RPU is $0.45 per hour. One important aspect to note about Redshift is that AWS offers massive discounts for long-term contract commitments.

At a quick glance, storage within Redshift is slightly more expensive than Snowflake at $24 per TB. However, data can also be stored within S3 and queried directly using Redshift Spectrum. The storage cost within S3 begins at $23 per TB, but this can be pushed down depending on the priority at which the data needs to be accessed because data can be kept in different storage layers within S3. This makes redshift substantially less expensive than Snowflake in some cases.

Cloud Infrastructure

One of the core things to keep in mind, when it comes to Snowflake and Redshift is that Snowflake is a SaaS solution, and Redshift is a PaaS solution. Users of Snowflake don’t have to maintain any infrastructure. The same cannot be said for Redshift. On top of this, Snowflake runs across clouds, so it is extremely flexible. On the other hand, Redshift is solely limited to AWS and it cannot run on another cloud provider like Azure or GCP (Google Cloud Platform).

Performance

Every benchmark can tell its own unique story and be tailored to perform in a certain way. Out of the box with no fine-tuning, Snowflake tends to beat out every other warehouse solution in terms of performance. This is not to say there is not a specific subset of data or use case where Redshift would outperform Snowflake. However, there is a substantial amount of leg work involved to even get Redshift up and running. Query optimization within Redshift is effective, but it can be time-consuming.

The key aspect to understand is that Snowflake provides immediate value; users are not forced to manage any infrastructure or do any query optimization. Additionally, the architecture within Snowflake is much more flexible since it has separated compute from storage. This is one of the main reasons that Snowflake is able to achieve such good performance.

The Main Difference Between Redshift and Snowflake

Redshift and Snowflake both operate in a similar manner since each solution is based on SQL, but Redshift is substantially older and harder to use. They also have several common features around data sharing and data recovery. The solutions themself are quite different though. Snowflake was designed to work straight out of the box and be easy to use with no maintenance. It's extremely efficient at tackling workloads that require frequent scaling. Users simply have to load their data into Snowflake and they can immediately start querying it for analysis.

Since Redshift is a much older tool, it carries some legacy baggage from the earlier years before the cloud was so prominent. It’s simply not possible for Redshift to provide the same value that Snowflake does in the same timeframe. Users are forced to set up infrastructure and hardware to even get started with Redshift.

The main advantage that Redshift has over Snowflake is the fact that it integrates natively with the rest of the AWS ecosystem. Snowflake has many integrations that integrate with other AWS services but it's not equal.

Snowflake is a modern data warehouse, but it can become expensive very quickly. This is another area in that Redshift has a major advantage, as Redshift is generally cheaper in the long run. Ultimately there are some workloads that Snowflake handles exceptionally well and there are some workloads that Redshift tackles more efficiently (ex: Machine Learning using AWS Sage Maker). Either way, the business demands, use cases, and resources should always be kept at the forefront of any decision-making when evaluating cloud data warehouse solutions.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK