48

Revolutionary features of Snowflake that sets it apart — A Deep dive

 4 years ago
source link: https://www.tuicool.com/articles/NvmaInB
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Revolutionary features of Snowflake that sets it apart — A Deep dive

Lately, I got the chance to attend Snowflake summit in Sydney. I learnt a few things and thought to try it on the actual data. Tentative at first but after using it on 40GB of data, I realized there are some features that makes snowflake competitive to Redshift, Azure SQL or similar MPP data warehouses in the market.

So, this post covers some of the features that snowflake has which are either not or partially implemented in other MPP warehouses or an engineer must implement it while doing ETL and features that are nice to have but if not then does not make any difference.

6FRfYze.png!web

Differentiating features

In no particular order, these are some of the features that are useful, and which sets Snowflake apart from other MPP warehouses.

Clone the table

Snowflake calls it ‘Zero copy clone’. One of the differentiating features of snowflake. Let’s take a use case where an engineer or an analyst has to clone a table to either alter it further or just take a snapshot of it. Now, as we know any MPP warehouse creates a deep copy of the table which consumes an additional space whereas when you clone it in snowflake, it will just store the pointers of the original table in the new table. In other words, the data is actually the same but referenced through two different table.

Now, when you alter the new table, it will just store that column whereas the rest of the columns are still using the same pointer. Since, snowflake like other MPP warehouses, stores data in columnar format, this feature become very useful.

Drop & Undrop the table

As the name suggests, one can undrop the dropped table with the help of this command. This feature might be useful if you have an engineer working on a table having the drop access on the resource and accidently drops it. At first this feature does not make any sense because you would not give drop access to an engineer who inadvertently drops a useful table. But considering the fact that some companies prefer decentralized approach to their data landscape, you might land up with different accounts handling by not so much experienced fellows who may drop a useful table in their instance. And believe me, that is happening a lot now a days.

So, following command lets you undrop the table:

Undrop table {table_name};

The output/result pane will show the following response once the above command gets executed.

yQRJZrY.png!web

Snowflake gives 24 hours by default to undrop the table but if you are using the Enterprise edition then you have 90 days.

Roll back a table after commit

Essentially this feature is a roll back of an update statement even after committing it. Or in other words

update {table_name} set {column_name} = ‘value’;

Change compute on per query basis

IvYBB3V.png!web

As it can be seen in the above window, we can change the cluster configuration before executing a particular query to let snowflake use higher compute if the query, we will be executing, is heavy and vice-versa. So, when you wants to run a query just select the query, change the Warehouse and click Run. The best part is you can include that in your ETL and make the most of it. I will leave it to you to explore it. And yes, warehouse here has the different meaning then what we know.

Automatic scale down

So, this actually caught my eye. Before delving into this, let me first tell you that snowflake is one such data warehouse capability in which compute and storage are billed separately. Or in other words, you will save a lot of money if cluster (warehouse here) gets suspended when it is not used.

fAJBFjA.png!web

While configuring the warehouse, you can specify the Auto suspend feature which essentially means the compute of the cluster will scale down to none if it is inactive for that specific time. You can also specify the scaling policy as you like to dynamically scale the cluster/warehouse to and fro. The best part is, you can dynamically scale it even through the script.

All cloud providers staged at one place

3iEfMnq.png!web

You need to let snowflake knows the existing staging area where the data is currently staged. The best part is we can have a schema in a database where it is pointed to a s3 location in AWS whereas other schema can be pointed to Azure blob. By clicking onto the ‘Create’, you can create the schema which lets you choose the s3 location. The similar is true for Azure Blob also.

B7NfA3Y.png!web

EjaqMvF.png!web

rIbyymY.png!web

And when you run the query as shown above, it will show you all the keys/files/blobs under a particular path.

Z7niI3v.png!web

As far as I know, the integration of Google Cloud storage is still underway and will be available shortly.

One can also create the stage using the following command:

create stage {stage_name} url = {s3_location};

Copy parquet file

At the time of writing this post, AWS does not has the option of parquet in the Copy statement and you need to spin up ADLA to do the same in Azure. Whereas in snowflake, you can just copy the parquet data in staging to table with the help of following steps:

First, you need to create file format which essentially is the schema which you let snowflake know so that it parses the data correctly before loading it into the table.

iY3qIfQ.png!web

RziUbif.png!web

Here, you can specify the Format type (Parquet), Compressions, delimiters, etc. These are more or less the same option available in Redshift Copy statement.

View s3 data before loading into table

Snowflake lets you query the staging area before loading it into the table. It will let you know what records are sitting in your staging area. I will let you figure it how it can be done. Needless to say, this is very helpful in avoiding the loading errors in the table.

Nice to have features

These are the features that are there in snowflake and nice to have in other MPP warehouses but does not make much a difference if it’s not there.

Cache query results

Snowflake caches the results of every of the query that was executed in the last 24hours. So, unless the underline data has changed in the table, the same query will give the same results. However, there are hardly any instances when anyone needs to execute the same query again unless there is a data change but it’s a nice to have feature anyway.

When you check the history, you can see that the 2nd query was executed much faster then the first whereas both the queries are exactly same.

Computations, Operations and Storage in one UI

Most of the cloud providers have taken care of computation, operations and storage but when you want to know cluster configuration, you have to open the cloud provider’s portal separately to view it. Whereas in snowflake the same window will give what query was executed, when was it executed and through which cluster configuration was it executed. All at one place. Again, it’s a nice to have feature but does not add much value.

YrUbaiz.png!web

In the above window, Warehouses pane will let you know what is the cluster configuration, Databases will have list of databases with it’s required features like create, Worksheets is actually it’s SQL editor, History gives the list of queries executed in the last 1 year and Shares will let you know what you shared and with whom.

jUJBbyN.png!web

AfqYjab.png!web

History through UI

Till now, if we need to extract value from the history, either we need to run system queries or create a separate ETL which ingests the history logs and push the data into the required history table. But in snowflake, you can view the list of queries executed by you and by other users (if you are sysadmin or higher) of the last 1 year. A useful feature from Data Governance perspective.

euUfi2i.png!web

Switch roles

If you are sysadmin or accountadmin, you can assume the other roles and check whether the role, which you have created, has adequate governance or not.

riQjiy6.png!web

Sharing table to another account

More often than not we encounter a situation where we need to share a particular table with an Analyst or Data scientist or business analysts wherein that team has a different snowflake account. Snowflake lets you share the table securely so that another team can work on the data as they want.

Wrap up

I should not call it a feature but a practice. A practice that Snowflake is trying to facilitate is that it purposefully lets you create the staging area in your cold storage so that you can see what the data looks like before loading it into it. A very useful practice, which I am afraid, not many companies have imbibed yet. However, right now, the result window only shows the most recent result and the GUI does not has the ability to give tab wise multiple results to compare one output with another. From my point of view, this is one major feature that is lacking right now in snowflake, but I can’t blame them because no other cloud vendors has fully embedded the sql editor in it let alone the multiple query results.

In the end, I have only used this on 40GB of data and did not get the chance to use it on our 20TB of data warehouse but from what I have heard, snowflake gets a bit expensive if it is not used wisely. Well, great features come with a bit of cost.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK