A deep dive into Snowflake's Caching Mechanism
Let's learn about how Snowflake keeps your queries running smoothly!
Speed absolutely matters - the faster you can get the result, the better. Luckily, if there is something wonderful about Snowflake, is definitely its ability to optimize query execution and to provide us with results super fast.
Cache… what is that?
You might have heard the word cache before, and most probably you already have an idea about what it is. But, just in case you’re not sure about it, let me explain it to you in a short sentence:
Cache is a temporary storage location where frequently used/accessed data is kept in order to be quickly retrieved when needed.
So, whenever data is stored in cache, it will be easily accessible without having to reprocess the data or having to execute a query again.
Caching in Snowflake
If I tell you that I was born on a Tuesday at 19:00, you will not instantly forget about it, most probably you will be able to remember this information for a few hours or even for the next days. So, if later in the day, somebody asks you “Hey, by any chance do you know the day and hour when The Snowflake Journal author was born?” you don’t have to come read this information, you probably will remember and be able to give that person a quick answer.
Snowflake works in a similar way. Whenever some query is executed in the platform, the results are not just deleted or forgot. Instead, Snowflake will keep the results in cache (in this case, a specific one called Results Cache). This means that if you want to execute the same query later, Snowflake doesn’t have to access the data, perform transformations and give you the information - it will instantly give you whatever has been placed in the Results Cache.
The different caching layers in Snowflake
Snowflake’s cache is a multi-layer approach. We don’t have only one type of caching, but three of them with different scopes. Let’s explore them
Results Cache
This is the most famous cache layer. As mentioned before, is the layer of cache that will keep the results of your queries.
Let’s imagine that you execute the following query
SELECT * FROM SALES WHERE DATE = ‘2024-01-01';
and you get the following results:
Number of Rows: 120.000.000
Time of execution: 26 seconds
these 120 million rows will be stored in the Results Cache. Now, if you (or some other user) execute the same query again, the results might be the following:
Numbers of Rows: 120.000.00
Time of execution : 0.5 seconds
This is because Snowflake, instead of querying the whole dataset again in the Storage layer, provided you with the results that were stored in the Results Cache.
Metadata Cache
Metadata cache is an amazing layer in the Snowflake’s caching system. This layer will store relevant metadata for the objects that we create in our databases, including maximum, minimum or range of values, count of rows or clustering information.
This layer of metadata is really helpful since it can help us to execute certain queries without even needing to use a Virtual Warehouse.
For example, if we need to count the number of rows in a table:
SELECT COUNT(*) FROM SALES;
Since, as mentioned before, the number of rows in a table is cached in the metadata cache layer, Snowflake will provide us with that value without performing any calculations - this is, the warehouse won’t be consuming resources.
Data Cache - Local Disk Cache
Finally, there is a last caching layer in Snowflake: the Local Disk Cache. In simple terms, this is the cache belonging to each node of our virtual warehouse (remember, node = virtual machine).
As the name mentions, this kind of cache is related to the data, not to the final result of a query. This means that what will be cached is not the end-result set (transformations, calculations applied, etc…), but a chunk of data.
As an example, let’s consider that we query our SALES table and we filter for the first day of the year:
SELECT * FROM SALES WHERE DATE = ‘2024-01-01’;
Snowflake will query the data from SALES and offer us the rows where the filtering condition is met. In addition to this, Snowflake will store in the nodes the SALES data related to ‘2024-01-01’. So, if after executing that query you need to tune your query to only extract specific columns:
SELECT PRODUCT, PRICE FROM SALES WHERE DATE = ‘2024-01-01’;
Snowflake won’t scan the data from the storage. Instead, since the data for that specific date is cached, we will get the PRODUCT and PRICE columns from that chunk of cached data, without needing to filter all the data from the storage again.
Now that we have understood how caching works in Snowflake, let’s learn about the limitations of the cache.
Snowflake’s Caching Mechanism: Nuances to consider
Cache Limitations when querying
For Snowflake to use the Results Cache, the query must be identical to the query previously executed. Also, session settings can influence the cache usage (e.g. warehouse used).Cache Invalidation
Our cached data will be bypassed or invalidated based on different factors:Modifying table structure (add, modify or remove column)
Modifying table data (insert, update or delete rows)
Warehouse Suspension = Data Cache cleared
24 hours window - after a period of 24 hours, cache will be cleared (even if any of the previous scenarios have happened)
Sharing Cache between different users/developers
The results cache can be reutilized by different users, but they must use the same role and session settings. If you query using the role DEVELOPMENT, a colleague of yours will be able to leverage results cache only if they use the same DEVELOPMENT role.
Sadly, this is the end of this crash-course on Snowflake Cache - Short but effective!
In this article, we have covered a bunch of things about caching and how it works in Snowflake. If you still have some doubts about the possible benefits, any drawbacks or you simply just want to discuss, feel free to reach out to me and we can talk about it.
I hope to see you soon around here, so remember to click that Subscribe button!
Happy Querying!