When do you need to shift from Redshift, and what are the alternatives?
When Amazon named Redshift, they were referring to a “shift” from the “Red”, from Oracle, to the now blue Redshift logo. This wasn’t the red shift astronomers refer to where light waves stretch out and shift towards the red spectrum as an object moves away at high speeds.
Oracle was legacy in many ways. It wasn’t really designed to be a data warehouse or for the cloud. Rather than build a new cloud data warehouse from scratch, Amazon took ParAccel, an on premises data warehouse, modified it to be a cloud data warehouse as a service, called it Redshift, and released it in 2012.
Today, Redshift is still the most widely deployed data warehouse. But it is also officially legacy SaaS. It is the older technology. Now, just because it’s legacy doesn’t mean you need to move. After all, as the old joke goes, the difference between legacy and other software is that legacy software is used.
The two questions are
- When do you know you need to shift from Redshift?
- What are your options?
The short answers are:
- When you hit problems with speed, scale and concurrency.
- You should first make sure you’ve taken advantage of all the tuning options and improvements to Redshift. If you have and have clearly hit your limits, then Snowflake is clearly one option, so you could shift from “blue to blue”. Or you could do a true double red shift all the way to Firebolt.
Now for the longer answers.
Limitations with Redshift
When Amazon created Redshift from on premises data warehouse technology, it helped them be the first to market. Over the years, this has made Redshift one of the most mature cloud data warehouses with many improvements that you should try before you give up on Redshift.
- Concurrency scaling: You can have Redshift automatically scale clusters. It’s not unlimited scale as the docs imply (see below). But this has become standard practice for user and query concurrency.
- Redshift RA3: With RA3, you can now scale up and down your clusters dynamically, and not all the data needs to be held in the compute nodes, just what is needed by the running queries.
- Redshift aqua: a hardware-based cache that is available as part of RA3.4xlarge and RA3.16xlarge nodes that basically sits on top of managed storage to improve performance. It’s currently in preview, so time will tell.
But it also gave Redshift many traditional data warehouse limitations.
- Batch-oriented - Redshift does not support continuous ingestion at any reasonable scale. This is a problem with most cloud data warehouses. It typically is caused by table- or partition-level locking for columnar storage.
- Slow performance - With Redshift, like many other cloud data warehouses, queries that are running for the first time, or accessing new data, can take seconds to minutes by the time you get to 1 Terabyte (TB) of data. This is a problem for most cloud data warehouses. The FiveTran benchmark shows that “first-time” queries with caches cleared take 8-11 seconds on average for most cloud data warehouses with just 1TB of data, including Redshift.
- Limited scalability - Because compute and storage are not decoupled, any cluster needs to hold all the data. Concurrency scaling allows multiple clusters, and RA3 will let you scale dynamically. But you still can’t spin up multiple clusters for different workloads. Each cluster has to support all the data needed by all the queries by the users.
- Low query concurrency - Redshift can support teams of analysts and more traditional BI workloads. But it hits a concurrency limit at around 50 queries across all of its queues, no matter how many clusters or compute is provisioned. That makes supporting larger employee groups or end customers more challenging.
Once you hit these limits with speed, scale and concurrency, it doesn’t matter how much money you throw at the problem. You need to start planning for a new solution.
Redshift alternatives for scale
What are the options? The obvious one is Snowflake, especially when you’re on AWS already.
If you are not concerned about speed or cost, Snowflake is great. It has a decoupled storage and compute architecture that provides all the scale. But Snowflake does not deliver faster performance. Just look at the FiveTran benchmarks. The customer benchmarks I’ve seen also put Redshift and Snowflake in roughly the same ballpark.
Snowflake is not the best choice for low latency applications. It does not support continuous ingestion at scale; the minimum latency is 1 minute intervals.
Snowflake can also be reassuringly expensive. I have mostly heard of bills increasing when customers moved from Redshift to Snowflake, as well as some stories of running out of credits.
Redshift alternatives for speed, scale and cost
Anytime you need to support ad hoc, interactive, operational or customer-facing analytics use cases, speed and cost become a major concern. When they do, you need to consider doing a double Red shift, beyond Snowflake as well, to something newer.
Firebolt was created because Eldad Farkash, who was also a founder of Sisense, could not find a data warehouse to power interactive analytics at scale. Eldad has written high-performance analytics databases since he was a teenager. After looking at various cloud data warehouses, he realized it was time to build a new cloud data warehouse.
What Eldad realized was that the modern decoupled storage and compute architecture needed to be re-written for speed and efficiency. He saw three major bottlenecks:
- Data access - Most cloud data warehouses fetch entire segments or partitions of data over the network despite the network being the biggest bottleneck. In AWS, for example, the 10, 25 or 100Gbps networks transport roughly 1, 2.5 or 10 Gigabytes (GB) per second at most. When working with Terabytes of data, data access takes seconds or more. Fetching exact data ranges instead of larger segments can cut access times 10x or more.
- Query execution - Query optimization makes a huge difference in performance. It’s one of the reasons Amazon chose to use ParAccel originally; because building all the optimization takes time. And yet most cloud data warehouses lack a lot of proven optimization techniques - from indexing to cost-based optimization.
- Compute efficiency - Decoupled storage and compute architectures have been a blessing and a curse. It allowed vendors to use nearly unlimited scale to improve performance instead of improving efficiency. Because vendors make more money when they sell more compute, there is an incentive to stay inefficient.
Firebolt ended up adding a number of innovations around data access, query optimization and node efficiency to improve performance and costs.
- Firebolt File Format (F3). Firebolt created a new storage format data access that uses sparse (primary) indexes to help support continuous ingestion fetch specific data ranges from within segments. The result is much faster data access.
- Indexing on top of F3 to achieve sub-second performance. This includes sparse indexing that accelerates data access and queries, and enables continuous ingestion as well; aggregating indexing for faster group by and other aggregating operations; and join indexing to accelerate joins by replacing expensive scan operations.
- A next-generation query engine designed for multi-workload performance. It includes a number of optimizations including vectorized processing, JIT compilation, cost-based optimization, indexing, and a host of tuning options.
The performance and price-performance become clear in customer benchmarks like this one, with 17-102x faster performance at roughly the same cost. Companies can choose the best price-performance because the ratio is generally linear. If you cut the costs in half, you half the performance as well.
Taking the red or the blue pill
Adding in a reference to the Matrix seemed like a good idea when I was writing this, especially since the Redshift and Snowflake logos are blue, and Firebolt is Red.
For most, once you’ve exhausted all your options for improving Redshift speed, scale and efficiency, the two main options are to take the blue pill and choose Snowflake, or to take the red pill, and choose Firebolt. If you do choose Snowflake, it will improve scale, but at a higher cost and without a dramatic performance improvement.
If you choose Firebolt, you will be able to address speed, scale and cost, or efficiency, all at once. The choice is yours. Choose wisely. Or just try Firebolt and Snowflake out now, because the best way to decide is with all the numbers in hand.