Databricks was built by the founders of Spark as an analytics platform to support machine learning use cases. It leverages the Spark framework to process data residing in a data lake and is supported on AWS, GCP and Azure. Databricks coined the marketing term “Lakehouse '' architecture to illustrate the unification of data lake and data warehouse use cases. Customers still manage Spark clusters that process data residing in a Delta lake. Conversion of data to Delta Lake format is required to leverage the functionality of Delta Lake. Databricks Sql is a relatively new addition to simplify access to data stored in a data lake.
Athena is serverless and built on a decoupled storage and compute architecture that queries data directly in S3, without the need to ingest/copy the data. It runs in multi-tenancy with shared resources. Users do not have control over the compute resources Athena chooses to allocate per query from the shared resource pool.
Databricks allow for autoscaling of clusters based on utilization. Additionally, increasing concurrency associated with a sql endpoint can be accomplished through the addition of clusters. Query concurrency per cluster is maxed at 10. However, scaling with additional clusters for concurrency is possible. Databricks provides a choice of instance types.
Athena is a shared multi-tenant resource, with no guarantees on the amount or availability of the resources allocated for your queries. From a data volume perspective, it can scale to large volumes, but large data volumes can suffer from very long run times and frequent timeouts. Query concurrency is maxed at 20. If scalability is a top priority, Athena is probably not the best choice.
Databricks is designed to leverage the Spark framework for processing large volumes of data. It leverages compressed Parquet files in a Delta Lake. To reduce the amount of data processed, it uses data pruning on partitions and Parquet file metadata. Databricks does not provide any indexes.
Athena (and Presto) are designed to query data where it is, sacrificing storage-compute optimizations. This makes it very convenient for easy and immediate querying but at the expense of performance. This typically puts Athena behind cloud data warehouses in terms of performance. But Athena still does relatively well in performance benchmarks, especially when external storage is managed by experts. While it supports partitions, there is no support for indexing, and together with the fact that resources are pooled from a shared multi-tenant service, low-latency and consistent performance are not Athena’s sweet spot. A cloud data warehouse is more performant than Athena in most cases.
Databricks is a mature Spark based platform proven for processing streaming data. It is widely used for Machine Learning use cases by data scientists through the use of integrated notebooks. From a low latency query perspective, while it offers features like Delta Cache, it does not provide specialized indexes that can deliver low latency queries.
Athena is a great choice for Ad-Hoc analytics. You can keep the data where it is, and start querying without worrying about hardware or pretty much anything else, given that Athena is serverless and takes care of everything behind the scenes. However, it is not a great fit when you need consistent and fast query performance, and/or high concurrency. This is why it is typically not the best choice for operational and customer-facing applications. It can be also easily and flexibly used for batch processing, which is often leveraged for ML use cases.