Databricks’ $15B Gamble: Can Proprietary Moats Hold Against the Open-Source Flood?

Databricks just raised another $15B, yet the core ingredients behind their success are turning into commodities at an astonishing pace.

I have a long history with Databricks dating all the way back to 2014, before it was generally available. At the time, I was working at a startup that processed massive data streams from live video feeds, relying on large Hadoop/HBase clusters. Our choices in terms of automating our infrastructure then ranged from clunky home-brewed automation (Chef/Puppet/Ansible) to the modest improvements offered by Hadoop distributions like Cloudera, HortonWorks, and MapR. When Databricks burst onto the scene, it was a revelation: suddenly, we could forget about the drudgery of managing distributed infrastructure and zero in on solving real business problems. That was invaluable for a lean startup.

But the world has moved on. Operating niche clusters like Mesos or Yarn exclusively for batch data jobs used to be a daunting weight on ops teams. Kubernetes emerged to orchestrate containers in a generic, scalable manner, and the Spark Kubernetes Scheduler slotted right in. Meanwhile, tools like spark-operator (originally open-sourced by Google and now adopted by Kubeflow) have made spinning up Spark clusters declarative and arguably more elegant than Databricks’s API-driven approach.

At the same time, the once-formidable complexities of running Kubernetes have dramatically decreased. Combine AWS EKS with open source Terraform modules, and you have a recipe for maintaining substantial infrastructure with minimal overhead. Throw Karpenter (Just-in-time Nodes for Kubernetes Clusters) into the mix – a tool that facilitates workloads being deployed on the cheapest spot instance based on CPU/memory needs alone – and you can slash compute costs substantially compared to Databricks’s more rigid instance family approach.

Databricks does shine with Photon, its vectorized query engine, which accelerates certain queries quite impressively. But purely from a cost perspective (a crucial metric for OLAP) my benchmarks consistently show Spark on Kubernetes running significantly cheaper (minimum ±50% savings) for similar workloads despite being slower. And let’s be honest: the open source community has a knack for catching up on cutting-edge innovations (ideas in Photon are not necessarily novel, check MonetDB/X100 as a case study).

Databricks also touts better I/O performance for reading and writing data in S3 with their proprietary connectors. Yet Steve Loughran’s work on Hadoop Vectorized I/O and the associated S3 integration narrows that gap significantly for those willing to invest a bit of elbow grease.

It’s telling that Databricks, after pouring resources into proprietary offerings such as Delta Format and Unity Catalog, eventually opened them up, at least in part or in name only (like for Unity Catalog). There’s a pattern: build proprietary, then open-source key components as market forces demand it or face customers churning because no one wants to bet their technical solution on proprietary tools when no like-for-like replacements exist.

One might suspect this dynamic shows up in Databricks’s customer numbers: over 10,000 customers, but only a few hundred paying north of $1M annually. It’s a great solution for smaller teams looking for convenience and lower overhead. But when costs skyrocket, companies tend to reevaluate and open source alternatives often deliver nearly the same capabilities for most ETL jobs at a fraction of the price.

So the question looms: How will Databricks spend that fresh $15B to stave off the open source tide? Doubling down on proprietary tech? Or embracing the community more fully? The future is bright – and more open than ever. Let me know your thoughts, and if you’d like some open source code to bootstrap your own Spark-on-Kubernetes setup, just say the word.