I love open-source but open-source software for data infrastructure is on the way out. There, I said it. And you might think I've got a screw loose, given the broad adoption of open source today, but hear me out. Yes, open source is ubiquitous in data management today, but the era of open-source innovation is all but over. In the age of public cloud, there is no longer a reason to build or use open source for data infrastructure, and a new category of software I'm labeling open services will render open-source data tools irrelevant.
How We Got to an Open-Source World
The last decade has been a bonanza for open-source software in the data world, to which I had front-row seats as a founding member of the Hadoop and RocksDB projects. Many will point to Hadoop, open sourced in 2006, as the technology that made Big Data a thing. A plethora—some will call it a zoo—of open-source projects soon followed, including household names like Spark, Kafka, and MongoDB.
The open-source wave was all about adoption—getting software into the hands of users with as little friction as possible. Users simply downloaded, installed, and used software at any time. And this was the promise of open source! Open-source software proved very developer-friendly, as developers could easily access the software and documentation. They could experiment, build, and deploy without having to deal with vendors and enterprise sales. To no one's surprise, recent history has seen a proliferation of open-source data infrastructure software, with its ease of adoption, at the expense of traditional commercial software.
But Open Source Isn't a Silver Bullet
Open source neutralized many barriers to adoption but, in the context of data infrastructure, it was still rarely simple to install, configure, manage, and administer. Enter the public cloud. Open source data technologies needed scale-out processing and storage, which the cloud readily provided. However, considerable complexity remained in managing the software layer, which IaaS could not solve.
To make data infrastructure software easier to use and adopt, many vendors turned to cloud options for their software. Count Hadoop, Spark, Kafka, MongoDB, and Elasticsearch among the open-source projects that have as-a-service offerings which provide an abstraction on both hardware and software. In many instances, it is these cloud services that are the growth engines for vendors. And just as open source was a step up from commercial software in terms of ease of adoption, cloud services are the next evolution in simplifying the consumption of data infrastructure.
The Age of Open Services in Data Infrastructure
Cloud services are characterized by their bundling of hardware, software, and operations into a utility model, making them eminently accessible to developers. An open service takes this concept a step further by implementing an API that is a well-defined standard and/or widely used across multiple software platforms. As an example, Snowflake is a data warehouse offered as an open service which exposes the SQL API. Just as users could avoid vendor lock-in by using open-source software, developing on an open API allows users to migrate from one service provider to another if needed.
For developers, open services are easier to adopt than open source. So if data offerings can be open services, why do we need open source? I believe that the time for open sourcing new, disruptive data technologies is over. Existing open-source software will continue to run its course, but there is no incentive for builders or users to choose open source over open services for new data offerings.
Ironically, it was ease of adoption that drove the open-source wave, and it is ease of adoption of open services that will precipitate the demise of open source in data management. Just as the last decade was the era of open-source data infrastructure, the next decade belongs to open services in the cloud.
I've focused on how ease of use of an open service disrupts open source. In my next blog, I'll share more thoughts on the economics of cloud and how they influence the design of new data management technology.