14 thoughts on the economics of the open-source data space and how to become the next MongoDB or Databricks

Image by the author.

The data space is booming, with companies like mongoDB (valued at 18 billion USD), databricks (30 billion), or Confluent, and many others. The startup space is overflowing with money and lots of founders want a share of the pie.

But in my opinion, the data space is set up to be dominated by open source solutions in the near future. Open source spaces have a very clear winner takes most dynamic making them extremely hard to compete. And that’s not even considering the fact that you don’t get paid to provide open source solutions, a priori.

And yet, open-source-based companies…

How to measure a data team’s success, why dagster is a kool tool, and how Airbyte compares to meltano in the EL(T) open-source space.

Data will power every piece of our existence in the near future. I collect “Data Points” to help understand & shape this future.

If you want to support this, please share it on Twitter, LinkedIn, or Facebook.

(1)🔮 Measuring Data Teams

Einar Orr, Co-founder of lakeFS and long-time data hero makes a good case for using meaningful metrics to evaluate data teams. The three metrics she suggests are:

  1. Data quality
  2. Data development velocity
  3. Data uptime

She makes a good case for the three and gives some insights on how to treat each of these metrics. I like that approach and find it feasible as…

How to revive your dead dashboards, a cool new graphDB called terminusDB, and developer experience for data guys.

Data will power every piece of our existence in the near future. I collect “Data Points” to help understand & shape this future.

If you want to support this, please share it on Twitter, LinkedIn, or Facebook.

(1)🔮 Make Dashboards Less Dead

This is a great post by Tristan Handy. I think there’s a good use case for dashboards just as there is for any of your “BI artifacts”. This post contains an excellent plan on making dashboards “less dead”.

The post contains a roll-out plan for trustworthy dashboards & reports written by Alexander Jia:

  • Create a single source of truth in your dwh for…

How AI creates an award-winning whisky, how data companies make money, and which tools you can use to version data.

Data will power every piece of our existence in the near future. I collect “Data Points” to help understand & shape this future.

If you want to support this, please share it on Twitter, LinkedIn, or Facebook.

(1)🔮 Data Open Source Business Models

I just stumbled across some weird data orchestrator business models, so I started researching…

I’m sharing this article because as I said before, I believe the data space will be dominated by open source solutions pretty soon. As such I think it’s interesting to understand how open source companies actually make money and make sure they survive. Something we as end-users actually have…

Finally, a good SQL linter, how to choose a data orchestrator, and why decoupling makes the BI world better.

Data will power every piece of our existence in the near future. I collect “Data Points” to help understand & shape this future.

If you want to support this, please share it on Twitter, LinkedIn, or Facebook.

(1)🔥 SQLFluff, Finally a Good SQLLinter

Finally, there’s a SQL linter that actually might make the cut. Almost all SQL linters fail on two problems, the variety of SQL dialects, say hello to RedShift & Snowflake, and the ability to lint templated, usually Jinja, SQL.

SQLFluff at least got these two problems nailed and seems to be in a good shape. So if you haven’t found your SQL linter…

Airbyte follows up with a CDK, Querybook is open-sourced, and how to choose a data discovery platform.

Data will power every piece of our existence in the near future. I collect “Data Points” to help understand & shape this future.

If you want to support this, please share it on Twitter, LinkedIn, or Facebook.

(1)🎁 Airbytes CDK

I think open-source data integration is the future of data integration. Both of the current newcomers in this space, meltano, and Airbyte are facing some hurdles. One of the biggest is the ease of contribution to their projects. Last month, meltano launched their SDK for building connectors, and only a month after Airbyte followed suit.

They do provide a speed run through the…

No, DaC is not just versioning data! It’s applying the whole software engineering toolchain to data. For that, we need principles.

This post is part of a small series beginning with: Data as Code — Achieving Zero Production Defects for Analytics Datasets.

Image by Sven Balnojan.

Data as Code is a simple concept. Just like Infrastructure as Code. It just says “Treat your data as code”. And yet, after IaC appeared on the ThoughtWorks Radar in 2011, it still took roughly 10 years to “settle in” and is still on an uneasy spot where IaC advocates feel they need to remind people of the following:

“ …. Saying “treat infrastructure like code” isn’t enough; we need to ensure the hard-won learnings from the software world…

Opinion

Conway’s law has an evil corollary that goes unnoticed in the dev world but wrecks your data org.

“You think it’s a hack, but all you’re hacking apart is the value of your data.” (the author)

Image by Sven Balnojan.

Melvin Conway, a brilliant computer scientist who also invented the notion of a coroutine, has become pretty famous in the last 20 years for a law named after him:

“Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization’s communication structure.”

Turns out this is very important as we move towards the world of domain-driven design and micro-service architectures. The law is also actionable with something called the “Inverse Conway Maneuver”.

What’s the t in EtLT? How to conduct manual data checks and the rise of the data engineer.

Data will power every piece of our existence in the near future. I collect “Data Points” to help understand & shape this future.

If you want to support this, please share it on Twitter, LinkedIn, or Facebook.

(1)🔮 E T! LT

With the rise of EL (T) over ETL, we took a great step towards much simpler and better processes in the data world. But it is becoming apparent that in some cases a little (t) as in E(t)L (T) in some form is actually needed. Because for some data sources, we simply only want parts of the data, not the complete raw data…

What are the cool OSS data projects this year? How does a good data roadmap look like? And how to pitch the data mesh paradigm to the C-level.

Data will power every piece of our existence in the near future. I collect “Data Points” to help understand & shape this future.

If you want to support this, please share it on Twitter, LinkedIn, or Facebook.

(1)🔥 Hot OSS Data Projects

Pete Sodeling ran a Data Council Survey to check out the hot data projects for 2021 which features a bunch of interesting ones you should have a look at. The good old Apache Airflow and the transformation tool dbt are on the list, but also a few other interesting tools you might not have seen.

A few ones I’d like to highlight: Apache…

Sven Balnojan

Ph.D., Product Manager, DevOps & Data enthusiast, and author of “Three Data Point Thursday”: https://www.getrevue.co/profile/svenbalnojan.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store