Data Lakehouse vs Data Warehouse: We Have Built Both

The data lakehouse has become the default recommendation in data engineering circles, and for good reason. But it is not the right choice for every team. We have built five data platforms in the past 18 months: two traditional warehouses (Snowflake, Redshift), two lakehouses (Databricks, S3+Spark+dbt), and one hybrid. Here is what we have learned about when each approach works best.

What a Lakehouse Actually Is

A data lakehouse combines the cheap, scalable storage of a data lake (S3, ADLS) with the structured querying and governance features of a warehouse. You store everything in open formats like Parquet or Delta Lake, and a query engine (Spark, Trino, or Databricks SQL) provides the SQL interface. The key benefit: you get one copy of the data that serves both analytics and ML workloads.

A traditional warehouse (Snowflake, Redshift, BigQuery) stores data in its own proprietary format and manages both storage and compute. The benefit: it just works. You load data in, write SQL, get fast results. No tuning, no file format decisions, no cluster management.

When a Warehouse Wins

Your team is mostly analysts and BI developers. If the primary use case is dashboards, ad-hoc SQL queries, and reporting, a warehouse is simpler and faster to get running. Snowflake's auto-scaling and caching make it extremely responsive for interactive queries without any tuning.

Data volumes are under 10TB. At smaller scales, the cost advantage of a lakehouse disappears. Snowflake's storage costs are higher per GB than S3, but the reduced engineering complexity more than makes up for it when your total data is relatively small.

You need it running in weeks, not months. A Snowflake setup with Fivetran for ingestion and dbt for transformation can be production-ready in 2-3 weeks. A lakehouse with custom Spark jobs, Delta Lake, and a serving layer takes 2-3 months to build and stabilize.

When a Lakehouse Wins

You need to serve both analytics and ML. If data scientists are running Spark jobs and pandas notebooks alongside analysts writing SQL, a lakehouse lets both teams work from the same data. With a warehouse, you end up copying data to S3 for the ML team, which creates sync headaches.

Data volumes exceed 50TB and are growing fast. Storage on S3 costs $0.023/GB/month. Storage on Snowflake costs $0.040/GB/month (or more depending on edition). At 100TB, that is $1,700 per month in storage savings. Compute costs differ too: Spark on spot instances is significantly cheaper than Snowflake credits for heavy transformation workloads.

You need streaming and batch on the same data. Delta Lake and Apache Iceberg support both batch and streaming writes to the same tables. This is hard to replicate with a traditional warehouse without building a separate streaming layer.

The Medallion Architecture

Every lakehouse we build uses the medallion pattern: Bronze (raw data as ingested), Silver (cleaned, validated, deduplicated), and Gold (business-level aggregations ready for dashboards and reports). This gives you clear data quality boundaries and makes debugging straightforward. When a dashboard number looks wrong, you trace it back through Gold to Silver to Bronze and find where the issue entered.

Tech Stack Comparison

For our warehouse projects, the stack is typically: Fivetran or Airbyte for ingestion, Snowflake or Redshift for storage and compute, dbt for transformation, and Looker or Metabase for dashboards.

For lakehouse projects: custom connectors or Airbyte for ingestion, S3 for storage, Spark (on Databricks or EMR) for processing, Delta Lake or Iceberg for the table format, dbt or Spark SQL for transformation, and Redshift Spectrum or Trino for the SQL serving layer.

The lakehouse stack has more pieces, which means more things to monitor and maintain. Budget for a dedicated data engineer to keep the pipelines running.

Cost Comparison From Real Projects

For a mid-size company processing 5TB of new data per month with 50TB total:

Snowflake: $8,000-12,000/month (storage + compute credits)
Databricks Lakehouse: $5,000-9,000/month (storage + DBUs)
DIY Lakehouse (S3+EMR+dbt): $3,000-6,000/month (but higher engineering cost)

The DIY lakehouse is cheapest in infrastructure but most expensive in engineering time. Databricks sits in the middle. Snowflake is the most expensive but requires the least specialized engineering.

Our Recommendation

If your team is under 5 data engineers and your primary use case is BI/analytics, start with Snowflake or BigQuery. You can always migrate to a lakehouse later if you outgrow it. If you already have a data engineering team, your data volumes are large, and you need to serve ML workloads alongside analytics, build a lakehouse from the start. The upfront investment pays off within 6-12 months.

Data Lakehouse Architecture