Cut spend, debug fast: CDC-first lakehouse

Kapil Poreddy, Senior Software Engineer @ Walmart

Peak season exposes every weakness in a data platform: partitions explode, indexes bloat, and “we’ll fold it back later” never happens. In this talk, I share a CDC-first lakehouse playbook that cuts spend and makes debugging fast. We’ll stitch streaming change data into an Apache Iceberg™/Delta/Apache Hudi™ table, use burst-merge-foldback patterns to tame partitions after scale-ups, trim storage via index-policy hygiene and payload compression (e.g., Protobuf + zstd), and—my favorite—time-travel reconstruction of entity state from snapshots for forensics. Powered by open tools (Kafka/Connect, Trino/Presto, DuckDB) with cloud-agnostic notes (incl. Cosmos change feed), you’ll leave with a blueprint, a checklist, and a 10-minute demo you can rerun in your org.

Key takeaways:

  • Architecture: CDC-first lakehouse reference diagram with insert-only + compaction.
  • Cost wins: Partition lifecycle (burst → merge → foldback) and index-policy pruning that actually sticks.
  • Speed: Payload choices (JSON→Proto), columnar + zstd/Snappy, and when to dictionary-encode.
  • Forensics: Reconstruct any entity point-in-time from CDC + snapshots in minutes.
  • Checklist: runbook for pre-peak hardening and post-peak foldback.
Where & when?

Open Source Data Summit 2025 was held on November 13th, 2025.

What is the cost of access to the live virtual sessions?

OSDS is always free and open to all.

What is Open Source Data Summit?

OSDS is a peer-to-peer gathering of data industry professionals, experts, and enthusiasts to explore the dynamic landscape of open source data tools and storage.

The central theme of OSDS revolves around the advantages of open source data products and their pivotal role in modern data ecosystems.

OSDS is the annual peer hub for knowledge exchange, fostering a deeper understanding of open source options and their role in shaping the data-driven future.

Who attends OSDS?

OSDS is attended by data engineers, data architects, developers, DevOps practitioners and managers, and data leadership.

Anyone looking for enriched perspectives on open source data tools and practical insights to navigate the evolving data landscape should attend this event.

Example topics for Open Source Data Summit:
  • Benefits of open source data tools
  • Cost/performance trade-offs
  • Building data storage solutions
  • Challenges surrounding open source data tool integration
  • Solutions for the cost of storing, accessing, and managing data
  • Data streams and ingestion
  • Hub-and-spoke data integration models
  • Choosing the right engine for your workload
Are you interested in speaking or sponsoring the next Open Source Data Summit?

Submit a talk proposal here or reach out to astronaut@solutionmonday.com.

That's a wrap for 2025! Enter your email address below for access to the 2025 sessions on-demand and news about the 2026 summit!!