OPEN-SOURCE-DATA-SUMMIT-LOGO-1.svg
OPEN SOURCE DATA SUMMIT LOGO (1)

Explore

open data

systems

Open source strategies for speed, performance, and savings

Nov 13, 2025 | Live Virtual Conference

2025 Speakers | Open Source Data Summit 2025

ajit-uber.jpg

Ajit Panda

Head of Data Security & Governance

ube.png
1653411144654.jpg

Chu Huang

Senior Software Engineer

ube.png
1757886387353.jpg

Jyoti Shah

Director of Application Development

adlog.webp
1754022518556.jpg

Kapil Poreddy

Senior Software Engineering Manager

walma.png
vinoth-chandar.jpg

Vinoth Chandar

Founder & CEO

onehous.svg
1735921057395

Jocelyn Byrne Houle

Sr. Director Product Management

66faf0c5e90f11a5a99f2405_6361623b93931774815cb1af_635ece2bf95dcc0e8b3433a4_securiti
1729196355568.jpg

Nancy Amandi

Data Engineer

moniepoint-logo-png_seeklogo-512644.png
vivek-aws.jpg

Vivek Singh

Principal Database Specialist

7.png
pavi-pic.png

Pavi Subenderan

Software Engineer

ube.png
1758505680859.jpg

Bhakti Hinduja

Director of Software Engineering

Ampere_Computing_Logo.png
images.jpg

Kai Wahner

Global Field CTO

conflu
kyle-onehous.jpg

Kyle Weller

VP of Product

onehous.svg
1756682597114.jpg

Raunak Kumar

Senior Manager, GTM Analytics

ff.svg
1701842155710.jpg

Shiyan (Raymond) Xu

Founding Team Member

onehous.svg
1

Milind Chitgupakar

Founder & CEO

yeedu logo
david h

David Handermann

Project Management Committee Chair

apache nifi
1571588557267

Tamara Janina Fingerlin

Senior Developer Advocate

astronomer-logo-RGB-standard-1200px
citations

Senthil Thangavel

Staff Engineer

PayPal.svg
citations

Sathish Srinivasan

Principal Software Engineer

Oracle-Logo
1761284950470

Vidhyadar Unni Krishnan

Lead SDE, Commercial Payment Solutions

vis
1708983333210

Esteban Puerta

Founder

Screenshot 2025-10-28 at 4.01.46 PM
1760508729966

Andrew Gelinas

Co-Founder

solution-monday-cwb-png
k

Karan Gupta

Cloud & Data Engineer

2025 Agenda | Open Source Data Summit | November 13, 2025

Morning Keynote
8:00 AM PT
The Open Efficiency Playbook: Why fast keeps your lakehouse open

Vinoth Chandar, Founder & CEO @ Onehouse

Open gives portability. Efficiency keeps it. This keynote shows how to assemble a vendor-neutral lakehouse with Apache Hudi™ or Apache Iceberg™, Apache XTable (incubating) for cross-format metadata without copies, and multi-catalog so one physical dataset serves Apache Spark™, Trino, Flink, and Ray. Then we apply a cost SLO rubric and engine fit rules to cut ETL spend and raise throughput. We finish by standardizing data layout work so every engine runs faster and cheaper using reproducible OSS patterns.

Key takeaways:

  • Build a vendor-neutral core with Apache Hudi™ or Apache Iceberg™, XTable, and multi-catalog
  • Track simple cost SLOs: $ per TB transformed, $ per 100M rows, time to first result
  • Match engines to jobs: OLAP on Trino or ClickHouse or StarRocks, DS/ML on Ray or Dask, streaming on Flink, when to keep Spark
  • Cut ETL waste by reducing scans, shuffles, and rewrites
  • Standardize layout ops: partition evolution, compaction, clustering and sort, stats and pruning, metadata hygiene
  • Avoid cost spikes from managed table services and prefer reproducible OSS techniques
Breakout Session
8:35 AM PT
The streaming-first lakehouse: Handling high-frequency mutable workloads with Apache Hudi™

Shiyan Xu, Founding Team Member @ Onehouse

Handling updates and deletes in a streaming lakehouse is a monumental challenge: processing high-frequency mutable workloads can lead to performance degradation, small file issues, and resource wastage due to conflicts when concurrent writers are involved. How can you build a truly streaming-first lakehouse without sacrificing performance? This session demystifies Apache Hudi™’s streaming-first designs built to handle these exact problems.

We'll dive into how Apache Hudi™ uses Merge-on-Read (MOR) tables to efficiently absorb frequent updates and record-level indexing to maintain low-latency writes for mutable data. Discover how auto-file sizing and asynchronous compaction proactively solve the "small file problem." We'll also cover Apache Hudi™ 1.0’s Non-Blocking Concurrency Control (NBCC) to avoid costly retries and the LSM Timeline for optimized metadata access.

Key Takeaways:

  • Understand Apache Hudi™'s core designs for handling streaming mutable workloads at scale.
  • Solve challenging workloads involving high-frequency updates and large, mutable datasets.
  • Leverage Apache Hudi™'s advanced concurrency and metadata optimizations to build a stable, low-latency lakehouse.
Breakout Session
8:35 AM PT
From silos to signals: Quick wins for data quality

Jyoti Shah, Director of Application Development @ ADP

Bad data erodes trust and wastes time. Jyoti Kunal Shah will share the five pillars of data quality, the hidden costs of poor data, and common root causes like siloed systems and manual errors. Attendees will leave with practical quick wins—such as adding simple anomaly detection or schema validation to pipelines—that can be implemented this week to show immediate value.

Key takeaways:

  • Spot the hidden costs and risks of poor data quality.
  • Apply the five pillars of data quality to your own organization.
  • Identify common pitfalls like siloed systems and weak governance.
  • Walk away with low-effort optimizations (e.g., simple anomaly detection or schema validation) you can start tomorrow.
Breakout Session
9:10 AM PT
Where Spark leaves you in the dark: Engine choices that cut costs

Kyle Weller, VP of Product @ Onehouse

Compare Apache Spark™, Trino, ClickHouse, and StarRocks. See where each excels and when to mix engines. Review build vs buy tradeoffs. Watch a demo of Onehouse Quanton that shows 2x to 3x savings on Spark workloads. Leave with a simple roadmap to run specialized engines for lower cost and higher reliability.

Key takeaways:

  • Match engines to workloads for price and performance wins
  • When to keep Spark and when to shift to Trino, ClickHouse, or StarRocks
  • Build vs buy criteria that prevent hidden ops costs
  • How Quanton optimizes Spark for 2x to 3x savings
  • A practical deployment framework for multi-engine stacks
Breakout Session
9:10 AM PT
Ensuring data quality across the organization

Nancy Amandi, Data Engineer @ Moniepoint

This session makes the case for data quality, then shows how to achieve it. Nancy will cover unit, regression, and anomaly tests. She'll walk through a comparison of common frameworks with simple blueprints. Finally, a look at a sample CI or Airflow pipeline with tests wired in. We end with test examples tied to business impact.

Key Takeaways

  • Why data quality matters for trust and cost
  • When to use unit, regression, or anomaly tests
  • Simple blueprints for leading data quality frameworks
  • A working example of a CI or Airflow pipeline with tests integrated
  • How test choices map to business impact across different contexts
Breakout Session
9:45 AM PT
Cut spend, debug Fast: CDC-first lakehouse

Kapil Poreddy, Senior Software Engineer @ Walmart

Peak season exposes every weakness in a data platform: partitions explode, indexes bloat, and “we’ll fold it back later” never happens. In this talk, I share a CDC-first lakehouse playbook that cuts spend and makes debugging fast. We’ll stitch streaming change data into an Apache Iceberg™/Delta/Apache Hudi™ table, use burst-merge-foldback patterns to tame partitions after scale-ups, trim storage via index-policy hygiene and payload compression (e.g., Protobuf + zstd), and—my favorite—time-travel reconstruction of entity state from snapshots for forensics. Powered by open tools (Kafka/Connect, Trino/Presto, DuckDB) with cloud-agnostic notes (incl. Cosmos change feed), you’ll leave with a blueprint, a checklist, and a 10-minute demo you can rerun in your org.

Key takeaways:

  • Architecture: CDC-first lakehouse reference diagram with insert-only + compaction.
  • Cost wins: Partition lifecycle (burst → merge → foldback) and index-policy pruning that actually sticks.
  • Speed: Payload choices (JSON→Proto), columnar + zstd/Snappy, and when to dictionary-encode.
  • Forensics: Reconstruct any entity point-in-time from CDC + snapshots in minutes.
  • Checklist: runbook for pre-peak hardening and post-peak foldback.
Breakout Session
9:45 AM PT
PostgreSQL operational excellence: Core best practices for reliability and performance

Vivek Singh, Principal Database Specialist @ AWS

This 30-minute session covers essential PostgreSQL operational best practices for production environments, focusing on configuration parameters, monitoring approaches, and maintenance procedures critical for reliability and performance. Attendees will receive a practical checklist of immediately applicable improvements.

Key takeaways: 

  • Critical configuration parameters for performance
  • Essential monitoring metrics and alerting thresholds
  • Vacuum and maintenance strategies
  • Index optimization techniques
  • Backup and recovery procedures
Breakout Session
10:20 AM PT
Offensive data security for the real-world

Ajit Panda, Head of Data Security & Governance @ Uber

Pavi Subenderan, Software Engineer @ Uber

Chu Huang, Senior Software Engineer @ Uber

Ajit, Pavi, and Chu join us from Uber to share how Offensive Data Security moves beyond reactive defense by proactively identifying vulnerabilities and simulating real-world exploits. They’ll highlight the GenAI-powered Data Security Auditor that automates policy and governance audits, and a security-aware recommendation engine for data pipeline editors, making data security faster, smarter, and more resilient.

Breakout Session
10:20 AM PT
Data streaming meets the lakehouse - Apache Iceberg™ for unified real-time and batch analytics

Kai Wahner, Global Field CTO @ Confluent

Modern enterprises need both the speed of streaming and the depth of lakehouse analytics. Apache Iceberg™ bridges these worlds with an open table format that supports ACID transactions, schema evolution, and vendor-neutral access from engines like Kafka, Flink, Apache Spark™, and Trino. This session explores how combining data streaming with Iceberg-powered lakehouses enables a “Shift Left” approach—governing data at the source, storing it once, and making it instantly reusable for real-time or batch workloads.

Key Takeaways:

  • Why Apache Iceberg™ is the open standard for unifying streaming and lakehouse data
  • How to turn real-time streams into governed, reusable lakehouse tables
  • Patterns to cut duplication and reduce Reverse ETL costs
  • Practical ways to improve data quality and governance with minimal overhead
  • Steps to future-proof architectures with open formats and multi-engine access
Breakout Session
10:55 AM PT
Enabling enterprise AI at scale for financial services: How unified data governance powers compliant AI innovation

Jocelyn Byrne Houle, Sr. Director Product Management @ Securiti

Breakout Session
10:55 AM PT
Building open source data pipelines for marketing mix modeling at scale

Raunak Kumar, Senior Manager GTM Analytics @ Intercom

This session provides a technical walkthrough of MMM pipelines built with open-source tools, covering:

  • Ingestion: streaming ad and CRM data with Kafka/CDC connectors.
  • Storage: structuring multi-terabyte datasets in Apache Iceberg™/Delta with partitioning and compaction.
  • Feature engineering: harmonizing spend and conversion data with Apache Spark™/Polaris and time-aligned features for regression-based MMM.
  • Validation: handling schema drift, metadata management, and experiment logs with open-source catalogs.
  • Engine trade-offs: using DuckDB/Polars for exploration vs. Apache Spark™ for scaled model training.

Key takeaways

  • A reproducible blueprint for MMM pipelines using open-source tools.
  • Benchmarks and trade-offs across engines and formats.
  • Patterns for schema governance, cost control, and feature generation.
  • Lessons from real SaaS deployments at scale.
Breakout Session
11:30 AM PT
Orchestrating your robots: Agent-interaction and human-in-the-loop with Apache Airflow® 3.1

Tamara Janina Fingerlin, Senior Developer Advocate @ Astronomer

Breakout Session
11:30 AM PT
Quantify the effort and outcome of removing technical debt from a large project

David Handermann, Project Management Committee Chair @ Apache NiFi

Breakout Session
12:05 PM PT
From tokens to thoughts: Breaking down LLM inference

Bhakti Hinduja, Director of Software Engineering @ Ampere

Breakout Session
12:05 PM PT
Modern Security Pattern for Serverless Apps - Multi-Layer Security defense model?

Karan Gupta, Cloud & AI Data Engineer

Breakout Session
12:40 PM PT
Distributed computing made easy: Harnessing Ray for scalable data processing

Sathish Srinivasan, Principal Engineer @ Oracle

Breakout Session
12:40 PM PT
RAG vs fine-tuning vs hybrid: A practitioner's guide to choosing the right approach for enterprise AI applications

Senthil Thangavel, Staff Engineer @ PayPal

Breakout Session
1:15 PM PT
Saving big data from big bills: Spark efficiency for an AI-ready era

Milind Chitgupakar, Founder & CEO @ Yeedu

Breakout Session
1:15 PM PT
Own agents, don't ship keys: Build secure, intelligent agents you control

Esteban Puerta, Founder @ CloudShip AI

Where & when?

Open Source Data Summit 2025 will be held on November 13th, 2025.

What is the cost of access to the live virtual sessions?

OSDS is always free and open to all.

What is Open Source Data Summit?

OSDS is a peer-to-peer gathering of data industry professionals, experts, and enthusiasts to explore the dynamic landscape of open source data tools and storage.

The central theme of OSDS revolves around the advantages of open source data products and their pivotal role in modern data ecosystems.

OSDS is the annual peer hub for knowledge exchange, fostering a deeper understanding of open source options and their role in shaping the data-driven future.

Who attends OSDS?

OSDS is attended by data engineers, data architects, developers, DevOps practitioners and managers, and data leadership.

Anyone looking for enriched perspectives on open source data tools and practical insights to navigate the evolving data landscape should attend this event.

On November 13th, 2025, we'll be back for discussions about:
  • Benefits of open source data tools
  • Cost/performance trade-offs
  • Building data storage solutions
  • Challenges surrounding open source data tool integration
  • Solutions for the cost of storing, accessing, and managing data
  • Data streams and ingestion
  • Hub-and-spoke data integration models
  • Choosing the right engine for your workload
Are you interested in speaking or sponsoring the next Open Source Data Summit?

Submit a talk proposal here or reach out to astronaut@solutionmonday.com.

Fill out the form below to register for a free ticket to Open Source Data Summit 2025!