Explore
open data
systems
Open source strategies for speed, performance, and savings
Nov 13, 2025 | Live Virtual Conference
2025 Speakers | Open Source Data Summit 2025
Ajit Panda
Head of Data Security & Governance
Chu Huang
Senior Software Engineer
Jyoti Shah
Director of Application Development
Kapil Poreddy
Senior Software Engineering Manager
Vinoth Chandar
Founder & CEO
Jocelyn Byrne Houle
Sr. Director Product Management
Nancy Amandi
Data Engineer
Vivek Singh
Principal Database Specialist
Pavi Subenderan
Software Engineer
Bhakti Hinduja
Director of Software Engineering
Kai Wahner
Global Field CTO
Kyle Weller
VP of Product
Raunak Kumar
Senior Manager, GTM Analytics
Shiyan (Raymond) Xu
Founding Team Member
Milind Chitgupakar
Founder & CEO
David Handermann
Project Management Committee Chair
Tamara Janina Fingerlin
Senior Developer Advocate
Senthil Thangavel
Staff Engineer
Sathish Srinivasan
Principal Software Engineer
Vidhyadar Unni Krishnan
Lead SDE, Commercial Payment Solutions
Esteban Puerta
Founder
Andrew Gelinas
Co-Founder
Karan Gupta
Cloud & Data Engineer
2025 Agenda | Open Source Data Summit | November 13, 2025
Morning Keynote
8:00 AM PT
The Open Efficiency Playbook: Why fast keeps your lakehouse open
Vinoth Chandar, Founder & CEO @ Onehouse
Open gives portability. Efficiency keeps it. This keynote shows how to assemble a vendor-neutral lakehouse with Apache Hudi™ or Apache Iceberg™, Apache XTable (incubating) for cross-format metadata without copies, and multi-catalog so one physical dataset serves Apache Spark™, Trino, Flink, and Ray. Then we apply a cost SLO rubric and engine fit rules to cut ETL spend and raise throughput. We finish by standardizing data layout work so every engine runs faster and cheaper using reproducible OSS patterns.
Key takeaways:
- Build a vendor-neutral core with Apache Hudi™ or Apache Iceberg™, XTable, and multi-catalog
- Track simple cost SLOs: $ per TB transformed, $ per 100M rows, time to first result
- Match engines to jobs: OLAP on Trino or ClickHouse or StarRocks, DS/ML on Ray or Dask, streaming on Flink, when to keep Spark
- Cut ETL waste by reducing scans, shuffles, and rewrites
- Standardize layout ops: partition evolution, compaction, clustering and sort, stats and pruning, metadata hygiene
- Avoid cost spikes from managed table services and prefer reproducible OSS techniques
Breakout Session
8:35 AM PT
The streaming-first lakehouse: Handling high-frequency mutable workloads with Apache Hudi™
Shiyan Xu, Founding Team Member @ Onehouse
Handling updates and deletes in a streaming lakehouse is a monumental challenge: processing high-frequency mutable workloads can lead to performance degradation, small file issues, and resource wastage due to conflicts when concurrent writers are involved. How can you build a truly streaming-first lakehouse without sacrificing performance? This session demystifies Apache Hudi™’s streaming-first designs built to handle these exact problems.
We'll dive into how Apache Hudi™ uses Merge-on-Read (MOR) tables to efficiently absorb frequent updates and record-level indexing to maintain low-latency writes for mutable data. Discover how auto-file sizing and asynchronous compaction proactively solve the "small file problem." We'll also cover Apache Hudi™ 1.0’s Non-Blocking Concurrency Control (NBCC) to avoid costly retries and the LSM Timeline for optimized metadata access.
Key Takeaways:
- Understand Apache Hudi™'s core designs for handling streaming mutable workloads at scale.
- Solve challenging workloads involving high-frequency updates and large, mutable datasets.
- Leverage Apache Hudi™'s advanced concurrency and metadata optimizations to build a stable, low-latency lakehouse.
Breakout Session
8:35 AM PT
From silos to signals: Quick wins for data quality
Jyoti Shah, Director of Application Development @ ADP
Bad data erodes trust and wastes time. Jyoti Kunal Shah will share the five pillars of data quality, the hidden costs of poor data, and common root causes like siloed systems and manual errors. Attendees will leave with practical quick wins—such as adding simple anomaly detection or schema validation to pipelines—that can be implemented this week to show immediate value.
Key takeaways:
- Spot the hidden costs and risks of poor data quality.
- Apply the five pillars of data quality to your own organization.
- Identify common pitfalls like siloed systems and weak governance.
- Walk away with low-effort optimizations (e.g., simple anomaly detection or schema validation) you can start tomorrow.
Breakout Session
9:10 AM PT
Where Spark leaves you in the dark: Engine choices that cut costs
Kyle Weller, VP of Product @ Onehouse
Compare Apache Spark™, Trino, ClickHouse, and StarRocks. See where each excels and when to mix engines. Review build vs buy tradeoffs. Watch a demo of Onehouse Quanton that shows 2x to 3x savings on Spark workloads. Leave with a simple roadmap to run specialized engines for lower cost and higher reliability.
Key takeaways:
- Match engines to workloads for price and performance wins
- When to keep Spark and when to shift to Trino, ClickHouse, or StarRocks
- Build vs buy criteria that prevent hidden ops costs
- How Quanton optimizes Spark for 2x to 3x savings
- A practical deployment framework for multi-engine stacks
Breakout Session
9:10 AM PT
Ensuring data quality across the organization
Nancy Amandi, Data Engineer @ Moniepoint
This session makes the case for data quality, then shows how to achieve it. Nancy will cover unit, regression, and anomaly tests. She'll walk through a comparison of common frameworks with simple blueprints. Finally, a look at a sample CI or Airflow pipeline with tests wired in. We end with test examples tied to business impact.
Key Takeaways
- Why data quality matters for trust and cost
- When to use unit, regression, or anomaly tests
- Simple blueprints for leading data quality frameworks
- A working example of a CI or Airflow pipeline with tests integrated
- How test choices map to business impact across different contexts
Breakout Session
9:45 AM PT
Cut spend, debug Fast: CDC-first lakehouse
Kapil Poreddy, Senior Software Engineer @ Walmart
Peak season exposes every weakness in a data platform: partitions explode, indexes bloat, and “we’ll fold it back later” never happens. In this talk, I share a CDC-first lakehouse playbook that cuts spend and makes debugging fast. We’ll stitch streaming change data into an Apache Iceberg™/Delta/Apache Hudi™ table, use burst-merge-foldback patterns to tame partitions after scale-ups, trim storage via index-policy hygiene and payload compression (e.g., Protobuf + zstd), and—my favorite—time-travel reconstruction of entity state from snapshots for forensics. Powered by open tools (Kafka/Connect, Trino/Presto, DuckDB) with cloud-agnostic notes (incl. Cosmos change feed), you’ll leave with a blueprint, a checklist, and a 10-minute demo you can rerun in your org.
Key takeaways:
- Architecture: CDC-first lakehouse reference diagram with insert-only + compaction.
- Cost wins: Partition lifecycle (burst → merge → foldback) and index-policy pruning that actually sticks.
- Speed: Payload choices (JSON→Proto), columnar + zstd/Snappy, and when to dictionary-encode.
- Forensics: Reconstruct any entity point-in-time from CDC + snapshots in minutes.
- Checklist: runbook for pre-peak hardening and post-peak foldback.
Breakout Session
9:45 AM PT
PostgreSQL operational excellence: Core best practices for reliability and performance
Vivek Singh, Principal Database Specialist @ AWS
This 30-minute session covers essential PostgreSQL operational best practices for production environments, focusing on configuration parameters, monitoring approaches, and maintenance procedures critical for reliability and performance. Attendees will receive a practical checklist of immediately applicable improvements.
Key takeaways:
- Critical configuration parameters for performance
- Essential monitoring metrics and alerting thresholds
- Vacuum and maintenance strategies
- Index optimization techniques
- Backup and recovery procedures
Breakout Session
10:20 AM PT
Offensive data security for the real-world
Ajit Panda, Head of Data Security & Governance @ Uber
Pavi Subenderan, Software Engineer @ Uber
Chu Huang, Senior Software Engineer @ Uber
Ajit, Pavi, and Chu join us from Uber to share how Offensive Data Security moves beyond reactive defense by proactively identifying vulnerabilities and simulating real-world exploits. They’ll highlight the GenAI-powered Data Security Auditor that automates policy and governance audits, and a security-aware recommendation engine for data pipeline editors, making data security faster, smarter, and more resilient.
Breakout Session
10:20 AM PT
Data streaming meets the lakehouse - Apache Iceberg™ for unified real-time and batch analytics
Kai Wahner, Global Field CTO @ Confluent
Modern enterprises need both the speed of streaming and the depth of lakehouse analytics. Apache Iceberg™ bridges these worlds with an open table format that supports ACID transactions, schema evolution, and vendor-neutral access from engines like Kafka, Flink, Apache Spark™, and Trino. This session explores how combining data streaming with Iceberg-powered lakehouses enables a “Shift Left” approach—governing data at the source, storing it once, and making it instantly reusable for real-time or batch workloads.
Key Takeaways:
- Why Apache Iceberg™ is the open standard for unifying streaming and lakehouse data
- How to turn real-time streams into governed, reusable lakehouse tables
- Patterns to cut duplication and reduce Reverse ETL costs
- Practical ways to improve data quality and governance with minimal overhead
- Steps to future-proof architectures with open formats and multi-engine access
Breakout Session
10:55 AM PT
Enabling enterprise AI at scale for financial services: How unified data governance powers compliant AI innovation
Jocelyn Byrne Houle, Sr. Director Product Management @ Securiti
Enterprise AI adoption is accelerating, but organizations face a critical challenge: managing data security, privacy, governance, and compliance across fragmented systems and hybrid multicloud environments. Traditional approaches using disconnected point tools create complexity, inconsistent controls, and become blockers to AI velocity rather than enablers.
This session explores how a unified data controls framework—powered by common data intelligence—enables organizations to safely harness the power of data and AI. We'll examine practical strategies for discovering AI models, mapping data flows, assessing risks, implementing controls, and ensuring continuous compliance with global AI standards including NIST AI RMF and the EU AI Act.
Key takeaways:
- Comprehensive AI & data discovery across public clouds, private clouds, and SaaS applications to eliminate "Shadow AI" and gain complete visibility
- Automated data+AI flow mapping that connects AI models to data sources, processing paths, sensitive data usage, and compliance obligations for full provenance tracking
- AI security controls addressing OWASP Top 10 vulnerabilities for LLMs and mitigating risks like prompt injection, data poisoning, and insecure model outputs
- Unified access governance establishing least privilege controls for structured and unstructured data with automated access pattern tracking
- Data quality and trust frameworks that profile, classify, and validate data automatically—building confidence in AI training data and decision-making
- Real-world implementation patterns for transitioning from disconnected tools to a unified data command center that accelerates AI innovation while reducing cost and complexity
Breakout Session
10:55 AM PT
Building open source data pipelines for marketing mix modeling at scale
Raunak Kumar, Senior Manager GTM Analytics @ Intercom
This session provides a technical walkthrough of MMM pipelines built with open-source tools, covering:
- Ingestion: streaming ad and CRM data with Kafka/CDC connectors.
- Storage: structuring multi-terabyte datasets in Apache Iceberg™/Delta with partitioning and compaction.
- Feature engineering: harmonizing spend and conversion data with Apache Spark™/Polaris and time-aligned features for regression-based MMM.
- Validation: handling schema drift, metadata management, and experiment logs with open-source catalogs.
- Engine trade-offs: using DuckDB/Polars for exploration vs. Apache Spark™ for scaled model training.
Key takeaways
- A reproducible blueprint for MMM pipelines using open-source tools.
- Benchmarks and trade-offs across engines and formats.
- Patterns for schema governance, cost control, and feature generation.
- Lessons from real SaaS deployments at scale.
Breakout Session
11:30 AM PT
Orchestrating your robots: Agent-interaction and human-in-the-loop with Apache Airflow® 3.1
Tamara Janina Fingerlin, Senior Developer Advocate @ Astronomer
Apache Airflow® is the open-source standard for workflow orchestration. Already a staple of the modern data engineering stack, Airflow is increasingly used to manage AI pipelines, including data preprocessing, model fine-tuning, inference execution, and multi-agent workflow coordination.
This talk covers the latest features available in Airflow 3.1 for AIOps pipelines, from event-driven scheduling to human-in-the-loop operators, as well as the Airflow AI SDK, a new open source package to orchestrate LLMs and agents.
Key takeaways:
-
Create a multi-agent pipeline as an Apache Airflow® Dag
-
Orchestrate Pydantic AI Agents with the @task.agent decorator of the Airflow AI SDK
-
Add human decision making to your orchestration with the new human-in-the-loop operators
-
Use AssetWatchers to schedule pipelines based on messages in a message queue
Breakout Session
11:30 AM PT
Quantify the effort and outcome of removing technical debt from a large project
David Handermann, Project Management Committee Chair @ Apache NiFi
- Understand the process for major version releases in an Apache Software Foundation project
- Learn the primary new features of Apache NiFi 2
- Quantify the effort and outcome of removing technical debt from a large project
- Consider the infrastructure benefits of optimized NiFi clustering on Kubernetes
- See the possibilities for building new NiFi Processors in native Python
Breakout Session
12:05 PM PT
From tokens to thoughts: Breaking down LLM inference
Bhakti Hinduja, Director of Software Engineering @ Ampere
This talk delivers a technical deep dive into LLM inference bottlenecks, focusing on the shift from compute-bound prefill to memory-bound decode phases. Using existing benchmarks and KV cache scaling, it demonstrates why memory bandwidth dominates AI serving architectures. Optimization strategies such as quantization, PagedAttention, speculative decoding, and dynamic batching are compared across leading inference frameworks (vLLM).
Emerging alternatives to transformers—including State Space Models (SSMs) and Mixture of Experts (MoE)—are evaluated for scalability and efficiency. The session highlights hierarchical inference architectures for embodied AI, covering real-time robotics and edge deployments with advanced compute. AGI infrastructure requirements are quantified, emphasizing the coming 1000x scale-up in compute and memory and outlining three evolving inference regimes: cloud hyperscale, edge/mobile, and task-specific accelerators.
Key Takeaways:
- Memory optimization (quantization, efficient caching) is the primary constraint
- MoE, SSM, and hybrid models address transformer scalability limits
- Hierarchical approaches enable real-time and edge AI
- Production teams should prepare for rapid increases in inference workloads and infrastructure needs
Breakout Session
12:05 PM PT
Modern Security Pattern for Serverless Apps - Multi-Layer Security defense model?
Karan Gupta, Cloud & AI Data Engineer
As teams shift to serverless and hybrid cloud, security must be embedded at every layer, not concentrated at the edge. This session presents a practical multi-layer security framework for open source serverless data pipelines that balances transparency, interoperability, and cost control. Using Kubeless for event execution, HashiCorp Vault for secrets and key management, Open Policy Agent for policy-as-code, Grafana Loki for log aggregation, and Apache Airflow for orchestrated controls, we show how to protect workloads end to end across identity, data, network, and runtime. A real financial data scenario illustrates secure movement across multi-cloud providers while sustaining real-time analytics and regulatory compliance. Attendees leave with a repeatable blueprint for secure, scalable, and open serverless architectures that avoid lock-in and improve resilience.
Key takeaways
-
A clear reference model for multi-layer security in serverless and hybrid cloud
-
How to implement least privilege and short-lived credentials with Vault and OIDC
-
Writing portable guardrails using OPA policies at the API, data, and workflow layers
-
Instrumentation patterns using Loki to enable traceable, tamper-evident audit logs
-
Orchestrating security controls in Airflow runs, including preflight checks and rollbacks
-
Network and runtime hardening for serverless functions, from egress control to SBOM checks
-
A cost-aware approach that reduces risk without vendor lock-in, with metrics to prove impact
Breakout Session
12:40 PM PT
Distributed computing made easy: Harnessing Ray for scalable data processing
Sathish Srinivasan, Principal Engineer @ Oracle
Modern data workloads—from ETL pipelines to machine learning training—are increasingly distributed by necessity, yet most frameworks make scaling complex. In this session, we’ll explore how Ray, an open-source framework from UC Berkeley’s RISELab, simplifies distributed computing by making it as intuitive as writing local Python code. We’ll dive into core abstractions (tasks, actors, object store), showcase Ray Data, Ray Train, and Ray Serve, and demonstrate how to scale from a single laptop to a Kubernetes cluster on Oracle OKE with no code changes. Attendees will learn practical patterns for building and deploying scalable, fault-tolerant, and GPU-accelerated pipelines for data processing and ML workloads — all with Python simplicity.
Audience Takeaways:
- Understand Ray’s unified distributed computing model
- Learn deployment best practices on Kubernetes and cloud platforms
- See live demos of Ray powering real-world data & ML pipelines
Breakout Session
12:40 PM PT
RAG vs fine-tuning vs hybrid: A practitioner's guide to choosing the right approach for enterprise AI applications
Senthil Thangavel, Staff Engineer @ PayPal
Fine-tuning LLMs costs $10-50k per iteration and requires months of data prep, while RAG systems deploy in weeks at 10% of the cost.
Most enterprises default to fine-tuning for knowledge problems when RAG would suffice, burning budgets on retraining as data drifts.
Pure RAG struggles with reasoning and style consistency, but hybrid approaches deliver 3-5x better ROI.
Teams that start with RAG and selectively apply fine-tuning only for behavior changes see faster deployment, better explainability, and lower maintenance costs.
Key takeaways:
- Start with RAG for knowledge retrieval, compliance docs, and factual Q&A before considering fine-tuning
- Fine-tune only for behavior changes like style, format, or domain-specific reasoning patterns
- Build evaluation pipelines first with metrics for retrieval accuracy, hallucination rates, and latency
- Use hybrid patterns: RAG for facts + LoRA for style, or instruction tuning + RAG for context
- Implement semantic caching and reranking to cut RAG latency from 3-5 seconds to under 1 s
Breakout Session
1:15 PM PT
Saving big data from big bills: Spark efficiency for an AI-ready era
Milind Chitgupakar, Founder & CEO @ Yeedu
Apache Spark™ still runs the bulk of enterprise data work, but the bottleneck is cost. Idle clusters, large shuffles, and skew waste budgets are common, while common tweaks deliver only 10 to 20% savings. New engines and single-node options promise 2 to 5x better price performance, though refactors may be needed. Teams that cut Spark spend now have free budget for AI and win on ROI.
Key takeaways:
- Measure and attack idle time, shuffles, and skew first
- Expect only modest gains from autoscaling, right-sizing, tuning, and caching
- Consider Polars or DuckDB for targeted pipelines if refactors are feasible
- Evaluate Velox, Turbo, Photon, and similar engines for vectorized, CPU aware execution with 2 to 5x gains
- Optimize for total cost and ROI, not just runtime
- Use ETL savings to fund AI and analytics work
Breakout Session
1:15 PM PT
Own agents, don't ship keys: Build secure, intelligent agents you control
Esteban Puerta, Founder @ CloudShip AI
In this hands-on session, you’ll build your own secure agents—using tools you already know. We’ll combine TFLint, Semgrep, and Trivy into a lightweight agent that scans infrastructure, code, and images locally never sharing credentials externally. You’ll learn how to work with MCP, Understand the anatomy of tool based agents, good agent design, and using agents practically in your CICD. By the end, you’ll have a good overview of operational agents and how to add intelligence to your pipelines.
Key takeaways:
- Build your own security agents that run with your tools—TFLint, Semgrep, Trivy—no new stack needed
- Understand the anatomy of MCP based agents
- Improve security scans with added intelligence
- Work with TFlint, Semgrep, Checkov, Syft, Tfsec
Where & when?
Open Source Data Summit 2025 will be held on November 13th, 2025.
What is the cost of access to the live virtual sessions?
OSDS is always free and open to all.
What is Open Source Data Summit?
OSDS is a peer-to-peer gathering of data industry professionals, experts, and enthusiasts to explore the dynamic landscape of open source data tools and storage.
The central theme of OSDS revolves around the advantages of open source data products and their pivotal role in modern data ecosystems.
OSDS is the annual peer hub for knowledge exchange, fostering a deeper understanding of open source options and their role in shaping the data-driven future.
Who attends OSDS?
OSDS is attended by data engineers, data architects, developers, DevOps practitioners and managers, and data leadership.
Anyone looking for enriched perspectives on open source data tools and practical insights to navigate the evolving data landscape should attend this event.
On November 13th, 2025, we'll be back for discussions about:
- Benefits of open source data tools
- Cost/performance trade-offs
- Building data storage solutions
- Challenges surrounding open source data tool integration
- Solutions for the cost of storing, accessing, and managing data
- Data streams and ingestion
- Hub-and-spoke data integration models
- Choosing the right engine for your workload
Are you interested in speaking or sponsoring the next Open Source Data Summit?
Submit a talk proposal here or reach out to astronaut@solutionmonday.com.