Explore

open data

systems

Open source strategies for speed, performance, and savings

Nov 13, 2025 | Live Virtual Conference

2025 Speakers | Open Source Data Summit 2025

Ajit Panda

Head of Data Security & Governance

Chu Huang

Senior Software Engineer

Jyoti Shah

Director of Application Development

Kapil Poreddy

Senior Software Engineering Manager

Vinoth Chandar

Founder & CEO

Jocelyn Byrne Houle

Sr. Director Product Management

66faf0c5e90f11a5a99f2405_6361623b93931774815cb1af_635ece2bf95dcc0e8b3433a4_securiti

Nancy Amandi

Data Engineer

Vivek Singh

Principal Database Specialist

Pavi Subenderan

Software Engineer

Bhakti Hinduja

Director of Software Engineering

Kai Wahner

Global Field CTO

Kyle Weller

VP of Product

Raunak Kumar

Senior Manager, GTM Analytics

Shiyan (Raymond) Xu

Founding Team Member

Milind Chitgupakar

Founder & CEO

David Handermann

Project Management Committee Chair

Tamara Janina Fingerlin

Senior Developer Advocate

Senthil Thangavel

Staff Engineer

Sathish Srinivasan

Principal Software Engineer

Vidhyadar Unni Krishnan

Lead SDE, Commercial Payment Solutions

Esteban Puerta

Founder

Andrew Gelinas

Co-Founder

Karan Gupta

Cloud & Data Engineer

Get a free virtual ticket to OSDS 2025

2025 Agenda | Open Source Data Summit | November 13, 2025

Morning Keynote

8:00 AM PT

The Open Efficiency Playbook: Why fast keeps your lakehouse open

Vinoth Chandar, Founder & CEO @ Onehouse

Open gives portability. Efficiency keeps it. This keynote shows how to assemble a vendor-neutral lakehouse with Apache Hudi™ or Apache Iceberg™, Apache XTable (incubating) for cross-format metadata without copies, and multi-catalog so one physical dataset serves Apache Spark™, Trino, Flink, and Ray. Then we apply a cost SLO rubric and engine fit rules to cut ETL spend and raise throughput. We finish by standardizing data layout work so every engine runs faster and cheaper using reproducible OSS patterns.

Key takeaways:

Build a vendor-neutral core with Apache Hudi™ or Apache Iceberg™, XTable, and multi-catalog
Track simple cost SLOs: $ per TB transformed, $ per 100M rows, time to first result
Match engines to jobs: OLAP on Trino or ClickHouse or StarRocks, DS/ML on Ray or Dask, streaming on Flink, when to keep Spark
Cut ETL waste by reducing scans, shuffles, and rewrites
Standardize layout ops: partition evolution, compaction, clustering and sort, stats and pruning, metadata hygiene
Avoid cost spikes from managed table services and prefer reproducible OSS techniques

Breakout Session

8:35 AM PT

The streaming-first lakehouse: Handling high-frequency mutable workloads with Apache Hudi™

Shiyan Xu, Founding Team Member @ Onehouse

Handling updates and deletes in a streaming lakehouse is a monumental challenge: processing high-frequency mutable workloads can lead to performance degradation, small file issues, and resource wastage due to conflicts when concurrent writers are involved. How can you build a truly streaming-first lakehouse without sacrificing performance? This session demystifies Apache Hudi™’s streaming-first designs built to handle these exact problems.

We'll dive into how Apache Hudi™ uses Merge-on-Read (MOR) tables to efficiently absorb frequent updates and record-level indexing to maintain low-latency writes for mutable data. Discover how auto-file sizing and asynchronous compaction proactively solve the "small file problem." We'll also cover Apache Hudi™ 1.0’s Non-Blocking Concurrency Control (NBCC) to avoid costly retries and the LSM Timeline for optimized metadata access.

Key Takeaways:

Understand Apache Hudi™'s core designs for handling streaming mutable workloads at scale.
Solve challenging workloads involving high-frequency updates and large, mutable datasets.
Leverage Apache Hudi™'s advanced concurrency and metadata optimizations to build a stable, low-latency lakehouse.

Breakout Session

8:35 AM PT

From silos to signals: Quick wins for data quality

Jyoti Shah, Director of Application Development @ ADP

Bad data erodes trust and wastes time. Jyoti Kunal Shah will share the five pillars of data quality, the hidden costs of poor data, and common root causes like siloed systems and manual errors. Attendees will leave with practical quick wins—such as adding simple anomaly detection or schema validation to pipelines—that can be implemented this week to show immediate value.

Key takeaways:

Spot the hidden costs and risks of poor data quality.
Apply the five pillars of data quality to your own organization.
Identify common pitfalls like siloed systems and weak governance.
Walk away with low-effort optimizations (e.g., simple anomaly detection or schema validation) you can start tomorrow.

Breakout Session

9:10 AM PT

Where Spark leaves you in the dark: Engine choices that cut costs

Kyle Weller, VP of Product @ Onehouse

Compare Apache Spark™, Trino, ClickHouse, and StarRocks. See where each excels and when to mix engines. Review build vs buy tradeoffs. Watch a demo of Onehouse Quanton that shows 2x to 3x savings on Spark workloads. Leave with a simple roadmap to run specialized engines for lower cost and higher reliability.

Key takeaways:

Match engines to workloads for price and performance wins
When to keep Spark and when to shift to Trino, ClickHouse, or StarRocks
Build vs buy criteria that prevent hidden ops costs
How Quanton optimizes Spark for 2x to 3x savings
A practical deployment framework for multi-engine stacks

Breakout Session

9:10 AM PT

Ensuring data quality across the organization

Nancy Amandi, Data Engineer @ Moniepoint

This session makes the case for data quality, then shows how to achieve it. Nancy will cover unit, regression, and anomaly tests. She'll walk through a comparison of common frameworks with simple blueprints. Finally, a look at a sample CI or Airflow pipeline with tests wired in. We end with test examples tied to business impact.

Key Takeaways

Why data quality matters for trust and cost
When to use unit, regression, or anomaly tests
Simple blueprints for leading data quality frameworks
A working example of a CI or Airflow pipeline with tests integrated
How test choices map to business impact across different contexts

Get a free virtual ticket to OSDS 2025

Breakout Session

9:45 AM PT

Cut spend, debug Fast: CDC-first lakehouse

Kapil Poreddy, Senior Software Engineer @ Walmart

Peak season exposes every weakness in a data platform: partitions explode, indexes bloat, and “we’ll fold it back later” never happens. In this talk, I share a CDC-first lakehouse playbook that cuts spend and makes debugging fast. We’ll stitch streaming change data into an Apache Iceberg™/Delta/Apache Hudi™ table, use burst-merge-foldback patterns to tame partitions after scale-ups, trim storage via index-policy hygiene and payload compression (e.g., Protobuf + zstd), and—my favorite—time-travel reconstruction of entity state from snapshots for forensics. Powered by open tools (Kafka/Connect, Trino/Presto, DuckDB) with cloud-agnostic notes (incl. Cosmos change feed), you’ll leave with a blueprint, a checklist, and a 10-minute demo you can rerun in your org.

Key takeaways:

Architecture: CDC-first lakehouse reference diagram with insert-only + compaction.
Cost wins: Partition lifecycle (burst → merge → foldback) and index-policy pruning that actually sticks.
Speed: Payload choices (JSON→Proto), columnar + zstd/Snappy, and when to dictionary-encode.
Forensics: Reconstruct any entity point-in-time from CDC + snapshots in minutes.
Checklist: runbook for pre-peak hardening and post-peak foldback.

Breakout Session

9:45 AM PT

PostgreSQL operational excellence: Core best practices for reliability and performance

Vivek Singh, Principal Database Specialist @ AWS

This 30-minute session covers essential PostgreSQL operational best practices for production environments, focusing on configuration parameters, monitoring approaches, and maintenance procedures critical for reliability and performance. Attendees will receive a practical checklist of immediately applicable improvements.

Key takeaways:

Critical configuration parameters for performance
Essential monitoring metrics and alerting thresholds
Vacuum and maintenance strategies
Index optimization techniques
Backup and recovery procedures

Breakout Session

10:20 AM PT

Offensive data security for the real-world

Ajit Panda, Head of Data Security & Governance @ Uber

Pavi Subenderan, Software Engineer @ Uber

Chu Huang, Senior Software Engineer @ Uber

Ajit, Pavi, and Chu join us from Uber to share how Offensive Data Security moves beyond reactive defense by proactively identifying vulnerabilities and simulating real-world exploits. They’ll highlight the GenAI-powered Data Security Auditor that automates policy and governance audits, and a security-aware recommendation engine for data pipeline editors, making data security faster, smarter, and more resilient.

Get a free virtual ticket to OSDS 2025

Breakout Session

10:20 AM PT

Data streaming meets the lakehouse - Apache Iceberg™ for unified real-time and batch analytics

Kai Wahner, Global Field CTO @ Confluent

Modern enterprises need both the speed of streaming and the depth of lakehouse analytics. Apache Iceberg™ bridges these worlds with an open table format that supports ACID transactions, schema evolution, and vendor-neutral access from engines like Kafka, Flink, Apache Spark™, and Trino. This session explores how combining data streaming with Iceberg-powered lakehouses enables a “Shift Left” approach—governing data at the source, storing it once, and making it instantly reusable for real-time or batch workloads.

Key Takeaways:

Why Apache Iceberg™ is the open standard for unifying streaming and lakehouse data
How to turn real-time streams into governed, reusable lakehouse tables
Patterns to cut duplication and reduce Reverse ETL costs
Practical ways to improve data quality and governance with minimal overhead
Steps to future-proof architectures with open formats and multi-engine access

Breakout Session

10:55 AM PT

Enabling enterprise AI at scale for financial services: How unified data governance powers compliant AI innovation

Jocelyn Byrne Houle, Sr. Director Product Management @ Securiti

Enterprise AI adoption is accelerating, but organizations face a critical challenge: managing data security, privacy, governance, and compliance across fragmented systems and hybrid multicloud environments. Traditional approaches using disconnected point tools create complexity, inconsistent controls, and become blockers to AI velocity rather than enablers.
This session explores how a unified data controls framework—powered by common data intelligence—enables organizations to safely harness the power of data and AI. We'll examine practical strategies for discovering AI models, mapping data flows, assessing risks, implementing controls, and ensuring continuous compliance with global AI standards including NIST AI RMF and the EU AI Act.

Key takeaways:

Comprehensive AI & data discovery across public clouds, private clouds, and SaaS applications to eliminate "Shadow AI" and gain complete visibility
Automated data+AI flow mapping that connects AI models to data sources, processing paths, sensitive data usage, and compliance obligations for full provenance tracking
AI security controls addressing OWASP Top 10 vulnerabilities for LLMs and mitigating risks like prompt injection, data poisoning, and insecure model outputs
Unified access governance establishing least privilege controls for structured and unstructured data with automated access pattern tracking
Data quality and trust frameworks that profile, classify, and validate data automatically—building confidence in AI training data and decision-making
Real-world implementation patterns for transitioning from disconnected tools to a unified data command center that accelerates AI innovation while reducing cost and complexity

Breakout Session

10:55 AM PT

Building open source data pipelines for marketing mix modeling at scale

Raunak Kumar, Senior Manager GTM Analytics @ Intercom

This session provides a technical walkthrough of MMM pipelines built with open-source tools, covering:

Ingestion: streaming ad and CRM data with Kafka/CDC connectors.
Storage: structuring multi-terabyte datasets in Apache Iceberg™/Delta with partitioning and compaction.
Feature engineering: harmonizing spend and conversion data with Apache Spark™/Polaris and time-aligned features for regression-based MMM.
Validation: handling schema drift, metadata management, and experiment logs with open-source catalogs.
Engine trade-offs: using DuckDB/Polars for exploration vs. Apache Spark™ for scaled model training.

Key takeaways

A reproducible blueprint for MMM pipelines using open-source tools.
Benchmarks and trade-offs across engines and formats.
Patterns for schema governance, cost control, and feature generation.
Lessons from real SaaS deployments at scale.

Get a free virtual ticket to OSDS 2025

Breakout Session

11:30 AM PT

Orchestrating your robots: Agent-interaction and human-in-the-loop with Apache Airflow® 3.1

Tamara Janina Fingerlin, Senior Developer Advocate @ Astronomer

Apache Airflow® is the open-source standard for workflow orchestration. Already a staple of the modern data engineering stack, Airflow is increasingly used to manage AI pipelines, including data preprocessing, model fine-tuning, inference execution, and multi-agent workflow coordination.

This talk covers the latest features available in Airflow 3.1 for AIOps pipelines, from event-driven scheduling to human-in-the-loop operators, as well as the Airflow AI SDK, a new open source package to orchestrate LLMs and agents.

Key takeaways:

Create a multi-agent pipeline as an Apache Airflow® Dag
Orchestrate Pydantic AI Agents with the @task.agent decorator of the Airflow AI SDK
Add human decision making to your orchestration with the new human-in-the-loop operators
Use AssetWatchers to schedule pipelines based on messages in a message queue

Breakout Session

11:30 AM PT

Quantify the effort and outcome of removing technical debt from a large project

David Handermann, Project Management Committee Chair @ Apache NiFi

Apache NiFi 2 introduced a significant number of foundational upgrades, from Java 8 to Java 21, from AngularJS 1 to Angular 18, and from Jetty 9 to 12, along with a substantial reduction in project source code. With close to one million lines of code, configuration, and documentation changed across more than 2000 issues, NiFi 2 represents collaborative focus on project modernization. The new version also added support for writing extensions in native Python, expanding opportunities for integration. This presentation reviews the path from initial discussion to general release for NiFi 2, with highlights and implementation strategies from the author of the NiFi 2 release goals. From incremental upgrades and deprecated code deletion to substantial rewrites and migration methods, this presentation covers several approaches to maintaining a large project that supports integrations with streaming, structured, and multimodal data sources.

Breakout Session

12:05 PM PT

From tokens to thoughts: Breaking down LLM inference

Bhakti Hinduja, Director of Software Engineering @ Ampere

This talk delivers a technical deep dive into LLM inference bottlenecks, focusing on the shift from compute-bound prefill to memory-bound decode phases. Using existing benchmarks and KV cache scaling, it demonstrates why memory bandwidth dominates AI serving architectures. Optimization strategies such as quantization, PagedAttention, speculative decoding, and dynamic batching are compared across leading inference frameworks (vLLM).

Emerging alternatives to transformers—including State Space Models (SSMs) and Mixture of Experts (MoE)—are evaluated for scalability and efficiency. The session highlights hierarchical inference architectures for embodied AI, covering real-time robotics and edge deployments with advanced compute. AGI infrastructure requirements are quantified, emphasizing the coming 1000x scale-up in compute and memory and outlining three evolving inference regimes: cloud hyperscale, edge/mobile, and task-specific accelerators.

Key Takeaways:

Memory optimization (quantization, efficient caching) is the primary constraint
MoE, SSM, and hybrid models address transformer scalability limits
Hierarchical approaches enable real-time and edge AI
Production teams should prepare for rapid increases in inference workloads and infrastructure needs

Get a free virtual ticket to OSDS 2025

Breakout Session

12:05 PM PT

Modern Security Pattern for Serverless Apps - Multi-Layer Security defense model?

Karan Gupta, Cloud & AI Data Engineer

As teams shift to serverless and hybrid cloud, security must be embedded at every layer, not concentrated at the edge. This session presents a practical multi-layer security framework for open source serverless data pipelines that balances transparency, interoperability, and cost control. Using Kubeless for event execution, HashiCorp Vault for secrets and key management, Open Policy Agent for policy-as-code, Grafana Loki for log aggregation, and Apache Airflow for orchestrated controls, we show how to protect workloads end to end across identity, data, network, and runtime. A real financial data scenario illustrates secure movement across multi-cloud providers while sustaining real-time analytics and regulatory compliance. Attendees leave with a repeatable blueprint for secure, scalable, and open serverless architectures that avoid lock-in and improve resilience.

Key takeaways

A clear reference model for multi-layer security in serverless and hybrid cloud
How to implement least privilege and short-lived credentials with Vault and OIDC
Writing portable guardrails using OPA policies at the API, data, and workflow layers
Instrumentation patterns using Loki to enable traceable, tamper-evident audit logs
Orchestrating security controls in Airflow runs, including preflight checks and rollbacks
Network and runtime hardening for serverless functions, from egress control to SBOM checks
A cost-aware approach that reduces risk without vendor lock-in, with metrics to prove impact

Breakout Session

12:40 PM PT

Distributed computing made easy: Harnessing Ray for scalable data processing

Sathish Srinivasan, Principal Engineer @ Oracle

Modern data workloads—from ETL pipelines to machine learning training—are increasingly distributed by necessity, yet most frameworks make scaling complex. In this session, we’ll explore how Ray, an open-source framework from UC Berkeley’s RISELab, simplifies distributed computing by making it as intuitive as writing local Python code. We’ll dive into core abstractions (tasks, actors, object store), showcase Ray Data, Ray Train, and Ray Serve, and demonstrate how to scale from a single laptop to a Kubernetes cluster on Oracle OKE with no code changes. Attendees will learn practical patterns for building and deploying scalable, fault-tolerant, and GPU-accelerated pipelines for data processing and ML workloads — all with Python simplicity.

Audience Takeaways:

Understand Ray’s unified distributed computing model
Learn deployment best practices on Kubernetes and cloud platforms
See live demos of Ray powering real-world data & ML pipelines

Breakout Session

12:40 PM PT

RAG vs fine-tuning vs hybrid: A practitioner's guide to choosing the right approach for enterprise AI applications

Senthil Thangavel, Staff Engineer @ PayPal

Fine-tuning LLMs costs $10-50k per iteration and requires months of data prep, while RAG systems deploy in weeks at 10% of the cost.
Most enterprises default to fine-tuning for knowledge problems when RAG would suffice, burning budgets on retraining as data drifts.
Pure RAG struggles with reasoning and style consistency, but hybrid approaches deliver 3-5x better ROI.
Teams that start with RAG and selectively apply fine-tuning only for behavior changes see faster deployment, better explainability, and lower maintenance costs.

Key takeaways:

Start with RAG for knowledge retrieval, compliance docs, and factual Q&A before considering fine-tuning
Fine-tune only for behavior changes like style, format, or domain-specific reasoning patterns
Build evaluation pipelines first with metrics for retrieval accuracy, hallucination rates, and latency
Use hybrid patterns: RAG for facts + LoRA for style, or instruction tuning + RAG for context
Implement semantic caching and reranking to cut RAG latency from 3-5 seconds to under 1 s

Get a free virtual ticket to OSDS 2025

Breakout Session

1:15 PM PT

Saving big data from big bills: Spark efficiency for an AI-ready era

Milind Chitgupakar, Founder & CEO @ Yeedu

Apache Spark™ still runs the bulk of enterprise data work, but the bottleneck is cost. Idle clusters, large shuffles, and skew waste budgets are common, while common tweaks deliver only 10 to 20% savings. New engines and single-node options promise 2 to 5x better price performance, though refactors may be needed. Teams that cut Spark spend now have free budget for AI and win on ROI.

Key takeaways:

Measure and attack idle time, shuffles, and skew first
Expect only modest gains from autoscaling, right-sizing, tuning, and caching
Consider Polars or DuckDB for targeted pipelines if refactors are feasible
Evaluate Velox, Turbo, Photon, and similar engines for vectorized, CPU aware execution with 2 to 5x gains
Optimize for total cost and ROI, not just runtime
Use ETL savings to fund AI and analytics work

Breakout Session

1:15 PM PT

Own agents, don't ship keys: Build secure, intelligent agents you control

Esteban Puerta, Founder @ CloudShip AI

In this hands-on session, you’ll build your own secure agents—using tools you already know. We’ll combine TFLint, Semgrep, and Trivy into a lightweight agent that scans infrastructure, code, and images locally never sharing credentials externally. You’ll learn how to work with MCP, Understand the anatomy of tool based agents, good agent design, and using agents practically in your CICD. By the end, you’ll have a good overview of operational agents and how to add intelligence to your pipelines.

Key takeaways:

Build your own security agents that run with your tools—TFLint, Semgrep, Trivy—no new stack needed
Understand the anatomy of MCP based agents
Improve security scans with added intelligence
Work with TFlint, Semgrep, Checkov, Syft, Tfsec

Get a free virtual ticket to OSDS 2025

Where & when?

Open Source Data Summit 2025 will be held on November 13th, 2025.

What is the cost of access to the live virtual sessions?

OSDS is always free and open to all.

What is Open Source Data Summit?

OSDS is a peer-to-peer gathering of data industry professionals, experts, and enthusiasts to explore the dynamic landscape of open source data tools and storage.

The central theme of OSDS revolves around the advantages of open source data products and their pivotal role in modern data ecosystems.

OSDS is the annual peer hub for knowledge exchange, fostering a deeper understanding of open source options and their role in shaping the data-driven future.

Who attends OSDS?

OSDS is attended by data engineers, data architects, developers, DevOps practitioners and managers, and data leadership.

Anyone looking for enriched perspectives on open source data tools and practical insights to navigate the evolving data landscape should attend this event.

On November 13th, 2025, we'll be back for discussions about:

Benefits of open source data tools
Cost/performance trade-offs
Building data storage solutions
Challenges surrounding open source data tool integration
Solutions for the cost of storing, accessing, and managing data
Data streams and ingestion
Hub-and-spoke data integration models
Choosing the right engine for your workload

Are you interested in speaking or sponsoring the next Open Source Data Summit?

Submit a talk proposal here or reach out to astronaut@solutionmonday.com.

Explore

open data

systems

Open source strategies for speed, performance, and savings

Nov 13, 2025 | Live Virtual Conference

2025 Speakers | Open Source Data Summit 2025

2025 Agenda | Open Source Data Summit | November 13, 2025

Morning Keynote

8:00 AM PT

The Open Efficiency Playbook: Why fast keeps your lakehouse open

Breakout Session

8:35 AM PT

The streaming-first lakehouse: Handling high-frequency mutable workloads with Apache Hudi™

Breakout Session

8:35 AM PT

From silos to signals: Quick wins for data quality

Breakout Session

9:10 AM PT

Where Spark leaves you in the dark: Engine choices that cut costs

Breakout Session

9:10 AM PT

Ensuring data quality across the organization

Breakout Session

9:45 AM PT

Cut spend, debug Fast: CDC-first lakehouse

Breakout Session

9:45 AM PT

PostgreSQL operational excellence: Core best practices for reliability and performance

Breakout Session

10:20 AM PT

Offensive data security for the real-world

Breakout Session

10:20 AM PT

Data streaming meets the lakehouse - Apache Iceberg™ for unified real-time and batch analytics

Breakout Session

10:55 AM PT

Enabling enterprise AI at scale for financial services: How unified data governance powers compliant AI innovation

Breakout Session

10:55 AM PT

Building open source data pipelines for marketing mix modeling at scale

Breakout Session

11:30 AM PT

Orchestrating your robots: Agent-interaction and human-in-the-loop with Apache Airflow® 3.1

Breakout Session

11:30 AM PT

Quantify the effort and outcome of removing technical debt from a large project

Breakout Session

12:05 PM PT

From tokens to thoughts: Breaking down LLM inference

Breakout Session

12:05 PM PT

Modern Security Pattern for Serverless Apps - Multi-Layer Security defense model?

Breakout Session

12:40 PM PT

Distributed computing made easy: Harnessing Ray for scalable data processing

Breakout Session

12:40 PM PT

RAG vs fine-tuning vs hybrid: A practitioner's guide to choosing the right approach for enterprise AI applications

Breakout Session

1:15 PM PT

Saving big data from big bills: Spark efficiency for an AI-ready era

Breakout Session

1:15 PM PT

Own agents, don't ship keys: Build secure, intelligent agents you control

Where & when?

What is the cost of access to the live virtual sessions?

What is Open Source Data Summit?

Who attends OSDS?

On November 13th, 2025, we'll be back for discussions about:

Are you interested in speaking or sponsoring the next Open Source Data Summit?

Fill out the form below to register for a free ticket to Open Source Data Summit 2025!