Explore the

open source

data landscape

November 15, 2023 | Live Virtual Conference

Open Source Data Summit 2023 Speakers

Jordan West

Staff Software Engineer

Eric Gonzalez

VP, Business Intelligence Architecture

Praveen Neppali Naga

VP of Engineering & Data Science

Ankur Ranjan

Data Engineer III

Kapil Surlaker

VP of Engineering

Raghu Ramakrishnan

CTO for Data

Vaishnavi Muraldihar

Data Engineer

Justin Borgman

Chairman & CEO

Prasad Pathak

Data Engineer

Jay Kreps

Co-Founder & CEO

Siddharth Jain

Senior Engineering Manager

Bhavani Sudha Saktheeswaran

Software Engineer

Girish Baliga

Director of Engineering

Nishith Agarwal

Head of Data & ML Platforms

Shirshanka Das

Co-founder & CTO

Balaji Varadarajan

Senior Staff Software Engineer

Shana Schipers

Principal Specialist SA, Analytics

Tun Shwe

VP of Data

Jay Clifford

Developer Advocate

Ayush Bijawat

Senior Data Engineer

Pritam Dey

Senior Software Engineer

Ashvin Agrawal

Senior Researcher

Tim Brown

Engineering

Michael Del Balso

CEO & Co-Founder

Sagar Sumit

Software Engineer

Ronak Shah

Head of Data & Product

Patrick McFadin

VP Developer Relations

Sarfaraz Hussain

Senior Data Engineer

Anoop Johnson

Senior Staff Software Engineer

James Greenhill

Data Engineer

Soumil Shah

Data Engineer Team Lead

Justin Levandoski

Director of Engineering

David Anderson

Software Practice Lead

Nadine Farah

Head of Developer Relations

Manfred Moser

Director of Technical Content

Watch All of the Open Source 2023 Sessions On-Demand

Keynote: Paradigm shift: Why open source should now be the bedrock of every data strategy

Vinoth Chandar, Founder & CEO @ Onehouse

Until recently, success with open data infrastructure has been a secret, limited to the most advanced data teams operating at scale. However, today's open source data ecosystem is stronger than ever, offering the freedom to either build on open source or use managed services built on open source.

In our opening keynote, Apache Hudi creator and originator of the data lakehouse architecture, Vinoth Chandar will bring into focus the current and future state of the open source data ecosystem that is now capable of serving diverse data needs from ingestion, storage, analytics, stream processing, AI/ML, to real-time analytics.

Join this keynote to hear Vinoth uncover

Build vs. buy for open source data
The right tools for your data
The best-of-breed engines for different use cases
The significance of data interoperability
Common vendor selection pitfalls

Session: Practicalities of deploying open source databases

Jordan West, Staff Software Engineer @ Netflix

Deploying Open Source Databases often takes more than downloading the artifact and running it on your server. Control planes need to be built, internal platform tools (logging and metrics) need to be integrated, and it's not uncommon to need to fork the code for your needs leading to questions like how are updates managed and how to contribute back. This talk will look at how Netflix has deployed Apache Cassandra and how it manages all of the aspects of that deployment from automation, monitoring, and development.

Session: A petabyte-scale vector store for generative AI

Patrick McFadin, VP Developer Relations @ DataStax

This talk will focus on the work in the Apache Cassandra® project to develop a vector store capable of handling petabytes of data, discussing why this capacity is critical for future AI applications. I will also connect how this pertains to the exciting new Generative AI technologies like Improved Large Language Models (LLMs), Retrieval-Augmented Generation (RAG), and Forward-Looking Active Retrieval Augmented Generation (FLARE) that all contribute to the growing need for such scalable solutions. The needs of autonomous agents will drive the next wave of data infrastructure. Are you ready?

Key Takeaways:

Understand the future of generative AI and why current laptop-scale models will soon be obsolete
Apache Cassandra® and its role in creating a petabyte-scale vector store for AI applications
Vector-powered AI technologies such as LLMs, RAG, and FLARE
How AI agents can leverage such scalable solutions for better decision-making
Importance of planning and managing future growth in AI applications and how to avoid painful migrations later
Use cases with frameworks like LangChain, LlamaIndex, and CassIO

Session: Scaling and governing Robinhood's data lakehouse

Balaji Varadarajan, Senior Staff Software Engineer @ Robinhood

Pritam Dey, Technical Lead @ Robinhood

Robinhood Markets’ mission is to democratize finance for all. Continuous data analysis and data-driven decision-making are fundamental to achieving this. The data required for analysis comes from varied sources - OLTP databases, event streams, and various 3rd party sources. A reliable lakehouse with an interoperable data ecosystem and fast data ingestion service is needed to power various reporting and business-critical pipelines and dashboards. Being in the financial domain, effective data governance is crucial for regulatory compliance and ensuring that data is consistent and trustworthy.

In this talk, we will describe the evolution of the big data ecosystem in Robinhood not only in terms of the scale of data stored and queries made but also the use cases that it supports. We go in-depth into the lakehouse along with the data ingestion services we built using open source tools to reduce the data freshness latency for our core datasets from one day to under 15 minutes. Finally, we will also describe our approach to data governance and compliance and how we are leveraging open source components in implementing this.

Session: Diving into Uber's cutting-edge data infrastructure

Girish Baliga, Director of Engineering @ Uber

Join this talk with Uber's Girish Baliga to hear how Uber is leveraging open source technologies like Presto, Spark, Flink, Pinot, Hudi, and HDFS to tailor solutions that meet their unique business requirements. Learn how embracing open source empowers Uber to achieve scalability, superior performance, and multi-cloud compatibility, paving the way for future-proofing while fostering collaboration with industry partners in both consumer and enterprise sectors.

Session: Incremental data processing: A path to streamlined data engineering pipelines

Prasad Pathak, Data Engineer @ Tesla

In this session, we invite you to explore the paradigm shift brought about by Incremental data processing. While the allure of 'Complete data processing' remains, Incremental processing offers a compelling alternative. We will delve into the core principles of Incremental data processing, illuminating its benefits, including accelerated data processing, reduced operational costs, and enhanced data quality.

Along this journey, we'll navigate the challenges inherent to Incremental processing and unveil best practices to mitigate them. Through a detailed example design pipeline, we will illustrate how Incremental data processing can be seamlessly integrated into your existing data engineering architecture, empowering organizations to stay competitive in an era driven by data. Join Prasad to unlock the transformative potential of incremental data processing and revolutionize your data engineering pipelines.

Session: Building an open source control plane for data

Shirshanka Das, Co-Founder & CTO @ Acryl

Today's data landscape, spanning operational, streaming, lakehouse, and warehouse systems, grapples with the complexities of decentralization. Do you seek new approaches, tools, and solutions to address challenges in data discovery, management, governance, quality, and observability?

How about a visibility layer, or control plane, for the data distributed across your stack?

In this session, Shirshanka Das, founder of the open source DataHub Project, will describe how a control plane for data, powered by metadata, can simplify data discovery, foster team collaboration, and offer real-time observability into data assets.

This talk will cover:

The control plane for data and its role in the decentralized data stack
Using the open-source DataHub Project to implement the control plane for data
How data contracts help you define and enforce decentralized governance
How this helps reduce operational overhead and keep cloud costs down
Why the industry is gravitating towards these ideas—and why the DataHub Project is key

Session: Enabling Walmart's data lakehouse with Apache Hudi

Ankur Ranjan, Data Engineer III @ Walmart

Ayush Bijawat, Senior Data Engineer @ Walmart

Ankur and Ayush are diving into the strategic shift from data lake to lakehouse architecture at Walmart, with a focus on the importance played by the open table format Apache Hudi. Attendees of this talk will gain a comprehensive understanding of the Hudi architecture, including insights into its underlying file structure, file groups, and the critical timeline feature.

Their session explores the crucial choice between Copy-on-Write (CoW) and Merge-on-Read (MoR) tables, highlighting why CoW tables were chosen for specific use cases. Moreover, the discussion will showcase the various optimization benefits unlocked by Apache Hudi, such as file resizing, enhanced UPSERT and DELETE support, schema evaluation, and partial updates.

The session also offers a practical perspective on enabling Apache Hudi within the Google BigQuery ecosystem, making it a must-attend event for those seeking to leverage cutting-edge data management solutions.

Session: Data plumbing basics: Build, deploy, and scale ML models for your time series data

Tun Shwe, VP of Data @ Quix

Jay Clifford, Developer Advocate @ InfluxData

“Collect”, “Store” and “Act” are the three key pillars to building any time series-based solution. While acting upon your data holds paramount importance, it simultaneously presents a puzzle of questions:

How do I query, transform, and process my stored time series data?
How can I build and run anomaly detection or forecasting algorithms on my time series data?
How can I efficiently scale and expand my time series analytics engine?

In this talk, we will explore the methodology for crafting a scalable time series data pipeline, leveraging the event streaming platform, Quix, and the time series database, InfluxDB. We will also walk through the process of creating, training, and deploying a machine-learning model, utilizing the power and flexibility of Keras and Hugging Face for anomaly detection.

Key Takeaways:

Collecting and storing time series data: Grasp the nuances of managing time series data with InfluxDB
Model training and storage: Dive into model creation and training using Keras and Hugging Face
Pipeline construction and deployment: Explore building and deploying a resilient time series data pipeline with Quix

Join this session to learn how to build a foundational but scalable architecture that you can plumb into your own time series solutions.

Session: Making decisions that are right for your data platform

Nishith Agarwal, Head of Data & ML Platforms @ Lyra Health

Today, with the multitude of cloud, vendor, and managed solutions, we often find ourselves at the crossroads of whether we should buy or build solutions. In this talk, I’ll walk through my experience building data platforms for the teams at Walmart Labs, Uber, and Lyra Health. We'll discuss key considerations, strategies, and capabilities of tooling and infrastructure required to set up a world-class data platform and provide a framework to help determine which solutions are good to build in-house and which ones are worth shopping around for.

Panel: The growing role of open source technology in today's data architectures

Vinoth Chandar, Founder & CEO @ Onehouse

Raghu Ramakrishnan, CTO @ Microsoft

Jay Kreps, Co-Founder & CEO @ Confluent

Kapil Surlaker, VP of Engineering @ LinkedIn

Justin Borgman, CEO @ Starburst

Praveen Neppali Naga, VP of Engineering & Data Science @ Uber

Justin Levandoski, Director of Engineering @ Google

The panel will discuss their storied experiences, success with open source data, user needs that are driving the shift to open data infrastructure, the importance of open democratized data access, how to think about build-vs-buy, and what the future holds for the industry at large. We invite you to attend and tap into their expertise to understand more deeply the indispensable role that open source plays in today's data landscape.

This lively panel has experiences across:

Building data infrastructure services on one of the largest public cloud providers
Creating industry-leading open stream processing and real-time analytics systems
Funding over a dozen major open source projects in the industry
Bringing to life a new world of open data lakes and open query engines
Solving some of the hardest problems in data today at planet scale

Session: Our emergency move from Postgres to OLAP

James Greenhill, Data Engineer @ PostHog

If you are interested in OLAP data stores, decoupling apps, and schemas - this talk is for you. PostHog recently found product market fit and emergency migrated to an OLAP solution that fit their bill of requirements. James joins us to transparently share a compilation of PostHog's learnings of what went well and how they protected themselves from themselves. He'll also share some thoughts about the mistakes and friends that were made along the way.

Session: Choosing an open table format for your transactional data lake

Shana Schipers, Principal Specialist SA, Analytics @ AWS

Join Shana for an explorative talk on the burgeoning popularity of transactional data lakes within modern data platforms. These innovative solutions, such as those supported by AWS, have unlocked a realm of possibilities, making previously challenging tasks like Change Data Capture (CDC) and GDPR compliance seamlessly achievable. By embracing open-table formats like Apache Hudi, Apache Iceberg, and Delta Lake, organizations can combine the power of transactional capabilities with the flexibility of Amazon S3 data lakes, all at scale. However, with numerous options available, there's no "one size fits all" solution.

This session aims to be your compass in this journey, offering guidance on selecting the right table format for your specific use case. We'll explore essential criteria and attention points to accelerate your decision-making process and ensure the success of your data lake endeavors.

Session: Maximizing efficiency by templating Glue jobs and serverless architecture in Hudi data lakes

Soumil Shah, Data Engineering Team Lead @ JobTarget

By combining the power of templated Glue jobs and a serverless architecture within Hudi data lakes, this project revolutionizes how organizations handle large volumes of data. In this project, we explore the challenges of data lake management and introduce a solution that leverages the flexibility of Glue jobs to automate data ingestion, transformation, and loading tasks. The templated approach significantly reduces the effort required to manage complex ETL processes.

Additionally, the project incorporates a serverless architecture, minimizing operational overhead and costs associated with traditional infrastructure management. This approach not only streamlines data processing but also ensures scalability, fault tolerance, and high availability.

Through practical examples, source code, and architectural insights, this project equips data engineers and architects with the knowledge and tools to implement these techniques in their own data lake ecosystems. Attendees will learn how to maximize efficiency, optimize ETL workflows, and harness the full potential of Hudi data lakes, ultimately improving their organization's data management capabilities.

Session: Open data foundations with OneTable - Hudi, Delta, and Iceberg interoperability

Ashvin Agrawal, Senior Researcher @ Microsoft

Tim Brown, Engineering @ Onehouse

Anoop Johnson, Senior Staff Software Engineer @ Google

OneTable is a brand new open source project that unlocks omni-directional interoperability between the popular lakehouse projects Apache Hudi, Delta Lake, and Apache Iceberg. When your data is at rest in your lake, Hudi, Delta, and Iceberg are not so different. They each offer a metadata layer over a set of parquet files. OneTable offers lightweight conversion mechanisms that can take a source metadata format and sync it into one or more target metadata formats.

This session will feature a live demo and describe real-world applications of how to build open data foundations that can accelerate your workloads into a variety of open source query engines including Spark, Presto, Trino, Flink, and more. We will describe the technical foundations for Hudi, Delta, and Iceberg and lay out the nuts and bolts of how OneTable seamlessly converts data between these formats. We will detail our journey to create the project, share the vision for the future, and show how you can join this new open community.

Session: Overhauling data management at Apna

Sarfaraz Hussain, Senior Data Engineer @ Apna

Ronak Shah, Head of Data & Product @ Apna

Sarfaraz and Ronak will walk through the transformation of data management at Apna, a leading technology company. In their talk they'll cover:

A comprehensive look at Apna’s historical data landscape, including the challenges and limitations it posed for the organization
The catalysts that drove the need for a significant data transformation
The methodologies, tools, and best practices employed to construct their new data platform internally
Apna’s revamped data storage architecture which is designed to enhance scalability, performance, and data accessibility
Their data ingestion and processing techniques as well as insights into the streamlined processes ensure efficient and reliable data handling

Session: Apache Hudi 1.0 preview: A database experience on the data lake

Bhavani Sudha Saktheeswaran, Software Engineer @ Onehouse

Sagar Sumit, Software Engineer @ Onehouse

Hudi is a top-level Apache open-source project and community that is breaking the boundaries of what is possible on a data lake. This talk unveils the essence of Apache Hudi 1.0, a pivotal version that will encapsulate a ground-up reimagination of Hudi's transactional database layer while staying true to its foundational principles. Diving deep, we'll explore:

State of the project: How Hudi today is a versatile data lake platform, enabling automated, near real-time data ingestion and incremental processing, integrated seamlessly with powerful frameworks such as Apache Spark, Flink, and Kafka Connect.
Hudi 1.0: An insightful look into how Hudi is architecting the foundational blocks of its database kernel. From implementing non-blocking concurrency control, and faster access methods with improved indexing and metadata, to leveraging an LSM tree-style timeline for infinite time travel – Hudi is redesigning every facet to optimize data lakes at scale.
The road ahead: Understanding the potential of transforming Hudi's core into a universal database experience for the lake, diving into deep query engine integrations, employing a hybrid server architecture, and expanding capabilities for complex data types, including images, videos, and formats conducive to ML/AI.

As Hudi 1.0 prepares to set a new benchmark in the world of streaming data lakes, this talk invites feedback, ideas, and collaborations to augment its scope and deliver unparalleled value to the user community. Join us to be part of this transformational journey!

Session: Options for real-time data pipelines

Siddharth Jain, Senior Engineering Manager @ Wayfair

Traditionally data pipelines have been set up for running as a batch. Now, cloud services give you the capability to run pipelines on a real-time basis. Real-time pipelines can drive operational efficiencies and provide an opportunity for businesses to react quickly. However, not all analytic needs deem a real-time approach, and knowing which do and do not is essential to achieving efficiency. Teams can easily fall into the trap of using services that enable real-time analytics, but it comes at a cost. Siddarth will discuss options like batching/micro-batching that can address most use cases at a cheaper cost.

Panel: A discussion on batch, streaming, and real-time data processing for ML

Eric Gonzalez, VP, Business Intelligence Architecture @ Eastern Bank

Vaishnavi Muraldihar, Data Engineer @ Intuit

Michael Del Balso, CEO & Co-Founder @ Tecton

This panel will explore critically different perspectives on batch, streaming, and real-time data processing, shedding light on common misconceptions, cost considerations, decision-making criteria, and the future of these architectures.

The panelists will discuss the most prevalent misconception in this domain, address the cost challenges of streaming architectures, and share insights on potential cost reduction strategies. They will also explore the decision-making criteria that guide organizations in choosing between these processing methods and discuss ways to make streaming data processing more accessible.

Furthermore, the advantages of streaming data for machine learning will be examined, emphasizing its potential to revolutionize real-time insights and predictive capabilities.

Panel: A discussion about contributing to open source projects

Nadine Farah, Head of Developer Relations @ Onehouse

Manfred Moser, Trino Contributor

David Anderson, Apache Flink Committer

Bhavani Sudha Saktheeswaran, Apache Hudi PMC

Join us for an important panel discussion on essential considerations when choosing to get involved with an open source project. Our panelists will discuss:

How to assess a project's worthiness for contribution
Steps for learning about and contributing to open source initiatives
How open source contribution can impact your career
Effective communication of project direction and growth from a technical standpoint

Session: Build user-facing analytics application that scales

Albert Wong, Head of Developer Relations @ CelerData

This session covers building scalable user-facing analytics applications with StarRocks, a high-performance analytical database.

Albert walks through its fast sub-second query responses, high scalability for petabytes of data, and user-friendly compatibility with standard SQL. He'll cover StarRocks' features, architecture, and practical usage.

Where & when?

Open Source Data Summit 2025 will be held on October 8th, 2025.

What is the cost of access to the live virtual sessions?

OSDS is always free and open to all.

What is Open Source Data Summit?

OSDS is a peer-to-peer gathering of data industry professionals, experts, and enthusiasts to explore the dynamic landscape of open source data tools and storage.

The central theme of OSDS revolves around the advantages of open source data products and their pivotal role in modern data ecosystems.

OSDS is the annual peer hub for knowledge exchange, fostering a deeper understanding of open source options and their role in shaping the data-driven future.

Who attends OSDS?

OSDS is attended by data engineers, data architects, developers, DevOps practitioners and managers, and data leadership.

Anyone looking for enriched perspectives on open source data tools and practical insights to navigate the evolving data landscape should attend this event.

On October 8th, 2025 we'll be back for discussions about:

Benefits of open source data tools
Cost/performance trade-offs
Building data storage solutions
Challenges surrounding open source data tool integration
Solutions for the cost of storing, accessing, and managing data
Data streams and ingestion
Hub-and-spoke data integration models
Choosing the right engine for your workload

Are you interested in speaking or sponsoring the next Open Source Data Summit?

Submit a talk proposal here or reach out to astronaut@solutionmonday.com.

Register now for access to Open Source Data Summit 2025!

"*" indicates required fields

Thank you to our past sponsors who've made Open Source Data Summit possible