That's a wrap! Thank you to all those who spoke, sponsored, and attended Open Source Data Summit 2024!

Explore the

open source

data landscape

October 2, 2024 | Live Virtual Conference

Open Source Data Summit 2024 Speaker Lineup

Sivanagaraju Gadiparthi

Lead - Data & Analytics

Shravana Krishnamurthy

Director of Engineering

Vinoth Chandar

Founder & CEO

Joe Reis

CEO

Lakshmana Yenduri

Sr. Staff Software Engineer

Denis Krivenko

Data Engineering Consultant

David Regalado

Founder

Y Ethan Guo

Software Engineer

Tim Meehan

Software Engineer

Bhavani Sudha Saktheeswaran

Software Engineer

Balaji Varadarajan

Engineer

Stephen Said

Senior Solutions Architect

Matthias Rudolph

Associate Solutions Architect

Ashvin Agrawal

Research Software Engineer

Priyanka Naik

Principal Software Engineer

Dipankar Mazumdar

Staff Data Engineer Advocate

Audra Montenegro

Community Program Manager

Emil Emilov

Principal Software Engineer

Denny Lee

Sr. Staff Developer Advocate

Lisa N. Cao

Apache Gravitino Product Manager

Shirshanka Das

Co-Founder & CTO

Kyle Weller

Head of Product

Neha Pawar

Head of Data Platform

Manfred Moser

Director of Trino Community Leadership

Russell Spitzer

Principal Engineer

Ajit Panda

Head of Data Security & Governance

Will Morrison

Senior Director of Specialty Services

Shuguang Xiang

Lead Data Engineer

Shi Kai Ng

Lead Software Engineer

Andrew Gelinas

Co-Founder

Official Agenda | Open Source Data Summit 2024

Opening Remarks

Andrew Gelinas, Co-Founder @ Solution Monday

The new normal: Unbundling your data platform with an open data lakehouse

Vinoth Chandar, Founder & CEO @ Onehouse

As organizations face growing demands for supporting diverse use cases for their data, using a tightly coupled data warehouse as the primary data store is becoming increasingly impractical. Over the past year, major data vendors and cloud providers have aligned towards an alternative quietly adopted by many forward-thinking data organizations - an unbundled data platform rooted in an open data lakehouse.

In this opening keynote, the founder of Onehouse and originator of the data lakehouse architecture, Vinoth Chandar, lays out a blueprint for building your next data platform on top of this emerging data architecture that decouples and democratizes data across different data systems while minimizing vendor lock-in and monolithic data infrastructure. This approach enables a modular architecture where each component is chosen based on specific use cases and requirements in a composable fashion, fostering innovation and efficiency across the data ecosystem.

Join this session to learn about:

Why this tectonic shift is happening now?
Tangible benefits of embracing unbundling.
Choosing the apt data storage, pipelines, query engines, and processing frameworks.
Success stories from leading organizations.
Ongoing work in the industry towards this end state.

The rise of open source data catalogs featuring: Unity Catalog, DataHub, Apache Gravitino, and Apache Polaris (Incubating)

Denny Lee, Sr. Staff Developer Advocate @ Databricks

Lisa N. Cao, Apache Gravitino Product Manager @ Datastrato

Shirshanka Das, Co-Founder & CTO @ Acryl Data

Kyle Weller, Head of Product @ Onehouse

Russell Spitzer, Principal Engineer @ Snowflake

In the last 5 years, the data landscape has rapidly evolved to a preference for independent, neutral storage decoupled from compute layers, databases, warehouses, etc. While this architectural pattern, often called a data lakehouse, provides excellent freedom for your data, the tradeoff is that it is naturally lacking in data governance.

Data Catalogs on data lakes are rapidly becoming a hotspot of focus and investment for many organizations and communities. With a large ecosystem of options, a few key open source data catalogs have recently risen in popularity: Unity Catalog, DataHub, Apache Gravitino, and Apache Polaris. Come join this panel discussion which features experts and community leaders from each of these prominent open source communities. We will discuss why open source driven data governance is important and we will dive into what each of these catalogs are doing for the industry.

Building a modern data lake to optimize digital offers for banking partners

Shravana Krishnamurthy, Director of Engineering @ Cardlytics

Cardlytics empowers advertisers with industry-leading purchase insights, enabling them to launch and optimize digital offers. By leveraging extensive purchase data from over 200 million bank customers, we identify opportunities, target real individuals within their banking environments, and measure the actual sales impact of our ads. Partnering with financial institutions, we run rewards programs that drive customer loyalty and deepen bank relationships. With a data scale covering $3.5 trillion in spend and 1 in 2 U.S. transactions, Cardlytics provides unmatched precision in Return on Ad Spend (ROAS) metrics, helping brands drive incremental sales and grow market share.

In this talk, Shravana Krishnamurthy will share insights on building a modern datalake architecture at Cardlytics using Hudi, Airflow, Spark, Lake Formation, Athena and EMR. The discussion will cover key learnings on Hudi concepts, including indexing strategies, file sizing and the development of streaming pipelines that ensure efficient data processing. Additionally, Shravana will highlight the use of Superset for Data quality and monitoring.

Say goodbye to the Lambda architecture

David Regalado, Founder @ Data Engineering Latam

In the Lambda Architecture, an immutable sequence of records is captured and fed into a batch system and a stream processing system in parallel. You implement your transformation logic twice, once in the batch and stream processing systems. You stitch together the results from both systems at query time to produce a complete answer.

The problem with the Lambda Architecture is that maintaining code that needs to produce the same result in two complex distributed systems is painful.

Apache Beam aims to solve this problem and David will share how that's working.

Unleashing Hudi 1.0: Re-inventing the data lakehouse wheel we created

Balaji Varadarajan, Senior Staff Software Engineer @ Applied Intuition

Y Ethan Guo, Software Engineer @ Onehouse

Join us at the Open Source Data Summit for an in-depth exploration of Hudi 1.0, a groundbreaking release set to redefine what a data lakehouse can do. This talk will delve into Hudi’s cutting-edge features designed to optimize storage for existing structured data while paving the way for unstructured data and blobs in an increasingly AI world. Discover how Hudi’s novel concurrency control mechanisms eliminate the need for blocking writers, enabling seamless, high-throughput writes and updates. Additionally, we will unveil Hudi’s powerful new indexing subsystem, showcasing its unique secondary and vector indexing capabilities. These advancements empower users with unprecedented query performance and flexibility in managing their data lakehouses.

Whether you’re a data engineer, architect, or enthusiast, this session will provide valuable insights into leveraging Hudi 1.0 to achieve superior efficiency and scalability in your data operations. Don’t miss this opportunity to learn from the experts and stay ahead in the rapidly evolving data landscape.

Trino Gateway: Because one Trino cluster is not enough

Manfred Moser, Director of Trino Community Leadership @ Starburst

Will Morrison, Senior Director of Specialty Systems @ Starburst

Mature organizations using Trino often end up running more than one cluster. These clusters are often for different departments, different datasets, and data locations, batch vs. analytics use cases, or simply to separate production from test clusters. Trino Gateway is the tool used to simplify access to all of these clusters.

Manfred and Will’s session dives into the power of the Trino Gateway. Acting as a load balancer, proxy server, and configurable routing gateway for multiple Trino clusters, many larger deployments across the Trino community rely on the Trino Gateway in production. Learn more about how the Trino community utilizes Trino Gateway in order to achieve workload distribution, automatic query routing, and more.

Batch vs. stream data processing: Navigating the best of both worlds

Sivanagaraju Gadiparthi, Lead Data & Analytics @ ADP

Sivanagaraju joins us to walk through a comprehensive overview of the evolving paradigms in data processing, focusing on batch and stream processing.

His talk outlines their historical development, key features, and common use cases, with examples of relevant tools and frameworks. A comparative analysis highlights the strengths and limitations of each approach, emphasizing their impact on infrastructure and cost.

Sivanagaraju's talk will also introduce hybrid approaches like the Lambda and Kappa Architectures, showcasing their practical applications. His talk will conclude with best practices, considerations for implementation, and a look at future trends, offering valuable insights for organizations optimizing their data processing strategies.

Apache XTable, cross table interoperability with Delta, Iceberg, and Hudi

Ashvin Agrawal, Research Software Engineer @ Microsoft

Dipankar Mazumdar, Staff Data Engineer Advocate @ Onehouse

Apache XTable is a new open source project incubating in the Apache Software Foundation that unlocks omni-directional interoperability between the popular lakehouse projects Delta Lake, Apache Iceberg, and Apache Hudi. With a budding community over the past year XTable has proven to be a bridge of unification for the data lakehouse industry.

In this session we will highlight what is new in the project and the community. We will review the technical details for how the metadata translation works and showcase real world examples for how users have adopted the project. Come see a live demo for how to use XTable with a variety of open source query engines including Spark, Presto, Trino, Flink, and more.

Achieving near real-time compliance at scale: How Uber manages sensitive data across an exabyte of global systems

Ajit Panda, Head of Data Security & Governance @ Uber

Uber has an exabyte of data across various systems like data lake, datastores backing up critical apps, and unstructured data lying around various drives. At the same time, Uber operates across the globe needing to ensure they comply with regulations as mandated by the countries/states it operates in.

In the last few years, Uber has come a long way from being reactive and spending a huge amount of time and resources to rewrite and risk reliability issues to being compliant near real-time. This is a huge turnaround on how compliance is being enforced in the industry.

In this talk, Ajit will focus on how Uber classifies data to identify various types of sensitive information, tools they have built over time to enforce compliance and fine-grained monitoring.

Racing through big data: A comparative analysis of Apache Spark, Hadoop, and Flink in batch processing

Lakshmana Yenduri, Sr. Staff Software Engineer @ Visa

In the era of Big Data, efficient data processing architectures are crucial for the timely analysis of vast datasets to extract valuable insights. Apache Hadoop (AH), Apache Spark (AS), and Apache Flink (AF) are prominent contenders in large-scale data processing.

Lakshmana's talk focuses on batch processing to evaluate performance, aiming to shed light on execution time with large datasets. Through experiments ranging from 1 GB to 5 GB, AS emerged as the frontrunner, showing significant performance advantages over AF and AH. Despite the absence of parallelism, Spark maintained its lead, indicating its potential for scalable batch processing. Further research is needed to explore Spark's performance in distributed environments.

Overall, this talk underscores Spark's significance in batch processing large datasets, contributing to our understanding of Big Data processing and informing data analytics workflows.

Building together: How user communities drive open source data projects

Audra Montenegro, Community Program Manager @ CNCF

Bhavani Sudha Saktheeswaran, Software Engineer @ Onehouse

Priyanka Naik, Principal Software Engineer @ Palo Alto Networks

Neha Pawar, Head of Data Platform @ Startree

Our panel will explore how open-source communities drive project success, with Audra, Neha, Sudha, and Priyanka discussing ways to activate, maintain, and measure community engagement. They will share strategies for leveraging community resources, ensuring security in open-source solutions, and empowering organizational contributors. Using examples from projects like Apache Hudi and Apache Pinot and the perspective of those evaluating projects, the panel will highlight practical approaches for building and sustaining thriving open-source ecosystems.

Enhancing interoperability of open table formats with Apache XTable

Stephen Said, Senior Solutions Architect @ AWS

Matthias Rudolph, Associate Solutions Architect @ AWS

XTable is an incubating Apache project for conversion between open table formats (OTF) which improves the interoperability of analytical data. For instance, XTable converts Delta Lake to Iceberg without data duplication.

In this session, Stephen and Matthias will give an introduction to XTable and demonstrate it in practice. They'll present OTF conversion with XTable in a data pipeline on Apache Airflow and how to run XTable in a background conversion mechanism.

Mixed model arts - The convergence of data modeling across apps, analytics, and AI

Joe Reis, Author, Fundamentals of Data Engineering & CEO @ Ternary Data

For decades, data modeling has been fragmented by use cases: applications, analytics, and machine learning/AI. This leads to data siloing and “throwing data over the wall.”

With the emergence of AI, streaming data, and “shifting left" are changing data modeling, these siloed approaches are insufficient for the diverse world of data use cases. Today's practitioners must possess an end-to-end understanding of the myriad techniques for modeling data throughout the data lifecycle. This presentation covers "mixed model arts," which advocates converging various data modeling methods and the innovations of new ones.

APIs and community in the composable data ecosystem

Tim Meehan, Software Engineer @ IBM

Tim's session explores the evolution and challenges of composable data systems, emphasizing the disaggregation of traditional data warehouses into specialized components such as storage, compute, ingestion, and query processing.

This shift toward composability, driven by cost, reliability, and performance, has led to new challenges in integration, particularly around APIs, protocols, and community engagement.

Key insights include best practices for fostering adoption, such as using HTTP, adopting modern languages, and maintaining clear governance. The discussion also covers lessons learned from Presto's evolution and speculates on future disaggregation trends, offering a roadmap for navigating this complex landscape.

Optimizing data lake infrastructure for sub-second query latency

Emil Emilov, Principal Software Engineer @ Conductor

Emil will share his journey of building and optimizing a data lake infrastructure using various open-source projects and a cloud-native data platform for high-performance user-facing analytics.

The talk will include real-world challenges and solutions around partitioning, custom bucketing, and optimizing query engines to handle massive datasets while achieving sub-second query latency. Attendees will gain insights into the nuances of data skipping and pruning, and best practices for data modeling to avoid large scans.

This session is ideal for developers and data engineers tackling the complexity of scaling analytics on large data lakes, particularly those focused on delivering high-performance, user-facing applications.

Open data analytics platforms on Kubernetes

Denis Krivenko, Senior Data Engineer @ Coody

Today cloud service providers offer easy access to enterprise-grade data platforms for users at any scale, from individuals to multi-million corporations. However,

What if a public cloud cannot be used?
What if the solution should be cloud agnostic?
What if Hadoop or Data Warehouse are not considered as solutions?

In today’s cloud age, there has been an increased reliance on Kubernetes, which allows its adopters to deploy, scale, and manage incredibly complex solutions in a few minutes.

In this session, Denis will talk about how to leverage the power of Kubernetes to build a cloud native data platform.

He will share the case and the reasons to create an analytics data platform, walk through its architecture principles, deep dive into how to build the solution based on open source projects and adopt a GitOps approach.

Enabling near real-time data analytics on the data lake

Shuguang Xiang, Lead Data Engineer @ Grab

Shi Kai Ng, Lead Software Engineer @ Grab

This session explores the challenges faced in achieving near real-time data analytics using conventional Change Data Capture (CDC) with Hive tables and their transition to a more efficient approach using Flink CDC integrated with Hudi. The discussion will focus on improving data freshness, enabling self-serve data ingestion, and overcoming bottlenecks in the traditional pipeline. Key use cases from Grab Taxi will illustrate the significant improvements in operational efficiency, data availability, and decision-making enabled by this new architecture.

Where & when?

Open Source Data Summit 2024 was held on October 2nd, 2024.

What is the cost of access to the live virtual sessions?

OSDS is always free and open for all to attend.

What is Open Source Data Summit?

OSDS is a peer-to-peer gathering of data industry professionals, experts, and enthusiasts to explore the dynamic landscape of open source data tools and storage.

The central theme of OSDS revolves around the advantages of open source data products and their pivotal role in modern data ecosystems.

OSDS is the annual peer hub for knowledge exchange that fosters a deeper understanding of open source options and their role in shaping the data-driven future.

Who attends OSDS?

OSDS is attended by data engineers, data architects, developers, DevOps practitioners and managers, and data leadership.

Anyone who is looking for enriched perspectives on open source data tools and practical insights to navigate the evolving data landscape should attend this event.

On October 2nd, 2024 we convened for discussions about:

Benefits of open source data tools
Cost/performance trade-offs
Building data storage solutions
Challenges surrounding open source data tool integration
Solutions for the cost of storing, accessing, and managing data
Data streams and ingestion
Hub-and-spoke data integration models
Choosing the right engine for your workload

Are you interested in speaking or sponsoring the next Open Source Data Summit?

Submit a talk proposal here or reach out to astronaut@solutionmonday.com.

Register for on-demand access to the OSDS 2024 sessions and announcements about OSDS 2025

"*" indicates required fields

Thank you to the sponsors who've made Open Source Data Summit possible for all to attend