That's a wrap! Thank you to all those who spoke, sponsored, and attended Open Source Data Summit 2024!
Explore the
open source
data landscape
October 2, 2024 | Live Virtual Conference
Open Source Data Summit 2024 Speaker Lineup
Sivanagaraju Gadiparthi
Lead - Data & Analytics
Shravana Krishnamurthy
Director of Engineering
Vinoth Chandar
Founder & CEO
Joe Reis
CEO
Lakshmana Yenduri
Sr. Staff Software Engineer
Denis Krivenko
Data Engineering Consultant
David Regalado
Founder
Y Ethan Guo
Software Engineer
Tim Meehan
Software Engineer
Bhavani Sudha Saktheeswaran
Software Engineer
Balaji Varadarajan
Engineer
Stephen Said
Senior Solutions Architect
Matthias Rudolph
Associate Solutions Architect
Ashvin Agrawal
Research Software Engineer
Priyanka Naik
Principal Software Engineer
Dipankar Mazumdar
Staff Data Engineer Advocate
Audra Montenegro
Community Program Manager
Emil Emilov
Principal Software Engineer
Denny Lee
Sr. Staff Developer Advocate
Lisa N. Cao
Apache Gravitino Product Manager
Shirshanka Das
Co-Founder & CTO
Kyle Weller
Head of Product
Neha Pawar
Head of Data Platform
Manfred Moser
Director of Trino Community Leadership
Russell Spitzer
Principal Engineer
Ajit Panda
Head of Data Security & Governance
Will Morrison
Senior Director of Specialty Services
Shuguang Xiang
Lead Data Engineer
Shi Kai Ng
Lead Software Engineer
Andrew Gelinas
Co-Founder
Official Agenda | Open Source Data Summit 2024
Opening Remarks
Andrew Gelinas, Co-Founder @ Solution Monday
The new normal: Unbundling your data platform with an open data lakehouse
Vinoth Chandar, Founder & CEO @ Onehouse
As organizations face growing demands for supporting diverse use cases for their data, using a tightly coupled data warehouse as the primary data store is becoming increasingly impractical. Over the past year, major data vendors and cloud providers have aligned towards an alternative quietly adopted by many forward-thinking data organizations - an unbundled data platform rooted in an open data lakehouse.
In this opening keynote, the founder of Onehouse and originator of the data lakehouse architecture, Vinoth Chandar, lays out a blueprint for building your next data platform on top of this emerging data architecture that decouples and democratizes data across different data systems while minimizing vendor lock-in and monolithic data infrastructure. This approach enables a modular architecture where each component is chosen based on specific use cases and requirements in a composable fashion, fostering innovation and efficiency across the data ecosystem.
Join this session to learn about:
- Why this tectonic shift is happening now?
- Tangible benefits of embracing unbundling.
- Choosing the apt data storage, pipelines, query engines, and processing frameworks.
- Success stories from leading organizations.
- Ongoing work in the industry towards this end state.
The rise of open source data catalogs featuring: Unity Catalog, DataHub, Apache Gravitino, and Apache Polaris (Incubating)
Denny Lee, Sr. Staff Developer Advocate @ Databricks
Lisa N. Cao, Apache Gravitino Product Manager @ Datastrato
Shirshanka Das, Co-Founder & CTO @ Acryl Data
Kyle Weller, Head of Product @ Onehouse
Russell Spitzer, Principal Engineer @ Snowflake
In the last 5 years, the data landscape has rapidly evolved to a preference for independent, neutral storage decoupled from compute layers, databases, warehouses, etc. While this architectural pattern, often called a data lakehouse, provides excellent freedom for your data, the tradeoff is that it is naturally lacking in data governance.
Data Catalogs on data lakes are rapidly becoming a hotspot of focus and investment for many organizations and communities. With a large ecosystem of options, a few key open source data catalogs have recently risen in popularity: Unity Catalog, DataHub, Apache Gravitino, and Apache Polaris. Come join this panel discussion which features experts and community leaders from each of these prominent open source communities. We will discuss why open source driven data governance is important and we will dive into what each of these catalogs are doing for the industry.
Building a modern data lake to optimize digital offers for banking partners
Shravana Krishnamurthy, Director of Engineering @ Cardlytics
Cardlytics empowers advertisers with industry-leading purchase insights, enabling them to launch and optimize digital offers. By leveraging extensive purchase data from over 200 million bank customers, we identify opportunities, target real individuals within their banking environments, and measure the actual sales impact of our ads. Partnering with financial institutions, we run rewards programs that drive customer loyalty and deepen bank relationships. With a data scale covering $3.5 trillion in spend and 1 in 2 U.S. transactions, Cardlytics provides unmatched precision in Return on Ad Spend (ROAS) metrics, helping brands drive incremental sales and grow market share.
In this talk, Shravana Krishnamurthy will share insights on building a modern datalake architecture at Cardlytics using Hudi, Airflow, Spark, Lake Formation, Athena and EMR. The discussion will cover key learnings on Hudi concepts, including indexing strategies, file sizing and the development of streaming pipelines that ensure efficient data processing. Additionally, Shravana will highlight the use of Superset for Data quality and monitoring.
Say goodbye to the Lambda architecture
David Regalado, Founder @ Data Engineering Latam
In the Lambda Architecture, an immutable sequence of records is captured and fed into a batch system and a stream processing system in parallel. You implement your transformation logic twice, once in the batch and stream processing systems. You stitch together the results from both systems at query time to produce a complete answer.
The problem with the Lambda Architecture is that maintaining code that needs to produce the same result in two complex distributed systems is painful.
Apache Beam aims to solve this problem and David will share how that's working.
Unleashing Hudi 1.0: Re-inventing the data lakehouse wheel we created
Balaji Varadarajan, Senior Staff Software Engineer @ Applied Intuition
Y Ethan Guo, Software Engineer @ Onehouse
Join us at the Open Source Data Summit for an in-depth exploration of Hudi 1.0, a groundbreaking release set to redefine what a data lakehouse can do. This talk will delve into Hudi’s cutting-edge features designed to optimize storage for existing structured data while paving the way for unstructured data and blobs in an increasingly AI world. Discover how Hudi’s novel concurrency control mechanisms eliminate the need for blocking writers, enabling seamless, high-throughput writes and updates. Additionally, we will unveil Hudi’s powerful new indexing subsystem, showcasing its unique secondary and vector indexing capabilities. These advancements empower users with unprecedented query performance and flexibility in managing their data lakehouses.
Whether you’re a data engineer, architect, or enthusiast, this session will provide valuable insights into leveraging Hudi 1.0 to achieve superior efficiency and scalability in your data operations. Don’t miss this opportunity to learn from the experts and stay ahead in the rapidly evolving data landscape.
Trino Gateway: Because one Trino cluster is not enough
Manfred Moser, Director of Trino Community Leadership @ Starburst
Will Morrison, Senior Director of Specialty Systems @ Starburst
Mature organizations using Trino often end up running more than one cluster. These clusters are often for different departments, different datasets, and data locations, batch vs. analytics use cases, or simply to separate production from test clusters. Trino Gateway is the tool used to simplify access to all of these clusters.
Manfred and Will’s session dives into the power of the Trino Gateway. Acting as a load balancer, proxy server, and configurable routing gateway for multiple Trino clusters, many larger deployments across the Trino community rely on the Trino Gateway in production. Learn more about how the Trino community utilizes Trino Gateway in order to achieve workload distribution, automatic query routing, and more.
Batch vs. stream data processing: Navigating the best of both worlds
Sivanagaraju Gadiparthi, Lead Data & Analytics @ ADP
Sivanagaraju joins us to walk through a comprehensive overview of the evolving paradigms in data processing, focusing on batch and stream processing.
His talk outlines their historical development, key features, and common use cases, with examples of relevant tools and frameworks. A comparative analysis highlights the strengths and limitations of each approach, emphasizing their impact on infrastructure and cost.
Sivanagaraju's talk will also introduce hybrid approaches like the Lambda and Kappa Architectures, showcasing their practical applications. His talk will conclude with best practices, considerations for implementation, and a look at future trends, offering valuable insights for organizations optimizing their data processing strategies.
Apache XTable, cross table interoperability with Delta, Iceberg, and Hudi
Ashvin Agrawal, Research Software Engineer @ Microsoft
Dipankar Mazumdar, Staff Data Engineer Advocate @ Onehouse
Apache XTable is a new open source project incubating in the Apache Software Foundation that unlocks omni-directional interoperability between the popular lakehouse projects Delta Lake, Apache Iceberg, and Apache Hudi. With a budding community over the past year XTable has proven to be a bridge of unification for the data lakehouse industry.
In this session we will highlight what is new in the project and the community. We will review the technical details for how the metadata translation works and showcase real world examples for how users have adopted the project. Come see a live demo for how to use XTable with a variety of open source query engines including Spark, Presto, Trino, Flink, and more.
Achieving near real-time compliance at scale: How Uber manages sensitive data across an exabyte of global systems
Ajit Panda, Head of Data Security & Governance @ Uber
Uber has an exabyte of data across various systems like data lake, datastores backing up critical apps, and unstructured data lying around various drives. At the same time, Uber operates across the globe needing to ensure they comply with regulations as mandated by the countries/states it operates in.
In the last few years, Uber has come a long way from being reactive and spending a huge amount of time and resources to rewrite and risk reliability issues to being compliant near real-time. This is a huge turnaround on how compliance is being enforced in the industry.
In this talk, Ajit will focus on how Uber classifies data to identify various types of sensitive information, tools they have built over time to enforce compliance and fine-grained monitoring.
Racing through big data: A comparative analysis of Apache Spark, Hadoop, and Flink in batch processing
Lakshmana Yenduri, Sr. Staff Software Engineer @ Visa
In the era of Big Data, efficient data processing architectures are crucial for the timely analysis of vast datasets to extract valuable insights. Apache Hadoop (AH), Apache Spark (AS), and Apache Flink (AF) are prominent contenders in large-scale data processing.
Lakshmana's talk focuses on batch processing to evaluate performance, aiming to shed light on execution time with large datasets. Through experiments ranging from 1 GB to 5 GB, AS emerged as the frontrunner, showing significant performance advantages over AF and AH. Despite the absence of parallelism, Spark maintained its lead, indicating its potential for scalable batch processing. Further research is needed to explore Spark's performance in distributed environments.
Overall, this talk underscores Spark's significance in batch processing large datasets, contributing to our understanding of Big Data processing and informing data analytics workflows.
Building together: How user communities drive open source data projects
Audra Montenegro, Community Program Manager @ CNCF
Bhavani Sudha Saktheeswaran, Software Engineer @ Onehouse
Priyanka Naik, Principal Software Engineer @ Palo Alto Networks
Neha Pawar, Head of Data Platform @ Startree
Our panel will explore how open-source communities drive project success, with Audra, Neha, Sudha, and Priyanka discussing ways to activate, maintain, and measure community engagement. They will share strategies for leveraging community resources, ensuring security in open-source solutions, and empowering organizational contributors. Using examples from projects like Apache Hudi and Apache Pinot and the perspective of those evaluating projects, the panel will highlight practical approaches for building and sustaining thriving open-source ecosystems.
Enhancing interoperability of open table formats with Apache XTable
Stephen Said, Senior Solutions Architect @ AWS
Matthias Rudolph, Associate Solutions Architect @ AWS
XTable is an incubating Apache project for conversion between open table formats (OTF) which improves the interoperability of analytical data. For instance, XTable converts Delta Lake to Iceberg without data duplication.
In this session, Stephen and Matthias will give an introduction to XTable and demonstrate it in practice. They'll present OTF conversion with XTable in a data pipeline on Apache Airflow and how to run XTable in a background conversion mechanism.
Mixed model arts - The convergence of data modeling across apps, analytics, and AI
Joe Reis, Author, Fundamentals of Data Engineering & CEO @ Ternary Data
For decades, data modeling has been fragmented by use cases: applications, analytics, and machine learning/AI. This leads to data siloing and “throwing data over the wall.”
With the emergence of AI, streaming data, and “shifting left" are changing data modeling, these siloed approaches are insufficient for the diverse world of data use cases. Today's practitioners must possess an end-to-end understanding of the myriad techniques for modeling data throughout the data lifecycle. This presentation covers "mixed model arts," which advocates converging various data modeling methods and the innovations of new ones.
APIs and community in the composable data ecosystem
Tim Meehan, Software Engineer @ IBM
Tim's session explores the evolution and challenges of composable data systems, emphasizing the disaggregation of traditional data warehouses into specialized components such as storage, compute, ingestion, and query processing.
This shift toward composability, driven by cost, reliability, and performance, has led to new challenges in integration, particularly around APIs, protocols, and community engagement.
Key insights include best practices for fostering adoption, such as using HTTP, adopting modern languages, and maintaining clear governance. The discussion also covers lessons learned from Presto's evolution and speculates on future disaggregation trends, offering a roadmap for navigating this complex landscape.
Effective data platforming with open source tools for faster insights
Priyanka Naik, Principal Software Engineer @ Palo Alto Networks
Priyanka joins OSDS 2024 to discuss:
- The architecture of the data streaming platform which we built in Palo Alto Networks using open source tools like Strimzi, Kafka, Kafka Connect, Conflluent Community licensed tools like Schema Registry and KSqlDB on K8s for supporting corporate risk intelligence, health, and compliance.
- Application of core software engineering principles in architecting open source data platforms and its benefits
- The general problem of Corporate Risk intelligence and compliance reporting in Infosec organizations and the benefits of solving it
- Some drawbacks that were identified in the data platform solutions and how we overcame those
Optimizing data lake infrastructure for sub-second query latency
Emil Emilov, Principal Software Engineer @ Conductor
Emil will share his journey of building and optimizing a data lake infrastructure using various open-source projects and a cloud-native data platform for high-performance user-facing analytics.
The talk will include real-world challenges and solutions around partitioning, custom bucketing, and optimizing query engines to handle massive datasets while achieving sub-second query latency. Attendees will gain insights into the nuances of data skipping and pruning, and best practices for data modeling to avoid large scans.
This session is ideal for developers and data engineers tackling the complexity of scaling analytics on large data lakes, particularly those focused on delivering high-performance, user-facing applications.
Open data analytics platforms on Kubernetes
Denis Krivenko, Senior Data Engineer @ Coody
Today cloud service providers offer easy access to enterprise-grade data platforms for users at any scale, from individuals to multi-million corporations. However,
- What if a public cloud cannot be used?
- What if the solution should be cloud agnostic?
- What if Hadoop or Data Warehouse are not considered as solutions?
In today’s cloud age, there has been an increased reliance on Kubernetes, which allows its adopters to deploy, scale, and manage incredibly complex solutions in a few minutes.
In this session, Denis will talk about how to leverage the power of Kubernetes to build a cloud native data platform.
He will share the case and the reasons to create an analytics data platform, walk through its architecture principles, deep dive into how to build the solution based on open source projects and adopt a GitOps approach.
Enabling near real-time data analytics on the data lake
Shuguang Xiang, Lead Data Engineer @ Grab
Shi Kai Ng, Lead Software Engineer @ Grab
This session explores the challenges faced in achieving near real-time data analytics using conventional Change Data Capture (CDC) with Hive tables and their transition to a more efficient approach using Flink CDC integrated with Hudi. The discussion will focus on improving data freshness, enabling self-serve data ingestion, and overcoming bottlenecks in the traditional pipeline. Key use cases from Grab Taxi will illustrate the significant improvements in operational efficiency, data availability, and decision-making enabled by this new architecture.
Where & when?
Open Source Data Summit 2024 was held on October 2nd, 2024.
What is the cost of access to the live virtual sessions?
OSDS is always free and open for all to attend.
What is Open Source Data Summit?
OSDS is a peer-to-peer gathering of data industry professionals, experts, and enthusiasts to explore the dynamic landscape of open source data tools and storage.
The central theme of OSDS revolves around the advantages of open source data products and their pivotal role in modern data ecosystems.
OSDS is the annual peer hub for knowledge exchange that fosters a deeper understanding of open source options and their role in shaping the data-driven future.
Who attends OSDS?
OSDS is attended by data engineers, data architects, developers, DevOps practitioners and managers, and data leadership.
Anyone who is looking for enriched perspectives on open source data tools and practical insights to navigate the evolving data landscape should attend this event.
On October 2nd, 2024 we convened for discussions about:
- Benefits of open source data tools
- Cost/performance trade-offs
- Building data storage solutions
- Challenges surrounding open source data tool integration
- Solutions for the cost of storing, accessing, and managing data
- Data streams and ingestion
- Hub-and-spoke data integration models
- Choosing the right engine for your workload
Are you interested in speaking or sponsoring the next Open Source Data Summit?
Submit a talk proposal here or reach out to astronaut@solutionmonday.com.
Register for on-demand access to the OSDS 2024 sessions and announcements about OSDS 2025
"*" indicates required fields