OPEN-SOURCE-DATA-SUMMIT-LOGO-1.svg
OPEN SOURCE DATA SUMMIT LOGO (1)

Explore the

open source

data landscape

October 2, 2024 | Live Virtual Conference

Announced Open Source Data Summit 2024 Speakers

Sivanagaraju.jpg

Sivanagaraju Gadiparthi

Lead - Data & Analytics

adlog.webp
shravana.jpg

Shravana Krishnamurthy

Director of Engineering

cardlytics.svg
vinoth-chandar.jpg

Vinoth Chandar

Founder & CEO

onehous.svg
Joe-Reis.jpg

Joe Reis

CEO

Ternary-Data-Logo-2018-B-150dpi-RGB.png
Lakshmana.jpg

Lakshmana Yenduri

Sr. Staff Software Engineer

vis.png
1592679645291

Denis Krivenko

Data Engineering Consultant

dcood.png
unnamed-1.png

David Regalado

Founder

unnamed.png
1516839276683

Y Ethan Guo

Software Engineer

onehous.svg
Tim-Meehan.jpg

Tim Meehan

Software Engineer

ib.png
sudha onehouse

Bhavani Sudha Saktheeswaran

Software Engineer

onehous.svg
balaji-varadarajan.jpg

Balaji Varadarajan

Engineer

appliintu2.svg
stephen-aws.jpg

Stephen Said

Senior Solutions Architect

7.png
matthias-aws.jpg

Matthias Rudolph

Associate Solutions Architect

7.png
ashvin.jpg

Ashvin Agrawal

Research Software Engineer

microso.png
priyanka.jpg

Priyanka Naik

Principal Software Engineer

pan.png
dipankar.jpg

Dipankar Mazumdar

Staff Data Engineer Advocate

onehous.svg
audra.jpg

Audra Montenegro

Community Program Manager

cnc.svg
emilweb.jpg

Emil Emilov

Principal Software Engineer

conduct.png

Announced Sessions | Open Source Data Summit 2024

Morning Keynote
The new normal: Unbundling your data platform with an open data lakehouse

Vinoth Chandar, Founder & CEO @ Onehouse

As organizations face growing demands for supporting diverse use cases for their data, using a tightly coupled data warehouse as the primary data store is becoming increasingly impractical. Over the past year, major data vendors and cloud providers have aligned towards an alternative quietly adopted by many forward-thinking data organizations - an unbundled data platform rooted in an open data lakehouse.

In this opening keynote, the founder of Onehouse and originator of the data lakehouse architecture, Vinoth Chandar, lays out a blueprint for building your next data platform on top of this emerging data architecture that decouples and democratizes data across different data systems while minimizing vendor lock-in and monolithic data infrastructure. This approach enables a modular architecture where each component is chosen based on specific use cases and requirements in a composable fashion, fostering innovation and efficiency across the data ecosystem.

Join this session to learn about:

  • Why this tectonic shift is happening now?
  • Tangible benefits of embracing unbundling.
  • Choosing the apt data storage, pipelines, query engines, and processing frameworks.
  • Success stories from leading organizations.
  • Ongoing work in the industry towards this end state.
Session
Open data analytics platforms on Kubernetes

Denis Krivenko, Senior Data Engineer @ Platform24

Today cloud service providers offer easy access to enterprise-grade data platforms for users at any scale, from individuals to multi-million corporations. However,

  • What if a public cloud cannot be used?
  • What if the solution should be cloud agnostic?
  • What if Hadoop or Data Warehouse are not considered as solutions?

In today’s cloud age, there has been an increased reliance on Kubernetes, which allows its adopters to deploy, scale, and manage incredibly complex solutions in a few minutes.

In this session, Denis will talk about how to leverage the power of Kubernetes to build a cloud native data platform.

He will share the case and the reasons to create an analytics data platform, walk through its architecture principles, deep dive into how to build the solution based on open source projects and adopt a GitOps approach.

Session
Batch vs. Stream Data Processing: Navigating the best of both worlds

Sivanagaraju Gadiparthi, Lead Data & Analytics @ ADP

Sivanagaraju joins us to walk through a comprehensive overview of the evolving paradigms in data processing, focusing on batch and stream processing.

His talk outlines their historical development, key features, and common use cases, with examples of relevant tools and frameworks. A comparative analysis highlights the strengths and limitations of each approach, emphasizing their impact on infrastructure and cost.

Sivanagaraju's talk will also introduce hybrid approaches like the Lambda and Kappa Architectures, showcasing their practical applications. His talk will conclude with best practices, considerations for implementation, and a look at future trends, offering valuable insights for organizations optimizing their data processing strategies.

Session
APIs and community in the composable data ecosystem

Tim Meehan, Software Engineer @ IBM

Tim's session explores the evolution and challenges of composable data systems, emphasizing the disaggregation of traditional data warehouses into specialized components such as storage, compute, ingestion, and query processing.

This shift toward composability, driven by cost, reliability, and performance, has led to new challenges in integration, particularly around APIs, protocols, and community engagement.

Key insights include best practices for fostering adoption, such as using HTTP, adopting modern languages, and maintaining clear governance. The discussion also covers lessons learned from Presto's evolution and speculates on future disaggregation trends, offering a roadmap for navigating this complex landscape.

Session
Racing through big data: A comparative analysis of Apache Spark, Hadoop, and Flink in batch processing

Lakshmana Yenduri, Sr. Staff Software Engineer @ Visa

In the era of Big Data, efficient data processing architectures are crucial for the timely analysis of vast datasets to extract valuable insights. Apache Hadoop (AH), Apache Spark (AS), and Apache Flink (AF) are prominent contenders in large-scale data processing.

Lakshmana's talk focuses on batch processing to evaluate performance, aiming to shed light on execution time with large datasets. Through experiments ranging from 1 GB to 5 GB, AS emerged as the frontrunner, showing significant performance advantages over AF and AH. Despite the absence of parallelism, Spark maintained its lead, indicating its potential for scalable batch processing. Further research is needed to explore Spark's performance in distributed environments.

Overall, this talk underscores Spark's significance in batch processing large datasets, contributing to our understanding of Big Data processing and informing data analytics workflows.

Session
Apache XTable, cross table interoperability with Delta, Iceberg and Hudi

Ashvin Agrawal, Research Software Engineer @ Microsoft

Dipankar Mazumdar, Staff Data Engineer Advocate @ Onehouse

Apache XTable is a new open source project incubating in the Apache Software Foundation that unlocks omni-directional interoperability between the popular lakehouse projects Delta Lake, Apache Iceberg, and Apache Hudi. With a budding community over the past year XTable has proven to be a bridge of unification for the data lakehouse industry.

In this session we will highlight what is new in the project and the community. We will review the technical details for how the metadata translation works and showcase real world examples for how users have adopted the project. Come see a live demo for how to use XTable with a variety of open source query engines including Spark, Presto, Trino, Flink, and more.

Panel Discussion
Community support with open source tools

Audra Montenegro, Community Program Manager @ CNCF

Bhavani Sudha Saktheeswaran, Software Engineer @ Onehouse

Priyanka Naik, Principal Software Engineer @ Palo Alto Networks

Session
Mastering data efficiency at Cardlytics with a precision focused data architecture

Shravana Krishnamurthy, Director of Engineering @ Cardlytics

Cardlytics empowers advertisers with industry-leading purchase insights, enabling them to launch and optimize digital offers. By leveraging extensive purchase data from over 200 million bank customers, we identify opportunities, target real individuals within their banking environments, and measure the actual sales impact of our ads. Partnering with financial institutions, we run rewards programs that drive customer loyalty and deepen bank relationships. With a data scale covering $3.5 trillion in spend and 1 in 2 U.S. transactions, Cardlytics provides unmatched precision in Return on Ad Spend (ROAS) metrics, helping brands drive incremental sales and grow market share.

In this talk, Shravana Krishnamurthy will share insights on building a modern datalake architecture at Cardlytics using Hudi, Airflow, Spark, Lake Formation, Athena and EMR. The discussion will cover key learnings on Hudi concepts, including indexing strategies, file sizing and the development of streaming pipelines that ensure efficient data processing. Additionally, Shravana will highlight the use of Superset for Data quality and monitoring.

Session
Enhancing interoperability of open table formats with Apache XTable

Stephen Said, Senior Solutions Architect @ AWS

Matthias Rudolph, Associate Solutions Architect @ AWS

XTable is an incubating Apache project for conversion between open table formats (OTF) which improves the interoperability of analytical data. For instance, XTable converts Delta Lake to Iceberg without data duplication.

In this session, Stephen and Matthias will give an introduction to XTable and demonstrate it in practice. They'll present OTF conversion with XTable in a data pipeline on Apache Airflow and how to run XTable in a background conversion mechanism.

Session
Mixed model arts - The convergence of data modeling across apps, analytics, and AI

Joe Reis, Author, Fundamentals of Data Engineering & CEO @ Ternary Data

For decades, data modeling has been fragmented by use cases: applications, analytics, and machine learning/AI. This leads to data siloing and “throwing data over the wall.”

With the emergence of AI, streaming data, and “shifting left" are changing data modeling, these siloed approaches are insufficient for the diverse world of data use cases. Today's practitioners must possess an end-to-end understanding of the myriad techniques for modeling data throughout the data lifecycle. This presentation covers "mixed model arts," which advocates converging various data modeling methods and the innovations of new ones.

Session
Optimizing data lake infrastructure for sub-second query latency

Emil Emilov, Principal Software Engineer @ Conductor

Emil will share his journey of building and optimizing a data lake infrastructure using various open-source projects and a cloud-native data platform for high-performance user-facing analytics.

The talk will include real-world challenges and solutions around partitioning, custom bucketing, and optimizing query engines to handle massive datasets while achieving sub-second query latency. Attendees will gain insights into the nuances of data skipping and pruning, and best practices for data modeling to avoid large scans.

This session is ideal for developers and data engineers tackling the complexity of scaling analytics on large data lakes, particularly those focused on delivering high-performance, user-facing applications.

Session
From data chaos to compliance clarity: A scalable data streaming solution

Priyanka Naik, Principal Software Engineer @ Palo Alto Networks

Priyanka joins OSDS 2024 to discuss:

  • The architecture of the data streaming platform which we built in Palo Alto Networks using open source tools like Strimzi, Kafka, Kafka Connect, Conflluent Community licensed tools like Schema Registry and KSqlDB on K8s for supporting corporate risk intelligence, health, and compliance.
  • Application of core software engineering principles in architecting open source data platforms and its benefits
  • The general problem of Corporate Risk intelligence and compliance reporting in Infosec organizations and the benefits of solving it
  • Some drawbacks that were identified in the data platform solutions and how we overcame those
Session
Unleashing Hudi 1.0: Re-inventing the data lakehouse wheel we created

Balaji Varadarajan, Senior Staff Software Engineer @ Onehouse

Y Ethan Guo, Software Engineer @ Onehouse

Join us at the Open Source Data Summit for an in-depth exploration of Hudi 1.0, a groundbreaking release set to redefine what a data lakehouse can do. This talk will delve into Hudi’s cutting-edge features designed to optimize storage for existing structured data while paving the way for unstructured data and blobs in an increasingly AI world. Discover how Hudi’s novel concurrency control mechanisms eliminate the need for blocking writers, enabling seamless, high-throughput writes and updates. Additionally, we will unveil Hudi’s powerful new indexing subsystem, showcasing its unique secondary and vector indexing capabilities. These advancements empower users with unprecedented query performance and flexibility in managing their data lakehouses.

Whether you’re a data engineer, architect, or enthusiast, this session will provide valuable insights into leveraging Hudi 1.0 to achieve superior efficiency and scalability in your data operations. Don’t miss this opportunity to learn from the experts and stay ahead in the rapidly evolving data landscape.

Session
Say goodbye to the Lambda architecture

David Regalado, Founder @ Data Engineering Latam

In the Lambda Architecture, an immutable sequence of records is captured and fed into a batch system and a stream processing system in parallel. You implement your transformation logic twice, once in the batch and stream processing systems. You stitch together the results from both systems at query time to produce a complete answer.

The problem with the Lambda Architecture is that maintaining code that needs to produce the same result in two complex distributed systems is painful.

Apache Beam aims to solve this problem and David will share how that's working.

Where & when?

Open Source Data Summit 2024 will be held on October 2nd, 2024.

What is the cost of access to the live virtual sessions?

OSDS is always free and open for all to attend.

What is Open Source Data Summit?

OSDS is a peer-to-peer gathering of data industry professionals, experts, and enthusiasts to explore the dynamic landscape of open source data tools and storage.

The central theme of OSDS revolves around the advantages of open source data products and their pivotal role in modern data ecosystems.

OSDS is the annual peer hub for knowledge exchange that fosters a deeper understanding of open source options and their role in shaping the data-driven future.

Who attends OSDS?

OSDS is attended by data engineers, data architects, developers, DevOps practitioners and managers, and data leadership.

Anyone who is looking for enriched perspectives on open source data tools and practical insights to navigate the evolving data landscape should attend this event.

Join again on October 2nd, 2024 for discussions around:
  • Benefits of open source data tools
  • Cost/performance trade-offs
  • Building data storage solutions
  • Challenges surrounding open source data tool integration
  • Solutions for the cost of storing, accessing, and managing data
  • Data streams and ingestion
  • Hub-and-spoke data integration models
  • Choosing the right engine for your workload
Interested in speaking or sponsoring Open Source Data Summit 2024?

Submit a talk proposal here or reach out to astronaut@solutionmonday.com.

Don't miss out on important updates! Register for access to Open Source Data Summit 2024

Register for OSDS 2024 Access

"*" indicates required fields

This field is for validation purposes and should be left unchanged.

Thank you to our previous sponsors who made Open Source Data Summit possible