Explore the
open source
data landscape
November 15, 2023 | Live Virtual Conference
Open Source Data Summit 2023 Speakers
Jordan West
Staff Software Engineer
Eric Gonzalez
VP, Business Intelligence Architecture
Praveen Neppali Naga
VP of Engineering & Data Science
Ankur Ranjan
Data Engineer III
Kapil Surlaker
VP of Engineering
Raghu Ramakrishnan
CTO for Data
Vaishnavi Muraldihar
Data Engineer
Justin Borgman
Chairman & CEO
Prasad Pathak
Data Engineer
Jay Kreps
Co-Founder & CEO
Siddharth Jain
Senior Engineering Manager
Bhavani Sudha Saktheeswaran
Software Engineer
Girish Baliga
Director of Engineering
Nishith Agarwal
Head of Data & ML Platforms
Shirshanka Das
Co-founder & CTO
Balaji Varadarajan
Senior Staff Software Engineer
Shana Schipers
Principal Specialist SA, Analytics
Tun Shwe
VP of Data
Jay Clifford
Developer Advocate
Ayush Bijawat
Senior Data Engineer
Pritam Dey
Senior Software Engineer
Ashvin Agrawal
Senior Researcher
Tim Brown
Engineering
Michael Del Balso
CEO & Co-Founder
Sagar Sumit
Software Engineer
Ronak Shah
Head of Data & Product
Patrick McFadin
VP Developer Relations
Sarfaraz Hussain
Senior Data Engineer
Anoop Johnson
Senior Staff Software Engineer
James Greenhill
Data Engineer
Soumil Shah
Data Engineer Team Lead
Justin Levandoski
Director of Engineering
David Anderson
Software Practice Lead
Nadine Farah
Head of Developer Relations
Manfred Moser
Director of Technical Content
Watch All of the Open Source 2023 Sessions On-Demand
Keynote: Paradigm shift: Why open source should now be the bedrock of every data strategy
Vinoth Chandar, Founder & CEO @ Onehouse
Until recently, success with open data infrastructure has been a secret, limited to the most advanced data teams operating at scale. However, today's open source data ecosystem is stronger than ever, offering the freedom to either build on open source or use managed services built on open source.
In our opening keynote, Apache Hudi creator and originator of the data lakehouse architecture, Vinoth Chandar will bring into focus the current and future state of the open source data ecosystem that is now capable of serving diverse data needs from ingestion, storage, analytics, stream processing, AI/ML, to real-time analytics.
Join this keynote to hear Vinoth uncover
- Build vs. buy for open source data
- The right tools for your data
- The best-of-breed engines for different use cases
- The significance of data interoperability
- Common vendor selection pitfalls
Session: Practicalities of deploying open source databases
Jordan West, Staff Software Engineer @ Netflix
Deploying Open Source Databases often takes more than downloading the artifact and running it on your server. Control planes need to be built, internal platform tools (logging and metrics) need to be integrated, and it's not uncommon to need to fork the code for your needs leading to questions like how are updates managed and how to contribute back. This talk will look at how Netflix has deployed Apache Cassandra and how it manages all of the aspects of that deployment from automation, monitoring, and development.
Session: A petabyte-scale vector store for generative AI
Patrick McFadin, VP Developer Relations @ DataStax
This talk will focus on the work in the Apache Cassandra® project to develop a vector store capable of handling petabytes of data, discussing why this capacity is critical for future AI applications. I will also connect how this pertains to the exciting new Generative AI technologies like Improved Large Language Models (LLMs), Retrieval-Augmented Generation (RAG), and Forward-Looking Active Retrieval Augmented Generation (FLARE) that all contribute to the growing need for such scalable solutions. The needs of autonomous agents will drive the next wave of data infrastructure. Are you ready?
Key Takeaways:
- Understand the future of generative AI and why current laptop-scale models will soon be obsolete
- Apache Cassandra® and its role in creating a petabyte-scale vector store for AI applications
- Vector-powered AI technologies such as LLMs, RAG, and FLARE
- How AI agents can leverage such scalable solutions for better decision-making
- Importance of planning and managing future growth in AI applications and how to avoid painful migrations later
- Use cases with frameworks like LangChain, LlamaIndex, and CassIO
Session: Scaling and governing Robinhood's data lakehouse
Balaji Varadarajan, Senior Staff Software Engineer @ Robinhood
Pritam Dey, Technical Lead @ Robinhood
Robinhood Markets’ mission is to democratize finance for all. Continuous data analysis and data-driven decision-making are fundamental to achieving this. The data required for analysis comes from varied sources - OLTP databases, event streams, and various 3rd party sources. A reliable lakehouse with an interoperable data ecosystem and fast data ingestion service is needed to power various reporting and business-critical pipelines and dashboards. Being in the financial domain, effective data governance is crucial for regulatory compliance and ensuring that data is consistent and trustworthy.
In this talk, we will describe the evolution of the big data ecosystem in Robinhood not only in terms of the scale of data stored and queries made but also the use cases that it supports. We go in-depth into the lakehouse along with the data ingestion services we built using open source tools to reduce the data freshness latency for our core datasets from one day to under 15 minutes. Finally, we will also describe our approach to data governance and compliance and how we are leveraging open source components in implementing this.
Session: Diving into Uber's cutting-edge data infrastructure
Girish Baliga, Director of Engineering @ Uber
Join this talk with Uber's Girish Baliga to hear how Uber is leveraging open source technologies like Presto, Spark, Flink, Pinot, Hudi, and HDFS to tailor solutions that meet their unique business requirements. Learn how embracing open source empowers Uber to achieve scalability, superior performance, and multi-cloud compatibility, paving the way for future-proofing while fostering collaboration with industry partners in both consumer and enterprise sectors.
Session: Incremental data processing: A path to streamlined data engineering pipelines
Prasad Pathak, Data Engineer @ Tesla
In this session, we invite you to explore the paradigm shift brought about by Incremental data processing. While the allure of 'Complete data processing' remains, Incremental processing offers a compelling alternative. We will delve into the core principles of Incremental data processing, illuminating its benefits, including accelerated data processing, reduced operational costs, and enhanced data quality.
Along this journey, we'll navigate the challenges inherent to Incremental processing and unveil best practices to mitigate them. Through a detailed example design pipeline, we will illustrate how Incremental data processing can be seamlessly integrated into your existing data engineering architecture, empowering organizations to stay competitive in an era driven by data. Join Prasad to unlock the transformative potential of incremental data processing and revolutionize your data engineering pipelines.
Session: Building an open source control plane for data
Shirshanka Das, Co-Founder & CTO @ Acryl
Today's data landscape, spanning operational, streaming, lakehouse, and warehouse systems, grapples with the complexities of decentralization. Do you seek new approaches, tools, and solutions to address challenges in data discovery, management, governance, quality, and observability?
- The control plane for data and its role in the decentralized data stack
- Using the open-source DataHub Project to implement the control plane for data
- How data contracts help you define and enforce decentralized governance
- How this helps reduce operational overhead and keep cloud costs down
- Why the industry is gravitating towards these ideas—and why the DataHub Project is key
Session: Enabling Walmart's data lakehouse with Apache Hudi
Ankur Ranjan, Data Engineer III @ Walmart
Ayush Bijawat, Senior Data Engineer @ Walmart
Ankur and Ayush are diving into the strategic shift from data lake to lakehouse architecture at Walmart, with a focus on the importance played by the open table format Apache Hudi. Attendees of this talk will gain a comprehensive understanding of the Hudi architecture, including insights into its underlying file structure, file groups, and the critical timeline feature.
Their session explores the crucial choice between Copy-on-Write (CoW) and Merge-on-Read (MoR) tables, highlighting why CoW tables were chosen for specific use cases. Moreover, the discussion will showcase the various optimization benefits unlocked by Apache Hudi, such as file resizing, enhanced UPSERT and DELETE support, schema evaluation, and partial updates.
The session also offers a practical perspective on enabling Apache Hudi within the Google BigQuery ecosystem, making it a must-attend event for those seeking to leverage cutting-edge data management solutions.
Session: Data plumbing basics: Build, deploy, and scale ML models for your time series data
Tun Shwe, VP of Data @ Quix
Jay Clifford, Developer Advocate @ InfluxData
“Collect”, “Store” and “Act” are the three key pillars to building any time series-based solution. While acting upon your data holds paramount importance, it simultaneously presents a puzzle of questions:
- How do I query, transform, and process my stored time series data?
- How can I build and run anomaly detection or forecasting algorithms on my time series data?
- How can I efficiently scale and expand my time series analytics engine?
In this talk, we will explore the methodology for crafting a scalable time series data pipeline, leveraging the event streaming platform, Quix, and the time series database, InfluxDB. We will also walk through the process of creating, training, and deploying a machine-learning model, utilizing the power and flexibility of Keras and Hugging Face for anomaly detection.
Key Takeaways:
- Collecting and storing time series data: Grasp the nuances of managing time series data with InfluxDB
- Model training and storage: Dive into model creation and training using Keras and Hugging Face
- Pipeline construction and deployment: Explore building and deploying a resilient time series data pipeline with Quix
Join this session to learn how to build a foundational but scalable architecture that you can plumb into your own time series solutions.
Session: Making decisions that are right for your data platform
Nishith Agarwal, Head of Data & ML Platforms @ Lyra Health
Today, with the multitude of cloud, vendor, and managed solutions, we often find ourselves at the crossroads of whether we should buy or build solutions. In this talk, I’ll walk through my experience building data platforms for the teams at Walmart Labs, Uber, and Lyra Health. We'll discuss key considerations, strategies, and capabilities of tooling and infrastructure required to set up a world-class data platform and provide a framework to help determine which solutions are good to build in-house and which ones are worth shopping around for.
Panel: The growing role of open source technology in today's data architectures
Vinoth Chandar, Founder & CEO @ Onehouse
Raghu Ramakrishnan, CTO @ Microsoft
Jay Kreps, Co-Founder & CEO @ Confluent
Kapil Surlaker, VP of Engineering @ LinkedIn
Justin Borgman, CEO @ Starburst
Praveen Neppali Naga, VP of Engineering & Data Science @ Uber
Justin Levandoski, Director of Engineering @ Google
The panel will discuss their storied experiences, success with open source data, user needs that are driving the shift to open data infrastructure, the importance of open democratized data access, how to think about build-vs-buy, and what the future holds for the industry at large. We invite you to attend and tap into their expertise to understand more deeply the indispensable role that open source plays in today's data landscape.
This lively panel has experiences across:
- Building data infrastructure services on one of the largest public cloud providers
- Creating industry-leading open stream processing and real-time analytics systems
- Funding over a dozen major open source projects in the industry
- Bringing to life a new world of open data lakes and open query engines
- Solving some of the hardest problems in data today at planet scale
Session: Our emergency move from Postgres to OLAP
James Greenhill, Data Engineer @ PostHog
If you are interested in OLAP data stores, decoupling apps, and schemas - this talk is for you. PostHog recently found product market fit and emergency migrated to an OLAP solution that fit their bill of requirements. James joins us to transparently share a compilation of PostHog's learnings of what went well and how they protected themselves from themselves. He'll also share some thoughts about the mistakes and friends that were made along the way.
Session: Choosing an open table format for your transactional data lake
Shana Schipers, Principal Specialist SA, Analytics @ AWS
Join Shana for an explorative talk on the burgeoning popularity of transactional data lakes within modern data platforms. These innovative solutions, such as those supported by AWS, have unlocked a realm of possibilities, making previously challenging tasks like Change Data Capture (CDC) and GDPR compliance seamlessly achievable. By embracing open-table formats like Apache Hudi, Apache Iceberg, and Delta Lake, organizations can combine the power of transactional capabilities with the flexibility of Amazon S3 data lakes, all at scale. However, with numerous options available, there's no "one size fits all" solution.
This session aims to be your compass in this journey, offering guidance on selecting the right table format for your specific use case. We'll explore essential criteria and attention points to accelerate your decision-making process and ensure the success of your data lake endeavors.
Session: Maximizing efficiency by templating Glue jobs and serverless architecture in Hudi data lakes
Soumil Shah, Data Engineering Team Lead @ JobTarget
By combining the power of templated Glue jobs and a serverless architecture within Hudi data lakes, this project revolutionizes how organizations handle large volumes of data. In this project, we explore the challenges of data lake management and introduce a solution that leverages the flexibility of Glue jobs to automate data ingestion, transformation, and loading tasks. The templated approach significantly reduces the effort required to manage complex ETL processes.
Additionally, the project incorporates a serverless architecture, minimizing operational overhead and costs associated with traditional infrastructure management. This approach not only streamlines data processing but also ensures scalability, fault tolerance, and high availability.
Through practical examples, source code, and architectural insights, this project equips data engineers and architects with the knowledge and tools to implement these techniques in their own data lake ecosystems. Attendees will learn how to maximize efficiency, optimize ETL workflows, and harness the full potential of Hudi data lakes, ultimately improving their organization's data management capabilities.
Session: Open data foundations with OneTable - Hudi, Delta, and Iceberg interoperability
Ashvin Agrawal, Senior Researcher @ Microsoft
Tim Brown, Engineering @ Onehouse
Anoop Johnson, Senior Staff Software Engineer @ Google
OneTable is a brand new open source project that unlocks omni-directional interoperability between the popular lakehouse projects Apache Hudi, Delta Lake, and Apache Iceberg. When your data is at rest in your lake, Hudi, Delta, and Iceberg are not so different. They each offer a metadata layer over a set of parquet files. OneTable offers lightweight conversion mechanisms that can take a source metadata format and sync it into one or more target metadata formats.
This session will feature a live demo and describe real-world applications of how to build open data foundations that can accelerate your workloads into a variety of open source query engines including Spark, Presto, Trino, Flink, and more. We will describe the technical foundations for Hudi, Delta, and Iceberg and lay out the nuts and bolts of how OneTable seamlessly converts data between these formats. We will detail our journey to create the project, share the vision for the future, and show how you can join this new open community.
Session: Overhauling data management at Apna
Sarfaraz Hussain, Senior Data Engineer @ Apna
Ronak Shah, Head of Data & Product @ Apna
Sarfaraz and Ronak will walk through the transformation of data management at Apna, a leading technology company. In their talk they'll cover:
- A comprehensive look at Apna’s historical data landscape, including the challenges and limitations it posed for the organization
- The catalysts that drove the need for a significant data transformation
- The methodologies, tools, and best practices employed to construct their new data platform internally
- Apna’s revamped data storage architecture which is designed to enhance scalability, performance, and data accessibility
- Their data ingestion and processing techniques as well as insights into the streamlined processes ensure efficient and reliable data handling
Session: Apache Hudi 1.0 preview: A database experience on the data lake
Bhavani Sudha Saktheeswaran, Software Engineer @ Onehouse
Sagar Sumit, Software Engineer @ Onehouse
Hudi is a top-level Apache open-source project and community that is breaking the boundaries of what is possible on a data lake. This talk unveils the essence of Apache Hudi 1.0, a pivotal version that will encapsulate a ground-up reimagination of Hudi's transactional database layer while staying true to its foundational principles. Diving deep, we'll explore:
- State of the project: How Hudi today is a versatile data lake platform, enabling automated, near real-time data ingestion and incremental processing, integrated seamlessly with powerful frameworks such as Apache Spark, Flink, and Kafka Connect.
- Hudi 1.0: An insightful look into how Hudi is architecting the foundational blocks of its database kernel. From implementing non-blocking concurrency control, and faster access methods with improved indexing and metadata, to leveraging an LSM tree-style timeline for infinite time travel – Hudi is redesigning every facet to optimize data lakes at scale.
- The road ahead: Understanding the potential of transforming Hudi's core into a universal database experience for the lake, diving into deep query engine integrations, employing a hybrid server architecture, and expanding capabilities for complex data types, including images, videos, and formats conducive to ML/AI.
As Hudi 1.0 prepares to set a new benchmark in the world of streaming data lakes, this talk invites feedback, ideas, and collaborations to augment its scope and deliver unparalleled value to the user community. Join us to be part of this transformational journey!
Session: Options for real-time data pipelines
Siddharth Jain, Senior Engineering Manager @ Wayfair
Traditionally data pipelines have been set up for running as a batch. Now, cloud services give you the capability to run pipelines on a real-time basis. Real-time pipelines can drive operational efficiencies and provide an opportunity for businesses to react quickly. However, not all analytic needs deem a real-time approach, and knowing which do and do not is essential to achieving efficiency. Teams can easily fall into the trap of using services that enable real-time analytics, but it comes at a cost. Siddarth will discuss options like batching/micro-batching that can address most use cases at a cheaper cost.
Panel: A discussion on batch, streaming, and real-time data processing for ML
Eric Gonzalez, VP, Business Intelligence Architecture @ Eastern Bank
Vaishnavi Muraldihar, Data Engineer @ Intuit
Michael Del Balso, CEO & Co-Founder @ Tecton
This panel will explore critically different perspectives on batch, streaming, and real-time data processing, shedding light on common misconceptions, cost considerations, decision-making criteria, and the future of these architectures.
The panelists will discuss the most prevalent misconception in this domain, address the cost challenges of streaming architectures, and share insights on potential cost reduction strategies. They will also explore the decision-making criteria that guide organizations in choosing between these processing methods and discuss ways to make streaming data processing more accessible.
Furthermore, the advantages of streaming data for machine learning will be examined, emphasizing its potential to revolutionize real-time insights and predictive capabilities.
Panel: A discussion about contributing to open source projects
Nadine Farah, Head of Developer Relations @ Onehouse
Manfred Moser, Trino Contributor
David Anderson, Apache Flink Committer
Bhavani Sudha Saktheeswaran, Apache Hudi PMC
Join us for an important panel discussion on essential considerations when choosing to get involved with an open source project. Our panelists will discuss:
- How to assess a project's worthiness for contribution
- Steps for learning about and contributing to open source initiatives
- How open source contribution can impact your career
- Effective communication of project direction and growth from a technical standpoint
Session: Build user-facing analytics application that scales
Albert Wong, Head of Developer Relations @ CelerData
This session covers building scalable user-facing analytics applications with StarRocks, a high-performance analytical database.
Albert walks through its fast sub-second query responses, high scalability for petabytes of data, and user-friendly compatibility with standard SQL. He'll cover StarRocks' features, architecture, and practical usage.
Where & when?
Open Source Data Summit 2024 was held on October 2nd, 2024.
What is the cost of access to the live virtual sessions?
OSDS is always free and open for all to attend.
What is Open Source Data Summit?
OSDS is a peer-to-peer gathering of data industry professionals, experts, and enthusiasts to explore the dynamic landscape of open source data tools and storage.
The central theme of OSDS revolves around the advantages of open source data products and their pivotal role in modern data ecosystems.
OSDS is the annual peer hub for knowledge exchange that fosters a deeper understanding of open source options and their role in shaping the data-driven future.
Who attends OSDS?
OSDS is attended by data engineers, data architects, developers, DevOps practitioners and managers, and data leadership.
Anyone who is looking for enriched perspectives on open source data tools and practical insights to navigate the evolving data landscape should attend this event.
On October 2nd, 2024 we convened for discussions about:
- Benefits of open source data tools
- Cost/performance trade-offs
- Building data storage solutions
- Challenges surrounding open source data tool integration
- Solutions for the cost of storing, accessing, and managing data
- Data streams and ingestion
- Hub-and-spoke data integration models
- Choosing the right engine for your workload
Are you interested in speaking or sponsoring the next Open Source Data Summit?
Submit a talk proposal here or reach out to astronaut@solutionmonday.com.
Register for on-demand access to the OSDS 2024 sessions and announcements about OSDS 2025
"*" indicates required fields