Top 16 Data Engineers in the Field

data engineers - Top 16 Data Engineers in the Field

Data engineering sits at the heart of modern analytics, bridging raw data and actionable insights.

The individuals below represent the best data engineers globally, excelling in open-source contributions, leadership at high-impact tech companies, influential blogging and community engagement, and competition accolades. They have built and maintained the platforms and tools that power data-driven organizations. Each profile includes background information and links to their active public profiles so you can follow their work.

  1. Ali Ghodsi
  2. Martin Kleppmann
  3. Jay Kreps
  4. Tristan Handy
  5. Wes McKinney
  6. DJ Patil
  7. Dhruba Borthakur
  8. Doug Cutting
  9. Jordan Tigani
  10. Frank McSherry
  11. Neha Narkhede
  12. Maxime Beauchemin
  13. Zhamak Dehghani
  14. Stephan Ewen
  15. Reynold Xin
  16. Holden Karau

Now, let’s delve deeper into their remarkable careers and contributions.

Ali Ghodsi

YouTube Video

Nationality: Iranian

Ali is the CEO and co-founder of Databricks, and was one of the original creators of Apache Spark. A former academic from Sweden (PhD in distributed computing), Ali helped turn Spark from a research project into an open-source powerhouse.

In 2013 he co-founded Databricks to commercialize Spark, and became CEO in 2016. Under his leadership, Databricks has become a leader in unified data analytics (recently valued over $60B) while remaining committed to open source (e.g., releasing Delta Lake and MLflow). Ali is known for his vision of the “Lakehouse” architecture that blends data lakes and warehouses. He’s an active voice in the data community, appearing in keynote talks and interviews about the future of AI and data.

Martin Kleppmann

Nationality: British

Martin is a researcher in distributed systems at the University of Cambridge and author of the acclaimed book Designing Data-Intensive Applications. Martin’s book (published 2017) has become a “bible” for data engineers, distilling the principles behind databases, streams, and distributed algorithms.

Previously, Martin was an engineer in industry – he co-founded and sold two startups and worked on large-scale data infrastructure at LinkedIn. He also co-developed Apache Samza (a stream processing framework) during his time at LinkedIn, contributing to early adoption of stream processing. Currently as an academic, he focuses on local-first collaboration software and CRDTs, pushing the frontier of realtime collaborative data systems. Martin continues to engage the community through his blog, talks, and open-source projects, bridging theoretical advances with practical engineering.

Jay Kreps

Jay Kreps - Top 16 Data Engineers in the Field

Nationality: American

Jay is the co-founder and CEO of Confluent and one of the original creators of Apache Kafka.

At LinkedIn, Jay and his colleagues built Kafka to handle the company’s massive event streams, and it open-sourced in 2011. Kafka’s publish/subscribe model has since become a standard for streaming data across thousands of organizations. In 2014 Jay left LinkedIn to found Confluent, bringing a cloud-native Kafka platform to enterprises. He has overseen Confluent’s growth (now a public company) and the evolution of Kafka into an ecosystem. Jay also coined the idea of the “log” as the heart of data systems in his book I ❤️ Logs. He frequently blogs and speaks about streaming architectures and has helped shape how modern companies think about real-time data.

Tristan Handy

Nationality: American

Tristan is the Founder and CEO of dbt Labs, the company behind dbt (data build tool). Tristan launched dbt in 2016 (initially as an open-source project at Fishtown Analytics) to empower data analysts to adopt software engineering best practices in analytics – namely, writing modular SQL transformations with version control and testing.

dbt has since sparked the analytics engineering movement and is used by over 60,000 companies. Tristan has grown Fishtown into dbt Labs, a venture-backed firm that now offers dbt Cloud and has become a hub of the modern data stack. He’s also known for his thought leadership via the weekly “Analytics Engineering Roundup” newsletter and podcast, where thousands tune in to hear Tristan discuss data team practices. His mix of community-building and product vision has made dbt an indispensable tool in the data engineer’s toolbox.

Wes McKinney

Wes McKinney - Top 16 Data Engineers in the Field

Nationality: American

Wes is the original creator of pandas, the ubiquitous Python data analysis library, and a co-creator of Apache Arrow. His work fundamentally improved how data scientists and engineers handle data in Python.

Wes wrote Python for Data Analysis, and created pandas in 2008 to bring R-like DataFrames to Python. He later founded Ursa Labs and co-created Arrow, a cross-language in-memory data format that has become an industry standard (enabling zero-copy data sharing between systems). Today Wes is a co-founder of Voltron Data (after Ursa Labs merged into it), where he continues to develop Arrow and related tools. He’s also a Principal Architect at Posit (RStudio) as of 2024, bridging Python and R ecosystems. Wes is an active open source advocate, often sharing insights on GitHub and Twitter, and has received awards for his contributions to data science software.

DJ Patil

Nationality: American

DJ Patil is often cited as one of the most influential data scientists in the world and was the first-ever U.S. Chief Data Scientist. With a background in mathematics, DJ helped coin the term “data scientist” during his tenure as Chief Scientist at LinkedIn in the late 2000s, where he led the development of LinkedIn’s early data products.

He also held senior data roles at eBay and PayPal. As U.S. Chief Data Scientist, DJ evangelized for data-driven policymaking and worked on initiatives in health care, criminal justice, and education, demonstrating the social impact of data engineering. After government, he entered venture capital and is currently a General Partner at GreatPoint Ventures, while advising startups and public organizations on data strategy.

DJ Patil remains a prominent public speaker on the power of data, and his blend of technical and leadership experience continues to inspire the next generation of data professionals.

Dhruba Borthakur

Nationality: Indian

Dhruba is the CTO and co-founder of Rockset, a real-time analytics database startup, and a veteran engineer behind key big data storage technologies. At Yahoo, Dhruba was one of the founding engineers of the Hadoop HDFS, which provided petabyte-scale storage to the early big data world.

Later at Facebook, he architected the distributed storage engine RocksDB as the founding engineer on Facebook’s database team. In 2016 he co-founded Rockset to build a cloud-native analytical database for fast SQL on semi-structured data. Rockset’s indexing technology owes much to Dhruba’s deep storage expertise. (Notably, Rockset was acquired by OpenAI in 2025, indicating the value of its technology.) Beyond these, Dhruba has contributed to Apache HBase and worked on the Haystack photo storage system at Facebook.

He frequently shares his knowledge in database conferences and continues to push the envelope of low-latency analytics in the cloud era.

Doug Cutting

Doug Cutting - Top 16 Data Engineers in the Field

Nationality: American

Doug is the creator of Apache Hadoop and a legend in open-source big data. He also created Apache Lucene and co-created Apache Nutch (web crawler) in the early 2000s.

Hadoop, born from Doug’s work at Yahoo around 2005, implemented the MapReduce paper and the Hadoop Distributed File System (HDFS) – forming the backbone of the big data movement. In 2009 Doug co-founded Cloudera and joined as Chief Architect, helping to bring Hadoop to the enterprise. He also served as Chairman of the Apache Software Foundation. Even after the Hadoop era, Doug continues to champion open data platforms.

His work enabled the era of distributed data lakes, and terms like “Hadoop ecosystem” exist largely thanks to him. In recent years he’s been an advocate for data privacy and open source governance.

Jordan Tigani

Nationality: American

Jordan is the co-founder and CEO of MotherDuck, a startup bringing the power of the open-source DuckDB project to the cloud. Prior to MotherDuck, Jordan was a founding engineer and longtime leader on Google BigQuery – he helped build BigQuery’s storage and metadata systems in its early 2010s launch, and later served as BigQuery’s Director of Engineering.

After Google, he was Chief Product Officer at SingleStore, another database company. In 2021, Jordan co-founded MotherDuck to integrate DuckDB’s in-process analytics with cloud scalability, aiming to provide fast analytics on smaller-scale data without complex infrastructure. Under his leadership, MotherDuck has gained buzz (raising over $100M and a $400M valuation). Jordan is also known for co-authoring the book Google BigQuery: The Definitive Guide and for his engaging conference talks on data architecture.

He brings a pragmatic perspective on when to leverage “big” data tech versus “duck-sized” data solutions.

Frank McSherry

Nationality: American

Frank is the chief scientist and co-founder of Materialize, and a researcher famed for his work on streaming dataflow systems. While at Microsoft Research, Frank co-invented Timely Dataflow and Differential Dataflow, two innovative computational models for incremental computing.

These became the foundation for Materialize’s real-time SQL database, which can maintain complex query results continuously. Materialize’s engine is built on Timely/Differential, frameworks that Frank open-sourced. In academia, Frank is known for contributions to database theory and privacy. At Materialize, he initially served as CEO and then transitioned to CTO – focusing on engineering the product’s core. Frank is respected for bringing rigorous computer science into practical systems; he often writes blog posts and speaks about how Materialize achieves its “streaming tables” magic.

He also remains active in the Rust community and is an advocate for open science in software.

Neha Narkhede

Neha Narkhede - Top 16 Data Engineers in the Field

Nationality: Indian

Neha is the co-founder and former CTO of Confluent, and a co-creator of Apache Kafka during her time at LinkedIn.

At LinkedIn, Neha was instrumental in developing Kafka as a reliable, high-throughput distributed messaging system that now handles trillions of events per day across industries. In 2014 she co-founded Confluent to build a streaming data platform around Kafka, and led its technology strategy as CTO. Neha has since moved on to co-found Oscilar (her second startup, focused on AI-driven risk management), where she is CEO. She has been recognized in Forbes’s Top 50 Women in Tech and MIT Tech Review’s Innovators Under 35.

Neha is also a sought-after speaker on streaming architectures and entrepreneurship, and serves as a board member for Confluent.

Maxime Beauchemin

Nationality: French

Max is the original creator of Apache Airflow and Apache Superset, two widely used open-source data tools. At Airbnb circa 2014, Max created Airflow to automate complex data pipelines (now a top workflow orchestrator in ETL and data engineering).

He later created Superset as an open-source business intelligence and data visualization platform. In 2019, Max founded Preset, a startup providing a managed Superset platform, where he is CEO. With past data engineering stints at Facebook, Airbnb, and Lyft, Max has consistently built tools to fill gaps in the data ecosystem. He’s an open-source evangelist and active on social media, where he shares thoughts on data tool design.

Max’s contributions have saved countless data engineers from “reinventing the wheel” by providing production-ready frameworks.

Zhamak Dehghani

The future of data is decentralized, domain-oriented, and product-driven.

Nationality: Iranian

Zhamak is best known as the creator of the “Data Mesh” paradigm – a decentralized approach to enterprise data architecture. She introduced the concept in 2019 while at ThoughtWorks, via influential articles that challenged traditional monolithic data lakes.

In 2022, Zhamak authored Data Mesh: Delivering Data-Driven Value at Scale, elaborating on treating data as a product and organizing teams around data domains. To further this vision, she founded Nextdata in 2023 and serves as CEO, aiming to productize data mesh principles. In April 2025, Zhamak’s company launched Nextdata OS, a platform for building “autonomous data products” that operationalize data mesh ideas. Zhamak is a frequent keynote speaker and thought leader in data management, advocating for federated governance and self-serve data infrastructure.

Her work is reshaping how large organizations manage analytical data at scale.

Stephan Ewen

Stephan Ewen - Top 16 Data Engineers in the Field

Nationality: German

Stephan is a co-founder of Apache Flink and was the CTO of Ververica (formerly data Artisans), the company that commercialized Flink.

Stephan started Flink as part of his research at TU Berlin, and helped shape it into a powerful open-source stream processing engine known for high-throughput, low-latency processing. In 2014 he co-founded data Artisans in Berlin to bring Flink to industry; the company was later acquired by Alibaba and rebranded Ververica. Stephan oversaw Flink’s evolution to handle real-time streaming at companies like Alibaba, Netflix, and Uber. In 2022, he left Ververica to start a new venture called Restate (focused on stateful event processing), where he is now Founder and CTO.

With over a decade of building streaming systems, Stephan remains a prominent voice in stream processing, often sharing his insights on topics like event-driven architecture and state management.

Reynold Xin

Nationality: Chinese

Reynold is a co-founder and Chief Architect at Databricks, and one of the original developers of Apache Spark. At Databricks, Reynold has overseen major technical contributions to Spark – he led the creation of Spark SQL/DataFrames and the Project Tungsten engine for optimizing in-memory computation.

These efforts greatly improved Spark’s performance and usability, expanding it from batch jobs to a general analytics engine. Reynold remains deeply involved in Spark’s development (he’s a Spark PMC member) and in Databricks’ product strategy. He frequently shares new features at Databricks’ Data + AI Summits and on the Databricks blog, helping engineers understand advanced topics like adaptive query execution and Photon engine optimizations.

With his academic background (Berkeley AMPLab) and practical leadership, Reynold is a key driver of Spark’s continued evolution in the open-source community.

Holden Karau

Nationality: Canadian

Holden is an open-source engineer and author known for her contributions to Apache Spark. She became a Spark committer in the project’s early days and co-authored several influential books, including Learning Spark (2015) and High Performance Spark (2017).

Holden has worked at companies like IBM, Google, and Netflix on large-scale data platforms, often focusing on improving Spark’s usability. As a transgender woman in tech, Holden is also a champion for diversity and mentorship in the data community. She is a frequent speaker at conferences (known for her fun live-coding talks) and shares knowledge through blogs and YouTube. In recent years, Holden founded a startup exploring AI applications while continuing to contribute to Spark and related projects.

Her approachable teaching style and deep expertise have made her a beloved figure in the Spark community.

Wrap Up

These legends represent exceptional talent, making them extremely challenging to headhunt. However, there are thousands of other highly skilled IT professionals available to hire with our help. Contact us, and we will be happy to discuss your hiring needs.

Note: We’ve dedicated significant time and effort to creating and verifying this curated list of top talent. However, if you believe a correction or addition is needed, feel free to reach out. We’ll gladly review and update the page.

Ready to get started?