Top 20 Apache Spark Experts and Consultants

Apache Spark’s global community is driven by a diverse set of experts – from core project contributors and startup founders to influential bloggers and competition winners.
Below is an updated list of top Spark experts, selected for their open-source contributions, technical leadership, community influence, and cutting-edge work with big data and AI. Each entry includes a brief bio, current role/location, and key public profiles.
- Matei Zaharia
- Patrick Wendell
- Reynold Xin
- Sean Owen
- Jean-Georges Perrin
- Michael Armbrust
- Hyukjin Kwon
- Jason Dai
- Sandy Ryza
- Felix Cheung
- Wenchen Fan
- Xiangrui Meng
- Ram Sriharsha
- Holden Karau
- Shivaram Venkataraman
- Nick Pentreath
- Jacek Laskowski
- Denny Lee
- Allison Wang
- Xiao Li
Matei Zaharia

No Priors Ep. 11
Nationality: USA
Matei started Apache Spark as a UC Berkeley Ph.D. project in 2009, envisioning a faster alternative to MapReduce.
He co-founded Databricks in 2013 and serves as its Chief Technologist (CTO) while remaining actively involved in research (ACM Doctoral Award 2014). Matei also spearheaded other open-source projects (MLflow, Delta Lake) and teaches distributed systems (recently joining faculty at Berkeley).
- LinkedIn: Matei Zaharia
- X (Twitter): @matei_zaharia
- GitHub: mateiz
Patrick Wendell
Nationality: USA
Patrick is a co-founder of Databricks and one of Spark’s earliest committers.
He acted as release manager for multiple Spark versions, shaping the project’s direction in its formative years. Now Databricks’ VP of Engineering, Patrick oversees teams in machine learning and data science platforms while still coding and reviewing core Spark changes. He’s respected for his distributed systems expertise and hands-on leadership in the open-source community
- LinkedIn: Patrick Wendell
- X (Twitter): @patrickwendell
- GitHub: pwendell
Reynold Xin
Nationality: USA
Reynold is a co-founder and Chief Architect at Databricks, best known for his influential work on Apache Spark’s engine.
As an original Spark committer, he designed core components like GraphX, Project Tungsten, Structured Streaming, and co-led the DataFrame API. Reynold continues to guide Spark’s technical evolution (he even led a Databricks team to win the Sort Benchmark in 2014) and remains a key voice in the big data community
- LinkedIn: Reynold Xin
- X (Twitter): @rxin
- GitHub: rxin
Sean Owen
Nationality: British
Sean is an Apache Spark PMC member and was one of the earliest evangelists for Spark in data science.
As Director of Data Science at Cloudera (in London), he championed Spark’s use for scalable ML and authored tools like Oryx (real-time recommender on Spark). He later joined Databricks to lead global data science practice, and contributed to MLlib and Spark community growth. Sean co-authored “Advanced Analytics with Spark” and has been a Kaggle competition master, blending practical machine learning with big data tech. He currently continues his applied research as a Staff Research Scientist at Databricks’ MosaicML division, focusing on ML systems.
- LinkedIn: Sean Owen
- X (Twitter): @SeanOwenPhD
- GitHub: srowen
Jean-Georges “jgp” Perrin
Nationality: French
Jean-Georges is the author of “Spark in Action, 2nd Edition” (Manning, 2020) – a comprehensive guide to building data engineering pipelines with Spark (foreword by IBM’s Rob Thomas).
An IBM Champion and veteran consultant, Perrin has promoted Spark across Europe, speaking at conferences like Spark Summit and All Things Open. In 2021 he joined PayPal as an engineer focusing on data mesh architecture, and in 2023 became Chief Innovation Officer at AbeaData (a data platform startup). With 20+ years in software (he’s worked with IBM Informix, WebSphere), JGP was among the first to bring Apache Spark to French enterprises and continues to advocate for open-source big data solutions
- LinkedIn: Jean-Georges “jgp” Perrin
- X (Twitter): @jgperrin
- GitHub: jgperrin
- Website: jgp.ai
Michael Armbrust
Nationality: USA
Michael is the mastermind behind Spark SQL – he created Spark’s relational DataFrame API and SQL optimizer, and later built Structured Streaming and Delta Lake.
As a Spark PMC member and Databricks engineering director, he leads development of Spark’s core SQL engine and next-generation features. Michael’s contributions (e.g. the Catalyst optimizer and “Tungsten” execution engine) have made Spark the go-to unified analytics engine. He continues to innovate (heading towards Spark 4.x) and frequently shares new features at industry conferences
- LinkedIn: Michael Armbrust
- X (Twitter): @michaelarmbrust
- GitHub: marmbrus
Hyukjin Kwon
Nationality: South Korean
Hyukjin is an Apache Spark PMC member and the lead engineer for PySpark APIs.
He spearheaded the Koalas project (pandas API on Spark) now merged into Spark 3.x, making pandas workloads scale transparently. As a Staff Software Engineer at Databricks, Hyukjin focuses on bridging Python and Spark – optimizing pandas UDFs, Arrow integration, and PySpark performance. He is a frequent speaker at PyData and Spark + AI Summits, sharing best practices for PySpark at scale.
- LinkedIn: Hyukjin Kwon
- X (Twitter): @hyukjinkwon
- GitHub: HyukjinKwon
Jason Dai
Nationality: Chinese
Jason is a Senior Principal Engineer at Intel and one of the most influential Spark contributors in the intersection of big data and AI. He created BigDL, an open-source deep learning library for Spark, enabling distributed AI workloads on Spark clusters.
As Intel’s first-ever Engineering Fellow in China, Jason leads a global team pushing the boundaries of analytics and AI on unified platforms. He has contributed to Spark MLlib and is also a PMC member of Apache Spark. Jason frequently speaks at industry conferences and wrote authoritative articles on scaling deep learning with Spark (including work on Analytics-Zoo). In 2023, he was recognized for his leadership in developing Intel’s oneAPI AI analytics toolkit that integrates with Spark. Jason holds a Ph.D. and has extensive research publications, but he focuses on practical solutions – helping enterprises use Spark with BigDL for tasks like large-scale recommendation systems and image analytics. For companies looking to implement machine learning or deep learning at scale on Spark, Jason’s expertise is second to none.
Sandy Ryza
So things very heavily in terms of data assets, both the source data and the final data and then also these intermediate data assets that can be useful for a bunch of different things.
Nationality: USA
Sandy co-authored “Advanced Analytics with Spark” (2015), sharing real-world machine learning recipes on Spark.
As an early Spark committer at Cloudera, he worked on MLlib and Spark’s integration with Hadoop, and contributed to improving Spark’s job scheduling and metrics. Sandy later led data science at Clover Health and now is Lead Engineer at Dagster Labs, building next-gen data orchestration tools. He remains a respected thought leader in data engineering, blending his experience in Spark, Hadoop, and now workflow management.
- LinkedIn: Sandy Ryza
- X (Twitter): @s_ryz
- GitHub: sryza
Felix Cheung
Nationality: Canadian
Felix is a longtime Spark PMC member and open-source leader.
He served as Spark Technical Lead at Uber, where he built the “Spark-as-a-Service” platform for hundreds of teams. Felix also contributes to Apache Zeppelin and Apache ORC, and mentored several Apache incubator projects. After a stint as VP Engineering at SafeGraph, he joined NVIDIA to work on accelerating Spark with GPUs. Felix’s deep expertise in Spark and machine learning infrastructure, along with his advocacy for open source, make him a sought-after speaker and advisor
- LinkedIn: Felix Cheung
- GitHub: felixcheung
Wenchen Fan
Nationality: Chinese
Wenchen is a senior Spark committer known for his work on Spark’s SQL and Catalyst optimizer.
Based in China, he specializes in core engine improvements – from adaptive query execution to DataSource v2 APIs – and has been one of the most active contributors to Spark 3.x. Wenchen is a Spark PMC member and Apache Software Foundation member, bridging the global community. He currently leads Spark development at Databricks (remotely from Hangzhou) focusing on performance and SQL enhancements.
- LinkedIn: Wenchen Fan
- GitHub: cloud-fan
Xiangrui Meng
Nationality: USA
Xiangrui has been a key figure in Spark’s machine learning library, MLlib, since its early days.
As a Spark PMC member, he co-authored the official MLlib research paper and helped implement its core algorithms. At Databricks, Xiangrui drove MLlib’s development (from ALS to DataFrame-based Pipelines) and more recently works on integrating Spark with emerging AI tools (he contributes to MLflow and GPU acceleration efforts). With a Stanford CS background, he balances theory and practice in scalable ML.
- LinkedIn: Xiangrui Meng
Ram Sriharsha
San Francisco, USA
Ram is an Apache Spark PMC member known for his contributions to Spark’s MLlib and runtime performance.
At Hortonworks, he led efforts to integrate Spark with Hadoop and authored improvements in Spark’s memory management and ML pipelines. Ram later joined the vector database startup Pinecone, where he is now Chief Technology Officer, applying his distributed systems know-how to AI similarity search. He holds a Ph.D. in theoretical physics, which underpins his analytical approach. A frequent speaker on data science platforms, Ram remains involved in open source (e.g., Apache Arrow) and is uniquely versed in both the Spark and emerging AI/ML stack.
- LinkedIn: Ram Sriharsha
- GitHub: harsha2010
Holden Karau
Nationality: Canadian
Holden is an open-source engineer and prolific author in the Spark community. A Spark committer since the project’s early days, she co-authored some of the most widely read books on Spark, including Learning Spark (O’Reilly, 2015) and High Performance Spark (2017).
Holden has worked at companies like IBM, Google, and Apple on big data platforms, and is known for her contributions to Spark’s Python APIs and testing infrastructure (she created the popular spark-testing-base library to simplify unit testing Spark code). Beyond coding, Holden is a frequent conference speaker and blogger, acclaimed for her fun and accessible presentations on complex Spark internals. She’s also an advocate for diversity in tech and open source. Currently, Holden is the founder of a startup using AI to help consumers (while still contributing to open source on the side), demonstrating that she remains passionate about solving real-world problems with Spark and AI.
- LinkedIn: Holden Karau
- X (Twitter): @holdenkarau
- GitHub: holdenk
Shivaram Venkataraman
Wisconsin, USA
Shivaram was a key contributor to Spark during his Ph.D. at Berkeley – he helped develop SparkR (R language API) and worked on optimizing machine learning pipelines on Spark.
An Apache Spark committer, he also co-authored the popular MLlib paper. Now in academia, Shivaram leads research on large-scale data systems at University of Wisconsin–Madison. He continues to bridge the gap between advanced research (e.g., scheduling and storage optimizations for Spark) and practical big data applications, earning him a unique dual perspective.
- Website: shivaram.org
Nick Pentreath
Nationality: South African
Nick is a principal engineer and machine learning specialist who has been a prominent contributor to Apache Spark’s MLlib library. He is the author of Machine Learning with Spark (Packt, 2015), one of the first books to show how to build ML pipelines on Spark.
Nick was an early member of the Spark Technology Center at IBM, where he worked on advancing Spark’s machine learning capabilities and helped enterprises deploy Spark for AI solutions. He later joined the Apache Spark PMC, contributing code to MLlib and mentoring its growth. Nick’s expertise spans machine learning, recommendation systems, and deep learning integration with Spark. He is also an active open-source contributor beyond Spark (including projects in the Hadoop ecosystem and model deployment frameworks). Currently based in Cape Town and working with the Apache Software Foundation, Nick consults and advises companies globally on big data ML architecture. His combination of competition-level machine learning skills and real-world Spark experience puts him among the elite Spark consultants.
- LinkedIn: Nick Pentreath
- X (Twitter): @mlnick
- GitHub: mlnick
Jacek Laskowski
Nationality: Polish
Jacek is a freelance consultant and technical instructor specializing in Apache Spark, Kafka, and related big data technologies. Widely regarded as a Spark guru, he is the author of the online books The Internals of Apache Spark and Mastering Apache Spark 2.x, which are go-to resources for developers seeking a deep understanding of Spark’s inner workings.
Jacek has been recognized as a Databricks Beacon (MVP) for his community contributions. He spends much of his time training engineering teams and writing detailed blog posts dissecting Spark components (from Spark SQL’s Catalyst optimizer to Structured Streaming). With over 20 years of IT experience, Jacek has helped numerous companies in Europe adopt and optimize Spark for their data pipelines. His enthusiasm for sharing knowledge and his hands-on approach to solving Spark problems have made him one of the top independent Spark consultants in the world.
- LinkedIn: Jacek Laskowski
- X (Twitter): @jaceklaskowski
- GitHub: jaceklaskowski
Denny Lee
Nationality: USA
Denny is a hands-on data engineer and Apache Spark contributor with 20+ years experience.
He is a Senior Staff Developer Advocate at Databricks, focusing on the open-source Delta Lake project and lakehouse best practices. Denny was part of Microsoft’s early big data team (bringing Hadoop to Azure) and co-founded the HDInsight Spark service. He co-authored Learning Spark 2E and is a maintainer of Delta Lake. A long-time Seattleite, Denny shares his expertise via blogs, talks, and even podcasts on data engineering. He’s known for approachable explanations of complex Spark topics (and for his enthusiasm for cycling and coffee!)
Allison Wang
Nationality: USA
Allison is an Apache Spark committer specializing in PySpark and SQL APIs.
As a software engineer at Databricks, she helped develop Spark’s new Python DataFrames API and Python UDF improvements, bridging the gap between pandas users and Spark clusters. Allison has been instrumental in PySpark’s recent performance boosts (e.g., Arrow-based vectorized UDFs) and is passionate about making Spark more user-friendly for Python and data science communities. She’s a frequent presenter at PyData and Spark meetups, and in 2023 co-led efforts like PySpark’s “Project Zen” for usability.
- LinkedIn: Allison Wang
- X (Twitter): @allisonwang42
- GitHub: allisonwang-db
Xiao Li
San Francisco, USA
Xiao is an Engineering Director at Databricks and a distinguished Spark committer who oversees Spark SQL and the Databricks Runtime teams.
With a Ph.D. in database systems, he brings deep expertise in query optimization and reliability. Xiao has driven many Spark SQL enhancements – e.g., Adaptive Query Execution and ANSI SQL compliance – ensuring Spark’s SQL engine meets enterprise needs. He was previously an IBM Master Inventor working on DB2 replication. Now, he leads multiple teams pushing Spark towards lakehouse capabilities, while remaining a hands-on contributor and Apache Spark PMC member.
- LinkedIn: Xiao Li
These legends represent exceptional talent, making them extremely challenging to headhunt. However, there are thousands of other highly skilled IT professionals available to hire with our help. Contact us, and we will be happy to discuss your hiring needs.
Note: We’ve dedicated significant time and effort to creating and verifying this curated list of top talent. However, if you believe a correction or addition is needed, feel free to reach out. We’ll gladly review and update the page.