Top 15 Apache Spark Developers

Apache Spark’s global community is driven by a diverse set of experts – from core project contributors and startup founders to influential bloggers and competition winners.
Below is an updated list of top Spark experts, selected for their open-source contributions, technical leadership, community influence, and cutting-edge work with big data and AI. Each entry includes a brief bio, current role/location, and key public profiles.
- Matei Zaharia
- Patrick Wendell
- Reynold Xin
- Sean Owen
- Jean-Georges “jgp” Perrin
- Michael Armbrust
- Hyukjin Kwon
- Sandy Ryza
- Felix Cheung
- Wenchen Fan
- Ram Sriharsha
- Shivaram Venkataraman
- Denny Lee
- Allison Wang
- Xiao Li
Matei Zaharia

No Priors Ep. 11
California, USA
Matei started Apache Spark as a UC Berkeley Ph.D. project in 2009, envisioning a faster alternative to MapReduce.
He co-founded Databricks in 2013 and serves as its Chief Technologist (CTO) while remaining actively involved in research (ACM Doctoral Award 2014). Matei also spearheaded other open-source projects (MLflow, Delta Lake) and teaches distributed systems (recently joining faculty at Berkeley).
- LinkedIn: Matei Zaharia
- X (Twitter): @matei_zaharia
- GitHub: mateiz
Patrick Wendell
California, USA
Patrick is a co-founder of Databricks and one of Spark’s earliest committers.
He acted as release manager for multiple Spark versions, shaping the project’s direction in its formative years. Now Databricks’ VP of Engineering, Patrick oversees teams in machine learning and data science platforms while still coding and reviewing core Spark changes. He’s respected for his distributed systems expertise and hands-on leadership in the open-source community
- LinkedIn: Patrick Wendell
- X (Twitter): @patrickwendell
- GitHub: pwendell
Reynold Xin
San Francisco, USA
Reynold is a co-founder and Chief Architect at Databricks, best known for his influential work on Apache Spark’s engine.
As an original Spark committer, he designed core components like GraphX, Project Tungsten, Structured Streaming, and co-led the DataFrame API. Reynold continues to guide Spark’s technical evolution (he even led a Databricks team to win the Sort Benchmark in 2014) and remains a key voice in the big data community
- LinkedIn: Reynold Xin
- X (Twitter): @rxin
- GitHub: rxin
Sean Owen
London, UK
Sean is an Apache Spark PMC member and was one of the earliest evangelists for Spark in data science.
As Director of Data Science at Cloudera (in London), he championed Spark’s use for scalable ML and authored tools like Oryx (real-time recommender on Spark). He later joined Databricks to lead global data science practice, and contributed to MLlib and Spark community growth. Sean co-authored “Advanced Analytics with Spark” and has been a Kaggle competition master, blending practical machine learning with big data tech. He currently continues his applied research as a Staff Research Scientist at Databricks’ MosaicML division, focusing on ML systems.
- LinkedIn: Sean Owen
- X (Twitter): @SeanOwenPhD
- GitHub: srowen
Jean-Georges “jgp” Perrin
Strasbourg, France
Jean-Georges is the author of “Spark in Action, 2nd Edition” (Manning, 2020) – a comprehensive guide to building data engineering pipelines with Spark (foreword by IBM’s Rob Thomas).
An IBM Champion and veteran consultant, Perrin has promoted Spark across Europe, speaking at conferences like Spark Summit and All Things Open. In 2021 he joined PayPal as an engineer focusing on data mesh architecture, and in 2023 became Chief Innovation Officer at AbeaData (a data platform startup). With 20+ years in software (he’s worked with IBM Informix, WebSphere), JGP was among the first to bring Apache Spark to French enterprises and continues to advocate for open-source big data solutions
- LinkedIn: Jean-Georges “jgp” Perrin
- X (Twitter): @jgperrin
- GitHub: jgperrin
- Website: jgp.ai
Michael Armbrust
California, USA
Michael is the mastermind behind Spark SQL – he created Spark’s relational DataFrame API and SQL optimizer, and later built Structured Streaming and Delta Lake.
As a Spark PMC member and Databricks engineering director, he leads development of Spark’s core SQL engine and next-generation features. Michael’s contributions (e.g. the Catalyst optimizer and “Tungsten” execution engine) have made Spark the go-to unified analytics engine. He continues to innovate (heading towards Spark 4.x) and frequently shares new features at industry conferences
- LinkedIn: Michael Armbrust
- X (Twitter): @michaelarmbrust
- GitHub: marmbrus
Hyukjin Kwon
South Korea
Hyukjin is an Apache Spark PMC member and the lead engineer for PySpark APIs.
He spearheaded the Koalas project (pandas API on Spark) now merged into Spark 3.x, making pandas workloads scale transparently. As a Staff Software Engineer at Databricks, Hyukjin focuses on bridging Python and Spark – optimizing pandas UDFs, Arrow integration, and PySpark performance. He is a frequent speaker at PyData and Spark + AI Summits, sharing best practices for PySpark at scale.
- LinkedIn: Hyukjin Kwon
- X (Twitter): @hyukjinkwon
- GitHub: HyukjinKwon
Sandy Ryza
So things very heavily in terms of data assets, both the source data and the final data and then also these intermediate data assets that can be useful for a bunch of different things.
San Francisco, USA
Sandy co-authored “Advanced Analytics with Spark” (2015), sharing real-world machine learning recipes on Spark.
As an early Spark committer at Cloudera, he worked on MLlib and Spark’s integration with Hadoop, and contributed to improving Spark’s job scheduling and metrics. Sandy later led data science at Clover Health and now is Lead Engineer at Dagster Labs, building next-gen data orchestration tools. He remains a respected thought leader in data engineering, blending his experience in Spark, Hadoop, and now workflow management.
- LinkedIn: Sandy Ryza
- X (Twitter): @s_ryz
- GitHub: sryza
Felix Cheung
Toronto, Canada
Felix is a longtime Spark PMC member and open-source leader.
He served as Spark Technical Lead at Uber, where he built the “Spark-as-a-Service” platform for hundreds of teams. Felix also contributes to Apache Zeppelin and Apache ORC, and mentored several Apache incubator projects. After a stint as VP Engineering at SafeGraph, he joined NVIDIA to work on accelerating Spark with GPUs. Felix’s deep expertise in Spark and machine learning infrastructure, along with his advocacy for open source, make him a sought-after speaker and advisor
- LinkedIn: Felix Cheung
- GitHub: felixcheung
Wenchen Fan
Hangzhou, China
Wenchen is a senior Spark committer known for his work on Spark’s SQL and Catalyst optimizer.
Based in China, he specializes in core engine improvements – from adaptive query execution to DataSource v2 APIs – and has been one of the most active contributors to Spark 3.x. Wenchen is a Spark PMC member and Apache Software Foundation member, bridging the global community. He currently leads Spark development at Databricks (remotely from Hangzhou) focusing on performance and SQL enhancements.
- LinkedIn: Wenchen Fan
- GitHub: cloud-fan
Xiangrui Meng
USA
Xiangrui has been a key figure in Spark’s machine learning library, MLlib, since its early days.
As a Spark PMC member, he co-authored the official MLlib research paper and helped implement its core algorithms. At Databricks, Xiangrui drove MLlib’s development (from ALS to DataFrame-based Pipelines) and more recently works on integrating Spark with emerging AI tools (he contributes to MLflow and GPU acceleration efforts). With a Stanford CS background, he balances theory and practice in scalable ML.
- LinkedIn: Xiangrui Meng
Ram Sriharsha
San Francisco, USA
Ram is an Apache Spark PMC member known for his contributions to Spark’s MLlib and runtime performance.
At Hortonworks, he led efforts to integrate Spark with Hadoop and authored improvements in Spark’s memory management and ML pipelines. Ram later joined the vector database startup Pinecone, where he is now Chief Technology Officer, applying his distributed systems know-how to AI similarity search. He holds a Ph.D. in theoretical physics, which underpins his analytical approach. A frequent speaker on data science platforms, Ram remains involved in open source (e.g., Apache Arrow) and is uniquely versed in both the Spark and emerging AI/ML stack.
- LinkedIn: Ram Sriharsha
- GitHub: harsha2010
Shivaram Venkataraman
Wisconsin, USA
Shivaram was a key contributor to Spark during his Ph.D. at Berkeley – he helped develop SparkR (R language API) and worked on optimizing machine learning pipelines on Spark.
An Apache Spark committer, he also co-authored the popular MLlib paper. Now in academia, Shivaram leads research on large-scale data systems at University of Wisconsin–Madison. He continues to bridge the gap between advanced research (e.g., scheduling and storage optimizations for Spark) and practical big data applications, earning him a unique dual perspective.
- Website: shivaram.org
Denny Lee
Seattle, USA
Denny is a hands-on data engineer and Apache Spark contributor with 20+ years experience.
He is a Senior Staff Developer Advocate at Databricks, focusing on the open-source Delta Lake project and lakehouse best practices. Denny was part of Microsoft’s early big data team (bringing Hadoop to Azure) and co-founded the HDInsight Spark service. He co-authored Learning Spark 2E and is a maintainer of Delta Lake. A long-time Seattleite, Denny shares his expertise via blogs, talks, and even podcasts on data engineering. He’s known for approachable explanations of complex Spark topics (and for his enthusiasm for cycling and coffee!)
Allison Wang
California, USA
Allison is an Apache Spark committer specializing in PySpark and SQL APIs.
As a software engineer at Databricks, she helped develop Spark’s new Python DataFrames API and Python UDF improvements, bridging the gap between pandas users and Spark clusters. Allison has been instrumental in PySpark’s recent performance boosts (e.g., Arrow-based vectorized UDFs) and is passionate about making Spark more user-friendly for Python and data science communities. She’s a frequent presenter at PyData and Spark meetups, and in 2023 co-led efforts like PySpark’s “Project Zen” for usability.
- LinkedIn: Allison Wang
- X (Twitter): @allisonwang42
- GitHub: allisonwang-db
Xiao Li
San Francisco, USA
Xiao is an Engineering Director at Databricks and a distinguished Spark committer who oversees Spark SQL and the Databricks Runtime teams.
With a Ph.D. in database systems, he brings deep expertise in query optimization and reliability. Xiao has driven many Spark SQL enhancements – e.g., Adaptive Query Execution and ANSI SQL compliance – ensuring Spark’s SQL engine meets enterprise needs. He was previously an IBM Master Inventor working on DB2 replication. Now, he leads multiple teams pushing Spark towards lakehouse capabilities, while remaining a hands-on contributor and Apache Spark PMC member.
- LinkedIn: Xiao Li
These legends represent exceptional talent, making them extremely challenging to headhunt. However, there are thousands of other highly skilled IT professionals available to hire with our help. Contact us, and we will be happy to discuss your hiring needs.
Note: We’ve dedicated significant time and effort to creating and verifying this curated list of top talent. If you intend to share or make use of it in any way, we kindly ask that you include a backlink to the original source – EchoGlobal.