Is Apache Spark still relevant?

Yes, Apache Spark remains widely used for large-scale data processing, real-time analytics, and machine learning. It is supported by major cloud providers and has an active open-source community.

How much do Apache Spark developers make?

Salaries vary by region, but Spark developers often earn between $90,000 and $150,000 per year in the United States, with senior specialists commanding higher pay due to demand in big data and AI projects.

What do Apache Spark consultants do?

They help organizations design and optimize data pipelines, integrate Spark with other tools, improve performance of distributed applications, and train in-house teams to manage big data workflows effectively.

Is it hard to find Apache Spark programmers?

Skilled Spark engineers are in demand, and while not as common as general software developers, they are accessible through specialized recruitment agencies, tech staffing firms, and global talent platforms.

What companies use Apache Spark?

Spark is used by companies such as Netflix, Uber, Airbnb, and Shopify to process large datasets, support real-time decision-making, and power advanced analytics applications.

Top 22 Apache Spark Experts and Consultants (2026)

Apache Spark’s global community is driven by a diverse set of experts – from core project contributors and startup founders to influential bloggers and competition winners.

Below is an updated list of top Spark experts, selected for their open-source contributions, technical leadership, community influence, and cutting-edge work with big data and AI. Each entry includes a brief bio, current role/location, and key public profiles.

Matei Zaharia
Patrick Wendell
Reynold Xin
Sean Owen
Jean-Georges Perrin
Michael Armbrust
Hyukjin Kwon
Jason Dai
Sandy Ryza
Felix Cheung
Wenchen Fan
Xiangrui Meng
Ram Sriharsha
Holden Karau
Shivaram Venkataraman
Nick Pentreath
Jacek Laskowski
Denny Lee
Allison Wang
Vivek Raj
Xiao Li
Tathagata Das

Matei Zaharia

No Priors Ep. 11

Nationality: USA

Matei started Apache Spark as a UC Berkeley Ph.D. project in 2009, envisioning a faster alternative to MapReduce.

He co-founded Databricks in 2013 and serves as its Chief Technologist (CTO) while remaining actively involved in research (ACM Doctoral Award 2014). Matei also spearheaded other open-source projects (MLflow, Delta Lake) and teaches distributed systems (recently joining faculty at Berkeley).

Linkedin: Matei Zaharia
X (Twitter): @matei_zaharia
Github: mateiz

Patrick Wendell

Nationality: USA

Patrick is a co-founder of Databricks and one of Spark’s earliest committers.

He acted as release manager for multiple Spark versions, shaping the project’s direction in its formative years. Now Databricks’ VP of Engineering, Patrick oversees teams in machine learning and data science platforms while still coding and reviewing core Spark changes. He’s respected for his distributed systems expertise and hands-on leadership in the open-source community

Linkedin: Patrick Wendell
X (Twitter): @patrickwendell
Github: pwendell

Reynold Xin

Reynold Xin - Top 22 Apache Spark Experts and Consultants

Nationality: USA

Reynold is a co-founder and Chief Architect at Databricks, best known for his influential work on Apache Spark’s engine.

As an original Spark committer, he designed core components like GraphX, Project Tungsten, Structured Streaming, and co-led the DataFrame API. Reynold continues to guide Spark’s technical evolution (he even led a Databricks team to win the Sort Benchmark in 2014) and remains a key voice in the big data community

Linkedin: Reynold Xin
X (Twitter): @rxin
Github: rxin

Sean Owen

Nationality: British

Sean is an Apache Spark PMC member and was one of the earliest evangelists for Spark in data science.

As Director of Data Science at Cloudera (in London), he championed Spark’s use for scalable ML and authored tools like Oryx (real-time recommender on Spark). He later joined Databricks to lead global data science practice, and contributed to MLlib and Spark community growth. Sean co-authored “Advanced Analytics with Spark” and has been a Kaggle competition master, blending practical machine learning with big data tech. He currently continues his applied research as a Staff Research Scientist at Databricks’ MosaicML division, focusing on ML systems.

Linkedin: Sean Owen
X (Twitter): @SeanOwenPhD
Github: srowen

Jean-Georges “jgp” Perrin

Nationality: French

Jean-Georges is the author of “Spark in Action, 2nd Edition” (Manning, 2020) – a comprehensive guide to building data engineering pipelines with Spark (foreword by IBM’s Rob Thomas).

An IBM Champion and veteran consultant, Perrin has promoted Spark across Europe, speaking at conferences like Spark Summit and All Things Open. In 2021 he joined PayPal as an engineer focusing on data mesh architecture, and in 2023 became Chief Innovation Officer at AbeaData (a data platform startup). With 20+ years in software (he’s worked with IBM Informix, WebSphere), JGP was among the first to bring Apache Spark to French enterprises and continues to advocate for open-source big data solutions

Linkedin: Jean-Georges “jgp” Perrin
X (Twitter): @jgperrin
Github: jgperrin
Website: jgp.ai

Michael Armbrust

Michael Armbrust e1744538711226 - Top 22 Apache Spark Experts and Consultants

Nationality: USA

Michael is the mastermind behind Spark SQL – he created Spark’s relational DataFrame API and SQL optimizer, and later built Structured Streaming and Delta Lake.

As a Spark PMC member and Databricks engineering director, he leads development of Spark’s core SQL engine and next-generation features. Michael’s contributions (e.g. the Catalyst optimizer and “Tungsten” execution engine) have made Spark the go-to unified analytics engine. He continues to innovate (heading towards Spark 4.x) and frequently shares new features at industry conferences

Linkedin: Michael Armbrust
X (Twitter): @michaelarmbrust
Github: marmbrus

Hyukjin Kwon

Nationality: South Korean

Hyukjin is an Apache Spark PMC member and the lead engineer for PySpark APIs.

He spearheaded the Koalas project (pandas API on Spark) now merged into Spark 3.x, making pandas workloads scale transparently. As a Staff Software Engineer at Databricks, Hyukjin focuses on bridging Python and Spark – optimizing pandas UDFs, Arrow integration, and PySpark performance. He is a frequent speaker at PyData and Spark + AI Summits, sharing best practices for PySpark at scale.

Linkedin: Hyukjin Kwon
X (Twitter): @hyukjinkwon
Github: HyukjinKwon

Jason Dai

Nationality: Chinese

Jason is a Senior Principal Engineer at Intel and one of the most influential Spark contributors in the intersection of big data and AI. He created BigDL, an open-source deep learning library for Spark, enabling distributed AI workloads on Spark clusters.

As Intel’s first-ever Engineering Fellow in China, Jason leads a global team pushing the boundaries of analytics and AI on unified platforms. He has contributed to Spark MLlib and is also a PMC member of Apache Spark. Jason frequently speaks at industry conferences and wrote authoritative articles on scaling deep learning with Spark (including work on Analytics-Zoo). In 2023, he was recognized for his leadership in developing Intel’s oneAPI AI analytics toolkit that integrates with Spark. Jason holds a Ph.D. and has extensive research publications, but he focuses on practical solutions – helping enterprises use Spark with BigDL for tasks like large-scale recommendation systems and image analytics. For companies looking to implement machine learning or deep learning at scale on Spark, Jason’s expertise is second to none.

Linkedin: Jason Dai
Github: jason-dai

Sandy Ryza

So things very heavily in terms of data assets, both the source data and the final data and then also these intermediate data assets that can be useful for a bunch of different things.

Nationality: USA

Sandy co-authored “Advanced Analytics with Spark” (2015), sharing real-world machine learning recipes on Spark.

As an early Spark committer at Cloudera, he worked on MLlib and Spark’s integration with Hadoop, and contributed to improving Spark’s job scheduling and metrics. Sandy later led data science at Clover Health and now is Lead Engineer at Dagster Labs, building next-gen data orchestration tools. He remains a respected thought leader in data engineering, blending his experience in Spark, Hadoop, and now workflow management.

Linkedin: Sandy Ryza
X (Twitter): @s_ryz
Github: sryza

Felix Cheung

Felix Cheung e1744540386660 - Top 22 Apache Spark Experts and Consultants

Nationality: Canadian

Felix is a longtime Spark PMC member and open-source leader.

He served as Spark Technical Lead at Uber, where he built the “Spark-as-a-Service” platform for hundreds of teams. Felix also contributes to Apache Zeppelin and Apache ORC, and mentored several Apache incubator projects. After a stint as VP Engineering at SafeGraph, he joined NVIDIA to work on accelerating Spark with GPUs. Felix’s deep expertise in Spark and machine learning infrastructure, along with his advocacy for open source, make him a sought-after speaker and advisor

Linkedin: Felix Cheung
Github: felixcheung

Wenchen Fan

Nationality: Chinese

Wenchen is a senior Spark committer known for his work on Spark’s SQL and Catalyst optimizer.

Based in China, he specializes in core engine improvements – from adaptive query execution to DataSource v2 APIs – and has been one of the most active contributors to Spark 3.x. Wenchen is a Spark PMC member and Apache Software Foundation member, bridging the global community. He currently leads Spark development at Databricks (remotely from Hangzhou) focusing on performance and SQL enhancements.

Linkedin: Wenchen Fan
Github: cloud-fan

Xiangrui Meng

Nationality: USA

Xiangrui has been a key figure in Spark’s machine learning library, MLlib, since its early days.

As a Spark PMC member, he co-authored the official MLlib research paper and helped implement its core algorithms. At Databricks, Xiangrui drove MLlib’s development (from ALS to DataFrame-based Pipelines) and more recently works on integrating Spark with emerging AI tools (he contributes to MLflow and GPU acceleration efforts). With a Stanford CS background, he balances theory and practice in scalable ML.

Linkedin: Xiangrui Meng

Ram Sriharsha

Ram Sriharsha - Top 22 Apache Spark Experts and Consultants

San Francisco, USA

Ram is an Apache Spark PMC member known for his contributions to Spark’s MLlib and runtime performance.

At Hortonworks, he led efforts to integrate Spark with Hadoop and authored improvements in Spark’s memory management and ML pipelines. Ram later joined the vector database startup Pinecone, where he is now Chief Technology Officer, applying his distributed systems know-how to AI similarity search. He holds a Ph.D. in theoretical physics, which underpins his analytical approach. A frequent speaker on data science platforms, Ram remains involved in open source (e.g., Apache Arrow) and is uniquely versed in both the Spark and emerging AI/ML stack.

Linkedin: Ram Sriharsha
Github: harsha2010

Holden Karau

Holden Karau - Top 22 Apache Spark Experts and Consultants

Nationality: Canadian

Holden is an open-source engineer and author in the Spark community. A Spark committer since the project’s early days, she co-authored some of the most widely read books on Spark, including Learning Spark (O’Reilly, 2015) and High Performance Spark (2017).

Holden has worked at companies like IBM, Google, and Apple on big data platforms, and is known for her contributions to Spark’s Python APIs and testing infrastructure (she created the popular spark-testing-base library to simplify unit testing Spark code). Beyond coding, Holden is a frequent conference speaker and blogger, acclaimed for her fun and accessible presentations on complex Spark internals. She’s also an advocate for diversity in tech and open source. Currently, Holden is the founder of a startup using AI to help consumers (while still contributing to open source on the side), demonstrating that she remains passionate about solving real-world problems with Spark and AI.

Linkedin: Holden Karau
X (Twitter): @holdenkarau
Github: holdenk

Shivaram Venkataraman

Wisconsin, USA

Shivaram was a key contributor to Spark during his Ph.D. at Berkeley – he helped develop SparkR (R language API) and worked on optimizing machine learning pipelines on Spark.

An Apache Spark committer, he also co-authored the popular MLlib paper. Now in academia, Shivaram leads research on large-scale data systems at University of Wisconsin–Madison. He continues to bridge the gap between advanced research (e.g., scheduling and storage optimizations for Spark) and practical big data applications, earning him a unique dual perspective.

Website: shivaram.org

Nick Pentreath

Nick Pentreath - Top 22 Apache Spark Experts and Consultants

Nationality: South African

Nick is a principal engineer and machine learning specialist who has been a prominent contributor to Apache Spark’s MLlib library. He is the author of Machine Learning with Spark (Packt, 2015), one of the first books to show how to build ML pipelines on Spark.

Nick was an early member of the Spark Technology Center at IBM, where he worked on advancing Spark’s machine learning capabilities and helped enterprises deploy Spark for AI solutions. He later joined the Apache Spark PMC, contributing code to MLlib and mentoring its growth. Nick’s expertise spans machine learning, recommendation systems, and deep learning integration with Spark. He is also an active open-source contributor beyond Spark (including projects in the Hadoop ecosystem and model deployment frameworks). Currently based in Cape Town and working with the Apache Software Foundation, Nick consults and advises companies globally on big data ML architecture. His combination of competition-level machine learning skills and real-world Spark experience puts him among the elite Spark consultants.

Linkedin: Nick Pentreath
X (Twitter): @mlnick
Github: mlnick

Jacek Laskowski

Nationality: Polish

Jacek is a freelance consultant and technical instructor specializing in Apache Spark, Kafka, and related big data technologies. Widely regarded as a Spark guru, he is the author of the online books The Internals of Apache Spark and Mastering Apache Spark 2.x, which are go-to resources for developers seeking a deep understanding of Spark’s inner workings.

Jacek has been recognized as a Databricks Beacon (MVP) for his community contributions. He spends much of his time training engineering teams and writing detailed blog posts dissecting Spark components (from Spark SQL’s Catalyst optimizer to Structured Streaming). With over 20 years of IT experience, Jacek has helped numerous companies in Europe adopt and optimize Spark for their data pipelines. His enthusiasm for sharing knowledge and his hands-on approach to solving Spark problems have made him one of the top independent Spark consultants in the world.

Linkedin: Jacek Laskowski
X (Twitter): @jaceklaskowski
Github: jaceklaskowski

Denny Lee

Nationality: USA

Denny is a hands-on data engineer and Apache Spark contributor with 20+ years experience.

He is a Senior Staff Developer Advocate at Databricks, focusing on the open-source Delta Lake project and lakehouse best practices. Denny was part of Microsoft’s early big data team (bringing Hadoop to Azure) and co-founded the HDInsight Spark service. He co-authored Learning Spark 2E and is a maintainer of Delta Lake. A long-time Seattleite, Denny shares his expertise via blogs, talks, and even podcasts on data engineering. He’s known for approachable explanations of complex Spark topics (and for his enthusiasm for cycling and coffee!)

Linkedin: Denny Lee
Github: dennyglee

Allison Wang

Allison Wang - Top 22 Apache Spark Experts and Consultants

Nationality: USA

Allison is an Apache Spark committer specializing in PySpark and SQL APIs.

As a software engineer at Databricks, she helped develop Spark’s new Python DataFrames API and Python UDF improvements, bridging the gap between pandas users and Spark clusters. Allison has been instrumental in PySpark’s recent performance boosts (e.g., Arrow-based vectorized UDFs) and is passionate about making Spark more user-friendly for Python and data science communities. She’s a frequent presenter at PyData and Spark meetups, and in 2023 co-led efforts like PySpark’s “Project Zen” for usability.

Linkedin: Allison Wang
X (Twitter): @allisonwang42
Github: allisonwang-db

Vivek Raj

Nationality: Indian

Vivek is a seasoned Azure Data Engineer with more than 13 years of experience in building scalable data pipelines, especially using Apache Spark and Azure Databricks.

In his LinkedIn article “Difference between Apache Spark and Databricks”, he explains that while Apache Spark is an open‑source distributed compute engine, Databricks is a managed cloud platform built atop Spark—offering automated clusters, collaborative notebooks, Delta Lake integration, and easier operational management. Vivek clarifies that Spark itself provides the core tools for data processing (Spark Core, SQL, Streaming, MLlib) while Databricks enhances it with orchestration, optimized execution, and lakehouse ecosystem support like Delta Lake and BI tooling.

His hands‑on expertise in PySpark, Azure Data Factory, Azure Databricks, and Delta Lake aligns closely with these distinctions and underscores his strength in architecting modern lakehouse and big data solutions.

Linkedin: Vivek Raj

Xiao Li

San Francisco, USA

Xiao is an Engineering Director at Databricks and a distinguished Spark committer who oversees Spark SQL and the Databricks Runtime teams.

With a Ph.D. in database systems, he brings deep expertise in query optimization and reliability. Xiao has driven many Spark SQL enhancements – e.g., Adaptive Query Execution and ANSI SQL compliance – ensuring Spark’s SQL engine meets enterprise needs. He was previously an IBM Master Inventor working on DB2 replication. Now, he leads multiple teams pushing Spark towards lakehouse capabilities, while remaining a hands-on contributor and Apache Spark PMC member.

Linkedin: Xiao Li

Tathagata Das

Nationality: Indian-American

Tathagata is one of the original Apache Spark developers and a Spark PMC member, best known for leading Spark Streaming and helping shape Structured Streaming.

His work focuses on streaming correctness, state management, and operational reliability, and he has also contributed to Delta Lake and related data management components in the Databricks ecosystem. Tathagata is a co-author on the early Spark Streaming research work presented at SOSP 2013 (Discretized Streams).

Linkedin: Tathagata Das
Github: tdas

Wrap Up

These experts represent exceptional talent, making them extremely challenging to headhunt. However, there are thousands of other highly skilled IT professionals available to hire with our help. Contact us, and we will be happy to discuss your hiring needs.

Note: We’ve dedicated significant time and effort to creating and verifying this curated list of top talent. However, if you believe a correction or addition is needed, feel free to reach out. We’ll gladly review and update the page.

Top 22 Apache Spark Experts and Consultants

Matei Zaharia

Patrick Wendell

Reynold Xin

Sean Owen

Jean-Georges “jgp” Perrin

Michael Armbrust

Hyukjin Kwon

Jason Dai

Sandy Ryza

Felix Cheung

Wenchen Fan

Xiangrui Meng

Ram Sriharsha

Holden Karau

Shivaram Venkataraman

Nick Pentreath

Jacek Laskowski

Denny Lee

Allison Wang

Vivek Raj

Xiao Li

Tathagata Das

Wrap Up

Vivek Raj