Top 19 Data Warehouse (DWH) Engineers

Data warehouse engineering sits at the heart of modern analytics, bridging vast data stores with actionable insights.
The individuals below represent of the best data warehouse (DWH) engineers globally, excelling in open-source contributions, leadership at high-impact tech companies, influential blogging and community engagement, and even competition accolades. They include famous open-source developers, startup tech founders, tech influencers, large company engineers, and data champions.
- Benoit Dageville
- Vinoth Chandar
- Ashish Thusoo
- Martin Traverso
- Thierry Cruanes
- Joydeep Sen Sarma
- Daniel Abadi
- Wes McKinney
- Reynold Xin
- David Phillips
- Hannes Mühleisen
- Fangjin Yang
- Kishore Gopalakrishna
- Mark Raasveldt
- Ryan Blue
- Jeff Hammerbacher
- Jordan Tigani
- Marcel Kornacker
- Barry Zane
Now, let’s dive into each of these experts’ profiles:
Benoit Dageville

Nationality: French
Benoit is co-founder of Snowflake and currently President of Products at the cloud data platform company. A distinguished database engineer, Benoit spent 15 years at Oracle as a lead architect for the Oracle Parallel SQL engine and the Oracle RAC database cluster.
In 2012, he left Oracle to start Snowflake, re-imagining the data warehouse for the cloud era. Benoit’s deep expertise in parallel query execution and self-tuning systems helped Snowflake build a scalable, fully-managed SQL data warehouse from scratch. Today Snowflake is an industry giant, and Benoit (holder of 80+ patents) still codes on weekends to stay connected to the technology.
He is widely regarded for “revolutionizing the future of computing” in data analytics, by proving that a cloud-native data warehouse could deliver performance, elasticity, and simplicity beyond traditional on-prem systems.
- LinkedIn: Benoit Dageville
- GitHub: sfc-gh-bdagevil
Vinoth Chandar
Nationality: Indian
Hudi was born out of a need to make incremental data management efficient on data lakes. It revolutionized how we think about streaming data and the lakehouse architecture.
Vinoth is the original creator of Apache Hudi, an open-source data platform that brings streaming data and database-like upsert abilities to data lakes. At Uber around 2016, Vinoth confronted the challenge of managing ever-growing incremental data in Hadoop – rides data, user events – where rewriting entire partitions daily was too slow. So he built Hudi (Hadoop Upserts and Incrementals) to enable upsert, delete, change capture, and incremental consumption on data lake files, significantly improving freshness and efficiency.
Hudi was open-sourced in 2017 and pioneered the concept of “lakehouse” before the term existed: it keeps data in columnar files but adds a transaction log and index, allowing near-real-time updates and point-in-time query views. This unlocked use cases like updating trip fares and replaying changelogs on Apache Hive/Spark. Hudi was adopted by AWS (as Lake Formation ACID tables) and by many companies needing delta handling on S3.
Vinoth became an Apache committee member and guided Hudi’s community growth (Hudi graduated to Apache Top-Level in 2020, and won the SIGMOD Systems Award 2021). In 2021, he founded Onehouse, aiming to provide a managed Lakehouse service on top of Hudi. As CEO of Onehouse, he continues to innovate – e.g., building a self-optimizing data layout for Hudi, and integrating it with query engines.
- LinkedIn: Vinoth Chandar
- X (Twitter): @byte_array
- GitHub: vinothchandar
Ashish Thusoo
Nationality: Indian
Ashish is the co-creator of Apache Hive and served as Facebook’s data infrastructure lead during the late 2000s.
At Facebook, Ashish and his team developed Hive out of necessity – to enable SQL querying on Hadoop – thereby kickstarting the SQL-on-big-data revolution. He was the founding VP of Apache Hive at the ASF, helping turn it into one of the most popular data warehouse solutions on Hadoop.
In 2011, Ashish co-founded Qubole, a cloud big-data platform, and as CEO he grew it over 9 years to 350+ people (Qubole was later acquired). Under his leadership, Qubole pioneered the concept of a self-service big data platform on AWS, Azure, etc., simplifying Hadoop, Spark and Hive for the cloud. After Qubole, Ashish joined AWS as GM for AI/ML (leading SageMaker teams) and as of 2023 he founded a stealth startup in the generative AI space.
With 25+ years in tech, Ashish Thusoo is a true data pioneer – from inventing Hive (which earned him a 2021 ACM SIGMOD Systems Award) to shaping big data in the cloud.
- LinkedIn: Ashish Thusoo
- X (Twitter): @ashishthusoo
Martin Traverso
Nationality: Canadian
Martin is the co-creator of Presto (now known as Trino), a distributed SQL query engine for big data, and a co-founder of the Trino Software Foundation. While at Facebook, Martin (along with colleagues) originally designed and developed Presto to allow data analysts to run interactive SQL queries on Facebook’s large Hadoop data warehouse.
In 2019, he and the other Presto creators forked the project to continue its development as Trino, under an open foundation. Martin is also a co-founder and CTO of Starburst Data, which commercializes Trino for enterprise analytics. With Trino/Presto, Martin proved that a massively parallel SQL engine could query heterogeneous data sources (Hive, Cassandra, etc.) with high performance on a distributed cluster. He remains deeply involved in the Trino project’s direction and community.
Prior to Starburst, Martin worked as a software engineer at Facebook and as a software architect at Proofpoint and Ning. His accomplishments – creating one of the fastest distributed query engines and successfully fostering its open-source ecosystem – have made Martin Traverso a leading figure in cloud data warehousing.
- LinkedIn: Martin Traverso
- X (Twitter): @mtraverso
- GitHub: martint
Thierry Cruanes
Nationality: French
Thierry is Snowflake’s other technical co-founder and a leading expert in SQL query optimization and parallel execution.
He spent 13 years at Oracle focusing on the database optimizer and parallel query layers (eventually leading Oracle’s query optimization group). Prior to Oracle, he worked on data mining algorithms at IBM’s European Center for Applied Mathematics.
Thierry earned a PhD in database systems and holds 40+ patents. In 2012, he teamed up with Benoît Dageville to start Snowflake, bringing his deep optimizer knowledge to design Snowflake’s ground-breaking architecture. Cruanes helped architect Snowflake’s elastic cloud data warehouse, ensuring it could automatically optimize and scale SQL queries across massive clusters. Like Dageville, he stays technically involved (both founders famously “still code on weekends” to stay in touch with their creation).
Under his and Dageville’s guidance, Snowflake went from an idea to a record-breaking IPO, validating their bet against the prevailing Hadoop paradigm.
- LinkedIn: Thierry Cruanes
Joydeep Sen Sarma
Nationality: Indian
Joydeep partnered with Ashish Thusoo to revolutionize Facebook’s data analytics. As Facebook’s Data Infrastructure Lead, Joydeep co-developed Apache Hive in 2008, creating a SQL layer on Hadoop that opened big data to non-programmers. Hive enabled Facebook’s 150+ PB data warehouse to be queried by thousands of employees using SQL, a breakthrough that inspired the modern data engineering ecosystem.
Joydeep was an Apache Hive founding committer and also led development of Facebook’s petabyte-scale messaging analytics. In 2011, he and Thusoo co-founded Qubole, where Joydeep served as CTO and built a cloud-native platform for Hadoop, Spark, and Hive as a Service. At Qubole, he continued pushing the envelope (e.g., integrating Hive with cloud object storage and accelerating workloads via Apache Presto). After Qubole, Joydeep has moved to a new venture (ClearFeed AI), but he remains an influential big data veteran.
An IIT Delhi alum and former Yahoo engineer, Joydeep also has competitive accolades (he was All-India Rank 18 in the IIT entrance exam). His blend of top-tier engineering and entrepreneurial execution helped shape the big data warehousing landscape that led to today’s cloud DWs.
- LinkedIn: Joydeep Sen Sarma
Daniel Abadi
Nationality: American
Daniel is a prominent database researcher whose work on column stores and distributed analytics has profoundly influenced data warehousing systems. As a PhD student at MIT, Abadi was a core contributor to the C-Store project – a novel column-oriented database that demonstrated how columnar storage and compression yield order-of-magnitude better analytic query performance. C-Store directly led to the founding of Vertica (with Stonebraker), bringing those ideas to market and helping launch the modern columnar DW era.
Abadi then turned to merging Hadoop with databases: his lab built HadoopDB, which federated relational database nodes with Hadoop for scalable SQL on big data. He co-founded Hadapt in 2010 to commercialize HadoopDB, pioneering the “SQL-on-Hadoop” concept (Hadapt was acquired by Teradata in 2014).
Currently a professor at University of Maryland, Abadi continues to be an influential voice. He authored the PACELC theorem extending CAP for distributed systems, and he’s an ACM Fellow (2020) recognized for contributions to stream, graph, and distributed databases.
- LinkedIn: Daniel Abadi
- X (Twitter): @daniel_abadi
- GitHub: abadid
Wes McKinney
Nationality: American
Wes might be best known for creating pandas, the Python data analysis library, but his work has been transformative for data warehousing and analytics, especially in bridging analytical databases with data science tools. Wes developed pandas in 2008 to provide flexible tabular data manipulation in Python, which has since become a de facto standard for millions of analysts.
In the late 2010s, he turned to solving the fragmented data ecosystem problem: he co-created Apache Arrow, an open standard for columnar in-memory data that allows zero-copy data sharing between systems. Arrow and its companion Apache Parquet format are now foundational in cloud data warehouses and lakehouses – enabling efficient storage and transfer of columnar data.
Wes also co-founded Ursa Labs (now Voltron Data) to advance open-source analytics computing. He has authored the widely read book Python for Data Analysis and is now a director at Posit (RStudio), focusing on interoperability of tools. By creating pandas, Wes empowered Python to become a major player in data wrangling.
- LinkedIn: Wes McKinney
- X (Twitter): @wesmckinn
- GitHub: wesm
- Website/Blog: wesmckinney.com
Reynold Xin
Nationality: Chinese/USA
Reynold is the architect behind much of Spark’s SQL and DataFrame capabilities, making him a crucial player in modern data warehousing on big data. As a Berkeley PhD student and later Databricks co-founder, Reynold was the lead developer of Spark SQL, DataFrames, and Catalyst optimizer – the components that allow Spark to run SQL at scale. He also led Project Tungsten and Structured Streaming, all of which transformed Spark from a batch engine into a full analytics engine.
Under Reynold’s engineering leadership, Spark set records like the 2014 Sort Benchmark (beating Hadoop by 30x). Reynold serves as Chief Architect at Databricks, overseeing the platform’s technical evolution. He’s known for pushing the envelope on data optimization – e.g., introducing vectorized Parquet readers and dynamic partition pruning in Spark, which significantly improved performance for warehouse-style queries.
An immigrant from China, Reynold has openly shared his story (“Four of the Databricks founders are immigrants” he notes). He exemplifies how innovation in open-source (Spark) can spawn an entire industry. By enabling fast SQL on “big data”, Reynold Xin’s work helped catalyze the trend of companies moving warehouse workloads to data lake platforms.
- LinkedIn: Reynold Xin
- X (Twitter): @rxin
- GitHub: rxin
David Phillips
Nationality: American
David was the third key engineer on the Facebook Presto team and is a co-founder of the Presto (Trino) Software Foundation. With a background in high-performance systems (previously at HP Labs), David brought strong software engineering rigor to Presto. He implemented major features like prepared statement support, syntax enhancements, and many connectors. Notably, David has been a public face of Presto/Trino: he co-authored the book Trino: The Definitive Guide and often speaks about query engines.
In 2018, after a community fork to continue the open-source project (renamed Trino), David co-founded the foundation and joined Starburst to work full-time on Trino. He served as an Apache Software Foundation board member as well. David’s contributions include improving Presto’s Hive integration and security features, making it more enterprise-ready. On X (as @electrum32) he shares deep insights on SQL engines.
His meticulous approach to correctness and completeness has helped Trino evolve from a Facebook internal tool to a broadly adopted analytics engine. Thousands of companies rely on Trino today, and David Phillips’s steady stewardship of the project and community is a big reason why. He shows that open-source success is not just about initial invention, but sustained iteration and community building.
- LinkedIn: David Phillips
- X (Twitter): @electrum32
- GitHub: electrum
Hannes Mühleisen
Nationality: German
Hannes is the co-inventor of DuckDB, an in-process analytic database dubbed the “SQLite for analytics”. While a researcher at CWI Amsterdam, Hannes (along with his colleague Mark Raasveldt) designed DuckDB to provide fast columnar query processing embedded within other applications.
Released in 2019, DuckDB can run complex analytical SQL queries entirely within a Python/R process or even in a browser, without needing a separate database server. This light-weight, zero-dependency design – combined with very efficient vectorized execution – has led to DuckDB’s popularity for data science, analytics, and even as a component in larger data platforms. Hannes received a PhD in parallel databases and has continued to publish cutting-edge research (he won the Dutch ICT Young Researcher Award 2025 for his work). Now as co-founder and CEO of DuckDB Labs, he oversees the open-source project’s rapid development and adoption.
Hannes’s vision is to “embed SQL analytics everywhere” from edge devices to Jupyter notebooks – a goal that seems increasingly plausible as DuckDB is integrated into countless tools. He balances academic excellence with practical engineering (DuckDB often outperforms more “enterprise” systems on single-node workloads). By making serious analytics available in a 5MB library, Hannes Mühleisen is changing how and where we can warehouse data – bringing analytical SQL to the data, rather than always bringing data to a remote warehouse server.
- LinkedIn: Hannes Mühleisen
- GitHub: hannes
Fangjin Yang
Nationality: Canadian
Fangjin “FJ” is one of the key engineers behind Apache Druid, a high-performance real-time analytics database. In 2011, at ad-tech startup Metamarkets, Fangjin and team were struggling to achieve sub-second queries on streaming event data using existing tools. So they built Druid – an OLAP datastore combining columnar storage, distributed processing, and an innovative indexing structure for slice-and-dice analytics on huge event streams.
Fangjin was a main committer and led the push to open-source Druid in 2012. Thanks to its speed (ingesting millions of events/sec and answering queries in <1s on trillions of rows), Druid gained massive adoption at companies like Netflix, Alibaba, and Airbnb. In 2015, Fangjin co-founded Imply to offer Druid as a managed platform and continue its development.
As CEO of Imply, he has guided Druid’s evolution. Fangjin’s contributions are not just technical but community-oriented: he has evangelized Druid’s concept of a “real-time data slice-and-dice engine” for operational analytics, which addresses use cases traditional warehouses couldn’t. Under his leadership, Imply raised major funding as Druid’s popularity soared. Fangjin holds degrees from University of Waterloo, and early in his career he also worked on Hadoop at Cisco.
- LinkedIn: Fangjin Yang
- X (Twitter): @fangjin
Kishore Gopalakrishna
Nationality: Indian
Kishore is the original creator of Apache Pinot, a distributed OLAP datastore built to serve real-time analytical queries with ultra-low latency. At LinkedIn around 2014, Kishore and his colleagues faced the challenge of providing members with real-time insights on fresh data. Traditional warehouses were too slow, so they built Pinot to power LinkedIn’s user-facing analytics, capable of handling 100k+ queries/sec on live event data with millisecond latency.
Kishore open-sourced Pinot in 2015, and it soon became the backbone for analytics at LinkedIn (70+ products), Uber (for metrics), Stripe, and others. Pinot’s secret sauce is combining ideas from search (inverted indexes) with OLAP: it pre-aggregates where useful, uses columnar store, and allows both real-time (streaming) and batch ingestion. In 2019, Kishore co-founded StarTree to provide a cloud service and further development for Pinot.
Now as StarTree’s CEO, he’s expanded Pinot’s reach – adding features like tiered storage, a JSON query language, and making it easy for companies to build their own customer-facing analytics (such as real-time dashboards). Kishore, who previously worked on Hadoop at Yahoo, has effectively dedicated his career to solving big data query problems.
- LinkedIn: Kishore Gopalakrishna
- X (Twitter): @KishoreBytes
Mark Raasveldt
Nationality: Dutch
Mark co-developed DuckDB alongside Hannes Mühleisen and is the technical lead (CTO) of DuckDB Labs. With a Master’s from Leiden University, Mark joined CWI as a researcher and soon became instrumental in writing DuckDB’s core C++ engine. He specializes in query execution and storage layout – Mark implemented DuckDB’s vectorized processing model and many of its optimizations for fast memory access.
He also ensured DuckDB supports a broad SQL dialect and ACID transactions despite its small footprint. Mark’s work on embedding DuckDB in Python (via a Pandas integration) and R has made it very accessible to data scientists. In fact, he and Hannes have published about using DuckDB for in-memory analytics and have demonstrated that it can outperform larger systems on many analytics tasks.
Now Mark leads the engineering team continuing to add features like row-level indexing, parallel joins, and an extension framework to DuckDB. He is also a visible advocate for open data formats – e.g., working on enhancements to Apache Arrow integration. Under Mark’s technical direction, DuckDB’s popularity has exploded, with tens of thousands of developers using it in Jupyter, in streaming pipelines, even inside web browsers.
- LinkedIn: Mark Raasveldt
- X (Twitter): @mraasveldt
- GitHub: myTherin
Ryan Blue
Nationality: American
Iceberg solved the data mutation and reliability gap in data lakes, making cloud object storage a viable foundation for modern data warehousing.
Ryan is a software engineer who co-created Apache Iceberg, a high-performance table format for huge analytic datasets on data lakes. While at Netflix in 2017, Ryan experienced pain points with Hive table formats. He spearheaded the creation of Iceberg to bring fully ACID compliant transactions, schema evolution, partition pruning, and other warehouse-like capabilities to data lakes. Open-sourced through Apache in 2018, Iceberg has since been embraced by the community and vendors as a key component of the “data lakehouse” architecture.
Ryan continued to lead the Iceberg project as it gained momentum – he became Iceberg PMC chair and co-founded Tabular, a startup to provide managed Iceberg catalogs, in 2021. Iceberg’s design (immutable files with transaction logs, no metastore bottleneck) addresses many of the consistency and performance issues of earlier approaches, and it allows multiple engines (Spark, Trino, Flink, etc.) to safely work on the same data.
Ryan is also an ASF member who previously contributed to Parquet and Hive. By solving the “data mutation and reliability” gap on data lakes, Ryan Blue has helped make cloud object storage a viable foundation for warehouses, influencing products like Snowflake’s External Tables and Databricks’ Delta Lake.
Jeff Hammerbacher
Nationality: American
Jeff is often credited with coining the term “Data Scientist”, but his impact on data warehousing is also significant through the teams and technologies he helped create. As an early Facebook engineer (2006–2008), Jeff led the development of the company’s data analytics infrastructure. He was Mark Zuckerberg’s first data analyst and quickly grew a team that built a 15 TB data warehouse on Hadoop – one of the largest of its era. This included developing a custom Hadoop-based warehouse.
In 2008, at just 25, he left Facebook to co-found Cloudera, the first big data platform company, bringing Hadoop (and by extension distributed warehousing capabilities) to the enterprise. At Cloudera, Jeff was Chief Scientist and worked on making Hadoop reliable and easy for companies to use for storage and analytics. His vision was to enable the “Facebook data experience” everywhere. Cloudera’s distribution included Hive, Impala, and other warehouse-related projects that Jeff championed early on.
He later worked on bioinformatics, but his influence continued: many of Facebook’s and Cloudera’s data engineers went on to create Apache Spark, Kafka, and more. Jeff’s 2017 quote encapsulates the zeitgeist: “The best minds of my generation are thinking about how to make people click ads” – a tongue-in-cheek nod that many brilliant engineers were, like him, optimizing data at web companies.
- LinkedIn: Jeff Hammerbacher
Jordan Tigani
Nationality: American
Jordan was one of the early engineers who built Google BigQuery, the first serverless cloud data warehouse, and he continues to innovate in the “small data” space today. He joined Google in 2011 as a founding engineer on BigQuery, contributing to its Dremel-based architecture that allows SQL queries over huge datasets in seconds. Jordan later became BigQuery’s Director of Engineering and then its Product Manager, guiding it to widespread success.
He co-authored Google BigQuery: The Definitive Guide, sharing his deep knowledge of distributed SQL engines. After nearly a decade at Google, Jordan moved to SingleStore as Chief Product Officer, and in 2021 he co-founded MotherDuck, a startup integrating DuckDB with cloud services. As MotherDuck’s CEO, he’s championing the idea that not all analytics needs “big data” – sometimes simpler, local processing can be more efficient. This contrarian vision was summed up in his talk “The Death of Big Data, Time to Think Small”.
Jordan is a prominent voice in the data community, blending practical engineering with product insight. He’s known for his humor and for bridging worlds: BigQuery’s success was about bringing Google’s research to everyone, and MotherDuck aims to bring advanced analytics to those with modest data volumes. In both cases, Jordan Tigani has helped broaden access to analytical power.
- LinkedIn: Jordan Tigani
- X (Twitter): @jrdntgn
Marcel Kornacker
Nationality: German
Marcel built one of the first native analytical databases for Hadoop, paving the way for today’s low-latency big data warehouses. A PhD from UC Berkeley, Marcel worked at Google where he was a tech lead on F1, Google’s distributed RDBMS that powers AdWords. In 2010, he joined Cloudera with a mission: to create an open-source MPP query engine that brings real-time, interactive SQL to Hadoop.
The result was Apache Impala, which Marcel architected and launched in 2012. Impala was inspired by Google F1’s approach (which Marcel knew intimately) and delivered dramatically faster performance than Hive – “instantaneous” SQL on Hadoop by bypassing MapReduce and executing long-running daemon processes on each node. This made Hadoop clusters capable of true data warehousing workloads in seconds, not minutes. Marcel’s contribution was both technical and visionary: hiring him was described as “busting out of Google [to rebuild their] top-secret query machine”.
Under his leadership, Impala implemented innovations like a distributed query planner, runtime code generation, and direct use of HDFS file formats (Parquet) – all to maximize speed. Impala’s open-source release pressured even Google to publish a paper on F1 later.
- LinkedIn: Marcel Kornacker
- GitHub: mkornacker
Barry Zane
Nationality: American
Barry is a serial database architect whose work directly enabled Amazon Redshift, one of the first cloud data warehouses. In 2005, Barry co-founded ParAccel, Inc., aiming to build a high-performance MPP analytic database to compete with Netezza and Vertica. As CTO at ParAccel, he led the engineering of its columnar, shared-nothing DBMS that achieved top performance in TPC-H benchmarks.
ParAccel’s technology was so solid that in 2012 Amazon selected it as the basis for Amazon Redshift, licensing the engine to launch their own cloud data warehouse. Thus, ParAccel essentially became Redshift’s core (Amazon later acquired ParAccel’s IP). Redshift’s success in the 2010s was a watershed for cloud DW adoption, and Barry Zane’s database kernel was at the heart of it. Before ParAccel, Barry had already made his mark: he was a co-founder of Netezza (though left early) and earlier a developer on Applix’s TM1 OLAP engine.
After ParAccel, he founded SPARQL City to work on graph databases for analytics. He holds multiple patents in query optimization and storage. In recent years, Barry has been with Cambridge Semantics working on a distributed graph/SQL database, but he keeps a lower profile than some on this list.
- LinkedIn: Barry Zane
Wrap Up
These legends represent exceptional talent, making them extremely challenging to headhunt. However, there are thousands of other highly skilled IT professionals available to hire with our help. Contact us, and we will be happy to discuss your hiring needs.
Note: We’ve dedicated significant time and effort to creating and verifying this curated list of top talent. However, if you believe a correction or addition is needed, feel free to reach out. We’ll gladly review and update the page.