Apache Hudi’s cover photo
Apache Hudi

Apache Hudi

Data Infrastructure and Analytics

San Francisco, CA 15,006 followers

Open source pioneer of the lakehouse reimagining batch processing with incremental framework for low latency analytics

About us

Open source pioneer of the lakehouse reimagining old-school batch processing with a powerful new incremental framework for low latency analytics. Hudi brings database and data warehouse capabilities to the data lake making it possible to create a unified data lakehouse for ETL, analytics, AI/ML, and more. Apache Hudi is battle-tested at scale powering some of the largest data lakes on the planet. Apache Hudi provides an open foundation that seamlessly connects to all other popular open source tools such as Spark, Presto, Trino, Flink, Hive, and so much more. Being an open source table format is not enough, Apache Hudi is also a comprehensive platform of open services and tools that are necessary to operate your data lakehouse in production at scale. Most importantly, Apache Hudi is a community built by a diverse group of engineers from all around the globe! Hudi is a friendly and inviting open source community that is growing every day. Join the community in Github: https://github.com/apache/hudi or find links to email lists and slack channels on the Hudi website: https://hudi.apache.org/

Website
https://hudi.apache.org/
Industry
Data Infrastructure and Analytics
Company size
201-500 employees
Headquarters
San Francisco, CA
Type
Nonprofit
Founded
2016
Specialties
ApacheHudi, DataEngineering, ApacheSpark, ApacheFlink, TrinoDB, Presto, DataAnalytics, DataLakehouse, AWS, GCP, Azure, ChangeDataCapture, and StreamProcessing

Locations

Employees at Apache Hudi

Updates

  • Static bloom filters hit a wall when capacity is exceeded—false positives skyrocket. Hudi takes a smarter path 💡. Our InternalDynamicBloomFilter uses a matrix of filters. When the active one reaches its threshold (`nr`), we add a new row. Checks scan all rows: a match in any means the key's likely there 📊. Expansion happens lazily—only when needed, no upfront allocation. Key for keeping memory tight with thousands of file groups 📈. Hit the max (`maxNr`)? It switches to round-robin overwrites. False positives rise, but everything keeps running without crashes or endless growth 🔄. Vital for upsert indexing: Hudi scans Parquet footer blooms to spot potential key files during writes. In ongoing Spark streaming with unknown cardinality, static filters waste space or fail quietly. Dynamic ones adjust on the fly 🚀. #ApacheHudi #DataEngineering #BigData #Spark #BloomFilters

    • No alternative text description for this image
  • Engineers sometimes describe compaction as housekeeping 🧹. That undersells it and can lead to terrible side-effects ⚠️. In Merge-on-Read systems, compaction is really deferred write work 📝. You kept ingestion fast by appending changes to logs instead of rewriting columnar files immediately 🚀. Compaction is the stage where those accumulated deltas get folded back into optimized base files 🔄. Seen that way, the real design question is not “should I compact?” It is “when should I pay back deferred write cost?” ⏳ That is why compaction strategy affects both sides: • ingest latency ⏱️ • read performance 📈 • storage efficiency 💾 • background resource usage ⚙️ Hudi made that tradeoff explicit early, which is why compaction is a first-class table service rather than an afterthought. #ApacheHudi #DataLakehouse #DataEngineering

    • No alternative text description for this image
  • "MOR isn't merely a storage optimization; it's an architectural shift." 🚀 This is a great breakdown by Sivabalan Narayanan on why Merge-On-Read is about more than just "fast writes"—it's about fundamentally decoupling ingestion-time mutations from storage-time optimization. A must-read for anyone scaling a modern #DataLakehouse.

    MOR isn't merely a storage optimization; it's an architectural shift. Many descriptions of Merge-On-Read simplify it to "COW is for reads, MOR is for writes." While this framing is technically accurate, it lacks analytical depth. MOR fundamentally involves time-shifting work by separating ingestion-time mutation handling from storage-time optimization. This perspective transforms modern lakehouse architecture from a mere collection of features into a coherent outcome of a single design decision. This decision was pioneered by Hudi in 2017, with Iceberg arriving at the same conclusion in 2021 and Delta following in 2023. Hudi was likely ahead of its time, as the broader community did not fully recognize the necessity for such foundational architecture. The post explores the architectural argument, the genuine costs of MOR, the timeline gap between Hudi and other formats, and production evidence, including ByteDance managing 400 PB on a single Hudi MOR table and Walmart identifying MOR as "the only open file format able to handle" their update-heavy workload. This is the first in a series as I resume my technical writing after a hiatus. More insights will be shared in the coming months. Link in comments. #DataLakehouse #ApacheHudi #MergeOnRead #DataEngineering

    • Timeline showing when each lakehouse format introduced Merge-On-Read: Hudi in March 2017, Iceberg position+equality deletes in August 2021, Delta deletion vectors in April 2023, Iceberg v3 deletion vectors in April 2025. Callouts show ByteDance running a 400 PB single Hudi MOR table in 2021, and Walmart finding Hudi MOR was the only open file format able to handle their update-heavy workload in 2023.
  • The Peloton Hudi talk starts with the actual pain, not the technology ⚠️. The old architecture relied on daily snapshots from PostgreSQL 📸. That created four predictable problems: • Reporting waited for snapshots to finish ⏳ • Recommender systems were stuck on stale data 🗑️ • Analytics stayed tightly coupled to operational systems 🔗 • Service migrations became all-at-once exercises 🚚 This is why snapshot-based ingestion architectures age badly once the business starts demanding fresher data and more independent service evolution 📈. Peloton’s move to CDC-driven ingestion with Hudi was not just about speed. It was about decoupling the platform from an outdated batch ingestion model. That is the more interesting lesson 💡. #ApacheHudi #DataLakehouse #DataEngineering #CDC

    • No alternative text description for this image
  • Apache Hudi reposted this

    I rejected Snowflake, Databricks, Iceberg, and Redshift. Then picked two winners for a multi TB lakehouse. Here is why (and what I would change in 2026): In one of my projects: → 300k+ events/sec → 20K operational centres → CDC-heavy → Sub-minute freshness required I evaluated everything. Here is what happened: a). 𝗥𝗲𝗱𝘀𝗵𝗶𝗳𝘁: Cost model broke at our write volume. Not built for high-frequency upserts. Gone in week 1. b). 𝗦𝗻𝗼𝘄𝗳𝗹𝗮𝗸𝗲: Snowpipe could not match our throughput. Per-credit pricing at 20+ TB with continuous writes = blank cheque. Beautiful product, wrong workload. c). 𝗗𝗲𝗹𝘁𝗮 𝗟𝗮𝗸𝗲/𝗗𝗮𝘁𝗮𝗯𝗿𝗶𝗰𝗸𝘀: We were on AWS EMR. Migrating = rewriting everything + accepting pricing we could not control. Running Delta outside Databricks in 2023? Multi-engine support was not there. Platform tax or ecosystem gaps — either way, didn't fit. d). 𝗜𝗰𝗲𝗯𝗲𝗿𝗴: Wanted this to win. Metadata design is beautiful. But, compaction overhead = 3× Hudi's for our write pattern and no mature CDC-native Upsert support. Read-heavy? Pick Iceberg. Write-heavy CDC at scale? Was not ready. e). 𝗛𝘂𝗱𝗶 (Primary lakehouse): Built at Uber for this exact problem. MoR tables, async compaction, native CDC Upserts. (Though we found a few bugs in MOR compaction — worked with Hudi committers and ran a forked version.) --> Result: 2-min query latency. 99% SLA. Real-time tracking shipped. Wait, we have another winner. f). 𝗕𝗶𝗴𝗤𝘂𝗲𝗿𝘆: For one system, we went a different route. Kafka Connect → BigQuery directly. Ran upserts natively. Optimized clustering, partitioning, query pruning — squeezed it to the absolute core. (Our GCP account manager probably thought we were broke. We were just efficient. BigQuery is a beauty — if you know which knobs to turn.) Latency: ~10 min (heavy process, limited slots due to cost constraints). Acceptable for this use case. 𝗧𝗵𝗲 𝗿𝗲𝗮𝗹 𝗹𝗲𝘀𝘀𝗼𝗻: We didn not pick ONE tool. We picked TWO. Apache Hudi for write-heavy CDC. BigQuery for managed simplicity (15-min latency, acceptable SLA). Different problems. Different tools. Same platform team. Anyone telling you there is one best answer for everything has not operated at scale. 𝟮𝟬𝟮𝟲 𝘂𝗽𝗱𝗮𝘁𝗲: Iceberg has matured massively. Snowflake supports Iceberg natively. Databricks open-sourced Unity Catalog. I would re-evaluate today. But the framework stays the same: match the tool to the workload, not the hype. Question for you: What's your lakehouse stack? And what tradeoffs did YOU accept? Drop it in the comments.

  • Hudi packs a compact storage engine drawing from BitCask 💡. Our BitCaskDiskMap appends values to one log file. A lightweight in-memory ConcurrentHashMap just holds key metadata: offset, size, timestamp. Keeps memory use low, enables quick offset-based reads 📊. For random reads to fetch existing value to merge against, we use ThreadLocal BufferedRandomAccessFile —no blocking, no locks needed 🔒. Reads recompute and check CRC checksums, catching disk corruption right away, before it hits queries ✅. ThreadLocal caching for compression handlers cuts down on allocations during heavy spills ⚙️. It's the go-to disk layer for ExternalSpillableMap. In MOR merges pushing memory limits, data spills here: fast append writes, efficient random access via offsets 🚀. #ApacheHudi #DataEngineering #BigData #DataLakehouse #Spark

    • No alternative text description for this image
  • Wide tables make mutation costs ugly. One of the most under-appreciated capabilities in table design is partial updates 🔄. If a wide record changes in only a few fields, full-row rewrite logic is wasteful by default ⚠️. That is especially painful for: • wide dimensions for ML training 🤖 • profile/dimension tables that have heavier records 📊 • sparse attribute updates 📉 • CDC streams with partial changes 🔄 Hudi’s support for partial updates makes write cost scale proportional with changed information and less with total row width 🚀. This allows Hudi to efficiently absorb changes only a few columns in upstream databases. #ApacheHudi #DataLakehouse #DataEngineering

    • No alternative text description for this image
  • Most in-memory data structures fail hard when they run out of memory ⚠️. Hudi's ExternalSpillableMap exists to prevent large merges from OOMing 💥. It holds two tiers: a standard in-memory HashMap and a disk-backed map (BitCask or RocksDB) 💾. Records stay in memory until a configurable byte threshold is crossed, then transparently spill to disk 📥. The interesting part is how it estimates memory usage 📊. Every 100 records, it recalculates payload size using an exponential moving average: 0.9 * old estimate + 0.1 * new sample. This smooths out variance from differently-sized records without expensive per-record measurement. A 0.8 sizing factor reserves 20% headroom above the configured limit, preventing thrashing near the boundary 🛡️. The disk map is lazily initialized -- it only materializes when the first spill actually happens. This is the core data structure behind MOR table merges, where a single base file might need to be merged against hundreds of delta log files. Without controlled spilling, these merges would OOM on large partitions. #ApacheHudi #DataEngineering #BigData #DataLakehouse #Spark

    • No alternative text description for this image
  • Bloom filters get all the spotlight in Hudi’s Bloom Index. ✨ But the part that makes it scale happens before you read a single bloom filter. ⚙️ Each Hudi base file stores the min + max record key in its footer. 🧾 Hudi uses those (minKey, maxKey) ranges to build an interval tree per partition. 🌳 When a record shows up, you traverse the tree and shortlist the only files whose key range could even contain it. 🎯 Why it matters: - Naive approach: N files × M incoming records = O(N × M) key-range checks - With the interval tree: O(M log N) In practice, that means you prune hundreds or thousands of files up front, then run Bloom filters on a tiny candidate set. A detail I really like: the tree isn’t self-balancing. Hudi just shuffles the file list before building the tree, so you get expected log depth without dragging in AVL / red-black rotations. Lookup path: Interval tree → a few candidate files → Bloom filter → exact match Range pruning first. Probabilistic filtering second. That combo is why the Bloom Index keeps working even when a partition has thousands of files.

    • No alternative text description for this image
  • Stream-table duality is one of those key ideas in data systems that just flips a light bulb on once you get it. 💡 A table? It's basically a sequence of changes built up over time. 📊 A stream of changes? You can roll it up into the current table state. 🔄 Databases nail this with logs and CDC. But data lakes? They got stuck with immutable tables forever, focusing on files and big batch snapshots. 🚀 That's where Apache Hudi steps in—it weaves this duality right into the lakehouse. 🛠️ Ingest CDC or any mutable data into a "table", then query that table as an incremental "stream" of changes. 📈 Suddenly, your lake isn't just dumb storage. It's a smart, stateful foundation for pipelines. For many teams, this flips the script from 'data dumps into lake storage' to 'data powers real-time processing.' #ApacheHudi #DataLakehouse #DataEngineering

    • No alternative text description for this image

Similar pages

Browse jobs