MayaNAS

Beyond the FUSE Debate: Why Native ZFS on Object Storage Changes Everything

MinIO says filesystem on object store is a bad idea. JuiceFS disagrees. They're both missing the point. MayaNAS with objbacker.io isn't FUSE—it's native ZFS with kernel-level object storage integration.

December 26, 2024 12 min read Supramani Sammandam

The Debate That Misses the Point

In early 2024, MinIO published a provocative article titled "Filesystem on Object Store is a Bad Idea". JuiceFS responded with their rebuttal, defending POSIX filesystems on object storage.

Both make valid points. Both miss the bigger picture. There's a third approach that sidesteps the entire FUSE debate: native ZFS integration with object storage via objbacker.io.

MinIO's Argument: FUSE Is Fundamentally Broken

MinIO's Position

MinIO argues that layering POSIX over object storage creates fundamental incompatibilities:

  • Performance degradation — POSIX is "IOPS-centric, chatty, expensive and hard to scale"
  • Semantic incompatibility — Object storage relies on atomic, immutable operations
  • Data integrity risks — Uncommitted data can be lost during crashes
  • Security gaps — POSIX permissions can't map to S3 IAM

To prove their point, MinIO benchmarked s3fs-fuse copying a 10GB file. The result? Over 5 minutes with I/O errors, versus their native S3 API.

Here's the problem: they generalized an entire architectural approach based on ONE implementation—and they picked the worst one. s3fs-fuse is widely known to be rudimentary. It's a rookie mistake to dismiss an entire category based on its weakest example.

They also never showed their own mc cp timing for comparison. Classic strawman argument.

The Fundamental Flaw in MinIO's Argument

MinIO claims: "There is simply no need for a filesystem utility in the middle of MinIO and your application!"

This ignores a fundamental reality: Linux syscalls are POSIX.

Reality check: Applications call read(), write(), open(), stat(). These are POSIX syscalls. Until Linux provides native S3 syscalls (it doesn't), applications need a POSIX interface. Period.

Not every application can be rewritten to use S3 SDKs. Consider:

  • HPC workloads — Supercomputers and scientific computing run on Lustre, GPFS, BeeGFS—all POSIX
  • Legacy applications — Millions of lines of code expecting file I/O
  • Databases — PostgreSQL, MySQL, Oracle all use file I/O
  • ML frameworks — PyTorch, TensorFlow checkpoint to files
  • Analytics tools — Spark, Hadoop expect HDFS/POSIX interfaces
  • Media workflows — Video editing, rendering pipelines use files

The world's fastest supercomputers—Frontier, Aurora, LUMI—all use POSIX filesystems. National labs, research institutions, and enterprises have decades of code built on POSIX I/O. Telling them to "just use S3" is disconnected from reality.

And what about decades of data management practices?

  • Snapshots — Instant, space-efficient point-in-time copies
  • Clones — Writable snapshots for dev/test
  • Replication — Efficient block-level sync
  • Compression — Transparent, algorithm-selectable
  • Checksums — End-to-end data integrity
  • Scrubbing — Proactive corruption detection

"But S3 has versioning!" Yes—at the object level, not the bucket level.

Capability S3 Versioning ZFS Snapshots
Scope Per-object Entire filesystem (atomic)
Point-in-time consistency No — each object versioned independently Yes — all files consistent at snapshot moment
Rollback entire dataset No — must restore objects individually Yes — instant rollback
Space efficiency Full copy per version Copy-on-write (only deltas stored)
Writable clones No Yes — instant, space-efficient
Send/Receive replication No Yes — incremental block-level sync

With S3 versioning, you can't say "restore my bucket to how it was at 3pm yesterday." You'd have to script through every object, find the right version, restore each one—and hope nothing was added or deleted in between. That's not data management. That's data archaeology.

Object storage has none of this natively. MinIO's answer? "Rewrite your applications." That's not a solution—it's an abdication.

JuiceFS's Rebuttal: Implementation Matters

JuiceFS's Position

JuiceFS correctly points out that s3fs-fuse is not a real filesystem—it's a protocol converter. Their approach:

  • Separate metadata — Redis/TiKV for fast metadata operations
  • Intelligent chunking — Optimized data splitting and caching
  • POSIX compatibility — Built from the ground up, not bolted on

JuiceFS ran the fair comparison MinIO avoided:

Method 10GB Write Time Notes
MinIO mc cp 27.65s Native S3 multipart upload
JuiceFS POSIX 28.10s FUSE + Redis metadata
s3fs-fuse 3m 6s 6x slower Temp file + upload

JuiceFS proved that a well-implemented POSIX layer can match native S3 performance. But they're still using FUSE. They still need external metadata (Redis). They're still working around the fundamental architecture rather than solving it.

The Third Way: Native ZFS Integration

MayaNAS + objbacker.io

What if you didn't need FUSE at all? What if the filesystem natively understood object storage as a block device tier?

  • No FUSE — objbacker.io is a native ZFS VDEV (kernel-level)
  • No external metadata — ZFS special VDEV keeps metadata on SSD
  • Already POSIX — ZFS is battle-tested POSIX, not retrofitted
  • Hybrid architecture — Hot data on NVMe, cold data on object storage

MayaNAS doesn't put a filesystem on top of object storage. It extends ZFS to use object storage as a block device tier. The distinction is fundamental.

Why objbacker.io Is Different

┌─────────────────────────────────────────────────────────────────┐
│                     APPLICATION LAYER                           │
│                    (NFS/SMB Clients)                           │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                         ZFS (Kernel)                            │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────────────┐  │
│  │   ARC Cache  │  │  L2ARC SSD   │  │   ZIL (SLOG on SSD)  │  │
│  └──────────────┘  └──────────────┘  └──────────────────────┘  │
│                              │                                  │
│  ┌───────────────────────────┴───────────────────────────────┐ │
│  │                      VDEV Layer                            │ │
│  │  ┌─────────────────┐         ┌─────────────────────────┐  │ │
│  │  │  Special VDEV   │         │    objbacker.io VDEV    │  │ │
│  │  │  (NVMe SSD)     │         │    (Object Storage)     │  │ │
│  │  │                 │         │                         │  │ │
│  │  │  • Metadata     │         │  • 1MB blocks           │  │ │
│  │  │  • Small blocks │         │  • Large files          │  │ │
│  │  │  • Dedup tables │         │  • Cold data            │  │ │
│  │  └────────┬────────┘         └────────────┬────────────┘  │ │
│  └───────────┼───────────────────────────────┼───────────────┘ │
└──────────────┼───────────────────────────────┼─────────────────┘
               │                               │
               ▼                               ▼
        ┌──────────────┐              ┌──────────────────┐
        │  Local NVMe  │              │   S3/GCS/Azure   │
        │    SSD       │              │  Object Storage  │
        └──────────────┘              └──────────────────┘

Key Architectural Differences

Aspect s3fs-fuse JuiceFS MayaNAS + objbacker.io
FUSE Required Yes Yes No Native
Metadata Storage Object storage Redis/TiKV ZFS Special VDEV (SSD) Integrated
External Dependencies None Redis/TiKV cluster None Self-contained
POSIX Compliance Partial High (worked for it) Full (it's ZFS) Native
Data Integrity Limited Good ZFS checksums + CoW Best
Caching Basic Client-side ARC + L2ARC (kernel) Optimized
Snapshots No Yes ZFS native snapshots Instant

The Metadata Problem—Solved by ZFS Architecture

MinIO's core criticism is valid: metadata operations on object storage are slow. Every ls, every stat, every directory traversal becomes an HTTP request.

JuiceFS solved this by adding Redis. Now you need a Redis cluster. More infrastructure. More failure modes.

MayaNAS Solution: ZFS special VDEV. Configure a small NVMe SSD as a special device, and ZFS automatically stores all metadata and small blocks there. No external database. No additional infrastructure. The filesystem handles it natively.

The ZFS special VDEV stores:

  • All metadata — Directory entries, file attributes, extended attributes
  • Small blocks — Files smaller than a configurable threshold (e.g., 64KB)
  • Deduplication tables — If dedup is enabled

Only large data blocks (1MB recordsize) go to object storage via objbacker.io. This means a 10GB file becomes ~10,000 objects—but metadata operations hit local NVMe at microsecond latency.

Performance: Real Numbers

Understanding the Benchmark Methodology

MinIO's 10GB file copy test is equivalent to running:

fio --name=test --size=10G --numjobs=1 --rw=write --bs=1M --direct=1

That's a single-threaded sequential write. It's the simplest possible storage benchmark. And here's the critical limitation they don't mention:

mc cp cannot scale. It writes to one object in one bucket. No multiple buckets. No multiple prefixes. No parallelism. It's fundamentally single-stream, capped by single-object PUT throughput limits.

Real workloads don't operate this way. They have concurrent I/O, multiple files, parallel threads. This is where architecture matters.

MayaNAS Validated Throughput

3.7 GB/s

Read (AWS, 6 buckets)

2.5 GB/s

Write (AWS, 6 buckets)

8.14 GB/s

Read (GCP, 20 buckets)

6.2 GB/s

Write (GCP, 20 buckets)

Test configurations:
AWS: c5n.9xlarge, 6 S3 buckets, 1MB recordsize, special VDEV on NVMe
GCP: n2-standard-48, 20 GCS buckets, 75 Gbps TIER_1 networking

Test 1: 10GB File Copy (mc cp equivalent)

MinIO's benchmark used mc cp with a 10GB taxi dataset CSV. We replicated this with a simple Linux cp command—the most straightforward comparison possible:

# Create 10GB test file (random data, incompressible)
dd if=/dev/urandom of=/tmp/10gb-testfile bs=1M count=10240

# Copy to MayaNAS (ZFS + objbacker.io)
time cp /tmp/10gb-testfile /minio-pool/testfs/
Method Time Throughput How It Works
MinIO mc cp ~28s ~360 MB/s Multipart to single object in single bucket
JuiceFS POSIX ~28s ~360 MB/s FUSE + chunking + Redis coordination
s3fs-fuse 3m 6s ~55 MB/s Temp file → single upload
MayaNAS cp 3.97s 7x faster 2.52 GB/s Parallel 1MB blocks across multiple buckets
Verified with dstat: Network send during the 4-second test showed 2.0-2.5 GB/s sustained, peaking at 3.04 GB/s. This is actual bytes hitting cloud object storage—not cached, not local.

Same test with fio for precise measurement:

fio --name=test --size=10G --numjobs=1 --rw=write --bs=1M \
    --ioengine=psync --directory=/pool/test --end_fsync=1

# Result: 10GB in 5.13s = 2.09 GB/s (with explicit fsync)

The difference? objbacker.io writes to multiple buckets in parallel. Each bucket handles ~400 MB/s, and they aggregate. No FUSE context switches. No userspace overhead. No Redis round-trips.

The Scaling Advantage mc cp Can Never Match

Here's what MinIO can't do: mc cp writes a single object to a single bucket. Even with multipart upload, it's still one destination. You can't stripe across buckets. You can't use multiple prefixes in parallel. The architecture has a hard ceiling.

Configuration mc cp MayaNAS objbacker.io
1 bucket ~360 MB/s (max) ~400 MB/s
6 buckets ~360 MB/s (can't use them) 2.5 GB/s (6x parallel)
20 buckets ~360 MB/s (can't use them) 6.2 GB/s (20x parallel)

With objbacker.io, adding more buckets directly increases throughput. The ZFS VDEV layer stripes I/O across all configured buckets automatically. This is the architectural advantage of treating object storage as block devices rather than as a file destination.

Read Performance: Where ZFS Really Shines

What about reading? S3 supports byte-range GET requests, and mc cp can use ~4 parallel connections for downloads. But again—it's reading one object from one bucket.

ZFS has aggressive read-ahead prefetching for sequential workloads:

  • Pattern detection — ZFS detects sequential access and prefetches ahead
  • Parallel bucket reads — With objbacker.io, prefetch triggers concurrent GETs across all buckets
  • ARC cache — Hot data stays in memory, eliminating repeated object fetches
  • L2ARC on SSD — Warm data cached on local NVMe for sub-millisecond access

10GB Cold Read Benchmark (cache dropped)

# Cache dropped before test to ensure cold read from object storage:
echo 3 | sudo tee /proc/sys/vm/drop_caches

fio --name=test --size=10G --numjobs=1 --rw=read --bs=1M \
    --ioengine=psync --directory=/pool/test
Method 10GB Cold Read Throughput
mc cp (S3 GET) ~28s ~360 MB/s
MayaNAS + objbacker.io 8.4s 1.27 GB/s 3.5x faster

Peak throughput reached 1.98 GB/s before the 10GB file ended—ZFS prefetch was still ramping up. With larger files, throughput stabilizes even higher as prefetch fully engages across all buckets.

Read Behavior mc cp MayaNAS objbacker.io
Parallel connections ~4 (single object) Unlimited (across all buckets)
Prefetching None ZFS automatic read-ahead
Caching None (re-fetch every time) ARC (memory) + L2ARC (SSD)
Repeat reads Full object fetch Cache hit (microseconds)

For sustained sequential reads with larger datasets, ZFS prefetch combined with multi-bucket striping delivers 8.14 GB/s on GCP—far beyond what any single-object download can achieve.

Why ZFS 1MB Recordsize Matters

ZFS with objbacker.io uses 1MB recordsize. A 10GB file becomes ~10,000 S3 PUT operations. This is optimal for object storage because:

  • Reduced API costs — Fewer PUT requests than smaller blocks
  • Better throughput — Each PUT transfers meaningful data
  • TRIM support — Deleted blocks are removed from object storage (cost savings)
  • Compression efficiency — 1MB gives LZ4/ZSTD good compression ratios

Test 2: Pandas Small-File Iterations (JuiceFS benchmark)

JuiceFS also tested a pandas workload: 100 iterations of read-modify-write on a small CSV file. This tests metadata operations and small-file handling:

# 100 iterations: read CSV → append row → write CSV
for i in range(100):
    df = pd.read_csv(test_file)
    df = pd.concat([df, new_row])
    df.to_csv(test_file)
Method 100 Iterations Per Operation
MinIO direct 0.83s 8.3ms
s3fs-fuse 0.78s 7.8ms
JuiceFS POSIX 0.43s 4.3ms
MayaNAS 0.12s 3.6x faster 1.2ms

Why is MayaNAS so fast? ZFS ARC (Adaptive Replacement Cache):

100 pandas iterations (0.12s total):
     ↓
All reads/writes hit ZFS ARC (memory)
     ↓
TXG commit at end → single write to storage
     ↓
File size: 1,508 bytes → tiny, unnoticeable commit

ZFS keeps hot data in ARC during the 100 iterations. The final commit is fast because of layered intelligence:

  • ZFS ARC — Hot data stays in memory during iterations
  • ZFS Special VDEV — If configured, small files commit directly to local NVMe SSD
  • objbacker.io — Even without special VDEV, handles small/unaligned writes efficiently

No 100 round-trips to object storage. No API cost explosion.

This is intelligent tiering. MinIO's approach (direct S3 writes) means every iteration hits object storage—100 PUT operations, 100 latency penalties, 100x the API cost. ZFS + objbacker.io batches intelligently at multiple layers. You get POSIX semantics without the overhead.

When Each Approach Makes Sense

Use MinIO Directly When:

  • Your application is S3-native (data lakes, analytics)
  • You don't need POSIX semantics
  • Simple archival storage

Use JuiceFS When:

  • You need POSIX but can manage Redis infrastructure
  • Kubernetes-native deployments
  • You're already invested in their ecosystem

Use MayaNAS + objbacker.io When:

  • You need maximum performance with POSIX
  • You want no external dependencies (no Redis, no FUSE)
  • You need enterprise features: snapshots, clones, replication, compression
  • You have mixed workloads: hot data on SSD, cold data on object storage
  • You want data integrity: ZFS checksums, scrubbing, self-healing
  • You're running NFS/SMB for traditional file sharing

Conclusion: Architecture Matters

MinIO's argument boiled down to: "Don't put a filesystem between your application and object storage." But here's what our benchmarks prove—having a filesystem in between doesn't break anything. It makes everything better.

Pandas doesn't slow down with MayaNAS between it and object storage. It speeds up. cp doesn't become a bottleneck. It becomes 7x faster than mc cp. Why? Because an intelligent filesystem layer does what applications cannot:

  • Intelligent caching: Hot data stays in ARC, cold data lives in object storage
  • Transaction batching: 100 writes become a handful of efficient TXG commits
  • Parallel streaming: Multi-bucket striping saturates network bandwidth
  • Aggressive read-ahead: ZFS prefetches what your application will need next
  • Write coalescing: Small random writes become large sequential objects

MinIO generalized from one bad implementation (s3fs-fuse) to condemn all filesystems. That's like condemning all databases because one SQL implementation was slow.

Architecture matters—with endless possibility. MayaNAS with objbacker.io isn't a filesystem bolted onto object storage. It's ZFS—the most advanced filesystem in production—extended to use object storage as a native tier. The result: your applications run faster, not slower, with a filesystem in between.

The question was never "filesystem or object storage?" The question is "Why would you talk directly to object storage when such an intelligent filesystem layer is available?"

objbacker.io answers that question.

Try MayaNAS

Deploy MayaNAS on AWS, Azure, or GCP with Terraform. Full ZFS functionality with object storage economics.

GitHub Learn More