[6.18] Track btrfs patches by kakra · Pull Request #40 · kakra/linux

kakra · 2025-12-14T01:03:18Z

Export patch series: https://github.com/kakra/linux/pull/40.patch

btrfs: tiered allocation hints and queue-based read balancing

Special Thanks to @Forza-tng for extensive testing, feedback, and maintaining the documentation guide:
👉 Btrfs Allocator Hints and Read Policies Guide by Forza-tng

This PR introduces a set of patches to improve Btrfs performance and flexibility in heterogeneous storage environments (mixed SSD/HDD, tiered storage, bcache).

1. Allocator Hints (Data Placement)

Allows preferring specific devices for data or metadata allocations. This works by storing a hint in the persistent device item on-disk.

Essential for tiered setups: Force metadata onto SSDs (for speed) while keeping bulk data on HDDs.
Graceful removal: Mark devices to accept no new allocations, allowing to drain them naturally or via balance without racing against new writes.

2. Read Balancing Policies

Extends Btrfs RAID1 read balancing with a dynamic queue policy. The standard PID-based policy is often insufficient for mixed-device pools or high-IOPS workloads.

pid: (Default) Static hashing by process ID. Good for simple setups.
round-robin: Distributes reads equally. Good for aggregate throughput on identical disks.
queue: (Recommended) Routes requests to the device with the fewest in-flight requests (shortest queue).
- Adapts instantly to device load and speed differences.
- Avoids "stalling" on busy devices.
- In benchmarks, this policy consistently delivered the highest IOPS and lowest latency, especially under mixed load.
devid: Pin reads to a specific device ID (mostly for testing).

(Note: Previous experimental latency-based policies were dropped in favor of queue due to better stability and lower complexity.)

3. Decoupling from Experimental Status

Important: Upstream Kernels (6.13+) use CONFIG_BTRFS_EXPERIMENTAL to gate various unstable work-in-progress features. To allow using the allocator hints and read policies without enabling potentially unstable upstream code, these features have been moved out of the experimental gate.

Recommendation: Remove the line CONFIG_BTRFS_EXPERIMENTAL from your .config before running make oldconfig. The build system will then prompt you specifically for the new options (CONFIG_BTRFS_ALLOCATOR_HINTS, CONFIG_BTRFS_READ_POLICIES), allowing you to enable them safely without turning on other experimental Btrfs features.

Quickstart Guide

Setting Allocator Hints

Enable CONFIG_BTRFS_ALLOCATOR_HINTS in kernel config.
Run btrfs device usage /mnt/path to identify your device IDs.
Set the hint via sysfs (this persists on-disk, no udev rules needed!):
echo <TYPE> | sudo tee /sys/fs/btrfs/<UUID>/devinfo/<DEVID>/type

Available Types:

0: Prefer data (Default for HDDs).
1: Prefer metadata (Recommended for SSDs/NVMe).
2: Metadata only (Use with caution).
3: Data only (Use with caution).
4: None preferred (Avoids new allocations, useful to drain a drive).
Added: 5: None (Strictly prevents ANY new allocation, useful for parallel device remove).

After changing hints, a rebalance of metadata/data is required to move existing extents to their preferred location.

Enabling Read Policy

Enable CONFIG_BTRFS_READ_POLICIES in kernel config.
Set boot parameter: btrfs.read_policy=queue
Or switch at runtime: echo queue | sudo tee /sys/fs/btrfs/<UUID>/read_policy

Diagnostic Statistics

Adds per-device read statistics to /sys/fs/btrfs/<UUID>/devinfo/<DEVID>/read_stats.

ios %lu wait %llu avg %llu age %llu ignored %llu

ios: Total read I/O count.
wait: Total accumulated wait time (ns).
avg: Cumulative average read latency (ns).
age: "Fairness" counter. Increments when the device is skipped/ignored during selection. Resets to 0 when selected. A constantly high age indicates the device is being avoided by the policy.
ignored: Total count of times this device was a candidate but skipped.

Benchmark Results

The following benchmarks (based on kernel 6.12 LTS) compare the new policies against the defaults. Tests were performed on a mixed HDD RAID10 array with bcache, comparing an idle system vs. a system under heavy background load (defrag).

queue proved to be the superior all-rounder, effectively isolating foreground workloads from background noise.

Scenario: No Background Load

Policy	RandRead 4k QD1 (Lat)	RandRead 4k QD32 (IOPS)	SeqRead 1M (BW)
`pid`	65 IOPS	537 IOPS	261 MiB/s
`round-robin`	241 IOPS	1180 IOPS	231 MiB/s
`latency-rr`*	702 IOPS	2477 IOPS	240 MiB/s
`queue`	1181 IOPS	3647 IOPS	272 MiB/s

Scenario: Heavy Background Load (Defrag)

Policy	RandRead 4k QD1 (Lat)	RandRead 4k QD32 (IOPS)	SeqRead 1M (BW)
`pid`	~0 IOPS (Stalled)	505 IOPS	121 MiB/s
`round-robin`	38 IOPS	717 IOPS	126 MiB/s
`latency-rr`*	585 IOPS	1562 IOPS	235 MiB/s
`queue`	967 IOPS	2437 IOPS	247 MiB/s

(latency-rr was an experimental hybrid policy used during testing, superseded by queue due to better performance and simplicity)

Changes in this version (Kernel 6.18 Port)

Ported to Linux 6.18.
Refactored Kconfig: Features are individually selectable and moved out of "Experimental".
Robustness: Fixed potential NULL pointer dereferences in stats tracking during mount/unmount and race conditions in round-robin calculation.
Simplified: Dropped complex EMA-based latency tracking in favor of the robust queue policy.

FAQ: Why is this not upstream?

1. Allocator Hints

The allocator hint patches (originally developed by Goffredo Baroncelli, now maintained here) have been discussed on the mailing list but were not merged for design reasons:

Free Space Calculation (df): Btrfs calculates available space assuming any chunk can be allocated on any device (respecting RAID profiles). Restricting allocations via hints makes this calculation unreliable. Tools might report free space while Btrfs returns ENOSPC (No space left on device) because the allowed devices for a specific chunk type are full, even if other devices are empty.
Maintenance: The original author ceased updates for newer kernels; this repository bridges that gap.

Compatibility Note: This patch reuses the existing (unused) type field in the device item on disk. It does not change the on-disk format version. Unpatched kernels simply ignore the value, ensuring data remains accessible (though allocation preferences will be lost until booted with a patched kernel).

2. Read Policies (`queue`)

The new queue policy is an experimental addition in this patch set. It is unlikely to be accepted upstream in its current form due to Layer Violation:

The filesystem layer (Btrfs) directly accesses internal block-layer statistics (in-flight queue depth) to make routing decisions. The Linux kernel generally enforces strict separation between these subsystems.
However, benchmarks show this cross-layer optimization yields significant performance gains in mixed setups, justifying its inclusion here.

The following kernel message may be logged if `add_inline_refs()` or `add_keyed_refs()` block for too long: > kernel: rcu: INFO: rcu_sched self-detected stall on CPU > kernel: rcu: 10-....: (2100 ticks this GP) idle=0494/1/0x4000000000000000 softirq=164826140/164826187 fqs=1052 > kernel: rcu: (t=2100 jiffies g=358306033 q=2241752 ncpus=16) > kernel: CPU: 10 UID: 0 PID: 1524681 Comm: map_0x178e45670 Not tainted 6.12.21-gentoo #1 > kernel: Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 > kernel: RIP: 0010:btrfs_get_64+0x65/0x110 > kernel: Code: d3 ed 48 8b 4f 70 48 8b 31 83 e6 40 74 11 0f b6 49 40 41 bc 00 10 00 00 49 d3 e4 49 83 ec 01 4a 8b 5c ed 70 49 21 d4 45 89 c9 <48> 2b 1d 7c 99 09 01 49 01 c1 8b 55 08 49 8d 49 08 44 8b 75 0c 48 > kernel: RSP: 0018:ffffbb7ad531bba0 EFLAGS: 00000202 > kernel: RAX: 0000000000001f15 RBX: fffff437ea382200 RCX: fffff437cb891200 > kernel: RDX: 000001922b68df2a RSI: 0000000000000000 RDI: ffffa434c3e66d20 > kernel: RBP: ffffa434c3e66d20 R08: 000001922b68c000 R09: 0000000000000015 > kernel: R10: 6c0000000000000a R11: 0000000009fe7000 R12: 0000000000000f2a > kernel: R13: 0000000000000001 R14: ffffa43192e6d230 R15: ffffa43160c4c800 > kernel: FS: 000055d07085e6c0(0000) GS:ffffa4452bc80000(0000) knlGS:0000000000000000 > kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > kernel: CR2: 00007fff204ecfc0 CR3: 0000000121a0b000 CR4: 00000000001506f0 > kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > kernel: Call Trace: > kernel: <IRQ> > kernel: ? rcu_dump_cpu_stacks+0xd3/0x100 > kernel: ? rcu_sched_clock_irq+0x4ff/0x920 > kernel: ? update_process_times+0x6c/0xa0 > kernel: ? tick_nohz_handler+0x82/0x110 > kernel: ? tick_do_update_jiffies64+0xd0/0xd0 > kernel: ? __hrtimer_run_queues+0x10b/0x190 > kernel: ? hrtimer_interrupt+0xf1/0x200 > kernel: ? __sysvec_apic_timer_interrupt+0x44/0x50 > kernel: ? sysvec_apic_timer_interrupt+0x60/0x80 > kernel: </IRQ> > kernel: <TASK> > kernel: ? asm_sysvec_apic_timer_interrupt+0x16/0x20 > kernel: ? btrfs_get_64+0x65/0x110 > kernel: find_parent_nodes+0x1b84/0x1dc0 > kernel: btrfs_find_all_leafs+0x31/0xd0 > kernel: ? queued_write_lock_slowpath+0x30/0x70 > kernel: iterate_extent_inodes+0x6f/0x370 > kernel: ? update_share_count+0x60/0x60 > kernel: ? extent_from_logical+0x139/0x190 > kernel: ? release_extent_buffer+0x96/0xb0 > kernel: iterate_inodes_from_logical+0xaa/0xd0 > kernel: btrfs_ioctl_logical_to_ino+0xaa/0x150 > kernel: __x64_sys_ioctl+0x84/0xc0 > kernel: do_syscall_64+0x47/0x100 > kernel: entry_SYSCALL_64_after_hwframe+0x4b/0x53 > kernel: RIP: 0033:0x55d07617eaaf > kernel: Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 18 48 8b 44 24 18 64 48 2b 04 25 28 00 00 > kernel: RSP: 002b:000055d07085bc20 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 > kernel: RAX: ffffffffffffffda RBX: 000055d0402f8550 RCX: 000055d07617eaaf > kernel: RDX: 000055d07085bca0 RSI: 00000000c038943b RDI: 0000000000000003 > kernel: RBP: 000055d07085bea0 R08: 00007fee46c84080 R09: 0000000000000000 > kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003 > kernel: R13: 000055d07085bf80 R14: 000055d07085bf48 R15: 000055d07085c0b0 > kernel: </TASK> The RCU stall could be because there's a large number of backrefs for some extents and we're spending too much time looping over them without ever yielding the cpu. Avoid the stall warning by adding `conf_resched()`. Link: https://lore.kernel.org/linux-btrfs/CAMthOuP_AE9OwiTQCrh7CK73xdTZvHsLTB1JU2WBK6cCc05JYg@mail.gmail.com/T/#md2e3504a1885c63531f8eefc70c94cff571b7a72 Signed-off-by: Kai Krakow <kk@netactive.de>

Signed-off-by: Kai Krakow <kai@kaishome.de>

CHN-beta · 2025-12-16T04:30:30Z

I tried this patch with CachyOS kernel 6.18.0. After recompiling and rebooting, /sys/fs/btrfs/<UUID>/devinfo/<DEVID>/type did not exist. I can confirm that the patch is applied, and the corresponding kernel options are set.

$ zcat /proc/config.gz | grep -i btrfs
CONFIG_BTRFS_FS=m
CONFIG_BTRFS_FS_POSIX_ACL=y
# CONFIG_BTRFS_FS_RUN_SANITY_TESTS is not set
# CONFIG_BTRFS_DEBUG is not set
# CONFIG_BTRFS_ASSERT is not set
CONFIG_BTRFS_ALLOCATOR_HINTS=y
# CONFIG_BTRFS_PER_DEVICE_IO_STATS is not set
CONFIG_BTRFS_READ_POLICIES=y
CONFIG_BTRFS_EXPERIMENTAL=y
$ ls /sys/fs/btrfs/f2dfd4a4-276d-4451-999e-a39457f032b5/devinfo/2
error_stats  fsid  in_fs_metadata  missing  replace_target  scrub_speed_max  writeable

Is there anything I am missing?

Forza-tng · 2025-12-16T05:51:27Z

I tried this patch with CachyOS kernel 6.18.0. After recompiling and rebooting, /sys/fs/btrfs/<UUID>/devinfo/<DEVID>/type did not exist. I can confirm that the patch is applied, and the corresponding kernel options are set.
$ zcat /proc/config.gz | grep -i btrfs
CONFIG_BTRFS_FS=m
CONFIG_BTRFS_FS_POSIX_ACL=y
# CONFIG_BTRFS_FS_RUN_SANITY_TESTS is not set
# CONFIG_BTRFS_DEBUG is not set
# CONFIG_BTRFS_ASSERT is not set
CONFIG_BTRFS_ALLOCATOR_HINTS=y
# CONFIG_BTRFS_PER_DEVICE_IO_STATS is not set
CONFIG_BTRFS_READ_POLICIES=y
CONFIG_BTRFS_EXPERIMENTAL=y
$ ls /sys/fs/btrfs/f2dfd4a4-276d-4451-999e-a39457f032b5/devinfo/2
error_stats  fsid  in_fs_metadata  missing  replace_target  scrub_speed_max  writeable
Is there anything I am missing?

Thanks for the reporty. I can confirm the same.

kakra · 2025-12-16T12:07:23Z

Thanks for the report. Will fix...

Add the following flags to give a hint about which chunk should be allocated on which a disk. The following flags are created: - BTRFS_DEV_ALLOCATION_PREFERRED_DATA preferred data chunk, but metadata chunk allowed - BTRFS_DEV_ALLOCATION_PREFERRED_METADATA preferred metadata chunk, but data chunk allowed - BTRFS_DEV_ALLOCATION_METADATA_ONLY only metadata chunk allowed - BTRFS_DEV_ALLOCATION_DATA_ONLY only data chunk allowed Co-authored-by: Goffredo Baroncelli <kreijack@inwid.it> Signed-off-by: Kai Krakow <kai@kaishome.de>

Co-authored-by: Goffredo Baroncelli <kreijack@inwind.it> Signed-off-by: Kai Krakow <kai@kaishome.de>

v2: Adds a check to prevent modification while the file system is still mounting. Todo: - Transactions should not be triggered from sysfw writes, see: https://lore.kernel.org/linux-btrfs/20251213200920.1808679-1-kai@kaishome.de/ Link: #36 (comment) Reported-by: Eli Venter <eli@genedx.com> Co-authored-by: Goffredo Baroncelli <kreijack@inwind.it> Signed-off-by: Kai Krakow <kai@kaishome.de>

kakra · 2025-12-16T14:14:24Z

@CHN-beta @Forza-tng Thanks for reporting and confirming. This was actually a bug I introduced when I made allocator hints configurable via make menuconfig: I used the wrong definition in the C code, the allocator hints actually never compiled and were not active (it thus also slipped through my compile tests which showed two syntax issues now). I never verified if the type fields still existed, I'll keep this in mind for the future. Sorry.

Important: This means, whoever used the 6.18 patches until now, never had allocator hints enabled since. Please use the new patches, then verify that devinfo/*/type exists (it will still have your original type value). If it exists, it means the patch is working now. But you'll need to check if your btrfs moved meta data to the slow devices:

# btrfs filesystem usage -T {BTRFS-MOUNT-PATH}
...
                  Data    Metadata System
Id Path           RAID1   RAID1    RAID1    Unallocated Total     Slack
-- -------------- ------- -------- -------- ----------- --------- -------
 1 /dev/bcache2   2.51TiB        -        -     1.12TiB   3.63TiB 3.50KiB
 2 /dev/bcache0   2.52TiB        -        -     1.12TiB   3.63TiB 3.50KiB
 4 /dev/nvme0n1p2       - 86.00GiB 32.00MiB    41.97GiB 128.00GiB       -
 6 /dev/nvme1n1p2       - 86.00GiB 32.00MiB    41.97GiB 128.00GiB       -
 7 /dev/bcache3   2.52TiB        -        -     1.12TiB   3.64TiB       -
 8 /dev/bcache1   2.50TiB        -        -     1.14TiB   3.64TiB       -
-- -------------- ------- -------- -------- ----------- --------- -------
   Total          5.03TiB 86.00GiB 32.00MiB     4.57TiB  14.79TiB 7.00KiB
   Used           4.92TiB 30.35GiB  1.05MiB

If it lists devices with unexpected meta data, take note of the affected device IDs, then run a meta data balance filtered for device ID (separate each ID by spaces):

for ID in {SLOW_DEV_IDs}; do btrfs balance start -mdevid=$ID --enqueue {BTRFS-MOUNT-PATH}; done

E.g., run for ID in 1 7; do ... if your meta data ended up unwanted on device ID 1 and 7. Balance will then rewrite all meta chunks on device 1 and 7, effectively re-allocating it from the fast dedicated devices, without touching existing meta data chunks on other devices. It should be a fast and safe operation.

Thanks.

CHN-beta · 2025-12-17T03:10:26Z

I never verified if the type fields still existed, I'll keep this in mind for the future. Sorry.

Please don't worry about it. Everyone makes mistakes sometimes, and this one didn't actually cause any damage. Thank you for your contribution!

When this mode is enabled, the chunk allocation policy is modified as follows: Each disk may have a different tag: - BTRFS_DEV_ALLOCATION_PREFERRED_METADATA - BTRFS_DEV_ALLOCATION_METADATA_ONLY - BTRFS_DEV_ALLOCATION_DATA_ONLY - BTRFS_DEV_ALLOCATION_PREFERRED_DATA (default) Where: - ALLOCATION_PREFERRED_X means that it is preferred to use this disk for the X chunk type (the other type may be allowed when the space is low) - ALLOCATION_X_ONLY means that it is used *only* for the X chunk type. This means also that it is a preferred choice. Each time the allocator allocates a chunk of type X, first it takes the disks tagged as ALLOCATION_X_ONLY or ALLOCATION_PREFERRED_X. If the space is not enough, it uses also the disks tagged as ALLOCATION_METADATA_ONLY. If the space is not enough, it uses also the other disks, with the exception of the one marked as ALLOCATION_PREFERRED_Y, where Y is the other type of chunk (i.e. not X). Co-authored-by: Goffredo Baroncelli <kreijack@inwind.it> Signed-off-by: Kai Krakow <kai@kaishome.de>

This is useful where you want to prevent new allocations of chunks on a disk which is going to be removed from the pool anyways, e.g. due to bad blocks or because it's slow. Signed-off-by: Kai Krakow <kai@kaishome.de>

@Zygo

This is useful where you want to prevent new allocations of chunks to a set of multiple disks which are going to be removed from the pool. This acts as a multiple `btrfs dev remove` on steroids that can remove multiple disks in parallel without moving data to disks which would be removed in the next round. In such cases, it will avoid moving the same data multiple times, and thus avoid placing it on potentially bad disks. Thanks to @Zygo for the explanation and suggestion. Link: kdave/btrfs-progs#907 (comment) Signed-off-by: Kai Krakow <kai@kaishome.de>

This adds read stats per device to devinfo to evaluate the effects of different read policies better. This adds a new file /sys/fs/btrfs/BTRFS-UUID/devinfo/ID/read_stats. Signed-off-by: Kai Krakow <kai@kaishome.de>

Read policies seem safe and stable enough to move it out of the experimental feature set. This allows us to add more policies without forcing users to enable the full experimental feature set. Signed-off-by: Kai Krakow <kai@kaishome.de>

Select the preferred stripe based on the mirror with the least in-flight requests. Signed-off-by: Kai Krakow <kai@kaishome.de>

kakra · 2025-12-17T03:41:42Z

Updated the branch to improve the style of some if statements.

kakra · 2025-12-17T13:08:13Z

Added a Github workflow to automatically compile and build-check the patches.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

redjard · 2026-06-23T19:28:17Z

Hope this isn't inappropriate to ask here, what's the status (or chances) of getting these patches (especially queue and allocator hints) upstreamed into the mainline kernel?

kakra · 2026-06-23T21:59:26Z

Hope this isn't inappropriate to ask here, what's the status (or chances) of getting these patches (especially queue and allocator hints) upstreamed into the mainline kernel?

I happily answer that:

Chances are near zero. This is not a quality or functional problem. It's more an organisational issue.

Allocator hints currently confuse free space calculation because the current approach to calculate df does not expect that we may allocate from another device or cannot allocate from some devices. Admins are expected to use btrfs cmdline tooling instead. Currently, upstream is working on a different approach to allocation hints, our implementation will stay compatible until then and cause no conflicts, but at some point in the future you may need to migrate and we drop our patches. I will send proper instructions when the time comes. Don't expect this to happen any time soon - read: not in the next 5-10 years. :-D
Our implementation of queue depth balancing breaks the rules of separation of concerns in the Linux kernel: file systems should not peek into underlying storage details. Thus, it won't be accepted until I find a better solution. One solution could be to provide an API from the lower levels, or try something similar to what mdraid does. For the type of systems that this patch targets, it's not a real issue. Crossing boundaries of concerns, however, may be an issue if you run filesystems which have to process 100k IOPS or more - but these have very different scaling characteristics and probably don't benefit from this patch anyways.
Our implementation of sysfs to persisted file system data is currently not accepted in the kernel because it bypasses btrfs transactions (thus, it can become lost under certain situations). I will probably fix this with the next LTS cycle. But even if we fixed this, chances aren't good due to the above points. We will see as time passes.

I hope this helps. Let me know if I should improve the initial FAQ above.

voidpointertonull · 2026-06-24T11:44:59Z

Allocator hints currently confuse free space calculation because the current approach to calculate df does not expect that we may allocate from another device or cannot allocate from some devices.

I've wondered about this problem for a while, so guess I'll use this "revive" of topic to ask: Is this actually a problem for preferred location too? I was under the impression that in that case there's no lie about free space, because for example devices preferring metadata storage could be also used for data.

Also, if free space calculation problems being made worse is the only blocker, then isn't that feasible to work around by metadata only devices not contributing to total space at all? While that would only cover a subset of what's offered here, I believe that to be the most important part.

I'm also generally confused about free space calculation concerns raising the bar of acceptance so high while free space fragmentation can still lead to the dreaded ENOSPC issue, so it's not like a perfect (or even good) system would be degraded, and the location hints being opt-in also prevent most of the surprise even if something does get worse (which I still doubt with only the preferred flags being in use).

Don't expect this to happen any time soon - read: not in the next 5-10 years. :-D

That's long enough to "solve" the issue for a lot of people by just waiting until technology advances to the point of being capable of hiding bad design decisions, which then leads to the problem no longer being considered worthy of treatment.

Quite like how dirty page cache flushing still sucks, it's just less apparent with faster and generally better storage devices.

Forza-tng · 2026-06-24T13:00:01Z

Allocator hints currently confuse free space calculation because the current approach to calculate df does not expect that we may allocate from another device or cannot allocate from some devices.

I've wondered about this problem for a while, so guess I'll use this "revive" of topic to ask: Is this actually a problem for preferred location too? I was under the impression that in that case there's no lie about free space, because for example devices preferring metadata storage could be also used for data.

Also, if free space calculation problems being made worse is the only blocker, then isn't that feasible to work around by metadata only devices not contributing to total space at all? While that would only cover a subset of what's offered here, I believe that to be the most important part.

I'm also generally confused about free space calculation concerns raising the bar of acceptance so high while free space fragmentation can still lead to the dreaded ENOSPC issue, so it's not like a perfect (or even good) system would be degraded, and the location hints being opt-in also prevent most of the surprise even if something does get worse (which I still doubt with only the preferred flags being in use).

Don't expect this to happen any time soon - read: not in the next 5-10 years. :-D

That's long enough to "solve" the issue for a lot of people by just waiting until technology advances to the point of being capable of hiding bad design decisions, which then leads to the problem no longer being considered worthy of treatment.

Quite like how dirty page cache flushing still sucks, it's just less apparent with faster and generally better storage devices.

for free space calc, I think we could come up with a simple rule.

preferred allocations : no change today
only data: no special treatment
only metadata: exclude from free space calculation
no allocation: exclude from free space calculation
*...

kakra · 2026-06-24T21:00:02Z

That's long enough to "solve" the issue for a lot of people by just waiting until technology advances to the point of being capable of hiding bad design decisions, which then leads to the problem no longer being considered worthy of treatment.

I think file system are generally a slow moving target... Btrfs took maybe 10 years to be usable without major corruption issues or other annoyances. I took another (overlapping) 10 years to fix most performance issues. We are probably already in the face of adding features.

I can think of some, and I'm working on these ideas:

For example, I'm thinking of a feature to select faster devices for allocating smaller extents. The size classes that have been added to the chunk allocation should be able to provide such an allocation strategy. It will likely also mess with free space calculation, tho. But it will also partially implement hot data placement, and we can maybe get rid of bcache. Meta data placement already solves part of it so I could already exclude meta data from bcache. Let's be honest: bcache is nice but it is far from providing the actually latency and throughput that could be expected. Especially small seeks it doesn't handle very well, exactly what it promises to be good at. True, it's still way faster than seeking purely on spinning disks, but it's also far from getting near native SSD speeds.

Next on the plan would be to provide some maintenance process to migrate small extents to faster devices, and big extents to slower devices.

I'm not sure, tho, how to properly decide or persist "data hotness" without generating a big write overhead penalty.

Given that, most of this is experimental. Finally getting to an implementation which solves hot/cold tier scenarios would be the final upstream candidate. Until then, we shouldn't push half baked features into upstream. The current state of the patches only partially solves a bigger problem, and may not be the optimal solution, so it probably shouldn't go into upstream anyways.

Quite like how dirty page cache flushing still sucks, it's just less apparent with faster and generally better storage devices.

Yeah, but given current pricing, this puts back pressure on improved implementations for slow or mixed devices. I always believed that you should have cold data on slow devices, and hot or small data on fast devices, but in one single file system, without the headache of manually managing that. Bcache mostly solves this but it is a distinct layer from btrfs, it doesn't know about btrfs, or how CoW relocates data and makes orphaned blocks obsolete in the cache, or that mirrored stripes contain identical data, so with btrfs, it can mostly make bad decisions except your data sits there unmodified. So I believe my idea for small/big placement may already make such a big improvement that we would get rid of bcache as an intermediate layer (consuming not particularly few CPU cycles). With this patch series using the queue read balancing, we can at least prefer exactly those devices which already have cached blocks in bcache - resulting in less usage penalty (duplicated data) in bcache itself.

Also, a note on page cache: As far as I understood, the page cache cannot identify the same (shared) extent from different on two different btrfs subvolumes as the same data - so it will be read twice, and will be kept twice in memory. That's one other potential optimization to solve. I wonder if that's why zfs implemented its own RAM cache.

I'm not sure what you mean by dirty page cache flushing: in general, or for btrfs specifically? In "my" other patch series (#45), I incorporated cache and memory management changes (le9). Well, it can behave quite well in many scenarios, even exceptionally well sometimes, but it can also get a lot worse in others. So I understand why such changes haven't gone upstream. IOW, just improving some scenarios can make it a lot worse for others, not only slightly, but a lot of. So it doesn't always seem easy, even if some solutions look obvious. Looking at the results of using the le9 patches, this pretty much proves that page cache isn't so easy to properly change or improve.

And then there's one other elephant it the room: Linux tries to be a kernel for any workload (servers, databases, desktops, gaming, big data) - it can only be tuned up to some level. I think we'd really need specialized versions that go deeper than just tuning some knobs on the scheduler or memory manager.

kakra · 2026-06-24T21:03:30Z

for free space calc, I think we could come up with a simple rule

Yep, this would probably solve most of the headaches with it. It seems simple and straight forward but nonetheless, it hasn't been implemented yet. So there's probably some deeper cause, why. I think one blocker would be btrfs file systems that used mixed metadata/data. I'd say btrfs should probably not run on very small devices anyways, and that mode of operation should be ripped out of btrfs.

It's one of those "bad ideas" that slipped into the code base without enough thought - similar to what I outlined above why the current ideas of data placement should not yet land in btrfs.

fdmanana and others added 2 commits December 13, 2025 16:25

btrfs: add new Kconfig option for btrfs allocator hints

8ff33f9

Signed-off-by: Kai Krakow <kai@kaishome.de>

kakra mentioned this pull request Dec 14, 2025

[6.12] Track btrfs patches #36

Closed

kakra and others added 3 commits December 16, 2025 14:49

btrfs: export dev_item.type in /sys/fs/btrfs/<uuid>/devinfo/<devid>/type

34ed0e4

Co-authored-by: Goffredo Baroncelli <kreijack@inwind.it> Signed-off-by: Kai Krakow <kai@kaishome.de>

kakra force-pushed the rebase-6.18/btrfs-patches branch from 7e81d2c to 8a8411c Compare December 16, 2025 13:57

kakra and others added 6 commits December 17, 2025 04:19

btrfs: add allocator_hint for no allocation preferred

d9b8b0c

This is useful where you want to prevent new allocations of chunks on a disk which is going to be removed from the pool anyways, e.g. due to bad blocks or because it's slow. Signed-off-by: Kai Krakow <kai@kaishome.de>

btrfs: add io read stats per device to devinfo

8305177

This adds read stats per device to devinfo to evaluate the effects of different read policies better. This adds a new file /sys/fs/btrfs/BTRFS-UUID/devinfo/ID/read_stats. Signed-off-by: Kai Krakow <kai@kaishome.de>

btrfs: move read policies out of experimental

4a6d012

Read policies seem safe and stable enough to move it out of the experimental feature set. This allows us to add more policies without forcing users to enable the full experimental feature set. Signed-off-by: Kai Krakow <kai@kaishome.de>

btrfs: add in-flight queue read policy

6137992

Select the preferred stripe based on the mirror with the least in-flight requests. Signed-off-by: Kai Krakow <kai@kaishome.de>

kakra force-pushed the rebase-6.18/btrfs-patches branch from 8a8411c to 6137992 Compare December 17, 2025 03:40

GITHUB: btrfs: add build workflow

6d23e96

kakra requested a review from Copilot December 20, 2025 15:53

Copilot AI reviewed Dec 20, 2025

Conversation

kakra commented Dec 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

btrfs: tiered allocation hints and queue-based read balancing

1. Allocator Hints (Data Placement)

2. Read Balancing Policies

3. Decoupling from Experimental Status

Quickstart Guide

Setting Allocator Hints

Enabling Read Policy

Diagnostic Statistics

Benchmark Results

Scenario: No Background Load

Scenario: Heavy Background Load (Defrag)

Changes in this version (Kernel 6.18 Port)

FAQ: Why is this not upstream?

1. Allocator Hints

2. Read Policies (queue)

Uh oh!

CHN-beta commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Forza-tng commented Dec 16, 2025

Uh oh!

kakra commented Dec 16, 2025

Uh oh!

kakra commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CHN-beta commented Dec 17, 2025

Uh oh!

kakra commented Dec 17, 2025

Uh oh!

kakra commented Dec 17, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

redjard commented Jun 23, 2026

Uh oh!

kakra commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

voidpointertonull commented Jun 24, 2026

Uh oh!

Forza-tng commented Jun 24, 2026

Uh oh!

kakra commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kakra commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

kakra commented Dec 14, 2025 •

edited

Loading

2. Read Policies (`queue`)

CHN-beta commented Dec 16, 2025 •

edited

Loading

kakra commented Dec 16, 2025 •

edited

Loading

kakra commented Jun 23, 2026 •

edited

Loading

kakra commented Jun 24, 2026 •

edited

Loading

kakra commented Jun 24, 2026 •

edited

Loading