[6.18] Track btrfs patches#40
Conversation
The following kernel message may be logged if `add_inline_refs()` or `add_keyed_refs()` block for too long: > kernel: rcu: INFO: rcu_sched self-detected stall on CPU > kernel: rcu: 10-....: (2100 ticks this GP) idle=0494/1/0x4000000000000000 softirq=164826140/164826187 fqs=1052 > kernel: rcu: (t=2100 jiffies g=358306033 q=2241752 ncpus=16) > kernel: CPU: 10 UID: 0 PID: 1524681 Comm: map_0x178e45670 Not tainted 6.12.21-gentoo #1 > kernel: Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 > kernel: RIP: 0010:btrfs_get_64+0x65/0x110 > kernel: Code: d3 ed 48 8b 4f 70 48 8b 31 83 e6 40 74 11 0f b6 49 40 41 bc 00 10 00 00 49 d3 e4 49 83 ec 01 4a 8b 5c ed 70 49 21 d4 45 89 c9 <48> 2b 1d 7c 99 09 01 49 01 c1 8b 55 08 49 8d 49 08 44 8b 75 0c 48 > kernel: RSP: 0018:ffffbb7ad531bba0 EFLAGS: 00000202 > kernel: RAX: 0000000000001f15 RBX: fffff437ea382200 RCX: fffff437cb891200 > kernel: RDX: 000001922b68df2a RSI: 0000000000000000 RDI: ffffa434c3e66d20 > kernel: RBP: ffffa434c3e66d20 R08: 000001922b68c000 R09: 0000000000000015 > kernel: R10: 6c0000000000000a R11: 0000000009fe7000 R12: 0000000000000f2a > kernel: R13: 0000000000000001 R14: ffffa43192e6d230 R15: ffffa43160c4c800 > kernel: FS: 000055d07085e6c0(0000) GS:ffffa4452bc80000(0000) knlGS:0000000000000000 > kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > kernel: CR2: 00007fff204ecfc0 CR3: 0000000121a0b000 CR4: 00000000001506f0 > kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > kernel: Call Trace: > kernel: <IRQ> > kernel: ? rcu_dump_cpu_stacks+0xd3/0x100 > kernel: ? rcu_sched_clock_irq+0x4ff/0x920 > kernel: ? update_process_times+0x6c/0xa0 > kernel: ? tick_nohz_handler+0x82/0x110 > kernel: ? tick_do_update_jiffies64+0xd0/0xd0 > kernel: ? __hrtimer_run_queues+0x10b/0x190 > kernel: ? hrtimer_interrupt+0xf1/0x200 > kernel: ? __sysvec_apic_timer_interrupt+0x44/0x50 > kernel: ? sysvec_apic_timer_interrupt+0x60/0x80 > kernel: </IRQ> > kernel: <TASK> > kernel: ? asm_sysvec_apic_timer_interrupt+0x16/0x20 > kernel: ? btrfs_get_64+0x65/0x110 > kernel: find_parent_nodes+0x1b84/0x1dc0 > kernel: btrfs_find_all_leafs+0x31/0xd0 > kernel: ? queued_write_lock_slowpath+0x30/0x70 > kernel: iterate_extent_inodes+0x6f/0x370 > kernel: ? update_share_count+0x60/0x60 > kernel: ? extent_from_logical+0x139/0x190 > kernel: ? release_extent_buffer+0x96/0xb0 > kernel: iterate_inodes_from_logical+0xaa/0xd0 > kernel: btrfs_ioctl_logical_to_ino+0xaa/0x150 > kernel: __x64_sys_ioctl+0x84/0xc0 > kernel: do_syscall_64+0x47/0x100 > kernel: entry_SYSCALL_64_after_hwframe+0x4b/0x53 > kernel: RIP: 0033:0x55d07617eaaf > kernel: Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 18 48 8b 44 24 18 64 48 2b 04 25 28 00 00 > kernel: RSP: 002b:000055d07085bc20 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 > kernel: RAX: ffffffffffffffda RBX: 000055d0402f8550 RCX: 000055d07617eaaf > kernel: RDX: 000055d07085bca0 RSI: 00000000c038943b RDI: 0000000000000003 > kernel: RBP: 000055d07085bea0 R08: 00007fee46c84080 R09: 0000000000000000 > kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003 > kernel: R13: 000055d07085bf80 R14: 000055d07085bf48 R15: 000055d07085c0b0 > kernel: </TASK> The RCU stall could be because there's a large number of backrefs for some extents and we're spending too much time looping over them without ever yielding the cpu. Avoid the stall warning by adding `conf_resched()`. Link: https://lore.kernel.org/linux-btrfs/CAMthOuP_AE9OwiTQCrh7CK73xdTZvHsLTB1JU2WBK6cCc05JYg@mail.gmail.com/T/#md2e3504a1885c63531f8eefc70c94cff571b7a72 Signed-off-by: Kai Krakow <kk@netactive.de>
Signed-off-by: Kai Krakow <kai@kaishome.de>
|
I tried this patch with CachyOS kernel 6.18.0. After recompiling and rebooting, Is there anything I am missing? |
Thanks for the reporty. I can confirm the same. |
|
Thanks for the report. Will fix... |
Add the following flags to give a hint about which chunk should be allocated on which a disk. The following flags are created: - BTRFS_DEV_ALLOCATION_PREFERRED_DATA preferred data chunk, but metadata chunk allowed - BTRFS_DEV_ALLOCATION_PREFERRED_METADATA preferred metadata chunk, but data chunk allowed - BTRFS_DEV_ALLOCATION_METADATA_ONLY only metadata chunk allowed - BTRFS_DEV_ALLOCATION_DATA_ONLY only data chunk allowed Co-authored-by: Goffredo Baroncelli <kreijack@inwid.it> Signed-off-by: Kai Krakow <kai@kaishome.de>
Co-authored-by: Goffredo Baroncelli <kreijack@inwind.it> Signed-off-by: Kai Krakow <kai@kaishome.de>
v2: Adds a check to prevent modification while the file system is still mounting. Todo: - Transactions should not be triggered from sysfw writes, see: https://lore.kernel.org/linux-btrfs/20251213200920.1808679-1-kai@kaishome.de/ Link: #36 (comment) Reported-by: Eli Venter <eli@genedx.com> Co-authored-by: Goffredo Baroncelli <kreijack@inwind.it> Signed-off-by: Kai Krakow <kai@kaishome.de>
7e81d2c to
8a8411c
Compare
|
@CHN-beta @Forza-tng Thanks for reporting and confirming. This was actually a bug I introduced when I made allocator hints configurable via Important: This means, whoever used the 6.18 patches until now, never had allocator hints enabled since. Please use the new patches, then verify that If it lists devices with unexpected meta data, take note of the affected device IDs, then run a meta data balance filtered for device ID (separate each ID by spaces): for ID in {SLOW_DEV_IDs}; do btrfs balance start -mdevid=$ID --enqueue {BTRFS-MOUNT-PATH}; doneE.g., run Thanks. |
Please don't worry about it. Everyone makes mistakes sometimes, and this one didn't actually cause any damage. Thank you for your contribution! |
When this mode is enabled, the chunk allocation policy is modified as follows: Each disk may have a different tag: - BTRFS_DEV_ALLOCATION_PREFERRED_METADATA - BTRFS_DEV_ALLOCATION_METADATA_ONLY - BTRFS_DEV_ALLOCATION_DATA_ONLY - BTRFS_DEV_ALLOCATION_PREFERRED_DATA (default) Where: - ALLOCATION_PREFERRED_X means that it is preferred to use this disk for the X chunk type (the other type may be allowed when the space is low) - ALLOCATION_X_ONLY means that it is used *only* for the X chunk type. This means also that it is a preferred choice. Each time the allocator allocates a chunk of type X, first it takes the disks tagged as ALLOCATION_X_ONLY or ALLOCATION_PREFERRED_X. If the space is not enough, it uses also the disks tagged as ALLOCATION_METADATA_ONLY. If the space is not enough, it uses also the other disks, with the exception of the one marked as ALLOCATION_PREFERRED_Y, where Y is the other type of chunk (i.e. not X). Co-authored-by: Goffredo Baroncelli <kreijack@inwind.it> Signed-off-by: Kai Krakow <kai@kaishome.de>
This is useful where you want to prevent new allocations of chunks on a disk which is going to be removed from the pool anyways, e.g. due to bad blocks or because it's slow. Signed-off-by: Kai Krakow <kai@kaishome.de>
This is useful where you want to prevent new allocations of chunks to a set of multiple disks which are going to be removed from the pool. This acts as a multiple `btrfs dev remove` on steroids that can remove multiple disks in parallel without moving data to disks which would be removed in the next round. In such cases, it will avoid moving the same data multiple times, and thus avoid placing it on potentially bad disks. Thanks to @Zygo for the explanation and suggestion. Link: kdave/btrfs-progs#907 (comment) Signed-off-by: Kai Krakow <kai@kaishome.de>
This adds read stats per device to devinfo to evaluate the effects of different read policies better. This adds a new file /sys/fs/btrfs/BTRFS-UUID/devinfo/ID/read_stats. Signed-off-by: Kai Krakow <kai@kaishome.de>
Read policies seem safe and stable enough to move it out of the experimental feature set. This allows us to add more policies without forcing users to enable the full experimental feature set. Signed-off-by: Kai Krakow <kai@kaishome.de>
Select the preferred stripe based on the mirror with the least in-flight requests. Signed-off-by: Kai Krakow <kai@kaishome.de>
8a8411c to
6137992
Compare
|
Updated the branch to improve the style of some if statements. |
|
Added a Github workflow to automatically compile and build-check the patches. |
|
Hope this isn't inappropriate to ask here, what's the status (or chances) of getting these patches (especially queue and allocator hints) upstreamed into the mainline kernel? |
I happily answer that: Chances are near zero. This is not a quality or functional problem. It's more an organisational issue.
I hope this helps. Let me know if I should improve the initial FAQ above. |
I've wondered about this problem for a while, so guess I'll use this "revive" of topic to ask: Is this actually a problem for preferred location too? I was under the impression that in that case there's no lie about free space, because for example devices preferring metadata storage could be also used for data. Also, if free space calculation problems being made worse is the only blocker, then isn't that feasible to work around by metadata only devices not contributing to total space at all? While that would only cover a subset of what's offered here, I believe that to be the most important part. I'm also generally confused about free space calculation concerns raising the bar of acceptance so high while free space fragmentation can still lead to the dreaded ENOSPC issue, so it's not like a perfect (or even good) system would be degraded, and the location hints being opt-in also prevent most of the surprise even if something does get worse (which I still doubt with only the preferred flags being in use).
That's long enough to "solve" the issue for a lot of people by just waiting until technology advances to the point of being capable of hiding bad design decisions, which then leads to the problem no longer being considered worthy of treatment. Quite like how dirty page cache flushing still sucks, it's just less apparent with faster and generally better storage devices. |
for free space calc, I think we could come up with a simple rule.
|
I think file system are generally a slow moving target... Btrfs took maybe 10 years to be usable without major corruption issues or other annoyances. I took another (overlapping) 10 years to fix most performance issues. We are probably already in the face of adding features. I can think of some, and I'm working on these ideas: For example, I'm thinking of a feature to select faster devices for allocating smaller extents. The size classes that have been added to the chunk allocation should be able to provide such an allocation strategy. It will likely also mess with free space calculation, tho. But it will also partially implement hot data placement, and we can maybe get rid of bcache. Meta data placement already solves part of it so I could already exclude meta data from bcache. Let's be honest: bcache is nice but it is far from providing the actually latency and throughput that could be expected. Especially small seeks it doesn't handle very well, exactly what it promises to be good at. True, it's still way faster than seeking purely on spinning disks, but it's also far from getting near native SSD speeds. Next on the plan would be to provide some maintenance process to migrate small extents to faster devices, and big extents to slower devices. I'm not sure, tho, how to properly decide or persist "data hotness" without generating a big write overhead penalty. Given that, most of this is experimental. Finally getting to an implementation which solves hot/cold tier scenarios would be the final upstream candidate. Until then, we shouldn't push half baked features into upstream. The current state of the patches only partially solves a bigger problem, and may not be the optimal solution, so it probably shouldn't go into upstream anyways.
Yeah, but given current pricing, this puts back pressure on improved implementations for slow or mixed devices. I always believed that you should have cold data on slow devices, and hot or small data on fast devices, but in one single file system, without the headache of manually managing that. Bcache mostly solves this but it is a distinct layer from btrfs, it doesn't know about btrfs, or how CoW relocates data and makes orphaned blocks obsolete in the cache, or that mirrored stripes contain identical data, so with btrfs, it can mostly make bad decisions except your data sits there unmodified. So I believe my idea for small/big placement may already make such a big improvement that we would get rid of bcache as an intermediate layer (consuming not particularly few CPU cycles). With this patch series using the queue read balancing, we can at least prefer exactly those devices which already have cached blocks in bcache - resulting in less usage penalty (duplicated data) in bcache itself. Also, a note on page cache: As far as I understood, the page cache cannot identify the same (shared) extent from different on two different btrfs subvolumes as the same data - so it will be read twice, and will be kept twice in memory. That's one other potential optimization to solve. I wonder if that's why zfs implemented its own RAM cache. I'm not sure what you mean by dirty page cache flushing: in general, or for btrfs specifically? In "my" other patch series (#45), I incorporated cache and memory management changes (le9). Well, it can behave quite well in many scenarios, even exceptionally well sometimes, but it can also get a lot worse in others. So I understand why such changes haven't gone upstream. IOW, just improving some scenarios can make it a lot worse for others, not only slightly, but a lot of. So it doesn't always seem easy, even if some solutions look obvious. Looking at the results of using the le9 patches, this pretty much proves that page cache isn't so easy to properly change or improve. And then there's one other elephant it the room: Linux tries to be a kernel for any workload (servers, databases, desktops, gaming, big data) - it can only be tuned up to some level. I think we'd really need specialized versions that go deeper than just tuning some knobs on the scheduler or memory manager. |
Yep, this would probably solve most of the headaches with it. It seems simple and straight forward but nonetheless, it hasn't been implemented yet. So there's probably some deeper cause, why. I think one blocker would be btrfs file systems that used mixed metadata/data. I'd say btrfs should probably not run on very small devices anyways, and that mode of operation should be ripped out of btrfs. It's one of those "bad ideas" that slipped into the code base without enough thought - similar to what I outlined above why the current ideas of data placement should not yet land in btrfs. |
Export patch series: https://github.com/kakra/linux/pull/40.patch
btrfs: tiered allocation hints and queue-based read balancing
Special Thanks to @Forza-tng for extensive testing, feedback, and maintaining the documentation guide:
👉 Btrfs Allocator Hints and Read Policies Guide by Forza-tng
This PR introduces a set of patches to improve Btrfs performance and flexibility in heterogeneous storage environments (mixed SSD/HDD, tiered storage, bcache).
1. Allocator Hints (Data Placement)
Allows preferring specific devices for data or metadata allocations. This works by storing a hint in the persistent device item on-disk.
2. Read Balancing Policies
Extends Btrfs RAID1 read balancing with a dynamic
queuepolicy. The standard PID-based policy is often insufficient for mixed-device pools or high-IOPS workloads.pid: (Default) Static hashing by process ID. Good for simple setups.round-robin: Distributes reads equally. Good for aggregate throughput on identical disks.queue: (Recommended) Routes requests to the device with the fewest in-flight requests (shortest queue).devid: Pin reads to a specific device ID (mostly for testing).(Note: Previous experimental latency-based policies were dropped in favor of
queuedue to better stability and lower complexity.)3. Decoupling from Experimental Status
Important: Upstream Kernels (6.13+) use
CONFIG_BTRFS_EXPERIMENTALto gate various unstable work-in-progress features. To allow using the allocator hints and read policies without enabling potentially unstable upstream code, these features have been moved out of the experimental gate.Recommendation: Remove the line
CONFIG_BTRFS_EXPERIMENTALfrom your.configbefore runningmake oldconfig. The build system will then prompt you specifically for the new options (CONFIG_BTRFS_ALLOCATOR_HINTS,CONFIG_BTRFS_READ_POLICIES), allowing you to enable them safely without turning on other experimental Btrfs features.Quickstart Guide
Setting Allocator Hints
CONFIG_BTRFS_ALLOCATOR_HINTSin kernel config.btrfs device usage /mnt/pathto identify your device IDs.echo <TYPE> | sudo tee /sys/fs/btrfs/<UUID>/devinfo/<DEVID>/typeAvailable Types:
0: Prefer data (Default for HDDs).1: Prefer metadata (Recommended for SSDs/NVMe).2: Metadata only (Use with caution).3: Data only (Use with caution).4: None preferred (Avoids new allocations, useful to drain a drive).5: None (Strictly prevents ANY new allocation, useful for parallel device remove).After changing hints, a rebalance of metadata/data is required to move existing extents to their preferred location.
Enabling Read Policy
CONFIG_BTRFS_READ_POLICIESin kernel config.btrfs.read_policy=queueecho queue | sudo tee /sys/fs/btrfs/<UUID>/read_policyDiagnostic Statistics
Adds per-device read statistics to
/sys/fs/btrfs/<UUID>/devinfo/<DEVID>/read_stats.ios: Total read I/O count.wait: Total accumulated wait time (ns).avg: Cumulative average read latency (ns).age: "Fairness" counter. Increments when the device is skipped/ignored during selection. Resets to 0 when selected. A constantly high age indicates the device is being avoided by the policy.ignored: Total count of times this device was a candidate but skipped.Benchmark Results
The following benchmarks (based on kernel 6.12 LTS) compare the new policies against the defaults. Tests were performed on a mixed HDD RAID10 array with bcache, comparing an idle system vs. a system under heavy background load (defrag).
queueproved to be the superior all-rounder, effectively isolating foreground workloads from background noise.Scenario: No Background Load
pidround-robinlatency-rr*queueScenario: Heavy Background Load (Defrag)
pidround-robinlatency-rr*queue(
latency-rrwas an experimental hybrid policy used during testing, superseded byqueuedue to better performance and simplicity)Changes in this version (Kernel 6.18 Port)
queuepolicy.FAQ: Why is this not upstream?
1. Allocator Hints
The allocator hint patches (originally developed by Goffredo Baroncelli, now maintained here) have been discussed on the mailing list but were not merged for design reasons:
df): Btrfs calculates available space assuming any chunk can be allocated on any device (respecting RAID profiles). Restricting allocations via hints makes this calculation unreliable. Tools might report free space while Btrfs returnsENOSPC(No space left on device) because the allowed devices for a specific chunk type are full, even if other devices are empty.Compatibility Note: This patch reuses the existing (unused)
typefield in the device item on disk. It does not change the on-disk format version. Unpatched kernels simply ignore the value, ensuring data remains accessible (though allocation preferences will be lost until booted with a patched kernel).2. Read Policies (
queue)The new
queuepolicy is an experimental addition in this patch set. It is unlikely to be accepted upstream in its current form due to Layer Violation: