On Mon, 13 May 2024 at 09:28, David Sterba <dsterba@xxxxxxxx> wrote:
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git tags/for-6.10-tag
So I initially blamed a GPU driver for the following problem, but Dave
Airlie seems to think it's unlikely that problem would cause this kind
of corruption, so now it looks like it might just be btrfs itself:
BUG: Bad page state in process kworker/u261:13 pfn:31fb9a
page: refcount:0 mapcount:0 mapping:00000000ff0b239e index:0x37ce8
pfn:0x31fb9a
aops:btree_aops ino:1
flags: 0x2fffc600000020c(referenced|uptodate|workingset|node=0|zone=2|lastcpupid=0x3fff)
page_type: 0xffffffff()
raw: 02fffc600000020c dead000000000100 dead000000000122 ffff9b191efb0338
raw: 0000000000037ce8 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: non-NULL mapping
CPU: 18 PID: 141351 Comm: kworker/u261:13 Tainted: G W
6.9.0-07381-g3860ca371740 #60
Workqueue: btrfs-delayed-meta btrfs_work_helper
Call Trace:
bad_page+0xe0/0xf0
free_unref_page_prepare+0x363/0x380
? __count_memcg_events+0x63/0xd0
free_unref_page+0x33/0x1f0
? __mem_cgroup_uncharge+0x80/0xb0
__folio_put+0x62/0x80
release_extent_buffer+0xad/0x110
btrfs_force_cow_block+0x68f/0x890
btrfs_cow_block+0xe5/0x240
btrfs_search_slot+0x30e/0x9f0
btrfs_lookup_inode+0x31/0xb0
__btrfs_update_delayed_inode+0x5c/0x350
? kfree+0x80/0x250
__btrfs_commit_inode_delayed_items+0x7a1/0x7d0
btrfs_async_run_delayed_root+0xf7/0x1b0
btrfs_work_helper+0xc0/0x320
process_scheduled_works+0x196/0x360
worker_thread+0x2b8/0x370
? pr_cont_work+0x190/0x190
kthread+0x111/0x120
? kthread_blkcg+0x30/0x30
ret_from_fork+0x30/0x40
? kthread_blkcg+0x30/0x30
ret_from_fork_asm+0x11/0x20
Note the line
page dumped because: non-NULL mapping
but the actual mapping pointer isn't a valid kernel pointer. I suspect
that may be due to pointer hashing, though. I'm not convinced that's a
great idea for this case, but hey, here we are. Sometimes those "don't
leak kernel pointers" things cause problems for debugging.
Anyway, it looks like the btrfs_cow_block -> btrfs_force_cow_block ->
release_extent_buffer -> __folio_put path might be releasing a page
that is still attached to a mapping. Perhaps some page counting
imbalance?
This all happened under fairly normal - for me - workstation loads. I
was (of course) doing an allmodconfig kernel build after a pull, and I
had a handful of terminals and the web browser open. Nothing
particularly interesting or odd.
Does the above make any btrfs people go "Ahh, I see how that would be
a problem"?
Linus