Whenever we attempt to do a non-aligned direct IO write with O_DSYNC, we
end up triggering an assertion and crashing. Example reproducer:
$ cat test.sh
#!/bin/bash
DEV=/dev/sdj
MNT=/mnt/sdj
mkfs.btrfs -f $DEV > /dev/null
mount $DEV $MNT
# Do a direct IO write with O_DSYNC into a non-aligned range...
xfs_io -f -d -s -c "pwrite -S 0xab -b 64K 1111 64K" $MNT/foobar
umount $MNT
When running the reproducer an assertion fails and produces the following
trace:
[ 2418.403134] assertion failed: !current->journal_info || flush != BTRFS_RESERVE_FLUSH_DATA, in fs/btrfs/space-info.c:1467
[ 2418.403745] ------------[ cut here ]------------
[ 2418.404306] kernel BUG at fs/btrfs/ctree.h:3286!
[ 2418.404862] invalid opcode: 0000 [#2] PREEMPT SMP DEBUG_PAGEALLOC PTI
[ 2418.405451] CPU: 1 PID: 64705 Comm: xfs_io Tainted: G D 5.10.15-btrfs-next-87 #1
[ 2418.406026] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[ 2418.407228] RIP: 0010:assertfail.constprop.0+0x18/0x26 [btrfs]
[ 2418.407835] Code: e6 48 c7 (...)
[ 2418.409078] RSP: 0018:
ffffb06080d13c98 EFLAGS:
00010246
[ 2418.409696] RAX:
000000000000006c RBX:
ffff994c1debbf08 RCX:
0000000000000000
[ 2418.410302] RDX:
0000000000000000 RSI:
0000000000000027 RDI:
00000000ffffffff
[ 2418.410904] RBP:
ffff994c21770000 R08:
0000000000000000 R09:
0000000000000000
[ 2418.411504] R10:
0000000000000000 R11:
0000000000000001 R12:
0000000000010000
[ 2418.412111] R13:
ffff994c22198400 R14:
ffff994c21770000 R15:
0000000000000000
[ 2418.412713] FS:
00007f54fd7aff00(0000) GS:
ffff994d35200000(0000) knlGS:
0000000000000000
[ 2418.413326] CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
[ 2418.413933] CR2:
000056549596d000 CR3:
000000010b928003 CR4:
0000000000370ee0
[ 2418.414528] DR0:
0000000000000000 DR1:
0000000000000000 DR2:
0000000000000000
[ 2418.415109] DR3:
0000000000000000 DR6:
00000000fffe0ff0 DR7:
0000000000000400
[ 2418.415669] Call Trace:
[ 2418.416254] btrfs_reserve_data_bytes.cold+0x22/0x22 [btrfs]
[ 2418.416812] btrfs_check_data_free_space+0x4c/0xa0 [btrfs]
[ 2418.417380] btrfs_buffered_write+0x1b0/0x7f0 [btrfs]
[ 2418.418315] btrfs_file_write_iter+0x2a9/0x770 [btrfs]
[ 2418.418920] new_sync_write+0x11f/0x1c0
[ 2418.419430] vfs_write+0x2bb/0x3b0
[ 2418.419972] __x64_sys_pwrite64+0x90/0xc0
[ 2418.420486] do_syscall_64+0x33/0x80
[ 2418.420979] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 2418.421486] RIP: 0033:0x7f54fda0b986
[ 2418.421981] Code: 48 c7 c0 (...)
[ 2418.423019] RSP: 002b:
00007ffc40569c38 EFLAGS:
00000246 ORIG_RAX:
0000000000000012
[ 2418.423547] RAX:
ffffffffffffffda RBX:
0000000000000000 RCX:
00007f54fda0b986
[ 2418.424075] RDX:
0000000000010000 RSI:
000056549595e000 RDI:
0000000000000003
[ 2418.424596] RBP:
0000000000000000 R08:
0000000000000000 R09:
0000000000000400
[ 2418.425119] R10:
0000000000000400 R11:
0000000000000246 R12:
00000000ffffffff
[ 2418.425644] R13:
0000000000000400 R14:
0000000000010000 R15:
0000000000000000
[ 2418.426148] Modules linked in: btrfs blake2b_generic (...)
[ 2418.429540] ---[ end trace
ef2aeb44dc0afa34 ]---
1) At btrfs_file_write_iter() we set current->journal_info to
BTRFS_DIO_SYNC_STUB;
2) We then call __btrfs_direct_write(), which calls btrfs_direct_IO();
3) We can't do the direct IO write because it starts at a non-aligned
offset (1111). So at btrfs_direct_IO() we return -EINVAL (coming from
check_direct_IO() which does the alignment check), but we leave
current->journal_info set to BTRFS_DIO_SYNC_STUB - we only clear it
at btrfs_dio_iomap_begin(), because we assume we always get there;
4) Then at __btrfs_direct_write() we see that the attempt to do the
direct IO write was not successful, 0 bytes written, so we fallback
to a buffered write by calling btrfs_buffered_write();
5) There we call btrfs_check_data_free_space() which in turn calls
btrfs_alloc_data_chunk_ondemand() and that calls
btrfs_reserve_data_bytes() with flush == BTRFS_RESERVE_FLUSH_DATA;
6) Then at btrfs_reserve_data_bytes() we have current->journal_info set to
BTRFS_DIO_SYNC_STUB, therefore not NULL, and flush has the value
BTRFS_RESERVE_FLUSH_DATA, triggering the second assertion:
int btrfs_reserve_data_bytes(struct btrfs_fs_info *fs_info, u64 bytes,
enum btrfs_reserve_flush_enum flush)
{
struct btrfs_space_info *data_sinfo = fs_info->data_sinfo;
int ret;
ASSERT(flush == BTRFS_RESERVE_FLUSH_DATA ||
flush == BTRFS_RESERVE_FLUSH_FREE_SPACE_INODE);
ASSERT(!current->journal_info || flush != BTRFS_RESERVE_FLUSH_DATA);
(...)
So fix that by setting the journal to NULL whenever check_direct_IO()
returns a failure.
This bug only affects 5.10 kernels, and the regression was introduced in
5.10-rc1 by commit
0eb79294dbe328 ("btrfs: dio iomap DSYNC workaround").
The bug does not exist in 5.11 kernels due to commit
ecfdc08b8cc65d
("btrfs: remove dio iomap DSYNC workaround"), which depends on a large
patchset that went into the merge window for 5.11. So this is a fix only
for 5.10.x stable kernels, as there are people hitting this bug.
Fixes: 0eb79294dbe328 ("btrfs: dio iomap DSYNC workaround")
CC: stable@vger.kernel.org # 5.10 (and only 5.10)
Acked-by: David Sterba <dsterba@suse.com>
Bugzilla: https://bugzilla.suse.com/show_bug.cgi?id=
1181605
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>