Qu Wenruo [Wed, 24 Apr 2019 07:22:53 +0000 (15:22 +0800)]
btrfs: tree-checker: Allow error injection for tree-checker
Allowing error injection for btrfs_check_leaf_full() and
btrfs_check_node() is useful to test the failure path of btrfs write
time tree check.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Mon, 22 Apr 2019 13:07:31 +0000 (16:07 +0300)]
btrfs: Document btrfs_csum_one_bio
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Wed, 17 Apr 2019 10:31:06 +0000 (11:31 +0100)]
Btrfs: improve performance on fsync of files with multiple hardlinks
Commit
41bd6067692382 ("Btrfs: fix fsync of files with multiple hard links
in new directories") introduced a path that makes fsync fallback to a full
transaction commit in order to avoid losing hard links and new ancestors
of the fsynced inode. That path is triggered only when the inode has more
than one hard link and either has a new hard link created in the current
transaction or the inode was evicted and reloaded in the current
transaction.
That path ends up getting triggered very often (hundreds of times) during
the course of pgbench benchmarks, resulting in performance drops of about
20%.
This change restores the performance by not triggering the full transaction
commit in those cases, and instead iterate the fs/subvolume tree in search
of all possible new ancestors, for all hard links, to log them.
Reported-by: Zhao Yuhu <zyuhu@suse.com>
Tested-by: James Wang <jnwang@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Mon, 22 Apr 2019 15:43:42 +0000 (16:43 +0100)]
Btrfs: fix race between send and deduplication that lead to failures and crashes
Send operates on read only trees and expects them to never change while it
is using them. This is part of its initial design, and this expection is
due to two different reasons:
1) When it was introduced, no operations were allowed to modifiy read-only
subvolumes/snapshots (including defrag for example).
2) It keeps send from having an impact on other filesystem operations.
Namely send does not need to keep locks on the trees nor needs to hold on
to transaction handles and delay transaction commits. This ends up being
a consequence of the former reason.
However the deduplication feature was introduced later (on September 2013,
while send was introduced in July 2012) and it allowed for deduplication
with destination files that belong to read-only trees (subvolumes and
snapshots).
That means that having a send operation (either full or incremental) running
in parallel with a deduplication that has the destination inode in one of
the trees used by the send operation, can result in tree nodes and leaves
getting freed and reused while send is using them. This problem is similar
to the problem solved for the root nodes getting freed and reused when a
snapshot is made against one tree that is currenly being used by a send
operation, fixed in commits [1] and [2]. These commits explain in detail
how the problem happens and the explanation is valid for any node or leaf
that is not the root of a tree as well. This problem was also discussed
and explained recently in a thread [3].
The problem is very easy to reproduce when using send with large trees
(snapshots) and just a few concurrent deduplication operations that target
files in the trees used by send. A stress test case is being sent for
fstests that triggers the issue easily. The most common error to hit is
the send ioctl return -EIO with the following messages in dmesg/syslog:
[1631617.204075] BTRFS error (device sdc): did not find backref in send_root. inode=63292, offset=0, disk_byte=
5228134400 found extent=
5228134400
[1631633.251754] BTRFS error (device sdc): parent transid verify failed on
32243712 wanted 24 found 27
The first one is very easy to hit while the second one happens much less
frequently, except for very large trees (in that test case, snapshots
with 100000 files having large xattrs to get deep and wide trees).
Less frequently, at least one BUG_ON can be hit:
[1631742.130080] ------------[ cut here ]------------
[1631742.130625] kernel BUG at fs/btrfs/ctree.c:1806!
[1631742.131188] invalid opcode: 0000 [#6] SMP DEBUG_PAGEALLOC PTI
[1631742.131726] CPU: 1 PID: 13394 Comm: btrfs Tainted: G B D W 5.0.0-rc8-btrfs-next-45 #1
[1631742.132265] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
[1631742.133399] RIP: 0010:read_node_slot+0x122/0x130 [btrfs]
(...)
[1631742.135061] RSP: 0018:
ffffb530021ebaa0 EFLAGS:
00010246
[1631742.135615] RAX:
ffff93ac8912e000 RBX:
000000000000009d RCX:
0000000000000002
[1631742.136173] RDX:
000000000000009d RSI:
ffff93ac564b0d08 RDI:
ffff93ad5b48c000
[1631742.136759] RBP:
ffffb530021ebb7d R08:
0000000000000001 R09:
ffffb530021ebb7d
[1631742.137324] R10:
ffffb530021eba70 R11:
0000000000000000 R12:
ffff93ac87d0a708
[1631742.137900] R13:
0000000000000000 R14:
0000000000000000 R15:
0000000000000001
[1631742.138455] FS:
00007f4cdb1528c0(0000) GS:
ffff93ad76a80000(0000) knlGS:
0000000000000000
[1631742.139010] CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
[1631742.139568] CR2:
00007f5acb3d0420 CR3:
000000012be3e006 CR4:
00000000003606e0
[1631742.140131] DR0:
0000000000000000 DR1:
0000000000000000 DR2:
0000000000000000
[1631742.140719] DR3:
0000000000000000 DR6:
00000000fffe0ff0 DR7:
0000000000000400
[1631742.141272] Call Trace:
[1631742.141826] ? do_raw_spin_unlock+0x49/0xc0
[1631742.142390] tree_advance+0x173/0x1d0 [btrfs]
[1631742.142948] btrfs_compare_trees+0x268/0x690 [btrfs]
[1631742.143533] ? process_extent+0x1070/0x1070 [btrfs]
[1631742.144088] btrfs_ioctl_send+0x1037/0x1270 [btrfs]
[1631742.144645] _btrfs_ioctl_send+0x80/0x110 [btrfs]
[1631742.145161] ? trace_sched_stick_numa+0xe0/0xe0
[1631742.145685] btrfs_ioctl+0x13fe/0x3120 [btrfs]
[1631742.146179] ? account_entity_enqueue+0xd3/0x100
[1631742.146662] ? reweight_entity+0x154/0x1a0
[1631742.147135] ? update_curr+0x20/0x2a0
[1631742.147593] ? check_preempt_wakeup+0x103/0x250
[1631742.148053] ? do_vfs_ioctl+0xa2/0x6f0
[1631742.148510] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
[1631742.148942] do_vfs_ioctl+0xa2/0x6f0
[1631742.149361] ? __fget+0x113/0x200
[1631742.149767] ksys_ioctl+0x70/0x80
[1631742.150159] __x64_sys_ioctl+0x16/0x20
[1631742.150543] do_syscall_64+0x60/0x1b0
[1631742.150931] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[1631742.151326] RIP: 0033:0x7f4cd9f5add7
(...)
[1631742.152509] RSP: 002b:
00007ffe91017708 EFLAGS:
00000202 ORIG_RAX:
0000000000000010
[1631742.152892] RAX:
ffffffffffffffda RBX:
0000000000000105 RCX:
00007f4cd9f5add7
[1631742.153268] RDX:
00007ffe91017790 RSI:
0000000040489426 RDI:
0000000000000007
[1631742.153633] RBP:
0000000000000007 R08:
00007f4cd9e79700 R09:
00007f4cd9e79700
[1631742.153999] R10:
00007f4cd9e799d0 R11:
0000000000000202 R12:
0000000000000003
[1631742.154365] R13:
0000555dfae53020 R14:
0000000000000000 R15:
0000000000000001
(...)
[1631742.156696] ---[ end trace
5dac9f96dcc3fd6b ]---
That BUG_ON happens because while send is using a node, that node is COWed
by a concurrent deduplication, gets freed and gets reused as a leaf (because
a transaction commit happened in between), so when it attempts to read a
slot from the extent buffer, at ctree.c:read_node_slot(), the extent buffer
contents were wiped out and it now matches a leaf (which can even belong to
some other tree now), hitting the BUG_ON(level == 0).
Fix this concurrency issue by not allowing send and deduplication to run
in parallel if both operate on the same readonly trees, returning EAGAIN
to user space and logging an exlicit warning in dmesg/syslog.
[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=
be6821f82c3cc36e026f5afd10249988852b35ea
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=
6f2f0b394b54e2b159ef969a0b5274e9bbf82ff2
[3] https://lore.kernel.org/linux-btrfs/CAL3q7H7iqSEEyFaEtpRZw3cp613y+4k2Q8b4W7mweR3tZA05bQ@mail.gmail.com/
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Mon, 15 Apr 2019 08:29:36 +0000 (09:29 +0100)]
Btrfs: send, flush dellaloc in order to avoid data loss
When we set a subvolume to read-only mode we do not flush dellaloc for any
of its inodes (except if the filesystem is mounted with -o flushoncommit),
since it does not affect correctness for any subsequent operations - except
for a future send operation. The send operation will not be able to see the
delalloc data since the respective file extent items, inode item updates,
backreferences, etc, have not hit yet the subvolume and extent trees.
Effectively this means data loss, since the send stream will not contain
any data from existing delalloc. Another problem from this is that if the
writeback starts and finishes while the send operation is in progress, we
have the subvolume tree being being modified concurrently which can result
in send failing unexpectedly with EIO or hitting runtime errors, assertion
failures or hitting BUG_ONs, etc.
Simple reproducer:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ btrfs subvolume create /mnt/sv
$ xfs_io -f -c "pwrite -S 0xea 0 108K" /mnt/sv/foo
$ btrfs property set /mnt/sv ro true
$ btrfs send -f /tmp/send.stream /mnt/sv
$ od -t x1 -A d /mnt/sv/foo
0000000 ea ea ea ea ea ea ea ea ea ea ea ea ea ea ea ea
*
0110592
$ umount /mnt
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt
$ btrfs receive -f /tmp/send.stream /mnt
$ echo $?
0
$ od -t x1 -A d /mnt/sv/foo
0000000
# ---> empty file
Since this a problem that affects send only, fix it in send by flushing
dellaloc for all the roots used by the send operation before send starts
to process the commit roots.
This is a problem that affects send since it was introduced (commit
31db9f7c23fbf7 ("Btrfs: introduce BTRFS_IOC_SEND for btrfs send/receive"))
but backporting it to older kernels has some dependencies:
- For kernels between 3.19 and 4.20, it depends on commit
3cd24c698004d2
("btrfs: use tagged writepage to mitigate livelock of snapshot") because
the function btrfs_start_delalloc_snapshot() does not exist before that
commit. So one has to either pick that commit or replace the calls to
btrfs_start_delalloc_snapshot() in this patch with calls to
btrfs_start_delalloc_inodes().
- For kernels older than 3.19 it also requires commit
e5fa8f865b3324
("Btrfs: ensure send always works on roots without orphans") because
it depends on the function ensure_commit_roots_uptodate() which that
commits introduced.
- No dependencies for 5.0+ kernels.
A test case for fstests follows soon.
CC: stable@vger.kernel.org # 3.19+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Mon, 15 Apr 2019 13:50:51 +0000 (14:50 +0100)]
Btrfs: do not start a transaction during fiemap
During fiemap, for regular extents (non inline) we need to check if they
are shared and if they are, set the shared bit. Checking if an extent is
shared requires checking the delayed references of the currently running
transaction, since some reference might have not yet hit the extent tree
and be only in the in-memory delayed references.
However we were using a transaction join for this, which creates a new
transaction when there is no transaction currently running. That means
that two more potential failures can happen: creating the transaction and
committing it. Further, if no write activity is currently happening in the
system, and fiemap calls keep being done, we end up creating and
committing transactions that do nothing.
In some extreme cases this can result in the commit of the transaction
created by fiemap to fail with ENOSPC when updating the root item of a
subvolume tree because a join does not reserve any space, leading to a
trace like the following:
heisenberg kernel: ------------[ cut here ]------------
heisenberg kernel: BTRFS: Transaction aborted (error -28)
heisenberg kernel: WARNING: CPU: 0 PID: 7137 at fs/btrfs/root-tree.c:136 btrfs_update_root+0x22b/0x320 [btrfs]
(...)
heisenberg kernel: CPU: 0 PID: 7137 Comm: btrfs-transacti Not tainted 4.19.0-4-amd64 #1 Debian 4.19.28-2
heisenberg kernel: Hardware name: FUJITSU LIFEBOOK U757/FJNB2A5, BIOS Version 1.21 03/19/2018
heisenberg kernel: RIP: 0010:btrfs_update_root+0x22b/0x320 [btrfs]
(...)
heisenberg kernel: RSP: 0018:
ffffb5448828bd40 EFLAGS:
00010286
heisenberg kernel: RAX:
0000000000000000 RBX:
ffff8ed56bccef50 RCX:
0000000000000006
heisenberg kernel: RDX:
0000000000000007 RSI:
0000000000000092 RDI:
ffff8ed6bda166a0
heisenberg kernel: RBP:
00000000ffffffe4 R08:
00000000000003df R09:
0000000000000007
heisenberg kernel: R10:
0000000000000000 R11:
0000000000000001 R12:
ffff8ed63396a078
heisenberg kernel: R13:
ffff8ed092d7c800 R14:
ffff8ed64f5db028 R15:
ffff8ed6bd03d068
heisenberg kernel: FS:
0000000000000000(0000) GS:
ffff8ed6bda00000(0000) knlGS:
0000000000000000
heisenberg kernel: CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
heisenberg kernel: CR2:
00007f46f75f8000 CR3:
0000000310a0a002 CR4:
00000000003606f0
heisenberg kernel: DR0:
0000000000000000 DR1:
0000000000000000 DR2:
0000000000000000
heisenberg kernel: DR3:
0000000000000000 DR6:
00000000fffe0ff0 DR7:
0000000000000400
heisenberg kernel: Call Trace:
heisenberg kernel: commit_fs_roots+0x166/0x1d0 [btrfs]
heisenberg kernel: ? _cond_resched+0x15/0x30
heisenberg kernel: ? btrfs_run_delayed_refs+0xac/0x180 [btrfs]
heisenberg kernel: btrfs_commit_transaction+0x2bd/0x870 [btrfs]
heisenberg kernel: ? start_transaction+0x9d/0x3f0 [btrfs]
heisenberg kernel: transaction_kthread+0x147/0x180 [btrfs]
heisenberg kernel: ? btrfs_cleanup_transaction+0x530/0x530 [btrfs]
heisenberg kernel: kthread+0x112/0x130
heisenberg kernel: ? kthread_bind+0x30/0x30
heisenberg kernel: ret_from_fork+0x35/0x40
heisenberg kernel: ---[ end trace
05de912e30e012d9 ]---
Since fiemap (and btrfs_check_shared()) is a read-only operation, do not do
a transaction join to avoid the overhead of creating a new transaction (if
there is currently no running transaction) and introducing a potential
point of failure when the new transaction gets committed, instead use a
transaction attach to grab a handle for the currently running transaction
if any.
Reported-by: Christoph Anton Mitterer <calestyo@scientia.net>
Link: https://lore.kernel.org/linux-btrfs/b2a668d7124f1d3e410367f587926f622b3f03a4.camel@scientia.net/
Fixes:
afce772e87c36c ("btrfs: fix check_shared for fiemap ioctl")
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 20 Mar 2019 10:43:36 +0000 (11:43 +0100)]
btrfs: remove unused parameter fs_info from btrfs_set_disk_extent_flags
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 20 Mar 2019 10:42:34 +0000 (11:42 +0100)]
btrfs: remove unused parameter fs_info from btrfs_add_delayed_extent_op
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 20 Mar 2019 10:29:46 +0000 (11:29 +0100)]
btrfs: remove unused parameter fs_info from emit_last_fiemap_cache
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 20 Mar 2019 14:59:22 +0000 (15:59 +0100)]
btrfs: remove unused parameter fs_info from CHECK_FE_ALIGNED
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 20 Mar 2019 14:03:48 +0000 (15:03 +0100)]
btrfs: remove unused parameter fs_info from from tree_advance
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 20 Mar 2019 14:02:46 +0000 (15:02 +0100)]
btrfs: remove unused parameter fs_info from tree_move_down
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 20 Mar 2019 13:51:10 +0000 (14:51 +0100)]
btrfs: remove unused parameter fs_info from btrfs_extend_item
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 20 Mar 2019 13:49:12 +0000 (14:49 +0100)]
btrfs: remove unused parameter fs_info from btrfs_truncate_item
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 20 Mar 2019 13:44:57 +0000 (14:44 +0100)]
btrfs: remove unused parameter fs_info from split_item
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Thu, 4 Apr 2019 06:45:37 +0000 (14:45 +0800)]
btrfs: qgroup: Don't scan leaf if we're modifying reloc tree
Since reloc tree doesn't contribute to qgroup numbers, just skip them.
This should catch the final cause of unnecessary data ref processing
when running balance of metadata with qgroups on.
The 4G data 16 snapshots test (*) should explain it pretty well:
| delayed subtree | refactor delayed ref | this patch
---------------------------------------------------------------------
relocated | 22653 | 22673 | 22744
qgroup dirty | 122792 | 48360 | 70
time | 24.494 | 11.606 | 3.944
Finally, we're at the stage where qgroup + metadata balance cost no
obvious overhead.
Test environment:
Test VM:
- vRAM 8G
- vCPU 8
- block dev vitrio-blk, 'unsafe' cache mode
- host block 850evo
Test workload:
- Copy 4G data from /usr/ to one subvolume
- Create 16 snapshots of that subvolume, and modify 3 files in each
snapshot
- Enable quota, rescan
- Time "btrfs balance start -m"
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Thu, 4 Apr 2019 06:45:36 +0000 (14:45 +0800)]
btrfs: extent-tree: Use btrfs_ref to refactor btrfs_free_extent()
Similar to btrfs_inc_extent_ref(), use btrfs_ref to replace the long
parameter list and the confusing @owner parameter.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Thu, 4 Apr 2019 06:45:35 +0000 (14:45 +0800)]
btrfs: extent-tree: Use btrfs_ref to refactor btrfs_inc_extent_ref()
Use the new btrfs_ref structure and replace parameter list to clean up
the usage of owner and level to distinguish the extent types.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Thu, 4 Apr 2019 06:45:34 +0000 (14:45 +0800)]
btrfs: extent-tree: Use btrfs_ref to refactor add_pinned_bytes()
Since add_pinned_bytes() only needs to know if the extent is metadata
and if it's a chunk tree extent, btrfs_ref is a perfect match for it, as
we don't need various owner/level trick to determine extent type.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Thu, 4 Apr 2019 06:45:33 +0000 (14:45 +0800)]
btrfs: ref-verify: Use btrfs_ref to refactor btrfs_ref_tree_mod()
It's a perfect match for btrfs_ref_tree_mod() to use btrfs_ref, as
btrfs_ref describes a metadata/data reference update comprehensively.
Now we have one less function use confusing owner/level trick.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Thu, 4 Apr 2019 06:45:32 +0000 (14:45 +0800)]
btrfs: delayed-ref: Use btrfs_ref to refactor btrfs_add_delayed_data_ref()
Just like btrfs_add_delayed_tree_ref(), use btrfs_ref to refactor
btrfs_add_delayed_data_ref().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Thu, 4 Apr 2019 06:45:31 +0000 (14:45 +0800)]
btrfs: delayed-ref: Use btrfs_ref to refactor btrfs_add_delayed_tree_ref()
btrfs_add_delayed_tree_ref() has a longer and longer parameter list, and
some callers like btrfs_inc_extent_ref() are using @owner as level for
delayed tree ref.
Instead of making the parameter list longer, use btrfs_ref to refactor
it, so each parameter assignment should be self-explaining without dirty
level/owner trick, and provides the basis for later refactoring.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Thu, 4 Apr 2019 06:45:30 +0000 (14:45 +0800)]
btrfs: extent-tree: Open-code process_func in __btrfs_mod_ref
The process_func function pointer is local to __btrfs_mod_ref() and
points to either btrfs_inc_extent_ref() or btrfs_free_extent().
Open code it to make later delayed ref refactor easier, so we can
refactor btrfs_inc_extent_ref() and btrfs_free_extent() in different
patches.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Thu, 4 Apr 2019 06:45:29 +0000 (14:45 +0800)]
btrfs: delayed-ref: Introduce better documented delayed ref structures
Current delayed ref interface has several problems:
- Longer and longer parameter lists
bytenr
num_bytes
parent
---------- so far so good
ref_root
owner
offset
---------- I don't feel good now
- Different interpretation of the same parameter
Above @owner for data ref is inode number (u64),
while for tree ref, it's level (int).
They are even in different size range.
For level we only need 0 ~ 8, while for ino it's
BTRFS_FIRST_FREE_OBJECTID ~ BTRFS_LAST_FREE_OBJECTID.
And @offset doesn't even make sense for tree ref.
Such parameter reuse may look clever as an hidden union, but it
destroys code readability.
To solve both problems, we introduce a new structure, btrfs_ref to solve
them:
- Structure instead of long parameter list
This makes later expansion easier, and is better documented.
- Use btrfs_ref::type to distinguish data and tree ref
- Use proper union to store data/tree ref specific structures.
- Use separate functions to fill data/tree ref data, with a common generic
function to fill common bytenr/num_bytes members.
All parameters will find its place in btrfs_ref, and an extra member,
@real_root, inspired by ref-verify code, is newly introduced for later
qgroup code, to record which tree is triggered by this extent modification.
This patch doesn't touch any code, but provides the basis for further
refactoring.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Wed, 17 Apr 2019 10:30:30 +0000 (11:30 +0100)]
Btrfs: do not start a transaction at iterate_extent_inodes()
When finding out which inodes have references on a particular extent, done
by backref.c:iterate_extent_inodes(), from the BTRFS_IOC_LOGICAL_INO (both
v1 and v2) ioctl and from scrub we use the transaction join API to grab a
reference on the currently running transaction, since in order to give
accurate results we need to inspect the delayed references of the currently
running transaction.
However, if there is currently no running transaction, the join operation
will create a new transaction. This is inefficient as the transaction will
eventually be committed, doing unnecessary IO and introducing a potential
point of failure that will lead to a transaction abort due to -ENOSPC, as
recently reported [1].
That's because the join, creates the transaction but does not reserve any
space, so when attempting to update the root item of the root passed to
btrfs_join_transaction(), during the transaction commit, we can end up
failling with -ENOSPC. Users of a join operation are supposed to actually
do some filesystem changes and reserve space by some means, which is not
the case of iterate_extent_inodes(), it is a read-only operation for all
contextes from which it is called.
The reported [1] -ENOSPC failure stack trace is the following:
heisenberg kernel: ------------[ cut here ]------------
heisenberg kernel: BTRFS: Transaction aborted (error -28)
heisenberg kernel: WARNING: CPU: 0 PID: 7137 at fs/btrfs/root-tree.c:136 btrfs_update_root+0x22b/0x320 [btrfs]
(...)
heisenberg kernel: CPU: 0 PID: 7137 Comm: btrfs-transacti Not tainted 4.19.0-4-amd64 #1 Debian 4.19.28-2
heisenberg kernel: Hardware name: FUJITSU LIFEBOOK U757/FJNB2A5, BIOS Version 1.21 03/19/2018
heisenberg kernel: RIP: 0010:btrfs_update_root+0x22b/0x320 [btrfs]
(...)
heisenberg kernel: RSP: 0018:
ffffb5448828bd40 EFLAGS:
00010286
heisenberg kernel: RAX:
0000000000000000 RBX:
ffff8ed56bccef50 RCX:
0000000000000006
heisenberg kernel: RDX:
0000000000000007 RSI:
0000000000000092 RDI:
ffff8ed6bda166a0
heisenberg kernel: RBP:
00000000ffffffe4 R08:
00000000000003df R09:
0000000000000007
heisenberg kernel: R10:
0000000000000000 R11:
0000000000000001 R12:
ffff8ed63396a078
heisenberg kernel: R13:
ffff8ed092d7c800 R14:
ffff8ed64f5db028 R15:
ffff8ed6bd03d068
heisenberg kernel: FS:
0000000000000000(0000) GS:
ffff8ed6bda00000(0000) knlGS:
0000000000000000
heisenberg kernel: CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
heisenberg kernel: CR2:
00007f46f75f8000 CR3:
0000000310a0a002 CR4:
00000000003606f0
heisenberg kernel: DR0:
0000000000000000 DR1:
0000000000000000 DR2:
0000000000000000
heisenberg kernel: DR3:
0000000000000000 DR6:
00000000fffe0ff0 DR7:
0000000000000400
heisenberg kernel: Call Trace:
heisenberg kernel: commit_fs_roots+0x166/0x1d0 [btrfs]
heisenberg kernel: ? _cond_resched+0x15/0x30
heisenberg kernel: ? btrfs_run_delayed_refs+0xac/0x180 [btrfs]
heisenberg kernel: btrfs_commit_transaction+0x2bd/0x870 [btrfs]
heisenberg kernel: ? start_transaction+0x9d/0x3f0 [btrfs]
heisenberg kernel: transaction_kthread+0x147/0x180 [btrfs]
heisenberg kernel: ? btrfs_cleanup_transaction+0x530/0x530 [btrfs]
heisenberg kernel: kthread+0x112/0x130
heisenberg kernel: ? kthread_bind+0x30/0x30
heisenberg kernel: ret_from_fork+0x35/0x40
heisenberg kernel: ---[ end trace
05de912e30e012d9 ]---
So fix that by using the attach API, which does not create a transaction
when there is currently no running transaction.
[1] https://lore.kernel.org/linux-btrfs/
b2a668d7124f1d3e410367f587926f622b3f03a4.camel@scientia.net/
Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 20 Mar 2019 15:34:54 +0000 (16:34 +0100)]
btrfs: get fs_info from device in btrfs_rm_dev_replace_free_srcdev
We can read fs_info from the device and can drop it from the parameters.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 20 Mar 2019 15:32:55 +0000 (16:32 +0100)]
btrfs: get fs_info from device in btrfs_scrub_cancel_dev
We can read fs_info from the device and can drop it from the parameters.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 20 Mar 2019 15:31:53 +0000 (16:31 +0100)]
btrfs: get fs_info from device in btrfs_rm_dev_item
We can read fs_info from the device and can drop it from the parameters.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 20 Mar 2019 13:40:41 +0000 (14:40 +0100)]
btrfs: get fs_info from eb in __push_leaf_left
We can read fs_info from extent buffer and can drop it from the
parameters.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 20 Mar 2019 13:39:45 +0000 (14:39 +0100)]
btrfs: get fs_info from eb in __push_leaf_right
We can read fs_info from extent buffer and can drop it from the
parameters.
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Wed, 10 Apr 2019 16:46:04 +0000 (19:46 +0300)]
btrfs: Remove bio_offset argument from submit_bio_hook
None of the implementers of the submit_bio_hook use the bio_offset
parameter, simply remove it. No functional changes.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Wed, 10 Apr 2019 14:24:42 +0000 (17:24 +0300)]
btrfs: Always pass 0 bio_offset for btree_submit_bio_start
The btree submit hook queues the async csum and forwards the bio_offset
parameter passed to btree_submit_bio_hook. This is redundant since
btree_submit_bio_start calls btree_csum_one_bio which doesn't use the
offset at all. No functional changes.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Wed, 10 Apr 2019 14:24:41 +0000 (17:24 +0300)]
btrfs: Pass 0 for bio_offset to btrfs_wq_submit_bio
Buffered writeback always calls btrfs_csum_one_bio with the last 2
arguments being 0 irrespective of what the bio_offset has been passed to
btrfs_submit_bio_start. Make this apparent by explicitly passing 0 for
bio_offset when calling btrfs_wq_submit_bio from btrfs_submit_bio_hook.
This will allow for further simplifications down the line. No functional
changes.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Wed, 10 Apr 2019 14:24:40 +0000 (17:24 +0300)]
btrfs: Remove 'tree' argument from read_extent_buffer_pages
This function always uses the btree inode's io_tree. Stop taking the
tree as a function argument and instead access it internally from
read_extent_buffer_pages. No functional changes.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Wed, 10 Apr 2019 14:24:39 +0000 (17:24 +0300)]
btrfs: Change submit_bio_hook to taking an inode directly
The only possible 'private_data' that is passed to this function is
actually an inode. Make that explicit by changing the signature of the
call back. No functional changes.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Wed, 10 Apr 2019 14:24:38 +0000 (17:24 +0300)]
btrfs: Define submit_bio_hook's type directly
There is no need to use a typedef to define the type of the function and
then use that to define the respective member in extent_io_ops. Define
struct's member directly. No functional changes.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 20 Mar 2019 13:11:21 +0000 (14:11 +0100)]
btrfs: get fs_info from block group in search_free_space_info
We can read fs_info from the block group cache structure and can drop it
from the parameters. Though the transaction is also availabe, it's not
guaranteed to be non-NULL.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 20 Mar 2019 12:53:49 +0000 (13:53 +0100)]
btrfs: get fs_info from block group in btrfs_find_space_cluster
We can read fs_info from the block group cache structure and can drop it
from the parameters.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 20 Mar 2019 12:49:09 +0000 (13:49 +0100)]
btrfs: get fs_info from block group in write_pinned_extent_entries
We can read fs_info from the block group cache structure and can drop it
from the parameters.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 20 Mar 2019 12:47:15 +0000 (13:47 +0100)]
btrfs: get fs_info from block group in load_free_space_cache
We can read fs_info from the block group cache structure and can drop it
from the parameters.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 20 Mar 2019 12:40:19 +0000 (13:40 +0100)]
btrfs: get fs_info from block group in lookup_free_space_inode
We can read fs_info from the block group cache structure and can drop it
from the parameters.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 20 Mar 2019 11:12:32 +0000 (12:12 +0100)]
btrfs: get fs_info from block group in pin_down_extent
We can read fs_info from the block group cache structure and can drop it
from the parameters.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 20 Mar 2019 11:01:07 +0000 (12:01 +0100)]
btrfs: get fs_info from block group in next_block_group
We can read fs_info from the block group cache structure and can drop it
from the parameters.
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Wed, 17 Apr 2019 09:28:47 +0000 (10:28 +0100)]
Btrfs: remove no longer used function to run delayed refs asynchronously
It used to be called from only two places (truncate path and releasing a
transaction handle), but commits
28bad2125767c5 ("btrfs: fix truncate
throttling") and
db2462a6ad3dc4 ("btrfs: don't run delayed refs in the end
transaction logic") removed their calls to this function, so it's not used
anymore. Just remove it and all its helpers.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Anand Jain [Fri, 12 Apr 2019 08:02:59 +0000 (16:02 +0800)]
btrfs: cleanup btrfs_setxattr_trans and drop transaction parameter
Previous patch made sure that btrfs_setxattr_trans() is called only when
transaction NULL. Clean up btrfs_setxattr_trans() and drop the
parameter.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Anand Jain [Fri, 12 Apr 2019 08:02:58 +0000 (16:02 +0800)]
btrfs: split btrfs_setxattr calls regarding transaction
When the caller has already created the transaction handle,
btrfs_setxattr() will use it. Also adds assert in btrfs_setxattr().
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Anand Jain [Fri, 12 Apr 2019 08:02:57 +0000 (16:02 +0800)]
btrfs: remove redundant readonly root check in btrfs_setxattr_trans
btrfs_setxattr_trans() is called by 5 functions as below and all of them
do updates. None of them would be roun on a read-only root.
So its ok to remove the readonly root check here as it's a high-level
conditon.
1.
__btrfs_set_acl()
btrfs_init_acl()
btrfs_init_inode_security()
2.
__btrfs_set_acl()
btrfs_set_acl()
3.
btrfs_set_prop()
btrfs_set_prop_trans()
/ \
btrfs_ioctl_setflags() btrfs_xattr_handler_set_prop()
4.
btrfs_xattr_handler_set()
5.
btrfs_initxattrs()
btrfs_xattr_security_init()
btrfs_init_inode_security()
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Anand Jain [Fri, 12 Apr 2019 08:02:56 +0000 (16:02 +0800)]
btrfs: export btrfs_setxattr
Preparatory patch, as we are going split the calls with and without
transaction to use the respective btrfs_setxattr() and
btrfs_setxattr_trans() functions. Export btrfs_setxattr() for calls
outside of xattr.c.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Anand Jain [Fri, 12 Apr 2019 08:02:55 +0000 (16:02 +0800)]
btrfs: rename do_setxattr to btrfs_setxattr
When trans is not NULL btrfs_setxattr() calls do_setxattr() directly
with a check for readonly root. Rename do_setxattr() btrfs_setxattr() in
preparation to call do_setxattr() directly instead. Preparatory patch,
no functional changes.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Anand Jain [Fri, 12 Apr 2019 08:02:54 +0000 (16:02 +0800)]
btrfs: rename btrfs_setxattr to btrfs_setxattr_trans
Rename btrfs_setxattr() to btrfs_setxattr_trans(), so that do_setxattr()
can be renamed to btrfs_setxattr().
Preparatory patch, no functional changes.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Mon, 15 Apr 2019 13:15:25 +0000 (21:15 +0800)]
btrfs: trace: Introduce trace events for all btrfs tree locking events
Unlike btrfs_tree_lock() and btrfs_tree_read_lock(), the remaining
functions in locking.c will not sleep, thus doesn't make much sense to
record their execution time.
Those events are introduced mainly for user space tool to audit and
detect lock leakage or dead lock.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Mon, 15 Apr 2019 13:15:24 +0000 (21:15 +0800)]
btrfs: trace: Introduce trace events for sleepable tree lock
There are two tree lock events which can sleep:
- btrfs_tree_read_lock()
- btrfs_tree_lock()
Sometimes we may need to look into the concurrency picture of the fs.
For that case, we need the execution time of above two functions and the
owner of @eb.
Here we introduce a trace events for user space tools like bcc, to get
the execution time of above two functions, and get detailed owner info
where eBPF code can't.
All the overhead is hidden behind the trace events, so if events are not
enabled, there is no overhead.
These trace events also output bytenr and generation, allow them to be
pared with unlock events to pin down deadlock.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Mon, 15 Apr 2019 08:29:19 +0000 (09:29 +0100)]
Btrfs: remove no longer used member num_dirty_bgs from transaction
The member num_dirty_bgs of struct btrfs_transaction is not used anymore,
it is set and incremented but nothing reads its value anymore. Its last
read use was removed by commit
64403612b73a94 ("btrfs: rework
btrfs_check_space_for_delayed_refs"). So just remove that member.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 20 Mar 2019 15:51:44 +0000 (16:51 +0100)]
btrfs: get fs_info from trans in btrfs_run_dev_replace
We can read fs_info from the transaction and can drop it from the
parameters.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 20 Mar 2019 15:50:38 +0000 (16:50 +0100)]
btrfs: get fs_info from trans in btrfs_run_dev_stats
We can read fs_info from the transaction and can drop it from the
parameters.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 20 Mar 2019 15:36:39 +0000 (16:36 +0100)]
btrfs: get fs_info from trans in btrfs_finish_sprout
We can read fs_info from the transaction and can drop it from the
parameters.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 20 Mar 2019 15:29:13 +0000 (16:29 +0100)]
btrfs: get fs_info from trans in init_first_rw_device
We can read fs_info from the transaction and can drop it from the
parameters.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 20 Mar 2019 13:42:33 +0000 (14:42 +0100)]
btrfs: get fs_info from trans in copy_for_split
We can read fs_info from the transaction and can drop it from the
parameters.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 20 Mar 2019 13:32:45 +0000 (14:32 +0100)]
btrfs: get fs_info from trans in insert_ptr
We can read fs_info from the transaction and can drop it from the
parameters.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 20 Mar 2019 13:18:06 +0000 (14:18 +0100)]
btrfs: get fs_info from trans in balance_node_right
We can read fs_info from the transaction and can drop it from the
parameters.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 20 Mar 2019 13:16:45 +0000 (14:16 +0100)]
btrfs: get fs_info from trans in push_node_left
We can read fs_info from the transaction and can drop it from the
parameters.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 20 Mar 2019 12:51:56 +0000 (13:51 +0100)]
btrfs: get fs_info from trans in btrfs_write_out_cache
We can read fs_info from the transaction and can drop it from the
parameters.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 20 Mar 2019 12:42:57 +0000 (13:42 +0100)]
btrfs: get fs_info from trans in create_free_space_inode
We can read fs_info from the transaction and can drop it from the
parameters.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 20 Mar 2019 12:28:05 +0000 (13:28 +0100)]
btrfs: get fs_info from trans in btrfs_set_log_full_commit
We can read fs_info from the transaction and can drop it from the
parameters.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 20 Mar 2019 12:25:34 +0000 (13:25 +0100)]
btrfs: get fs_info from trans in btrfs_need_log_full_commit
We can read fs_info from the transaction and can drop it from the
parameters.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 20 Mar 2019 12:20:49 +0000 (13:20 +0100)]
btrfs: get fs_info from trans in btrfs_create_tree
We can read fs_info from the transaction and can drop it from the
parameters.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 20 Mar 2019 11:10:15 +0000 (12:10 +0100)]
btrfs: get fs_info from trans in update_block_group
We can read fs_info from the transaction and can drop it from the
parameters.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 20 Mar 2019 11:04:08 +0000 (12:04 +0100)]
btrfs: get fs_info from trans in btrfs_write_dirty_block_groups
We can read fs_info from the transaction and can drop it from the
parameters.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 20 Mar 2019 11:02:55 +0000 (12:02 +0100)]
btrfs: get fs_info from trans in btrfs_setup_space_cache
We can read fs_info from the transaction and can drop it from the
parameters.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 20 Mar 2019 10:57:46 +0000 (11:57 +0100)]
btrfs: get fs_info from trans in write_one_cache_group
We can read fs_info from the transaction and can drop it from the
parameters.
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Wed, 10 Apr 2019 13:16:11 +0000 (16:16 +0300)]
btrfs: Remove redundant inode argument from btrfs_add_ordered_sum
Ordered csums are keyed off of a btrfs_ordered_extent, which already has
a reference to the inode. This implies that an explicit inode argument
is redundant. So remove it.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Thu, 4 Apr 2019 03:47:08 +0000 (11:47 +0800)]
btrfs: Do mandatory tree block check before submitting bio
There are at least 2 reports about a memory bit flip sneaking into
on-disk data.
Currently we only have a relaxed check triggered at
btrfs_mark_buffer_dirty() time, as it's not mandatory and only for
CONFIG_BTRFS_FS_CHECK_INTEGRITY enabled build, it doesn't help users to
detect such problem.
This patch will address the hole by triggering comprehensive check on
tree blocks before writing it back to disk.
The design points are:
- Timing of the check: Tree block write hook
This timing is chosen to reduce the overhead.
The comprehensive check should be as expensive as a checksum
calculation.
Doing full check at btrfs_mark_buffer_dirty() is too expensive for end
user.
- Loose empty leaf check
Originally for an empty leaf, tree-checker will report error if it's
not a tree root.
The problem for such check at write time is:
* False alert for tree root created in current transaction
In that case, the commit root still needs to be written to disk.
And since current root can differ from commit root, then it will
cause false alert.
This happens for log tree.
* False alert for relocated tree block
Relocated tree block can be written to disk due to memory pressure,
in that case an empty csum tree root can be written to disk and
cause false alert, since csum root node hasn't been updated.
Previous patch of removing comprehensive empty leaf owner check has
paved the way for this patch.
The example error output will be something like:
BTRFS critical (device dm-3): corrupt leaf: root=2 block=
1350630375424 slot=68, bad key order, prev (
10510212874240 169 0) current (
1714119868416 169 0)
BTRFS error (device dm-3): block=
1350630375424 write time tree block corruption detected
BTRFS: error (device dm-3) in btrfs_commit_transaction:2220: errno=-5 IO failure (Error while writing out transaction)
BTRFS info (device dm-3): forced readonly
BTRFS warning (device dm-3): Skipping commit of aborted transaction.
BTRFS: error (device dm-3) in cleanup_transaction:1839: errno=-5 IO failure
BTRFS info (device dm-3): delayed_refs has NO entry
Reported-by: Leonard Lausen <leonard@lausen.nl>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Thu, 4 Apr 2019 03:47:07 +0000 (11:47 +0800)]
btrfs: tree-checker: Remove comprehensive root owner check
Commit
1ba98d086fe3 ("Btrfs: detect corruption when non-root leaf has
zero item") introduced comprehensive root owner checker.
However it's pretty expensive tree search to locate the owner root,
especially when it get reused by mandatory read and write time
tree-checker.
This patch will remove that check, and completely rely on owner based
empty leaf check, which is much faster and still works fine for most
case.
And since we skip the old root owner check, now write time tree check
can be merged with btrfs_check_leaf_full().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Robbie Ko [Tue, 26 Mar 2019 03:56:11 +0000 (11:56 +0800)]
Btrfs: fix data bytes_may_use underflow with fallocate due to failed quota reserve
When doing fallocate, we first add the range to the reserve_list and
then reserve the quota. If quota reservation fails, we'll release all
reserved parts of reserve_list.
However, cur_offset is not updated to indicate that this range is
already been inserted into the list. Therefore, the same range is freed
twice. Once at list_for_each_entry loop, and once at the end of the
function. This will result in WARN_ON on bytes_may_use when we free the
remaining space.
At the end, under the 'out' label we have a call to:
btrfs_free_reserved_data_space(inode, data_reserved, alloc_start, alloc_end - cur_offset);
The start offset, third argument, should be cur_offset.
Everything from alloc_start to cur_offset was freed by the
list_for_each_entry_safe_loop.
Fixes:
18513091af94 ("btrfs: update btrfs_space_info's bytes_may_use timely")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 20 Mar 2019 15:45:15 +0000 (16:45 +0100)]
btrfs: get fs_info from eb in read_one_dev
We can read fs_info from extent buffer and can drop it from the
parameters.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 20 Mar 2019 15:43:07 +0000 (16:43 +0100)]
btrfs: get fs_info from eb in read_one_chunk
We can read fs_info from extent buffer and can drop it from the
parameters.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 20 Mar 2019 15:40:48 +0000 (16:40 +0100)]
btrfs: get fs_info from eb in btrfs_check_chunk_valid
We can read fs_info from extent buffer and can drop it from the
parameters.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 20 Mar 2019 15:38:52 +0000 (16:38 +0100)]
btrfs: get fs_info from eb in should_balance_chunk
We can read fs_info from extent buffer and can drop it from the
parameters.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 20 Mar 2019 15:25:00 +0000 (16:25 +0100)]
btrfs: get fs_info from eb in btrfs_check_node
We can read fs_info from extent buffer and can drop it from the
parameters.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 20 Mar 2019 15:24:18 +0000 (16:24 +0100)]
btrfs: get fs_info from eb in btrfs_check_leaf_relaxed
We can read fs_info from extent buffer and can drop it from the
parameters.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 20 Mar 2019 15:23:29 +0000 (16:23 +0100)]
btrfs: get fs_info from eb in btrfs_check_leaf_full
We can read fs_info from extent buffer and can drop it from the
parameters.
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Wed, 27 Mar 2019 12:24:18 +0000 (14:24 +0200)]
btrfs: Switch btrfs_trim_free_extents to find_first_clear_extent_bit
Instead of always calling the allocator to search for a free extent,
that satisfies the input criteria, switch btrfs_trim_free_extents to
using find_first_clear_extent_bit. With this change it's no longer
necessary to read the device tree in order to figure out holes in
the devices.
Now the code always searches in-memory data structure to figure out the
space range which contains the requested which should result in speed
improvements.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Wed, 27 Mar 2019 12:24:17 +0000 (14:24 +0200)]
btrfs: Implement find_first_clear_extent_bit
This function is very similar to find_first_extent_bit except that it
locates the first contiguous span of space which does not have bits set.
It's intended use is in the freespace trimming code.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Wed, 27 Mar 2019 12:24:16 +0000 (14:24 +0200)]
btrfs: Optimize unallocated chunks discard
Currently unallocated chunks are always trimmed. For example
2 consecutive trims on large storage would trim freespace twice
irrespective of whether the space was actually allocated or not between
those trims.
Optimise this behavior by exploiting the newly introduced alloc_state
tree of btrfs_device. A new CHUNK_TRIMMED bit is used to mark
those unallocated chunks which have been trimmed and have not been
allocated afterwards. On chunk allocation the respective underlying devices'
physical space will have its CHUNK_TRIMMED flag cleared. This avoids
submitting discards for space which hasn't been changed since the last
time discard was issued.
This applies to the single mount period of the filesystem as the
information is not stored permanently.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Wed, 27 Mar 2019 12:24:15 +0000 (14:24 +0200)]
btrfs: Factor out in_range macro
This is used in more than one places so let's factor it out in ctree.h.
No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Wed, 27 Mar 2019 12:24:14 +0000 (14:24 +0200)]
btrfs: Remove 'trans' argument from find_free_dev_extent(_start)
Now that these functions no longer require a handle to transaction to
inspect pending/pinned chunks the argument can be removed. At the same
time also remove any surrounding code which acquired the handle.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Jeff Mahoney [Wed, 27 Mar 2019 12:24:12 +0000 (14:24 +0200)]
btrfs: replace pending/pinned chunks lists with io tree
The pending chunks list contains chunks that are allocated in the
current transaction but haven't been created yet. The pinned chunks
list contains chunks that are being released in the current transaction.
Both describe chunks that are not reflected on disk as in use but are
unavailable just the same.
The pending chunks list is anchored by the transaction handle, which
means that we need to hold a reference to a transaction when working
with the list.
The way we use them is by iterating over both lists to perform
comparisons on the stripes they describe for each device. This is
backwards and requires that we keep a transaction handle open while
we're trimming.
This patchset adds an extent_io_tree to btrfs_device that maintains
the allocation state of the device. Extents are set dirty when
chunks are first allocated -- when the extent maps are added to the
mapping tree. They're cleared when last removed -- when the extent
maps are removed from the mapping tree. This matches the lifespan
of the pending and pinned chunks list and allows us to do trims
on unallocated space safely without pinning the transaction for what
may be a lengthy operation. We can also use this io tree to mark
which chunks have already been trimmed so we don't repeat the operation.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Tue, 12 Feb 2019 14:13:14 +0000 (16:13 +0200)]
btrfs: Transpose btrfs_close_devices/btrfs_mapping_tree_free in close_ctree
Following the introduction of the alloc_state tree, some of the callees
of btrfs_mapping_tree_free will have to interact with the btrfs_device
of the constituent devices. Enable this by moving the code responsible
for freeing devices after the last user (btrfs_mapping_tree_free).
Otherwise the kernel could crash due to use-after-free.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Wed, 27 Mar 2019 12:24:11 +0000 (14:24 +0200)]
btrfs: Stop using call_rcu for device freeing
btrfs_device structs are freed from RCU context since device iteration
is protected by RCU. Currently this is achieved by using call_rcu since
no blocking functions are called within btrfs_free_device. Future
refactoring of pending/pinned chunks will require calling sleeping
functions.
This patch is in preparation for these changes by simply switching from
RCU callbacks to explicit calls of synchronize_rcu and calling
btrfs_free_device directly. This is functionally equivalent, making sure
that there are no readers at that time.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Wed, 27 Mar 2019 12:24:10 +0000 (14:24 +0200)]
btrfs: Implement set_extent_bits_nowait
It will be used in a future patch that will require modifying an
extent_io_tree struct under a spinlock.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Mon, 25 Mar 2019 12:31:26 +0000 (14:31 +0200)]
btrfs: Introduce new bits for device allocation tree
Rather than hijacking the existing defines let's just define new bits,
with more descriptive names. Instead of using yet more (currently at 18)
bits for the new flags, use the fact those flags will be specific to
the device allocation tree so define them using existing EXTENT_* flags.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Mon, 25 Mar 2019 12:31:25 +0000 (14:31 +0200)]
btrfs: Populate ->orig_block_len during read_one_chunk
Chunks read from disk currently don't get their ->orig_block_len member
set, in contrast when a new chunk is allocated, the respective
extent_map's ->orig_block_len is assigned the size of the stripe of this
chunk.
Let's apply the same strategy for chunks which are read from
disk, not only does this codify the invariant that ->orig_block_len
always contains the size of the stripe for a chunk (when the em belongs
to the mapping tree). But it's also a preparatory patch for further work
around tracking chunk allocation in an extent tree rather than
pinned/pending lists.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Mon, 25 Mar 2019 12:31:24 +0000 (14:31 +0200)]
btrfs: Rename and export clear_btree_io_tree
This function is going to be used to clear out the device extent
allocation information. Give it a more generic name and export it. This
is in preparation to replacing the pending/pinned chunk lists with an
extent tree. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Mon, 25 Mar 2019 12:31:23 +0000 (14:31 +0200)]
btrfs: Handle pending/pinned chunks before blockgroup relocation during device shrink
During device shrink pinned/pending chunks (i.e. those which have been
deleted/created respectively, in the current transaction and haven't
touched disk) need to be accounted when doing device shrink. Presently
this happens after the main relocation loop in btrfs_shrink_device,
which could lead to making another go in the body of the function.
Since there is no hard requirement to perform pinned/pending chunks
handling after the relocation loop, move the code before it. This leads
to simplifying the code flow around - i.e. no need to use 'goto again'.
A notable side effect of this change is that modification of the
device's size requires a transaction to be started and committed before
the relocation loop starts. This is necessary to ensure that relocation
process sees the shrunk device size.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Mon, 25 Mar 2019 12:31:22 +0000 (14:31 +0200)]
btrfs: combine device update operations during transaction commit
We currently overload the pending_chunks list to handle updating
btrfs_device->commit_bytes used. We don't actually care about the
extent mapping or even the device mapping for the chunk - we just need
the device, and we can end up processing it multiple times. The
fs_devices->resized_list does more or less the same thing, but with the
disk size. They are called consecutively during commit and have more or
less the same purpose.
We can combine the two lists into a single list that attaches to the
transaction and contains a list of devices that need updating. Since we
always add the device to a list when we change bytes_used or
disk_total_size, there's no harm in copying both values at once.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Mon, 25 Mar 2019 12:31:21 +0000 (14:31 +0200)]
btrfs: Honour FITRIM range constraints during free space trim
Up until now trimming the freespace was done irrespective of what the
arguments of the FITRIM ioctl were. For example fstrim's -o/-l arguments
will be entirely ignored. Fix it by correctly handling those paramter.
This requires breaking if the found freespace extent is after the end of
the passed range as well as completing trim after trimming
fstrim_range::len bytes.
Fixes:
499f377f49f0 ("btrfs: iterate over unused chunk space in FITRIM")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Robbie Ko [Fri, 29 Mar 2019 10:03:27 +0000 (18:03 +0800)]
Btrfs: send, improve clone range
Improve clone_range in two scenarios.
1. Remove the limit of inode size when find clone inodes We can do
partial clone, so there is no need to limit the size of the candidate
inode. When clone a range, we clone the legal range only by bytenr,
offset, len, inode size.
2. In the scenarios of rewrite or clone_range, data_offset rarely
matches exactly, so the chance of a clone is missed.
e.g.
1. Write a 1M file
dd if=/dev/zero of=1M bs=1M count=1
2. Clone 1M file
cp --reflink 1M clone
3. Rewrite 4k on the clone file
dd if=/dev/zero of=clone bs=4k count=1 conv=notrunc
The disk layout is as follows:
item 16 key (257 EXTENT_DATA 0) itemoff 15353 itemsize 53
extent data disk byte
1103101952 nr 1048576
extent data offset 0 nr 1048576 ram 1048576
extent compression(none)
...
item 22 key (258 EXTENT_DATA 0) itemoff 14959 itemsize 53
extent data disk byte
1104150528 nr 4096
extent data offset 0 nr 4096 ram 4096
extent compression(none)
item 23 key (258 EXTENT_DATA 4096) itemoff 14906 itemsize 53
extent data disk byte
1103101952 nr 1048576
extent data offset 4096 nr 1044480 ram 1048576
extent compression(none)
When send, inode 258 file offset 4096~1048576 (item 23) has a chance to
clone_range, but because data_offset does not match inode 257 (item 16),
it causes missed clone and can only transfer actual data.
Improve the problem by judging whether the current data_offset has
overlap with the file extent item, and if so, adjusting offset and
extent_len so that we can clone correctly.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Anand Jain [Tue, 2 Apr 2019 10:07:41 +0000 (18:07 +0800)]
btrfs: prop: open code btrfs_set_prop in inherit_prop
When an inode inherits property from its parent, we call btrfs_set_prop().
btrfs_set_prop() does an elaborate checks, which is not required in the
context of inheriting a property. Instead just open-code only the required
items from btrfs_set_prop() and then call btrfs_setxattr() directly. So
now the only user of btrfs_set_prop() is gone, (except for the wraper
function btrfs_set_prop_trans()).
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Anand Jain [Fri, 29 Mar 2019 06:03:17 +0000 (14:03 +0800)]
btrfs: drop unused parameter in mount_subvol
@device_name in mount_subvol() is not used, drop it. Also see:
5bedc48a8f9e ("btrfs: drop unused parameters from mount_subvol").
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 20 Mar 2019 15:22:58 +0000 (16:22 +0100)]
btrfs: tree-checker: get fs_info from eb in check_inode_item
We can read fs_info from extent buffer and can drop it from the
parameters.
Signed-off-by: David Sterba <dsterba@suse.com>