Pull AFS updates from David Howells:
"There's some core VFS changes which affect a couple of filesystems:
- Make the inode hash table RCU safe and providing some RCU-safe
accessor functions. The search can then be done without taking the
inode_hash_lock. Care must be taken because the object may be being
deleted and no wait is made.
- Allow iunique() to avoid taking the inode_hash_lock.
- Allow AFS's callback processing to avoid taking the inode_hash_lock
when using the inode table to find an inode to notify.
- Improve Ext4's time updating. Konstantin Khlebnikov said "For now,
I've plugged this issue with try-lock in ext4 lazy time update.
This solution is much better."
Then there's a set of changes to make a number of improvements to the
AFS driver:
- Improve callback (ie. third party change notification) processing
by:
(a) Relying more on the fact we're doing this under RCU and by
using fewer locks. This makes use of the RCU-based inode
searching outlined above.
(b) Moving to keeping volumes in a tree indexed by volume ID
rather than a flat list.
(c) Making the server and volume records logically part of the
cell. This means that a server record now points directly at
the cell and the tree of volumes is there. This removes an N:M
mapping table, simplifying things.
- Improve keeping NAT or firewall channels open for the server
callbacks to reach the client by actively polling the fileserver on
a timed basis, instead of only doing it when we have an operation
to process.
- Improving detection of delayed or lost callbacks by including the
parent directory in the list of file IDs to be queried when doing a
bulk status fetch from lookup. We can then check to see if our copy
of the directory has changed under us without us getting notified.
- Determine aliasing of cells (such as a cell that is pointed to be a
DNS alias). This allows us to avoid having ambiguity due to
apparently different cells using the same volume and file servers.
- Improve the fileserver rotation to do more probing when it detects
that all of the addresses to a server are listed as non-responsive.
It's possible that an address that previously stopped responding
has become responsive again.
Beyond that, lay some foundations for making some calls asynchronous:
- Turn the fileserver cursor struct into a general operation struct
and hang the parameters off of that rather than keeping them in
local variables and hang results off of that rather than the call
struct.
- Implement some general operation handling code and simplify the
callers of operations that affect a volume or a volume component
(such as a file). Most of the operation is now done by core code.
- Operations are supplied with a table of operations to issue
different variants of RPCs and to manage the completion, where all
the required data is held in the operation object, thereby allowing
these to be called from a workqueue.
- Put the standard "if (begin), while(select), call op, end" sequence
into a canned function that just emulates the current behaviour for
now.
There are also some fixes interspersed:
- Don't let the EACCES from ICMP6 mapping reach the user as such,
since it's confusing as to whether it's a filesystem error. Convert
it to EHOSTUNREACH.
- Don't use the epoch value acquired through probing a server. If we
have two servers with the same UUID but in different cells, it's
hard to draw conclusions from them having different epoch values.
- Don't interpret the argument to the CB.ProbeUuid RPC as a
fileserver UUID and look up a fileserver from it.
- Deal with servers in different cells having the same UUIDs. In the
event that a CB.InitCallBackState3 RPC is received, we have to
break the callback promises for every server record matching that
UUID.
- Don't let afs_statfs return values that go below 0.
- Don't use running fileserver probe state to make server selection
and address selection decisions on. Only make decisions on final
state as the running state is cleared at the start of probing"
Acked-by: Al Viro <viro@zeniv.linux.org.uk> (fs/inode.c part)
* tag 'afs-next-
20200604' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: (27 commits)
afs: Adjust the fileserver rotation algorithm to reprobe/retry more quickly
afs: Show more a bit more server state in /proc/net/afs/servers
afs: Don't use probe running state to make decisions outside probe code
afs: Fix afs_statfs() to not let the values go below zero
afs: Fix the by-UUID server tree to allow servers with the same UUID
afs: Reorganise volume and server trees to be rooted on the cell
afs: Add a tracepoint to track the lifetime of the afs_volume struct
afs: Detect cell aliases 3 - YFS Cells with a canonical cell name op
afs: Detect cell aliases 2 - Cells with no root volumes
afs: Detect cell aliases 1 - Cells with root volumes
afs: Implement client support for the YFSVL.GetCellName RPC op
afs: Retain more of the VLDB record for alias detection
afs: Fix handling of CB.ProbeUuid cache manager op
afs: Don't get epoch from a server because it may be ambiguous
afs: Build an abstraction around an "operation" concept
afs: Rename struct afs_fs_cursor to afs_operation
afs: Remove the error argument from afs_protocol_error()
afs: Set error flag rather than return error from file status decode
afs: Make callback processing more efficient.
afs: Show more information in /proc/net/afs/servers
...
{
struct sockaddr_rxrpc srx;
struct socket *socket;
- unsigned int min_level;
int ret;
_enter("");
srx.transport.sin6.sin6_family = AF_INET6;
srx.transport.sin6.sin6_port = htons(AFS_CM_PORT);
- min_level = RXRPC_SECURITY_ENCRYPT;
- ret = kernel_setsockopt(socket, SOL_RXRPC, RXRPC_MIN_SECURITY_LEVEL,
- (void *)&min_level, sizeof(min_level));
+ ret = rxrpc_sock_set_min_security_level(socket->sk,
+ RXRPC_SECURITY_ENCRYPT);
if (ret < 0)
goto error_2;
if (call->type->destructor)
call->type->destructor(call);
- afs_put_server(call->net, call->server, afs_server_trace_put_call);
- afs_put_cb_interest(call->net, call->cbi);
+ afs_unuse_server_notime(call->net, call->server, afs_server_trace_put_call);
afs_put_addrlist(call->alist);
kfree(call->request);
struct bio_vec *bv, pgoff_t first, pgoff_t last,
unsigned offset)
{
+ struct afs_operation *op = call->op;
struct page *pages[AFS_BVEC_MAX];
unsigned int nr, n, i, to, bytes = 0;
nr = min_t(pgoff_t, last - first + 1, AFS_BVEC_MAX);
- n = find_get_pages_contig(call->mapping, first, nr, pages);
+ n = find_get_pages_contig(op->store.mapping, first, nr, pages);
ASSERTCMP(n, ==, nr);
msg->msg_flags |= MSG_MORE;
for (i = 0; i < nr; i++) {
to = PAGE_SIZE;
if (first + i >= last) {
- to = call->last_to;
+ to = op->store.last_to;
msg->msg_flags &= ~MSG_MORE;
}
bv[i].bv_page = pages[i];
*/
static int afs_send_pages(struct afs_call *call, struct msghdr *msg)
{
+ struct afs_operation *op = call->op;
struct bio_vec bv[AFS_BVEC_MAX];
unsigned int bytes, nr, loop, offset;
- pgoff_t first = call->first, last = call->last;
+ pgoff_t first = op->store.first, last = op->store.last;
int ret;
- offset = call->first_offset;
- call->first_offset = 0;
+ offset = op->store.first_offset;
+ op->store.first_offset = 0;
do {
afs_load_bvec(call, msg, bv, first, last, offset);
bytes = msg->msg_iter.count;
nr = msg->msg_iter.nr_segs;
- ret = rxrpc_kernel_send_data(call->net->socket, call->rxcall, msg,
+ ret = rxrpc_kernel_send_data(op->net->socket, call->rxcall, msg,
bytes, afs_notify_end_request_tx);
for (loop = 0; loop < nr; loop++)
put_page(bv[loop].bv_page);
first += nr;
} while (first <= last);
- trace_afs_sent_pages(call, call->first, last, first, ret);
+ trace_afs_sent_pages(call, op->store.first, last, first, ret);
return ret;
}
*/
tx_total_len = call->request_size;
if (call->send_pages) {
- if (call->last == call->first) {
- tx_total_len += call->last_to - call->first_offset;
+ struct afs_operation *op = call->op;
+
+ if (op->store.last == op->store.first) {
+ tx_total_len += op->store.last_to - op->store.first_offset;
} else {
/* It looks mathematically like you should be able to
* combine the following lines with the ones above, but
* unsigned arithmetic is fun when it wraps...
*/
- tx_total_len += PAGE_SIZE - call->first_offset;
- tx_total_len += call->last_to;
- tx_total_len += (call->last - call->first - 1) * PAGE_SIZE;
+ tx_total_len += PAGE_SIZE - op->store.first_offset;
+ tx_total_len += op->store.last_to;
+ tx_total_len += (op->store.last - op->store.first - 1) * PAGE_SIZE;
}
}
ret = call->type->deliver(call);
state = READ_ONCE(call->state);
+ if (ret == 0 && call->unmarshalling_error)
+ ret = -EBADMSG;
switch (ret) {
case 0:
afs_queue_call_work(call);
if (state == AFS_CALL_CL_PROC_REPLY) {
- if (call->cbi)
+ if (call->op)
set_bit(AFS_SERVER_FL_MAY_HAVE_CB,
- &call->cbi->server->flags);
+ &call->op->server->flags);
goto call_complete;
}
ASSERTCMP(state, >, AFS_CALL_CL_PROC_REPLY);
/*
* Log protocol error production.
*/
- noinline int afs_protocol_error(struct afs_call *call, int error,
+ noinline int afs_protocol_error(struct afs_call *call,
enum afs_eproto_cause cause)
{
- trace_afs_protocol_error(call, error, cause);
- return error;
+ trace_afs_protocol_error(call, cause);
+ if (call)
+ call->unmarshalling_error = true;
+ return -EBADMSG;
}
truncate_inode_pages_final(&inode->i_data);
/*
+ * For inodes with journalled data, transaction commit could have
+ * dirtied the inode. Flush worker is ignoring it because of I_FREEING
+ * flag but we still need to remove the inode from the writeback lists.
+ */
+ if (!list_empty_careful(&inode->i_io_list)) {
+ WARN_ON_ONCE(!ext4_should_journal_data(inode));
+ inode_io_list_del(inode);
+ }
+
+ /*
* Protect us against freezing - iput() caller didn't have to have any
* protection against it
*/
*/
down_read(&EXT4_I(inode)->i_data_sem);
if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) {
- retval = ext4_ext_map_blocks(handle, inode, map, flags &
- EXT4_GET_BLOCKS_KEEP_SIZE);
+ retval = ext4_ext_map_blocks(handle, inode, map, 0);
} else {
- retval = ext4_ind_map_blocks(handle, inode, map, flags &
- EXT4_GET_BLOCKS_KEEP_SIZE);
+ retval = ext4_ind_map_blocks(handle, inode, map, 0);
}
up_read((&EXT4_I(inode)->i_data_sem));
#endif
map->m_flags = 0;
- ext_debug("ext4_map_blocks(): inode %lu, flag %d, max_blocks %u,"
- "logical block %lu\n", inode->i_ino, flags, map->m_len,
- (unsigned long) map->m_lblk);
+ ext_debug(inode, "flag 0x%x, max_blocks %u, logical block %lu\n",
+ flags, map->m_len, (unsigned long) map->m_lblk);
/*
* ext4_map_blocks returns an int, and m_len is an unsigned int
*/
down_read(&EXT4_I(inode)->i_data_sem);
if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) {
- retval = ext4_ext_map_blocks(handle, inode, map, flags &
- EXT4_GET_BLOCKS_KEEP_SIZE);
+ retval = ext4_ext_map_blocks(handle, inode, map, 0);
} else {
- retval = ext4_ind_map_blocks(handle, inode, map, flags &
- EXT4_GET_BLOCKS_KEEP_SIZE);
+ retval = ext4_ind_map_blocks(handle, inode, map, 0);
}
if (retval > 0) {
unsigned int status;
return ret;
}
}
+
+ if (retval < 0)
+ ext_debug(inode, "failed with err %d\n", retval);
return retval;
}
* filesystems.
*/
if (i_size_changed || inline_data)
- ext4_mark_inode_dirty(handle, inode);
+ ret = ext4_mark_inode_dirty(handle, inode);
if (pos + len > inode->i_size && !verity && ext4_can_truncate(inode))
/* if we have allocated more blocks and copied
struct ext4_map_blocks map;
struct ext4_io_submit io_submit; /* IO submission data */
unsigned int do_map:1;
+ unsigned int scanned_until_end:1;
};
static void mpage_release_unused_pages(struct mpage_da_data *mpd,
if (mpd->first_page >= mpd->next_page)
return;
+ mpd->scanned_until_end = 0;
index = mpd->first_page;
end = mpd->next_page - 1;
if (invalidate) {
invalid_block = ~0;
map->m_flags = 0;
- ext_debug("ext4_da_map_blocks(): inode %lu, max_blocks %u,"
- "logical block %lu\n", inode->i_ino, map->m_len,
+ ext_debug(inode, "max_blocks %u, logical block %lu\n", map->m_len,
(unsigned long) map->m_lblk);
/* Lookup extent status tree firstly */
return err;
}
-#define BH_FLAGS ((1 << BH_Unwritten) | (1 << BH_Delay))
+#define BH_FLAGS (BIT(BH_Unwritten) | BIT(BH_Delay))
/*
* mballoc gives us at most this number of blocks...
if (err < 0)
return err;
}
- return lblk < blocks;
+ if (lblk >= blocks) {
+ mpd->scanned_until_end = 1;
+ return 0;
+ }
+ return 1;
}
/*
* mapping, or maybe the page was submitted for IO.
* So we return to call further extent mapping.
*/
- if (err < 0 || map_bh == true)
+ if (err < 0 || map_bh)
goto out;
/* Page fully mapped - let IO run! */
err = mpage_submit_page(mpd, page);
dioread_nolock = ext4_should_dioread_nolock(inode);
if (dioread_nolock)
get_blocks_flags |= EXT4_GET_BLOCKS_IO_CREATE_EXT;
- if (map->m_flags & (1 << BH_Delay))
+ if (map->m_flags & BIT(BH_Delay))
get_blocks_flags |= EXT4_GET_BLOCKS_DELALLOC_RESERVE;
err = ext4_map_blocks(handle, inode, map, get_blocks_flags);
nr_pages = pagevec_lookup_range_tag(&pvec, mapping, &index, end,
tag);
if (nr_pages == 0)
- goto out;
+ break;
for (i = 0; i < nr_pages; i++) {
struct page *page = pvec.pages[i];
pagevec_release(&pvec);
cond_resched();
}
+ mpd->scanned_until_end = 1;
return 0;
out:
pagevec_release(&pvec);
struct inode *inode = mapping->host;
int needed_blocks, rsv_blocks = 0, ret = 0;
struct ext4_sb_info *sbi = EXT4_SB(mapping->host->i_sb);
- bool done;
struct blk_plug plug;
bool give_up_on_write = false;
retry:
if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)
tag_pages_for_writeback(mapping, mpd.first_page, mpd.last_page);
- done = false;
blk_start_plug(&plug);
/*
* started.
*/
mpd.do_map = 0;
+ mpd.scanned_until_end = 0;
mpd.io_submit.io_end = ext4_init_io_end(inode, GFP_KERNEL);
if (!mpd.io_submit.io_end) {
ret = -ENOMEM;
if (ret < 0)
goto unplug;
- while (!done && mpd.first_page <= mpd.last_page) {
+ while (!mpd.scanned_until_end && wbc->nr_to_write > 0) {
/* For each extent of pages we use new io_end */
mpd.io_submit.io_end = ext4_init_io_end(inode, GFP_KERNEL);
if (!mpd.io_submit.io_end) {
trace_ext4_da_write_pages(inode, mpd.first_page, mpd.wbc);
ret = mpage_prepare_extent_to_map(&mpd);
- if (!ret) {
- if (mpd.map.m_len)
- ret = mpage_map_and_submit_extent(handle, &mpd,
+ if (!ret && mpd.map.m_len)
+ ret = mpage_map_and_submit_extent(handle, &mpd,
&give_up_on_write);
- else {
- /*
- * We scanned the whole range (or exhausted
- * nr_to_write), submitted what was mapped and
- * didn't find anything needing mapping. We are
- * done.
- */
- done = true;
- }
- }
/*
* Caution: If the handle is synchronous,
* ext4_journal_stop() can wait for transaction commit
* new_i_size is less that inode->i_size
* bu greater than i_disksize.(hint delalloc)
*/
- ext4_mark_inode_dirty(handle, inode);
+ ret = ext4_mark_inode_dirty(handle, inode);
}
}
if (ret2 < 0)
ret = ret2;
ret2 = ext4_journal_stop(handle);
- if (!ret)
+ if (unlikely(ret2 && !ret))
ret = ret2;
return ret ? ret : copied;
ret = ext4_readpage_inline(inode, page);
if (ret == -EAGAIN)
- return ext4_mpage_readpages(page->mapping, NULL, page, 1,
- false);
+ return ext4_mpage_readpages(inode, NULL, page);
return ret;
}
-static int
-ext4_readpages(struct file *file, struct address_space *mapping,
- struct list_head *pages, unsigned nr_pages)
+static void ext4_readahead(struct readahead_control *rac)
{
- struct inode *inode = mapping->host;
+ struct inode *inode = rac->mapping->host;
- /* If the file has inline data, no need to do readpages. */
+ /* If the file has inline data, no need to do readahead. */
if (ext4_has_inline_data(inode))
- return 0;
+ return;
- return ext4_mpage_readpages(mapping, pages, NULL, nr_pages, true);
+ ext4_mpage_readpages(inode, rac, NULL);
}
static void ext4_invalidatepage(struct page *page, unsigned int offset,
static const struct address_space_operations ext4_aops = {
.readpage = ext4_readpage,
- .readpages = ext4_readpages,
+ .readahead = ext4_readahead,
.writepage = ext4_writepage,
.writepages = ext4_writepages,
.write_begin = ext4_write_begin,
static const struct address_space_operations ext4_journalled_aops = {
.readpage = ext4_readpage,
- .readpages = ext4_readpages,
+ .readahead = ext4_readahead,
.writepage = ext4_writepage,
.writepages = ext4_writepages,
.write_begin = ext4_write_begin,
static const struct address_space_operations ext4_da_aops = {
.readpage = ext4_readpage,
- .readpages = ext4_readpages,
+ .readahead = ext4_readahead,
.writepage = ext4_writepage,
.writepages = ext4_writepages,
.write_begin = ext4_da_write_begin,
loff_t len)
{
handle_t *handle;
+ int ret;
+
loff_t size = i_size_read(inode);
WARN_ON(!inode_is_locked(inode));
if (IS_ERR(handle))
return PTR_ERR(handle);
ext4_update_i_disksize(inode, size);
- ext4_mark_inode_dirty(handle, inode);
+ ret = ext4_mark_inode_dirty(handle, inode);
ext4_journal_stop(handle);
- return 0;
+ return ret;
}
static void ext4_wait_dax_page(struct ext4_inode_info *ei)
loff_t first_block_offset, last_block_offset;
handle_t *handle;
unsigned int credits;
- int ret = 0;
+ int ret = 0, ret2 = 0;
trace_ext4_punch_hole(inode, offset, length, 0);
ext4_handle_sync(handle);
inode->i_mtime = inode->i_ctime = current_time(inode);
- ext4_mark_inode_dirty(handle, inode);
+ ret2 = ext4_mark_inode_dirty(handle, inode);
+ if (unlikely(ret2))
+ ret = ret2;
if (ret >= 0)
ext4_update_inode_fsync_trans(handle, inode, 1);
out_stop:
{
struct ext4_inode_info *ei = EXT4_I(inode);
unsigned int credits;
- int err = 0;
+ int err = 0, err2;
handle_t *handle;
struct address_space *mapping = inode->i_mapping;
ext4_orphan_del(handle, inode);
inode->i_mtime = inode->i_ctime = current_time(inode);
- ext4_mark_inode_dirty(handle, inode);
+ err2 = ext4_mark_inode_dirty(handle, inode);
+ if (unlikely(err2 && !err))
+ err = err2;
ext4_journal_stop(handle);
trace_ext4_truncate_exit(inode);
return 0;
}
- struct other_inode {
- unsigned long orig_ino;
- struct ext4_inode *raw_inode;
- };
-
- static int other_inode_match(struct inode * inode, unsigned long ino,
- void *data)
+ static void __ext4_update_other_inode_time(struct super_block *sb,
+ unsigned long orig_ino,
+ unsigned long ino,
+ struct ext4_inode *raw_inode)
{
- struct other_inode *oi = (struct other_inode *) data;
+ struct inode *inode;
+
+ inode = find_inode_by_ino_rcu(sb, ino);
+ if (!inode)
+ return;
- if ((inode->i_ino != ino) ||
- (inode->i_state & (I_FREEING | I_WILL_FREE | I_NEW |
+ if ((inode->i_state & (I_FREEING | I_WILL_FREE | I_NEW |
I_DIRTY_INODE)) ||
((inode->i_state & I_DIRTY_TIME) == 0))
- return 0;
+ return;
+
spin_lock(&inode->i_lock);
if (((inode->i_state & (I_FREEING | I_WILL_FREE | I_NEW |
I_DIRTY_INODE)) == 0) &&
spin_unlock(&inode->i_lock);
spin_lock(&ei->i_raw_lock);
- EXT4_INODE_SET_XTIME(i_ctime, inode, oi->raw_inode);
- EXT4_INODE_SET_XTIME(i_mtime, inode, oi->raw_inode);
- EXT4_INODE_SET_XTIME(i_atime, inode, oi->raw_inode);
- ext4_inode_csum_set(inode, oi->raw_inode, ei);
+ EXT4_INODE_SET_XTIME(i_ctime, inode, raw_inode);
+ EXT4_INODE_SET_XTIME(i_mtime, inode, raw_inode);
+ EXT4_INODE_SET_XTIME(i_atime, inode, raw_inode);
+ ext4_inode_csum_set(inode, raw_inode, ei);
spin_unlock(&ei->i_raw_lock);
- trace_ext4_other_inode_update_time(inode, oi->orig_ino);
- return -1;
+ trace_ext4_other_inode_update_time(inode, orig_ino);
+ return;
}
spin_unlock(&inode->i_lock);
- return -1;
}
/*
static void ext4_update_other_inodes_time(struct super_block *sb,
unsigned long orig_ino, char *buf)
{
- struct other_inode oi;
unsigned long ino;
int i, inodes_per_block = EXT4_SB(sb)->s_inodes_per_block;
int inode_size = EXT4_INODE_SIZE(sb);
- oi.orig_ino = orig_ino;
/*
* Calculate the first inode in the inode table block. Inode
* numbers are one-based. That is, the first inode in a block
* (assuming 4k blocks and 256 byte inodes) is (n*16 + 1).
*/
ino = ((orig_ino - 1) & ~(inodes_per_block - 1)) + 1;
+ rcu_read_lock();
for (i = 0; i < inodes_per_block; i++, ino++, buf += inode_size) {
if (ino == orig_ino)
continue;
- oi.raw_inode = (struct ext4_inode *) buf;
- (void) find_inode_nowait(sb, ino, other_inode_match, &oi);
+ __ext4_update_other_inode_time(sb, orig_ino, ino,
+ (struct ext4_inode *)buf);
}
+ rcu_read_unlock();
}
/*
inode->i_gid = attr->ia_gid;
error = ext4_mark_inode_dirty(handle, inode);
ext4_journal_stop(handle);
+ if (unlikely(error))
+ return error;
}
if (attr->ia_valid & ATTR_SIZE) {
* Whenever the user wants stuff synced (sys_sync, sys_msync, sys_fsync)
* we start and wait on commits.
*/
-int ext4_mark_inode_dirty(handle_t *handle, struct inode *inode)
+int __ext4_mark_inode_dirty(handle_t *handle, struct inode *inode,
+ const char *func, unsigned int line)
{
struct ext4_iloc iloc;
struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
trace_ext4_mark_inode_dirty(inode, _RET_IP_);
err = ext4_reserve_inode_write(handle, inode, &iloc);
if (err)
- return err;
+ goto out;
if (EXT4_I(inode)->i_extra_isize < sbi->s_want_extra_isize)
ext4_try_to_expand_extra_isize(inode, sbi->s_want_extra_isize,
iloc, handle);
- return ext4_mark_iloc_dirty(handle, inode, &iloc);
+ err = ext4_mark_iloc_dirty(handle, inode, &iloc);
+out:
+ if (unlikely(err))
+ ext4_error_inode_err(inode, func, line, 0, err,
+ "mark_inode_dirty error");
+ return err;
}
/*
*/
#ifdef CONFIG_SYSCTL
int proc_nr_inodes(struct ctl_table *table, int write,
- void __user *buffer, size_t *lenp, loff_t *ppos)
+ void *buffer, size_t *lenp, loff_t *ppos)
{
inodes_stat.nr_inodes = get_nr_inodes();
inodes_stat.nr_unused = get_nr_inodes_unused();
spin_lock(&inode_hash_lock);
spin_lock(&inode->i_lock);
- hlist_add_head(&inode->i_hash, b);
+ hlist_add_head_rcu(&inode->i_hash, b);
spin_unlock(&inode->i_lock);
spin_unlock(&inode_hash_lock);
}
{
spin_lock(&inode_hash_lock);
spin_lock(&inode->i_lock);
- hlist_del_init(&inode->i_hash);
+ hlist_del_init_rcu(&inode->i_hash);
spin_unlock(&inode->i_lock);
spin_unlock(&inode_hash_lock);
}
*/
spin_lock(&inode->i_lock);
inode->i_state |= I_NEW;
- hlist_add_head(&inode->i_hash, head);
+ hlist_add_head_rcu(&inode->i_hash, head);
spin_unlock(&inode->i_lock);
if (!creating)
inode_sb_list_add(inode);
inode->i_ino = ino;
spin_lock(&inode->i_lock);
inode->i_state = I_NEW;
- hlist_add_head(&inode->i_hash, head);
+ hlist_add_head_rcu(&inode->i_hash, head);
spin_unlock(&inode->i_lock);
inode_sb_list_add(inode);
spin_unlock(&inode_hash_lock);
struct hlist_head *b = inode_hashtable + hash(sb, ino);
struct inode *inode;
- spin_lock(&inode_hash_lock);
- hlist_for_each_entry(inode, b, i_hash) {
- if (inode->i_ino == ino && inode->i_sb == sb) {
- spin_unlock(&inode_hash_lock);
+ hlist_for_each_entry_rcu(inode, b, i_hash) {
+ if (inode->i_ino == ino && inode->i_sb == sb)
return 0;
- }
}
- spin_unlock(&inode_hash_lock);
-
return 1;
}
static unsigned int counter;
ino_t res;
+ rcu_read_lock();
spin_lock(&iunique_lock);
do {
if (counter <= max_reserved)
res = counter++;
} while (!test_inode_iunique(sb, res));
spin_unlock(&iunique_lock);
+ rcu_read_unlock();
return res;
}
}
EXPORT_SYMBOL(find_inode_nowait);
+ /**
+ * find_inode_rcu - find an inode in the inode cache
+ * @sb: Super block of file system to search
+ * @hashval: Key to hash
+ * @test: Function to test match on an inode
+ * @data: Data for test function
+ *
+ * Search for the inode specified by @hashval and @data in the inode cache,
+ * where the helper function @test will return 0 if the inode does not match
+ * and 1 if it does. The @test function must be responsible for taking the
+ * i_lock spin_lock and checking i_state for an inode being freed or being
+ * initialized.
+ *
+ * If successful, this will return the inode for which the @test function
+ * returned 1 and NULL otherwise.
+ *
+ * The @test function is not permitted to take a ref on any inode presented.
+ * It is also not permitted to sleep.
+ *
+ * The caller must hold the RCU read lock.
+ */
+ struct inode *find_inode_rcu(struct super_block *sb, unsigned long hashval,
+ int (*test)(struct inode *, void *), void *data)
+ {
+ struct hlist_head *head = inode_hashtable + hash(sb, hashval);
+ struct inode *inode;
+
+ RCU_LOCKDEP_WARN(!rcu_read_lock_held(),
+ "suspicious find_inode_rcu() usage");
+
+ hlist_for_each_entry_rcu(inode, head, i_hash) {
+ if (inode->i_sb == sb &&
+ !(READ_ONCE(inode->i_state) & (I_FREEING | I_WILL_FREE)) &&
+ test(inode, data))
+ return inode;
+ }
+ return NULL;
+ }
+ EXPORT_SYMBOL(find_inode_rcu);
+
+ /**
+ * find_inode_by_rcu - Find an inode in the inode cache
+ * @sb: Super block of file system to search
+ * @ino: The inode number to match
+ *
+ * Search for the inode specified by @hashval and @data in the inode cache,
+ * where the helper function @test will return 0 if the inode does not match
+ * and 1 if it does. The @test function must be responsible for taking the
+ * i_lock spin_lock and checking i_state for an inode being freed or being
+ * initialized.
+ *
+ * If successful, this will return the inode for which the @test function
+ * returned 1 and NULL otherwise.
+ *
+ * The @test function is not permitted to take a ref on any inode presented.
+ * It is also not permitted to sleep.
+ *
+ * The caller must hold the RCU read lock.
+ */
+ struct inode *find_inode_by_ino_rcu(struct super_block *sb,
+ unsigned long ino)
+ {
+ struct hlist_head *head = inode_hashtable + hash(sb, ino);
+ struct inode *inode;
+
+ RCU_LOCKDEP_WARN(!rcu_read_lock_held(),
+ "suspicious find_inode_by_ino_rcu() usage");
+
+ hlist_for_each_entry_rcu(inode, head, i_hash) {
+ if (inode->i_ino == ino &&
+ inode->i_sb == sb &&
+ !(READ_ONCE(inode->i_state) & (I_FREEING | I_WILL_FREE)))
+ return inode;
+ }
+ return NULL;
+ }
+ EXPORT_SYMBOL(find_inode_by_ino_rcu);
+
int insert_inode_locked(struct inode *inode)
{
struct super_block *sb = inode->i_sb;
if (likely(!old)) {
spin_lock(&inode->i_lock);
inode->i_state |= I_NEW | I_CREATING;
- hlist_add_head(&inode->i_hash, head);
+ hlist_add_head_rcu(&inode->i_hash, head);
spin_unlock(&inode->i_lock);
spin_unlock(&inode_hash_lock);
return 0;
{
struct super_block *sb = inode->i_sb;
const struct super_operations *op = inode->i_sb->s_op;
+ unsigned long state;
int drop;
WARN_ON(inode->i_state & I_NEW);
return;
}
+ state = inode->i_state;
if (!drop) {
- inode->i_state |= I_WILL_FREE;
+ WRITE_ONCE(inode->i_state, state | I_WILL_FREE);
spin_unlock(&inode->i_lock);
+
write_inode_now(inode, 1);
+
spin_lock(&inode->i_lock);
- WARN_ON(inode->i_state & I_NEW);
- inode->i_state &= ~I_WILL_FREE;
+ state = inode->i_state;
+ WARN_ON(state & I_NEW);
+ state &= ~I_WILL_FREE;
}
- inode->i_state |= I_FREEING;
+ WRITE_ONCE(inode->i_state, state | I_FREEING);
if (!list_empty(&inode->i_lru))
inode_lru_list_del(inode);
spin_unlock(&inode->i_lock);
* @inode: inode owning the block number being requested
* @block: pointer containing the block to find
*
- * Replaces the value in *block with the block number on the device holding
+ * Replaces the value in ``*block`` with the block number on the device holding
* corresponding to the requested block number in the file.
* That is, asked for block 4 of inode 1 the function will replace the
- * 4 in *block, with disk block relative to the disk start that holds that
+ * 4 in ``*block``, with disk block relative to the disk start that holds that
* block of the file.
*
* Returns -EINVAL in case of error, 0 otherwise. If mapping falls into a
- * hole, returns 0 and *block is also set to 0.
+ * hole, returns 0 and ``*block`` is also set to 0.
*/
int bmap(struct inode *inode, sector_t *block)
{
#include <linux/capability.h>
#include <linux/semaphore.h>
#include <linux/fcntl.h>
-#include <linux/fiemap.h>
#include <linux/rculist_bl.h>
#include <linux/atomic.h>
#include <linux/shrinker.h>
struct bdi_writeback;
struct bio;
struct export_operations;
+struct fiemap_extent_info;
struct hd_geometry;
struct iovec;
struct kiocb;
struct page;
struct address_space;
struct writeback_control;
+struct readahead_control;
/*
* Write life time hint values.
*/
int (*readpages)(struct file *filp, struct address_space *mapping,
struct list_head *pages, unsigned nr_pages);
+ void (*readahead)(struct readahead_control *);
int (*write_begin)(struct file *, struct address_space *mapping,
loff_t pos, unsigned len, unsigned flags,
#endif /* #ifdef CONFIG_EPOLL */
struct address_space *f_mapping;
errseq_t f_wb_err;
+ errseq_t f_sb_err; /* for syncfs */
} __randomize_layout
__attribute__((aligned(4))); /* lest something weird decides that 2 is OK */
/* Being remounted read-only */
int s_readonly_remount;
+ /* per-sb errseq_t for reporting writeback errors via syncfs */
+ errseq_t s_wb_err;
+
/* AIO completions deferred from interrupt context */
struct workqueue_struct *s_dio_done_wq;
struct hlist_head s_pins;
extern int vfs_rmdir(struct inode *, struct dentry *);
extern int vfs_unlink(struct inode *, struct dentry *, struct inode **);
extern int vfs_rename(struct inode *, struct dentry *, struct inode *, struct dentry *, struct inode **, unsigned int);
-extern int vfs_whiteout(struct inode *, struct dentry *);
+
+static inline int vfs_whiteout(struct inode *dir, struct dentry *dentry)
+{
+ return vfs_mknod(dir, dentry, S_IFCHR | WHITEOUT_MODE, WHITEOUT_DEV);
+}
extern struct dentry *vfs_tmpfile(struct dentry *dentry, umode_t mode,
int open_flag);
extern void inode_init_owner(struct inode *inode, const struct inode *dir,
umode_t mode);
extern bool may_open_dev(const struct path *path);
-/*
- * VFS FS_IOC_FIEMAP helper definitions.
- */
-struct fiemap_extent_info {
- unsigned int fi_flags; /* Flags as passed from user */
- unsigned int fi_extents_mapped; /* Number of mapped extents */
- unsigned int fi_extents_max; /* Size of fiemap_extent array */
- struct fiemap_extent __user *fi_extents_start; /* Start of
- fiemap_extent array */
-};
-int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical,
- u64 phys, u64 len, u32 flags);
-int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags);
/*
* This is the "filldir" function type, used by readdir() to let
*
* I_CREATING New object's inode in the middle of setting up.
*
+ * I_DONTCACHE Evict inode as soon as it is not used anymore.
+ *
* Q: What is the difference between I_WILL_FREE and I_FREEING?
*/
#define I_DIRTY_SYNC (1 << 0)
#define I_WB_SWITCH (1 << 13)
#define I_OVL_INUSE (1 << 14)
#define I_CREATING (1 << 15)
+#define I_DONTCACHE (1 << 16)
#define I_DIRTY_INODE (I_DIRTY_SYNC | I_DIRTY_DATASYNC)
#define I_DIRTY (I_DIRTY_INODE | I_DIRTY_PAGES)
#ifdef CONFIG_BLOCK
extern int register_blkdev(unsigned int, const char *);
extern void unregister_blkdev(unsigned int, const char *);
-extern void bdev_unhash_inode(dev_t dev);
extern struct block_device *bdget(dev_t);
extern struct block_device *bdgrab(struct block_device *bdev);
extern void bd_set_size(struct block_device *, loff_t size);
extern const struct file_operations def_blk_fops;
extern const struct file_operations def_chr_fops;
#ifdef CONFIG_BLOCK
-extern int ioctl_by_bdev(struct block_device *, unsigned, unsigned long);
extern int blkdev_ioctl(struct block_device *, fmode_t, unsigned, unsigned long);
extern long compat_blkdev_ioctl(struct file *, unsigned, unsigned long);
extern int blkdev_get(struct block_device *bdev, fmode_t mode, void *holder);
extern int revalidate_disk(struct gendisk *);
extern int check_disk_change(struct block_device *);
extern int __invalidate_device(struct block_device *, bool);
-extern int invalidate_partition(struct gendisk *, int);
#endif
unsigned long invalidate_mapping_pages(struct address_space *mapping,
pgoff_t start, pgoff_t end);
return errseq_sample(&mapping->wb_err);
}
+/**
+ * file_sample_sb_err - sample the current errseq_t to test for later errors
+ * @mapping: mapping to be sampled
+ *
+ * Grab the most current superblock-level errseq_t value for the given
+ * struct file.
+ */
+static inline errseq_t file_sample_sb_err(struct file *file)
+{
+ return errseq_sample(&file->f_path.dentry->d_sb->s_wb_err);
+}
+
static inline int filemap_nr_thps(struct address_space *mapping)
{
#ifdef CONFIG_READ_ONLY_THP_FOR_FS
extern int generic_delete_inode(struct inode *inode);
static inline int generic_drop_inode(struct inode *inode)
{
- return !inode->i_nlink || inode_unhashed(inode);
+ return !inode->i_nlink || inode_unhashed(inode) ||
+ (inode->i_state & I_DONTCACHE);
}
+extern void d_mark_dontcache(struct inode *inode);
extern struct inode *ilookup5_nowait(struct super_block *sb,
unsigned long hashval, int (*test)(struct inode *, void *),
int (*match)(struct inode *,
unsigned long, void *),
void *data);
+ extern struct inode *find_inode_rcu(struct super_block *, unsigned long,
+ int (*)(struct inode *, void *), void *);
+ extern struct inode *find_inode_by_ino_rcu(struct super_block *, unsigned long);
extern int insert_inode_locked4(struct inode *, unsigned long, int (*test)(struct inode *, void *), void *);
extern int insert_inode_locked(struct inode *);
#ifdef CONFIG_DEBUG_LOCK_ALLOC
extern int generic_copy_file_checks(struct file *file_in, loff_t pos_in,
struct file *file_out, loff_t pos_out,
size_t *count, unsigned int flags);
+extern ssize_t generic_file_buffered_read(struct kiocb *iocb,
+ struct iov_iter *to, ssize_t already_read);
extern ssize_t generic_file_read_iter(struct kiocb *, struct iov_iter *);
extern ssize_t __generic_file_write_iter(struct kiocb *, struct iov_iter *);
extern ssize_t generic_file_write_iter(struct kiocb *, struct iov_iter *);
DIO_SKIP_HOLES = 0x02,
};
-void dio_end_io(struct bio *bio);
-
ssize_t __blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
struct block_device *bdev, struct iov_iter *iter,
get_block_t get_block,
extern const char *vfs_get_link(struct dentry *, struct delayed_call *);
extern int vfs_readlink(struct dentry *, char __user *, int);
-extern int __generic_block_fiemap(struct inode *inode,
- struct fiemap_extent_info *fieinfo,
- loff_t start, loff_t len,
- get_block_t *get_block);
-extern int generic_block_fiemap(struct inode *inode,
- struct fiemap_extent_info *fieinfo, u64 start,
- u64 len, get_block_t *get_block);
-
extern struct file_system_type *get_filesystem(struct file_system_type *fs);
extern void put_filesystem(struct file_system_type *fs);
extern struct file_system_type *get_fs_type(const char *name);
extern int file_update_time(struct file *file);
-static inline bool io_is_direct(struct file *filp)
-{
- return (filp->f_flags & O_DIRECT) || IS_DAX(filp->f_mapping->host);
-}
-
static inline bool vma_is_dax(const struct vm_area_struct *vma)
{
return vma->vm_file && IS_DAX(vma->vm_file->f_mapping->host);
int res = 0;
if (file->f_flags & O_APPEND)
res |= IOCB_APPEND;
- if (io_is_direct(file))
+ if (file->f_flags & O_DIRECT)
res |= IOCB_DIRECT;
if ((file->f_flags & O_DSYNC) || IS_SYNC(file->f_mapping->host))
res |= IOCB_DSYNC;
struct ctl_table;
int proc_nr_files(struct ctl_table *table, int write,
- void __user *buffer, size_t *lenp, loff_t *ppos);
+ void *buffer, size_t *lenp, loff_t *ppos);
int proc_nr_dentry(struct ctl_table *table, int write,
- void __user *buffer, size_t *lenp, loff_t *ppos);
+ void *buffer, size_t *lenp, loff_t *ppos);
int proc_nr_inodes(struct ctl_table *table, int write,
- void __user *buffer, size_t *lenp, loff_t *ppos);
+ void *buffer, size_t *lenp, loff_t *ppos);
int __init get_filesystem_list(char *buf);
#define __FMODE_EXEC ((__force int) FMODE_EXEC)