summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2017-01-19NFSv4.1: nfs4_fl_prepare_ds must be careful about reporting success.NeilBrown
commit cfd278c280f997cf2fe4662e0acab0fe465f637b upstream. Various places assume that if nfs4_fl_prepare_ds() turns a non-NULL 'ds', then ds->ds_clp will also be non-NULL. This is not necessasrily true in the case when the process received a fatal signal while nfs4_pnfs_ds_connect is waiting in nfs4_wait_ds_connect(). In that case ->ds_clp may not be set, and the devid may not recently have been marked unavailable. So add a test for ds_clp == NULL and return NULL in that case. Fixes: c23266d532b4 ("NFS4.1 Fix data server connection race") Signed-off-by: NeilBrown <neilb@suse.com> Acked-by: Olga Kornievskaia <aglo@umich.edu> Acked-by: Adamson, Andy <William.Adamson@netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-19NFS: Fix a performance regression in readdirTrond Myklebust
commit 79f687a3de9e3ba2518b4ea33f38ca6cbe9133eb upstream. Ben Coddington reports that commit 311324ad1713, by adding the function nfs_dir_mapping_need_revalidate() that checks page cache validity on each call to nfs_readdir() causes a performance regression when the directory is being modified. If the directory is changing while we're iterating through the directory, POSIX does not require us to invalidate the page cache unless the user calls rewinddir(). However, we still do want to ensure that we use readdirplus in order to avoid a load of stat() calls when the user is doing an 'ls -l' workload. The fix should be to invalidate the page cache immediately when we're setting the NFS_INO_ADVISE_RDPLUS bit. Reported-by: Benjamin Coddington <bcodding@redhat.com> Fixes: 311324ad1713 ("NFS: Be more aggressive in using readdirplus...") Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Tested-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-19pNFS: Fix race in pnfs_wait_on_layoutreturnTrond Myklebust
commit ee284e35d8c71bf5d4d807eaff6f67a17134b359 upstream. We must put the task to sleep while holding the inode->i_lock in order to ensure atomicity with the test for NFS_LAYOUT_RETURN. Fixes: 500d701f336b ("NFS41: make close wait for layoutreturn") Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-19NFS: fix typo in parameter descriptionWei Yongjun
commit f36ab161bebe464d33b998294eff29b17a9c8918 upstream. Fix typo in parameter description. Fixes: 5405fc44c337 ("NFSv4.x: Add kernel parameter to control the callback server") Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-19btrfs: fix error handling when run_delayed_extent_op failsJeff Mahoney
commit aa7c8da35d1905d80e840d075f07d26ec90144b5 upstream. In __btrfs_run_delayed_refs, the error path when run_delayed_extent_op fails sets locked_ref->processing = 0 but doesn't re-increment delayed_refs->num_heads_ready. As a result, we end up triggering the WARN_ON in btrfs_select_ref_head. Fixes: d7df2c796d7 (Btrfs: attach delayed ref updates to delayed ref heads) Reported-by: Jon Nelson <jnelson-suse@jamponi.net> Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-19btrfs: fix locking when we put back a delayed ref that's too newJeff Mahoney
commit d0280996437081dd12ed1e982ac8aeaa62835ec4 upstream. In __btrfs_run_delayed_refs, when we put back a delayed ref that's too new, we have already dropped the lock on locked_ref when we set ->processing = 0. This patch keeps the lock to cover that assignment. Fixes: d7df2c796d7 (Btrfs: attach delayed ref updates to delayed ref heads) Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-19sysctl: Drop reference added by grab_header in proc_sys_readdirZhou Chengming
commit 93362fa47fe98b62e4a34ab408c4a418432e7939 upstream. Fixes CVE-2016-9191, proc_sys_readdir doesn't drop reference added by grab_header when return from !dir_emit_dots path. It can cause any path called unregister_sysctl_table will wait forever. The calltrace of CVE-2016-9191: [ 5535.960522] Call Trace: [ 5535.963265] [<ffffffff817cdaaf>] schedule+0x3f/0xa0 [ 5535.968817] [<ffffffff817d33fb>] schedule_timeout+0x3db/0x6f0 [ 5535.975346] [<ffffffff817cf055>] ? wait_for_completion+0x45/0x130 [ 5535.982256] [<ffffffff817cf0d3>] wait_for_completion+0xc3/0x130 [ 5535.988972] [<ffffffff810d1fd0>] ? wake_up_q+0x80/0x80 [ 5535.994804] [<ffffffff8130de64>] drop_sysctl_table+0xc4/0xe0 [ 5536.001227] [<ffffffff8130de17>] drop_sysctl_table+0x77/0xe0 [ 5536.007648] [<ffffffff8130decd>] unregister_sysctl_table+0x4d/0xa0 [ 5536.014654] [<ffffffff8130deff>] unregister_sysctl_table+0x7f/0xa0 [ 5536.021657] [<ffffffff810f57f5>] unregister_sched_domain_sysctl+0x15/0x40 [ 5536.029344] [<ffffffff810d7704>] partition_sched_domains+0x44/0x450 [ 5536.036447] [<ffffffff817d0761>] ? __mutex_unlock_slowpath+0x111/0x1f0 [ 5536.043844] [<ffffffff81167684>] rebuild_sched_domains_locked+0x64/0xb0 [ 5536.051336] [<ffffffff8116789d>] update_flag+0x11d/0x210 [ 5536.057373] [<ffffffff817cf61f>] ? mutex_lock_nested+0x2df/0x450 [ 5536.064186] [<ffffffff81167acb>] ? cpuset_css_offline+0x1b/0x60 [ 5536.070899] [<ffffffff810fce3d>] ? trace_hardirqs_on+0xd/0x10 [ 5536.077420] [<ffffffff817cf61f>] ? mutex_lock_nested+0x2df/0x450 [ 5536.084234] [<ffffffff8115a9f5>] ? css_killed_work_fn+0x25/0x220 [ 5536.091049] [<ffffffff81167ae5>] cpuset_css_offline+0x35/0x60 [ 5536.097571] [<ffffffff8115aa2c>] css_killed_work_fn+0x5c/0x220 [ 5536.104207] [<ffffffff810bc83f>] process_one_work+0x1df/0x710 [ 5536.110736] [<ffffffff810bc7c0>] ? process_one_work+0x160/0x710 [ 5536.117461] [<ffffffff810bce9b>] worker_thread+0x12b/0x4a0 [ 5536.123697] [<ffffffff810bcd70>] ? process_one_work+0x710/0x710 [ 5536.130426] [<ffffffff810c3f7e>] kthread+0xfe/0x120 [ 5536.135991] [<ffffffff817d4baf>] ret_from_fork+0x1f/0x40 [ 5536.142041] [<ffffffff810c3e80>] ? kthread_create_on_node+0x230/0x230 One cgroup maintainer mentioned that "cgroup is trying to offline a cpuset css, which takes place under cgroup_mutex. The offlining ends up trying to drain active usages of a sysctl table which apprently is not happening." The real reason is that proc_sys_readdir doesn't drop reference added by grab_header when return from !dir_emit_dots path. So this cpuset offline path will wait here forever. See here for details: http://www.openwall.com/lists/oss-security/2016/11/04/13 Fixes: f0c3b5093add ("[readdir] convert procfs") Reported-by: CAI Qian <caiqian@redhat.com> Tested-by: Yang Shukui <yangshukui@huawei.com> Signed-off-by: Zhou Chengming <zhouchengming1@huawei.com> Acked-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-19mnt: Protect the mountpoint hashtable with mount_lockEric W. Biederman
commit 3895dbf8985f656675b5bde610723a29cbce3fa7 upstream. Protecting the mountpoint hashtable with namespace_sem was sufficient until a call to umount_mnt was added to mntput_no_expire. At which point it became possible for multiple calls of put_mountpoint on the same hash chain to happen on the same time. Kristen Johansen <kjlx@templeofstupid.com> reported: > This can cause a panic when simultaneous callers of put_mountpoint > attempt to free the same mountpoint. This occurs because some callers > hold the mount_hash_lock, while others hold the namespace lock. Some > even hold both. > > In this submitter's case, the panic manifested itself as a GP fault in > put_mountpoint() when it called hlist_del() and attempted to dereference > a m_hash.pprev that had been poisioned by another thread. Al Viro observed that the simple fix is to switch from using the namespace_sem to the mount_lock to protect the mountpoint hash table. I have taken Al's suggested patch moved put_mountpoint in pivot_root (instead of taking mount_lock an additional time), and have replaced new_mountpoint with get_mountpoint a function that does the hash table lookup and addition under the mount_lock. The introduction of get_mounptoint ensures that only the mount_lock is needed to manipulate the mountpoint hashtable. d_set_mounted is modified to only set DCACHE_MOUNTED if it is not already set. This allows get_mountpoint to use the setting of DCACHE_MOUNTED to ensure adding a struct mountpoint for a dentry happens exactly once. Fixes: ce07d891a089 ("mnt: Honor MNT_LOCKED when detaching mounts") Reported-by: Krister Johansen <kjlx@templeofstupid.com> Suggested-by: Al Viro <viro@ZenIV.linux.org.uk> Acked-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-19btrfs: fix crash when tracepoint arguments are freed by wq callbacksDavid Sterba
commit ac0c7cf8be00f269f82964cf7b144ca3edc5dbc4 upstream. Enabling btrfs tracepoints leads to instant crash, as reported. The wq callbacks could free the memory and the tracepoints started to dereference the members to get to fs_info. The proposed fix https://marc.info/?l=linux-btrfs&m=148172436722606&w=2 removed the tracepoints but we could preserve them by passing only the required data in a safe way. Fixes: bc074524e123 ("btrfs: prefix fsid to all trace events") Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-19xfs: Timely free truncated dirty pagesJan Kara
commit 0a417b8dc1f10b03e8f558b8a831f07ec4c23795 upstream. Commit 99579ccec4e2 "xfs: skip dirty pages in ->releasepage()" started to skip dirty pages in xfs_vm_releasepage() which also has the effect that if a dirty page is truncated, it does not get freed by block_invalidatepage() and is lingering in LRU list waiting for reclaim. So a simple loop like: while true; do dd if=/dev/zero of=file bs=1M count=100 rm file done will keep using more and more memory until we hit low watermarks and start pagecache reclaim which will eventually reclaim also the truncate pages. Keeping these truncated (and thus never usable) pages in memory is just a waste of memory, is unnecessarily stressing page cache reclaim, and reportedly also leads to anonymous mmap(2) returning ENOMEM prematurely. So instead of just skipping dirty pages in xfs_vm_releasepage(), return to old behavior of skipping them only if they have delalloc or unwritten buffers and fix the spurious warnings by warning only if the page is clean. CC: Brian Foster <bfoster@redhat.com> CC: Vlastimil Babka <vbabka@suse.cz> Reported-by: Petr Tůma <petr.tuma@d3s.mff.cuni.cz> Fixes: 99579ccec4e271c3d4d4e7c946058766812afdab Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-19ocfs2: fix crash caused by stale lvb with fsdlm pluginEric Ren
commit e7ee2c089e94067d68475990bdeed211c8852917 upstream. The crash happens rather often when we reset some cluster nodes while nodes contend fiercely to do truncate and append. The crash backtrace is below: dlm: C21CBDA5E0774F4BA5A9D4F317717495: dlm_recover_grant 1 locks on 971 resources dlm: C21CBDA5E0774F4BA5A9D4F317717495: dlm_recover 9 generation 5 done: 4 ms ocfs2: Begin replay journal (node 318952601, slot 2) on device (253,18) ocfs2: End replay journal (node 318952601, slot 2) on device (253,18) ocfs2: Beginning quota recovery on device (253,18) for slot 2 ocfs2: Finishing quota recovery on device (253,18) for slot 2 (truncate,30154,1):ocfs2_truncate_file:470 ERROR: bug expression: le64_to_cpu(fe->i_size) != i_size_read(inode) (truncate,30154,1):ocfs2_truncate_file:470 ERROR: Inode 290321, inode i_size = 732 != di i_size = 937, i_flags = 0x1 ------------[ cut here ]------------ kernel BUG at /usr/src/linux/fs/ocfs2/file.c:470! invalid opcode: 0000 [#1] SMP Modules linked in: ocfs2_stack_user(OEN) ocfs2(OEN) ocfs2_nodemanager ocfs2_stackglue(OEN) quota_tree dlm(OEN) configfs fuse sd_mod iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi af_packet iscsi_ibft iscsi_boot_sysfs softdog xfs libcrc32c ppdev parport_pc pcspkr parport joydev virtio_balloon virtio_net i2c_piix4 acpi_cpufreq button processor ext4 crc16 jbd2 mbcache ata_generic cirrus virtio_blk ata_piix drm_kms_helper ahci syscopyarea libahci sysfillrect sysimgblt fb_sys_fops ttm floppy libata drm virtio_pci virtio_ring uhci_hcd virtio ehci_hcd usbcore serio_raw usb_common sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod autofs4 Supported: No, Unsupported modules are loaded CPU: 1 PID: 30154 Comm: truncate Tainted: G OE N 4.4.21-69-default #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20151112_172657-sheep25 04/01/2014 task: ffff88004ff6d240 ti: ffff880074e68000 task.ti: ffff880074e68000 RIP: 0010:[<ffffffffa05c8c30>] [<ffffffffa05c8c30>] ocfs2_truncate_file+0x640/0x6c0 [ocfs2] RSP: 0018:ffff880074e6bd50 EFLAGS: 00010282 RAX: 0000000000000074 RBX: 000000000000029e RCX: 0000000000000000 RDX: 0000000000000001 RSI: 0000000000000246 RDI: 0000000000000246 RBP: ffff880074e6bda8 R08: 000000003675dc7a R09: ffffffff82013414 R10: 0000000000034c50 R11: 0000000000000000 R12: ffff88003aab3448 R13: 00000000000002dc R14: 0000000000046e11 R15: 0000000000000020 FS: 00007f839f965700(0000) GS:ffff88007fc80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00007f839f97e000 CR3: 0000000036723000 CR4: 00000000000006e0 Call Trace: ocfs2_setattr+0x698/0xa90 [ocfs2] notify_change+0x1ae/0x380 do_truncate+0x5e/0x90 do_sys_ftruncate.constprop.11+0x108/0x160 entry_SYSCALL_64_fastpath+0x12/0x6d Code: 24 28 ba d6 01 00 00 48 c7 c6 30 43 62 a0 8b 41 2c 89 44 24 08 48 8b 41 20 48 c7 c1 78 a3 62 a0 48 89 04 24 31 c0 e8 a0 97 f9 ff <0f> 0b 3d 00 fe ff ff 0f 84 ab fd ff ff 83 f8 fc 0f 84 a2 fd ff RIP [<ffffffffa05c8c30>] ocfs2_truncate_file+0x640/0x6c0 [ocfs2] It's because ocfs2_inode_lock() get us stale LVB in which the i_size is not equal to the disk i_size. We mistakenly trust the LVB because the underlaying fsdlm dlm_lock() doesn't set lkb_sbflags with DLM_SBF_VALNOTVALID properly for us. But, why? The current code tries to downconvert lock without DLM_LKF_VALBLK flag to tell o2cb don't update RSB's LVB if it's a PR->NULL conversion, even if the lock resource type needs LVB. This is not the right way for fsdlm. The fsdlm plugin behaves different on DLM_LKF_VALBLK, it depends on DLM_LKF_VALBLK to decide if we care about the LVB in the LKB. If DLM_LKF_VALBLK is not set, fsdlm will skip recovering RSB's LVB from this lkb and set the right DLM_SBF_VALNOTVALID appropriately when node failure happens. The following diagram briefly illustrates how this crash happens: RSB1 is inode metadata lock resource with LOCK_TYPE_USES_LVB; The 1st round: Node1 Node2 RSB1: PR RSB1(master): NULL->EX ocfs2_downconvert_lock(PR->NULL, set_lvb==0) ocfs2_dlm_lock(no DLM_LKF_VALBLK) - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - dlm_lock(no DLM_LKF_VALBLK) convert_lock(overwrite lkb->lkb_exflags with no DLM_LKF_VALBLK) RSB1: NULL RSB1: EX reset Node2 dlm_recover_rsbs() recover_lvb() /* The LVB is not trustable if the node with EX fails and * no lock >= PR is left. We should set RSB_VALNOTVALID for RSB1. */ if(!(kb_exflags & DLM_LKF_VALBLK)) /* This means we miss the chance to return; * to invalid the LVB here. */ The 2nd round: Node 1 Node2 RSB1(become master from recovery) ocfs2_setattr() ocfs2_inode_lock(NULL->EX) /* dlm_lock() return the stale lvb without setting DLM_SBF_VALNOTVALID */ ocfs2_meta_lvb_is_trustable() return 1 /* so we don't refresh inode from disk */ ocfs2_truncate_file() mlog_bug_on_msg(disk isize != i_size_read(inode)) /* crash! */ The fix is quite straightforward. We keep to set DLM_LKF_VALBLK flag for dlm_lock() if the lock resource type needs LVB and the fsdlm plugin is uesed. Link: http://lkml.kernel.org/r/1481275846-6604-1-git-send-email-zren@suse.com Signed-off-by: Eric Ren <zren@suse.com> Reviewed-by: Joseph Qi <jiangqi903@gmail.com> Cc: Mark Fasheh <mfasheh@versity.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-12xfs: fix max_retries _show and _store functionsCarlos Maiolino
commit ff97f2399edac1e0fb3fa7851d5fbcbdf04717cf upstream. max_retries _show and _store functions should test against cfg->max_retries, not cfg->retry_timeout Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-12xfs: fix crash and data corruption due to removal of busy COW extentsChristoph Hellwig
commit a1b7a4dea6166cf46be895bce4aac67ea5160fe8 upstream. There is a race window between write_cache_pages calling clear_page_dirty_for_io and XFS calling set_page_writeback, in which the mapping for an inode is tagged neither as dirty, nor as writeback. If the COW shrinker hits in exactly that window we'll remove the delayed COW extents and writepages trying to write it back, which in release kernels will manifest as corruption of the bmap btree, and in debug kernels will trip the ASSERT about now calling xfs_bmapi_write with the COWFORK flag for holes. A complex customer load manages to hit this window fairly reliably, probably by always having COW writeback in flight while the cow shrinker runs. This patch adds another check for having the I_DIRTY_PAGES flag set, which is still set during this race window. While this fixes the problem I'm still not overly happy about the way the COW shrinker works as it still seems a bit fragile. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-12xfs: use the actual AG length when reserving blocksDarrick J. Wong
commit 20e73b000bcded44a91b79429d8fa743247602ad upstream. We need to use the actual AG length when making per-AG reservations, since we could otherwise end up reserving more blocks out of the last AG than there are actual blocks. Complained-about-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-12xfs: fix double-cleanup when CUI recovery failsDarrick J. Wong
commit 7a21272b088894070391a94fdd1c67014020fa1d upstream. Dan Carpenter reported a double-free of rcur if _defer_finish fails while we're recovering CUI items. Fix the error recovery to prevent this. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-12xfs: use GPF_NOFS when allocating btree cursorsDarrick J. Wong
commit b24a978c377be5f14e798cb41238e66fe51aab2f upstream. Use NOFS for allocating btree cursors, since they can be called under the ilock. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-12xfs: ignore leaf attr ichdr.count in verifier during log replayEric Sandeen
commit 2e1d23370e75d7d89350d41b4ab58c7f6a0e26b2 upstream. When we create a new attribute, we first create a shortform attribute, and try to fit the new attribute into it. If that fails, we copy the (empty) attribute into a leaf attribute, and do the copy again. Thus there can be a transient state where we have an empty leaf attribute. If we encounter this during log replay, the verifier will fail. So add a test to ignore this part of the leaf attr verification during log replay. Thanks as usual to dchinner for spotting the problem. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-12xfs: don't cap maximum dedupe request lengthDarrick J. Wong
commit 1bb33a98702d8360947f18a44349df75ba555d5d upstream. After various discussions on linux-fsdevel, it has been decided that it is not necessary to cap the length of a dedupe request, and that correctly-written userspace client programs will be able to absorb the change. Therefore, remove the length clamping behavior. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-12xfs: don't allow di_size with high bit setDarrick J. Wong
commit ef388e2054feedaeb05399ed654bdb06f385d294 upstream. The on-disk field di_size is used to set i_size, which is a signed integer of loff_t. If the high bit of di_size is set, we'll end up with a negative i_size, which will cause all sorts of problems. Since the VFS won't let us create a file with such length, we should catch them here in the verifier too. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-12xfs: error out if trying to add attrs and anextents > 0Darrick J. Wong
commit 0f352f8ee8412bd9d34fb2a6411241da61175c0e upstream. We shouldn't assert if somehow we end up trying to add an attr fork to an inode that apparently already has attr extents because this is an indication of on-disk corruption. Instead, return an error code to userspace. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-12xfs: don't crash if reading a directory results in an unexpected holeDarrick J. Wong
commit 96a3aefb8ffde23180130460b0b2407b328eb727 upstream. In xfs_dir3_data_read, we can encounter the situation where err == 0 and *bpp == NULL if the given bno offset happens to be a hole; this leads to a crash if we try to set the buffer type after the _da_read_buf call. Holes can happen due to corrupt or malicious entries in the bmbt data, so be a little more careful when we're handling buffers. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-12xfs: complain if we don't get nextents bmap recordsDarrick J. Wong
commit 356a3225222e5bc4df88aef3419fb6424f18ab69 upstream. When reading into memory all extents of a btree-format inode fork, complain if the number of extents we find is not the same as the number of extents reported in the inode core. This is needed to stop an IO action from accessing the garbage areas of the in-core fork. [dchinner: removed redundant assert] Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-12xfs: check for bogus values in btree block headersDarrick J. Wong
commit bb3be7e7c1c18e1b141d4cadeb98cc89ecf78099 upstream. When we're reading a btree block, make sure that what we retrieved matches the owner and level; and has a plausible number of records. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-12xfs: forbid AG btrees with level == 0Darrick J. Wong
commit d2a047f31e86941fa896e0e3271536d50aba415e upstream. There is no such thing as a zero-level AG btree since even a single-node zero-records btree has one level. Btree cursor constructors read cur_nlevels straight from disk and then access things like cur_bufs[cur_nlevels - 1] which is /really/ bad if cur_nlevels is zero! Therefore, strengthen the verifiers to prevent this possibility. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-12xfs: handle cow fork in xfs_bmap_trace_exlistEric Sandeen
commit c44a1f22626c153976289e1cd67bdcdfefc16e1f upstream. By inspection, xfs_bmap_trace_exlist isn't handling cow forks, and will trace the data fork instead. Fix this by setting state appropriately if whichfork == XFS_COW_FORK. ()___() < @ @ > | | {o_o} (|) Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-12xfs: pass state not whichfork to trace_xfs_extlistEric Sandeen
commit 7710517fc37b1899722707883b54694ea710b3c0 upstream. When xfs_bmap_trace_exlist called trace_xfs_extlist, it sent in the "whichfork" var instead of the bmap "state" as expected (even though state was already set up for this purpose). As a result, the xfs_bmap_class in tracing code used "whichfork" not state in xfs_iext_state_to_fork(), and got the wrong ifork pointer. It all goes downhill from there, including an ASSERT when ifp_bytes is empty by the time it reaches xfs_iext_get_ext(): XFS: Assertion failed: idx < ifp->if_bytes / sizeof(xfs_bmbt_rec_t) Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-12xfs: Move AGI buffer type setting to xfs_read_agiEric Sandeen
commit 200237d6746faaeaf7f4ff4abbf13f3917cee60a upstream. We've missed properly setting the buffer type for an AGI transaction in 3 spots now, so just move it into xfs_read_agi() and set it if we are in a transaction to avoid the problem in the future. This is similar to how it is done in i.e. the dir3 and attr3 read functions. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-12xfs: pass post-eof speculative prealloc blocks to bmapiBrian Foster
commit f782088c9e5d08e9494c63e68b4e85716df3e5f8 upstream. xfs_file_iomap_begin_delay() implements post-eof speculative preallocation by extending the block count of the requested delayed allocation. Now that xfs_bmapi_reserve_delalloc() has been updated to handle prealloc blocks separately and tag the inode, update xfs_file_iomap_begin_delay() to use the new parameter and rely on the former to tag the inode. Note that this patch does not change behavior. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-12xfs: use new extent lookup helpers xfs_file_iomap_begin_delayChristoph Hellwig
commit 656152e552e5cbe0c11ad261b524376217c2fb13 upstream. And only lookup the previous extent inside xfs_iomap_prealloc_size if we actually need it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-12xfs: clean up cow fork reservation and tag inodes correctlyBrian Foster
commit 0260d8ff5f76617e3a55a1c471383ecb4404c3ad upstream. COW fork reservation is implemented via delayed allocation. The code is modeled after the traditional delalloc allocation code, but is slightly different in terms of how preallocation occurs. Rather than post-eof speculative preallocation, COW fork preallocation is implemented via a COW extent size hint that is designed to minimize fragmentation as a reflinked file is split over time. xfs_reflink_reserve_cow() still uses logic that is oriented towards dealing with post-eof speculative preallocation, however, and is stale or not necessarily correct. First, the EOF alignment to the COW extent size hint is implemented in xfs_bmapi_reserve_delalloc() (which does so correctly by aligning the start and end offsets) and so is not necessary in xfs_reflink_reserve_cow(). The backoff and retry logic on ENOSPC is also ineffective for the same reason, as xfs_bmapi_reserve_delalloc() will simply perform the same allocation request on the retry. Finally, since the COW extent size hint aligns the start and end offset of the range to allocate, the end_fsb != orig_end_fsb logic is not sufficient. Indeed, if a write request happens to end on an aligned offset, it is possible that we do not tag the inode for COW preallocation even though xfs_bmapi_reserve_delalloc() may have preallocated at the start offset. Kill the unnecessary, duplicate code in xfs_reflink_reserve_cow(). Remove the inode tag logic as well since xfs_bmapi_reserve_delalloc() has been updated to tag the inode correctly. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-12xfs: use new extent lookup helpers in __xfs_reflink_reserve_cowChristoph Hellwig
commit 2755fc4438501c8c28e7783df890e889f6772bee upstream. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-12xfs: track preallocation separately in xfs_bmapi_reserve_delalloc()Brian Foster
commit 974ae922efd93b07b6cdf989ae959883f6f05fd8 upstream. Speculative preallocation is currently processed entirely by the callers of xfs_bmapi_reserve_delalloc(). The caller determines how much preallocation to include, adjusts the extent length and passes down the resulting request. While this works fine for post-eof speculative preallocation, it is not as reliable for COW fork preallocation. COW fork preallocation is implemented via the cowextszhint, which aligns the start offset as well as the length of the extent. Further, it is difficult for the caller to accurately identify when preallocation occurs because the returned extent could have been merged with neighboring extents in the fork. To simplify this situation and facilitate further COW fork preallocation enhancements, update xfs_bmapi_reserve_delalloc() to take a separate preallocation parameter to incorporate into the allocation request. The preallocation blocks value is tacked onto the end of the request and adjusted to accommodate neighboring extents and extent size limits. Since xfs_bmapi_reserve_delalloc() now knows precisely how much preallocation was included in the allocation, it can also tag the inodes appropriately to support preallocation reclaim. Note that xfs_bmapi_reserve_delalloc() callers are not yet updated to use the preallocation mechanism. This patch should not change behavior outside of correctly tagging reflink inodes when start offset preallocation occurs (which the caller does not handle correctly). Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-12xfs: remove prev argument to xfs_bmapi_reserve_delallocChristoph Hellwig
commit 65c5f419788d623a0410eca1866134f5e4628594 upstream. We can easily lookup the previous extent for the cases where we need it, which saves the callers from looking it up for us later in the series. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-12xfs: always succeed when deduping zero bytesDarrick J. Wong
commit fba3e594ef0ad911fa8f559732d588172f212d71 upstream. It turns out that btrfs and xfs had differing interpretations of what to do when the dedupe length is zero. Change xfs to follow btrfs' semantics so that the userland interface is consistent. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-12xfs: factor rmap btree size into the indlen calculationsDarrick J. Wong
commit fd26a88093bab6529ea2de819114ca92dbd1d71d upstream. When we're estimating the amount of space it's going to take to satisfy a delalloc reservation, we need to include the space that we might need to grow the rmapbt. This helps us to avoid running out of space later when _iomap_write_allocate needs more space than we reserved. Eryu Guan observed this happening on generic/224 when sunit/swidth were set. Reported-by: Eryu Guan <eguan@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-12xfs: new inode extent list lookup helpersChristoph Hellwig
commit 93533c7855c3c78c8a900cac65c8d669bb14935d upstream. xfs_iext_lookup_extent looks up a single extent at the passed in offset, and returns the extent covering the area, or the one behind it in case of a hole, as well as the index of the returned extent in arguments, as well as a simple bool as return value that is set to false if no extent could be found because the offset is behind EOF. It is a simpler replacement for xfs_bmap_search_extent that leaves looking up the rarely needed previous extent to the caller and has a nicer calling convention. xfs_iext_get_extent is a helper for iterating over the extent list, it takes an extent index as input, and returns the extent at that index in it's expanded form in an argument if it exists. The actual return value is a bool whether the index is valid or not. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-12xfs: fix unbalanced inode reclaim flush lockingBrian Foster
commit 98efe8af1c9ffac47e842b7a75ded903e2f028da upstream. Filesystem shutdown testing on an older distro kernel has uncovered an imbalanced locking pattern for the inode flush lock in xfs_reclaim_inode(). Specifically, there is a double unlock sequence between the call to xfs_iflush_abort() and xfs_reclaim_inode() at the "reclaim:" label. This actually does not cause obvious problems on current kernels due to the current flush lock implementation. Older kernels use a counting based flush lock mechanism, however, which effectively breaks the lock indefinitely when an already unlocked flush lock is repeatedly unlocked. Though this only currently occurs on filesystem shutdown, it has reproduced the effect of elevating an fs shutdown to a system-wide crash or hang. As it turns out, the flush lock is not actually required for the reclaim logic in xfs_reclaim_inode() because by that time we have already cycled the flush lock once while holding ILOCK_EXCL. Therefore, remove the additional flush lock/unlock cycle around the 'reclaim:' label and update branches into this label to release the flush lock where appropriate. Add an assert to xfs_ifunlock() to help prevent future occurences of the same problem. Reported-by: Zorro Lang <zlang@redhat.com> Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-12xfs: check minimum block size for CRC filesystemsDarrick J. Wong
commit bec9d48d7a303a5bb95c05961ff07ec7eeb59058 upstream. Check the minimum block size on v5 filesystems. [dchinner: cleaned up XFS_MIN_CRC_BLOCKSIZE check] Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-12xfs: provide helper for counting extents from if_bytesEric Sandeen
commit 5d829300bee000980a09ac2ccb761cb25867b67c upstream. The open-coded pattern: ifp->if_bytes / (uint)sizeof(xfs_bmbt_rec_t) is all over the xfs code; provide a new helper xfs_iext_count(ifp) to count the number of inline extents in an inode fork. [dchinner: pick up several missed conversions] Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-12xfs: don't BUG() on mixed direct and mapped I/OBrian Foster
commit 04197b341f23b908193308b8d63d17ff23232598 upstream. We've had reports of generic/095 causing XFS to BUG() in __xfs_get_blocks() due to the existence of delalloc blocks on a direct I/O read. generic/095 issues a mix of various types of I/O, including direct and memory mapped I/O to a single file. This is clearly not supported behavior and is known to lead to such problems. E.g., the lack of exclusion between the direct I/O and write fault paths means that a write fault can allocate delalloc blocks in a region of a file that was previously a hole after the direct read has attempted to flush/inval the file range, but before it actually reads the block mapping. In turn, the direct read discovers a delalloc extent and cannot proceed. While the appropriate solution here is to not mix direct and memory mapped I/O to the same regions of the same file, the current BUG_ON() behavior is probably overkill as it can crash the entire system. Instead, localize the failure to the I/O in question by returning an error for a direct I/O that cannot be handled safely due to delalloc blocks. Be careful to allow the case of a direct write to post-eof delalloc blocks. This can occur due to speculative preallocation and is safe as post-eof blocks are not accompanied by dirty pages in pagecache (conversely, preallocation within eof must have been zeroed, and thus dirtied, before the inode size could have been increased beyond said blocks). Finally, provide an additional warning if a direct I/O write occurs while the file is memory mapped. This may not catch all problematic scenarios, but provides a hint that some known-to-be-problematic I/O methods are in use. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-12xfs: don't skip cow forks w/ delalloc blocks in cowblocks scanBrian Foster
commit 399372349a7f9b2d7e56e4fa4467c69822d07024 upstream. The cowblocks background scanner currently clears the cowblocks tag for inodes without any real allocations in the cow fork. This excludes inodes with only delalloc blocks in the cow fork. While we might never expect to clear delalloc blocks from the cow fork in the background scanner, it is not necessarily correct to clear the cowblocks tag from such inodes. For example, if the background scanner happens to process an inode between a buffered write and writeback, the scanner catches the inode in a state after delalloc blocks have been allocated to the cow fork but before the delalloc blocks have been converted to real blocks by writeback. The background scanner then incorrectly clears the cowblocks tag, even if part of the aforementioned delalloc reservation will not be remapped to the data fork (i.e., extra blocks due to the cowextsize hint). This means that any such additional blocks in the cow fork might never be reclaimed by the background scanner and could persist until the inode itself is reclaimed. To address this problem, only skip and clear inodes without any cow fork allocations whatsoever from the background scanner. While we generally do not want to cancel delalloc reservations from the background scanner, the pagecache dirty check following the cowblocks check should prevent that situation. If we do end up with delalloc cow fork blocks without a dirty address space mapping, this is probably an indication that something has gone wrong and the blocks should be reclaimed, as they may never be converted to a real allocation. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-12xfs: check return value of _trans_reserve_quota_nblksDarrick J. Wong
commit 4fd29ec47212c8cbf98916af519019ccc5e58e49 upstream. Check the return value of xfs_trans_reserve_quota_nblks for errors. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-12xfs: don't call xfs_sb_quota_from_disk twiceEric Sandeen
commit e6fc6fcf4447c9266038c55c25e4c7c14bee110c upstream. Source xfsprogs commit: ee3754254e8c186c99b6cdd4d59f741759d04acb Kernel commit 5ef828c4 ("xfs: avoid false quotacheck after unclean shutdown") made xfs_sb_from_disk() also call xfs_sb_quota_from_disk by default. However, when this was merged to libxfs, existing separate calls to libxfs_sb_quota_from_disk remained, and calling it twice in a row on a V4 superblock leads to issues, because: if (sbp->sb_qflags & XFS_PQUOTA_ACCT) { ... sbp->sb_pquotino = sbp->sb_gquotino; sbp->sb_gquotino = NULLFSINO; and after the second call, we have set both pquotino and gquotino to NULLFSINO. Fix this by making it safe to call twice, and also remove the extra calls to libxfs_sb_quota_from_disk. This is only spotted when running xfstests with "-m crc=0" because the sb_from_disk change came about after V5 became default, and the above behavior only exists on a V4 superblock. Reported-by: Eryu Guan <eguan@redhat.com> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-12f2fs: hide a maybe-uninitialized warningArnd Bergmann
commit 230436b3ef3fd7d4a1da19edf5e87bb2d74e0fc2 upstream. gcc is unsure about the use of last_ofs_in_node, which might happen without a prior initialization: fs/f2fs//git/arm-soc/fs/f2fs/data.c: In function ‘f2fs_map_blocks’: fs/f2fs/data.c:799:54: warning: ‘last_ofs_in_node’ may be used uninitialized in this function [-Wmaybe-uninitialized] if (prealloc && dn.ofs_in_node != last_ofs_in_node + 1) { As pointed out by Chao Yu, the code is actually correct as 'prealloc' is only set if the last_ofs_in_node has been set, the two always get updated together. This initializes last_ofs_in_node to dn.ofs_in_node for each new dnode at the start of the 'next_block' loop, which at that point is a correct initialization as well. I assume that compilers that correctly track the contents of the variables and do not warn about the condition also figure out that they can eliminate the extra assignment here. Fixes: 46008c6d4232 ("f2fs: support in batch multi blocks preallocation") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-12f2fs: remove percpu_count due to performance regressionJaegeuk Kim
commit 35782b233f37e48ecc469d9c7232f3f6a7fad41a upstream. This patch removes percpu_count usage due to performance regression in iozone. Fixes: 523be8a6b3 ("f2fs: use percpu_counter for page counters") Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-12fscrypt: fix renaming and linking special filesEric Biggers
commit 42d97eb0ade31e1bc537d086842f5d6e766d9d51 upstream. Attempting to link a device node, named pipe, or socket file into an encrypted directory through rename(2) or link(2) always failed with EPERM. This happened because fscrypt_has_permitted_context() saw that the file was unencrypted and forbid creating the link. This behavior was unexpected because such files are never encrypted; only regular files, directories, and symlinks can be encrypted. To fix this, make fscrypt_has_permitted_context() always return true on special files. This will be covered by a test in my encryption xfstests patchset. Fixes: 9bd8212f981e ("ext4 crypto: add encryption policy and password salt support") Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Richard Weinberger <richard@nod.at> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-09pNFS: Fix a deadlock between read resends and layoutreturnTrond Myklebust
commit 54e4a0dfa25d9365c4e80a639e80d9213eb6edbe upstream. We must not call nfs_pageio_init_read() on a new nfs_pageio_descriptor while holding a reference to a layout segment, as that can deadlock pnfs_update_layout(). Fixes: d67ae825a59d6 ("pnfs/flexfiles: Add the FlexFile Layout Driver") Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-09pNFS: Clear NFS_LAYOUT_RETURN_REQUESTED when invalidating the layout stateidTrond Myklebust
commit ae5a459d5f65c3e83f3e14068dde5fb9c9d81807 upstream. We must ensure that we don't schedule a layoutreturn if the layout stateid has been marked as invalid. Fixes: 2a59a0411671e ("pNFS: Fix pnfs_set_layout_stateid() to clear...") Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-09pNFS: Don't clear the layout stateid if a layout return is outstandingTrond Myklebust
commit 7b650994ab07434ae58a247dc9ac87d2488ca75c upstream. If we no longer hold any layout segments, we're normally expected to consider the layout stateid to be invalid. However we cannot assume this if we're about to, or in the process of sending a layoutreturn. Fixes: 334a8f37115b ("pNFS: Don't forget the layout stateid if...") Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-09pNFS: On error, do not send LAYOUTGET until the LAYOUTRETURN has completedTrond Myklebust
commit 6604b203fb6394ed1f24c21bfa3c207e5ae8e461 upstream. If there is an I/O error, we should not call LAYOUTGET until the LAYOUTRETURN that reports the error is complete. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>