summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2018-03-28staging: ncpfs: memory corruption in ncp_read_kernel()Dan Carpenter
commit 4c41aa24baa4ed338241d05494f2c595c885af8f upstream. If the server is malicious then *bytes_read could be larger than the size of the "target" buffer. It would lead to memory corruption when we do the memcpy(). Reported-by: Dr Silvio Cesare of InfoSect <Silvio Cesare <silvio.cesare@gmail.com> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Cc: stable <stable@vger.kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-28nfsd: remove blocked locks on client teardownJeff Layton
commit 68ef3bc3166468678d5e1fdd216628c35bd1186f upstream. We had some reports of panics in nfsd4_lm_notify, and that showed a nfs4_lockowner that had outlived its so_client. Ensure that we walk any leftover lockowners after tearing down all of the stateids, and remove any blocked locks that they hold. With this change, we also don't need to walk the nbl_lru on nfsd_net shutdown, as that will happen naturally when we tear down the clients. Fixes: 76d348fadff5 (nfsd: have nfsd4_lock use blocking locks for v4.1+ locks) Reported-by: Frank Sorenson <fsorenso@redhat.com> Signed-off-by: Jeff Layton <jlayton@redhat.com> Cc: stable@vger.kernel.org # 4.9 Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-24nfsd4: permit layoutget of executable-only filesBenjamin Coddington
[ Upstream commit 66282ec1cf004c09083c29cb5e49019037937bbd ] Clients must be able to read a file in order to execute it, and for pNFS that means the client needs to be able to perform a LAYOUTGET on the file. This behavior for executable-only files was added for OPEN in commit a043226bc140 "nfsd4: permit read opens of executable-only files". This fixes up xfstests generic/126 on block/scsi layouts. Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-24cifs: small underflow in cnvrtDosUnixTm()Dan Carpenter
[ Upstream commit 564277eceeca01e02b1ef3e141cfb939184601b4 ] January is month 1. There is no zero-th month. If someone passes a zero month then it means we read from one space before the start of the total_days_of_prev_months[] array. We may as well also be strict about days as well. Fixes: 1bd5bbcb6531 ("[CIFS] Legacy time handling for Win9x and OS/2 part 1") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Steve French <smfrench@gmail.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-24pNFS: Fix a deadlock when coalescing writes and returning the layoutTrond Myklebust
[ Upstream commit 61f454e30c18a28924e96be12592c0d5e24bcc81 ] Consider the following deadlock: Process P1 Process P2 Process P3 ========== ========== ========== lock_page(page) lseg = pnfs_update_layout(inode) lo = NFS_I(inode)->layout pnfs_error_mark_layout_for_return(lo) lock_page(page) lseg = pnfs_update_layout(inode) In this scenario, - P1 has declared the layout to be in error, but P2 holds a reference to a layout segment on that inode, so the layoutreturn is deferred. - P2 is waiting for a page lock held by P3. - P3 is asking for a new layout segment, but is blocked waiting for the layoutreturn. The fix is to ensure that pnfs_error_mark_layout_for_return() does not set the NFS_LAYOUT_RETURN flag, which blocks P3. Instead, we allow the latter to call LAYOUTGET so that it can make progress and unblock P2. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-24jbd2: Fix lockdep splat with generic/270 testJan Kara
[ Upstream commit c52c47e4b4fbe4284602fc2ccbfc4a4d8dc05b49 ] I've hit a lockdep splat with generic/270 test complaining that: 3216.fsstress.b/3533 is trying to acquire lock: (jbd2_handle){++++..}, at: [<ffffffff813152e0>] jbd2_log_wait_commit+0x0/0x150 but task is already holding lock: (jbd2_handle){++++..}, at: [<ffffffff8130bd3b>] start_this_handle+0x35b/0x850 The underlying problem is that jbd2_journal_force_commit_nested() (called from ext4_should_retry_alloc()) may get called while a transaction handle is started. In such case it takes care to not wait for commit of the running transaction (which would deadlock) but only for a commit of a transaction that is already committing (which is safe as that doesn't wait for any filesystem locks). In fact there are also other callers of jbd2_log_wait_commit() that take care to pass tid of a transaction that is already committing and for those cases, the lockdep instrumentation is too restrictive and leading to false positive reports. Fix the problem by calling jbd2_might_wait_for_commit() from jbd2_log_wait_commit() only if the transaction isn't already committing. Fixes: 1eaa566d368b214d99cbb973647c1b0b8102a9ae Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-24orangefs: do not wait for timeout if umountingMartin Brandenburg
[ Upstream commit b5a9d61eebdd0016ccb383b25a5c3d04977a6549 ] When the computer is turned off, all the processes are killed and then all the filesystems are umounted. OrangeFS should not wait for the userspace daemon to come back in that case. This only works for plain umount(2). To actually take advantage of this interactively, `umount -f' is needed; otherwise umount will issue a statfs first, which will wait for the userspace daemon to come back. Signed-off-by: Martin Brandenburg <martin@omnibond.com> Signed-off-by: Mike Marshall <hubcap@omnibond.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-24Btrfs: fix extent map leak during fallocate error pathFilipe Manana
[ Upstream commit be2d253cc98244765323a7c94cc1ac5cd5a17072 ] If the call to btrfs_qgroup_reserve_data() failed, we were leaking an extent map structure. The failure can happen either due to an -ENOMEM condition or, when quotas are enabled, due to -EDQUOT for example. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-24Btrfs: send, fix file hole not being preserved due to inline extentFilipe Manana
[ Upstream commit e1cbfd7bf6dabdac561c75d08357571f44040a45 ] Normally we don't have inline extents followed by regular extents, but there's currently at least one harmless case where this happens. For example, when the page size is 4Kb and compression is enabled: $ mkfs.btrfs -f /dev/sdb $ mount -o compress /dev/sdb /mnt $ xfs_io -f -c "pwrite -S 0xaa 0 4K" -c "fsync" /mnt/foobar $ xfs_io -c "pwrite -S 0xbb 8K 4K" -c "fsync" /mnt/foobar In this case we get a compressed inline extent, representing 4Kb of data, followed by a hole extent and then a regular data extent. The inline extent was not expanded/converted to a regular extent exactly because it represents 4Kb of data. This does not cause any apparent problem (such as the issue solved by commit e1699d2d7bf6 ("btrfs: add missing memset while reading compressed inline extents")) except trigger an unexpected case in the incremental send code path that makes us issue an operation to write a hole when it's not needed, resulting in more writes at the receiver and wasting space at the receiver. So teach the incremental send code to deal with this particular case. The issue can be currently triggered by running fstests btrfs/137 with compression enabled (MOUNT_OPTIONS="-o compress" ./check btrfs/137). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-24Btrfs: fix incorrect space accounting after failure to insert inline extentFilipe Manana
[ Upstream commit 1c81ba237bcecad9bc885a1ddcf02d725ea38482 ] When using compression, if we fail to insert an inline extent we incorrectly end up attempting to free the reserved data space twice, once through extent_clear_unlock_delalloc(), because we pass it the flag EXTENT_DO_ACCOUNTING, and once through a direct call to btrfs_free_reserved_data_space_noquota(). This results in a trace like the following: [ 834.576240] ------------[ cut here ]------------ [ 834.576825] WARNING: CPU: 2 PID: 486 at fs/btrfs/extent-tree.c:4316 btrfs_free_reserved_data_space_noquota+0x60/0x9f [btrfs] [ 834.579501] Modules linked in: btrfs crc32c_generic xor raid6_pq ppdev i2c_piix4 acpi_cpufreq psmouse tpm_tis parport_pc pcspkr serio_raw tpm_tis_core sg parport evdev i2c_core tpm button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix virtio_pci libata virtio_ring virtio scsi_mod e1000 floppy [last unloaded: btrfs] [ 834.592116] CPU: 2 PID: 486 Comm: kworker/u32:4 Not tainted 4.10.0-rc8-btrfs-next-37+ #2 [ 834.593316] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014 [ 834.595273] Workqueue: btrfs-delalloc btrfs_delalloc_helper [btrfs] [ 834.596103] Call Trace: [ 834.596103] dump_stack+0x67/0x90 [ 834.596103] __warn+0xc2/0xdd [ 834.596103] warn_slowpath_null+0x1d/0x1f [ 834.596103] btrfs_free_reserved_data_space_noquota+0x60/0x9f [btrfs] [ 834.596103] compress_file_range.constprop.42+0x2fa/0x3fc [btrfs] [ 834.596103] ? submit_compressed_extents+0x3a7/0x3a7 [btrfs] [ 834.596103] async_cow_start+0x32/0x4d [btrfs] [ 834.596103] btrfs_scrubparity_helper+0x187/0x3e7 [btrfs] [ 834.596103] btrfs_delalloc_helper+0xe/0x10 [btrfs] [ 834.596103] process_one_work+0x273/0x4e4 [ 834.596103] worker_thread+0x1eb/0x2ca [ 834.596103] ? rescuer_thread+0x2b6/0x2b6 [ 834.596103] kthread+0x100/0x108 [ 834.596103] ? __list_del_entry+0x22/0x22 [ 834.596103] ret_from_fork+0x2e/0x40 [ 834.611656] ---[ end trace 719902fe6bdef08f ]--- So fix this by not calling directly btrfs_free_reserved_data_space_noquota() if an error happened. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-24NFS: don't try to cross a mountpount when there isn't one there.NeilBrown
[ Upstream commit 99bbf6ecc694dfe0b026e15359c5aa2a60b97a93 ] consider the sequence of commands: mkdir -p /import/nfs /import/bind /import/etc mount --bind / /import/bind mount --make-private /import/bind mount --bind /import/etc /import/bind/etc exportfs -o rw,no_root_squash,crossmnt,async,no_subtree_check localhost:/ mount -o vers=4 localhost:/ /import/nfs ls -l /import/nfs/etc You would not expect this to report a stale file handle. Yet it does. The manipulations under /import/bind cause the dentry for /etc to get the DCACHE_MOUNTED flag set, even though nothing is mounted on /etc. This causes nfsd to call nfsd_cross_mnt() even though there is no mountpoint. So an upcall to mountd for "/etc" is performed. The 'crossmnt' flag on the export of / causes mountd to report that /etc is exported as it is a descendant of /. It assumes the kernel wouldn't ask about something that wasn't a mountpoint. The filehandle returned identifies the filesystem and the inode number of /etc. When this filehandle is presented to rpc.mountd, via "nfsd.fh", the inode cannot be found associated with any name in /etc/exports, or with any mountpoint listed by getmntent(). So rpc.mountd says the filehandle doesn't exist. Hence ESTALE. This is fixed by teaching nfsd not to trust DCACHE_MOUNTED too much. It is just a hint, not a guarantee. Change nfsd_mountpoint() to return '1' for a certain mountpoint, '2' for a possible mountpoint, and 0 otherwise. Then change nfsd_crossmnt() to check if follow_down() actually found a mountpount and, if not, to avoid performing a lookup if the location is not known to certainly require an export-point. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-24pNFS: Fix use after free issues in pnfs_do_read()Trond Myklebust
[ Upstream commit 6aeafd05eca9bc8ab6b03d7e56d09ffd18190f44 ] The assumption should be that if the caller returns PNFS_ATTEMPTED, then hdr has been consumed, and so we should not be testing hdr->task.tk_status. If the caller returns PNFS_TRY_AGAIN, then we need to recoalesce and free hdr. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-24NFS: Fix missing pg_cleanup after nfs_pageio_cond_complete()Benjamin Coddington
[ Upstream commit 43b7d964ed30dbca5c83c90cb010985b429ec4f9 ] Commit a7d42ddb3099727f58366fa006f850a219cce6c8 ("nfs: add mirroring support to pgio layer") moved pg_cleanup out of the path when there was non-sequental I/O that needed to be flushed. The result is that for layouts that have more than one layout segment per file, the pg_lseg is not cleared, so we can end up hitting the WARN_ON_ONCE(req_start >= seg_end) in pnfs_generic_pg_test since the pg_lseg will be pointing to that previously-flushed layout segment. Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Fixes: a7d42ddb3099 ("nfs: add mirroring support to pgio layer") Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-24btrfs: fix a bogus warning when converting only data or metadataAdam Borowski
[ Upstream commit 14506127979a5a3d0c5d9b4cc76ce9d4ec23b717 ] If your filesystem has, eg, data:raid0 metadata:raid1, and you run "btrfs balance -dconvert=raid1", the meta.target field will be uninitialized. That's otherwise ok, as it's unused except for this warning. Thus, let's use the existing set of raid levels for the comparison. As a side effect, non-convert balances will now nag about data>metadata. Signed-off-by: Adam Borowski <kilobyte@angband.pl> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-24CIFS: Enable encryption during session setup phasePavel Shilovsky
commit cabfb3680f78981d26c078a26e5c748531257ebb upstream. In order to allow encryption on SMB connection we need to exchange a session key and generate encryption and decryption keys. Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com> Cc: Steve French <smfrench@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-24SMB3: Validate negotiate request must always be signedSteve French
commit 4587eee04e2ac7ac3ac9fa2bc164fb6e548f99cd upstream. According to MS-SMB2 3.2.55 validate_negotiate request must always be signed. Some Windows can fail the request if you send it unsigned See kernel bugzilla bug 197311 CC: Stable <stable@vger.kernel.org> Acked-by: Ronnie Sahlberg <lsahlber.redhat.com> Signed-off-by: Steve French <smfrench@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-22btrfs: Fix use-after-free when cleaning up fs_devs with a single stale deviceNikolay Borisov
commit fd649f10c3d21ee9d7542c609f29978bdf73ab94 upstream. Commit 4fde46f0cc71 ("Btrfs: free the stale device") introduced btrfs_free_stale_device which iterates the device lists for all registered btrfs filesystems and deletes those devices which aren't mounted. In a btrfs_devices structure has only 1 device attached to it and it is unused then btrfs_free_stale_devices will proceed to also free the btrfs_fs_devices struct itself. Currently this leads to a use after free since list_for_each_entry will try to perform a check on the already freed memory to see if it has to terminate the loop. The fix is to use 'break' when we know we are freeing the current fs_devs. Fixes: 4fde46f0cc71 ("Btrfs: free the stale device") Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-22btrfs: alloc_chunk: fix DUP stripe size handlingHans van Kranenburg
commit 92e222df7b8f05c565009c7383321b593eca488b upstream. In case of using DUP, we search for enough unallocated disk space on a device to hold two stripes. The devices_info[ndevs-1].max_avail that holds the amount of unallocated space found is directly assigned to stripe_size, while it's actually twice the stripe size. Later on in the code, an unconditional division of stripe_size by dev_stripes corrects the value, but in the meantime there's a check to see if the stripe_size does not exceed max_chunk_size. Since during this check stripe_size is twice the amount as intended, the check will reduce the stripe_size to max_chunk_size if the actual correct to be used stripe_size is more than half the amount of max_chunk_size. The unconditional division later tries to correct stripe_size, but will actually make sure we can't allocate more than half the max_chunk_size. Fix this by moving the division by dev_stripes before the max chunk size check, so it always contains the right value, instead of putting a duct tape division in further on to get it fixed again. Since in all other cases than DUP, dev_stripes is 1, this change only affects DUP. Other attempts in the past were made to fix this: * 37db63a400 "Btrfs: fix max chunk size check in chunk allocator" tried to fix the same problem, but still resulted in part of the code acting on a wrongly doubled stripe_size value. * 86db25785a "Btrfs: fix max chunk size on raid5/6" unintentionally broke this fix again. The real problem was already introduced with the rest of the code in 73c5de0051. The user visible result however will be that the max chunk size for DUP will suddenly double, while it's actually acting according to the limits in the code again like it was 5 years ago. Reported-by: Naohiro Aota <naohiro.aota@wdc.com> Link: https://www.spinics.net/lists/linux-btrfs/msg69752.html Fixes: 73c5de0051 ("btrfs: quasi-round-robin for chunk allocation") Fixes: 86db25785a ("Btrfs: fix max chunk size on raid5/6") Signed-off-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update comment ] Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-22fs/aio: Use RCU accessors for kioctx_table->table[]Tejun Heo
commit d0264c01e7587001a8c4608a5d1818dba9a4c11a upstream. While converting ioctx index from a list to a table, db446a08c23d ("aio: convert the ioctx list to table lookup v3") missed tagging kioctx_table->table[] as an array of RCU pointers and using the appropriate RCU accessors. This introduces a small window in the lookup path where init and access may race. Mark kioctx_table->table[] with __rcu and use the approriate RCU accessors when using the field. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Jann Horn <jannh@google.com> Fixes: db446a08c23d ("aio: convert the ioctx list to table lookup v3") Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: stable@vger.kernel.org # v3.12+ Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-22fs/aio: Add explicit RCU grace period when freeing kioctxTejun Heo
commit a6d7cff472eea87d96899a20fa718d2bab7109f3 upstream. While fixing refcounting, e34ecee2ae79 ("aio: Fix a trinity splat") incorrectly removed explicit RCU grace period before freeing kioctx. The intention seems to be depending on the internal RCU grace periods of percpu_ref; however, percpu_ref uses a different flavor of RCU, sched-RCU. This can lead to kioctx being freed while RCU read protected dereferences are still in progress. Fix it by updating free_ioctx() to go through call_rcu() explicitly. v2: Comment added to explain double bouncing. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Jann Horn <jannh@google.com> Fixes: e34ecee2ae79 ("aio: Fix a trinity splat") Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: stable@vger.kernel.org # v3.13+ Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-22lock_parent() needs to recheck if dentry got __dentry_kill'ed under itAl Viro
commit 3b821409632ab778d46e807516b457dfa72736ed upstream. In case when dentry passed to lock_parent() is protected from freeing only by the fact that it's on a shrink list and trylock of parent fails, we could get hit by __dentry_kill() (and subsequent dentry_kill(parent)) between unlocking dentry and locking presumed parent. We need to recheck that dentry is alive once we lock both it and parent *and* postpone rcu_read_unlock() until after that point. Otherwise we could return a pointer to struct dentry that already is rcu-scheduled for freeing, with ->d_lock held on it; caller's subsequent attempt to unlock it can end up with memory corruption. Cc: stable@vger.kernel.org # 3.12+, counting backports Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-22fs: Teach path_connected to handle nfs filesystems with multiple roots.Eric W. Biederman
commit 95dd77580ccd66a0da96e6d4696945b8cea39431 upstream. On nfsv2 and nfsv3 the nfs server can export subsets of the same filesystem and report the same filesystem identifier, so that the nfs client can know they are the same filesystem. The subsets can be from disjoint directory trees. The nfsv2 and nfsv3 filesystems provides no way to find the common root of all directory trees exported form the server with the same filesystem identifier. The practical result is that in struct super s_root for nfs s_root is not necessarily the root of the filesystem. The nfs mount code sets s_root to the root of the first subset of the nfs filesystem that the kernel mounts. This effects the dcache invalidation code in generic_shutdown_super currently called shrunk_dcache_for_umount and that code for years has gone through an additional list of dentries that might be dentry trees that need to be freed to accomodate nfs. When I wrote path_connected I did not realize nfs was so special, and it's hueristic for avoiding calling is_subdir can fail. The practical case where this fails is when there is a move of a directory from the subtree exposed by one nfs mount to the subtree exposed by another nfs mount. This move can happen either locally or remotely. With the remote case requiring that the move directory be cached before the move and that after the move someone walks the path to where the move directory now exists and in so doing causes the already cached directory to be moved in the dcache through the magic of d_splice_alias. If someone whose working directory is in the move directory or a subdirectory and now starts calling .. from the initial mount of nfs (where s_root == mnt_root), then path_connected as a heuristic will not bother with the is_subdir check. As s_root really is not the root of the nfs filesystem this heuristic is wrong, and the path may actually not be connected and path_connected can fail. The is_subdir function might be cheap enough that we can call it unconditionally. Verifying that will take some benchmarking and the result may not be the same on all kernels this fix needs to be backported to. So I am avoiding that for now. Filesystems with snapshots such as nilfs and btrfs do something similar. But as the directory tree of the snapshots are disjoint from one another and from the main directory tree rename won't move things between them and this problem will not occur. Cc: stable@vger.kernel.org Reported-by: Al Viro <viro@ZenIV.linux.org.uk> Fixes: 397d425dc26d ("vfs: Test for and handle paths that are unreachable from their mnt_root") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-22userns: Don't fail follow_automount based on s_user_nsEric W. Biederman
[ Upstream commit bbc3e471011417598e598707486f5d8814ec9c01 ] When vfs_submount was added the test to limit automounts from filesystems that with s_user_ns != &init_user_ns accidentially left in follow_automount. The test was never about any security concerns and was always about how do we implement this for filesystems whose s_user_ns != &init_user_ns. At the moment this check makes no difference as there are no filesystems that both set FS_USERNS_MOUNT and implement d_automount. Remove this check now while I am thinking about it so there will not be odd booby traps for someone who does want to make this combination work. vfs_submount still needs improvements to allow this combination to work, and vfs_submount contains a check that presents a warning. The autofs4 filesystem could be modified to set FS_USERNS_MOUNT and it would need not work on this code path, as userspace performs the mounts. Fixes: 93faccbbfa95 ("fs: Better permission checking for submounts") Fixes: aeaa4a79ff6a ("fs: Call d_automount with the filesystems creds") Acked-by: Ian Kent <raven@themaw.net> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-22reiserfs: Make cancel_old_flush() reliableJan Kara
[ Upstream commit 71b0576bdb862e964a82c73327cdd1a249c53e67 ] Currently canceling of delayed work that flushes old data using cancel_old_flush() does not prevent work from being requeued. Thus in theory new work can be queued after cancel_old_flush() from reiserfs_freeze() has run. This will become larger problem once flush_old_commits() can requeue the work itself. Fix the problem by recording in sbi->work_queue that flushing work is canceled and should not be requeued. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-22f2fs: relax node version check for victim data in gcJaegeuk Kim
[ Upstream commit c13ff37e359bb3eacf4e1760dcea8d9760aa7459 ] - has_not_enough_free_secs node_secs: 0 dent_secs: 0 freed:0 free_segments:103 reserved:104 - f2fs_gc - get_victim_by_default alloc_mode 0, gc_mode 1, max_search 2672, offset 4654, ofs_unit 1 - do_garbage_collect start_segno 3976, end_segno 3977 type 0 - is_alive nid 22797, blkaddr 2131882, ofs_in_node 0, version 0x8/0x0 - gc_data_segment 766, segno 3976, block 512/426 not alive So, this patch fixes subtle corrupted case where node version does not match to summary version which results in infinite loop by gc. Reported-by: Yunlei He <heyunlei@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-18NFS: Fix unstable write completionTrond Myklebust
commit c4f24df942a181699c5bab01b8e5e82b925f77f3 upstream. We do want to respect the FLUSH_SYNC argument to nfs_commit_inode() to ensure that all outstanding COMMIT requests to the inode in question are complete. Currently we may exit early from both nfs_commit_inode() and nfs_write_inode() even if there are COMMIT requests in flight, or unstable writes on the commit list. In order to get the right semantics w.r.t. sync_inode(), we don't need to have nfs_commit_inode() reset the inode dirty flags when called from nfs_wb_page() and/or nfs_wb_all(). We just need to ensure that nfs_write_inode() leaves them in the right state if there are outstanding commits, or stable pages. Reported-by: Scott Mayhew <smayhew@redhat.com> Fixes: dc4fd9ab01ab ("nfs: don't wait on commit in nfs_commit_inode()...") Cc: stable@vger.kernel.org # v4.14+ Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-18NFS: Fix an incorrect type in struct nfs_direct_reqTrond Myklebust
commit d9ee65539d3eabd9ade46cca1780e3309ad0f907 upstream. The start offset needs to be of type loff_t. Fixed: 5fadeb47dcc5c ("nfs: count DIO good bytes correctly with mirroring") Cc: stable@vger.kernel.org # v4.0+ Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-18ext4: inplace xattr block update fails to deduplicate blocksTahsin Erdogan
commit ec00022030da5761518476096626338bd67df57a upstream. When an xattr block has a single reference, block is updated inplace and it is reinserted to the cache. Later, a cache lookup is performed to see whether an existing block has the same contents. This cache lookup will most of the time return the just inserted entry so deduplication is not achieved. Running the following test script will produce two xattr blocks which can be observed in "File ACL: " line of debugfs output: mke2fs -b 1024 -I 128 -F -O extent /dev/sdb 1G mount /dev/sdb /mnt/sdb touch /mnt/sdb/{x,y} setfattr -n user.1 -v aaa /mnt/sdb/x setfattr -n user.2 -v bbb /mnt/sdb/x setfattr -n user.1 -v aaa /mnt/sdb/y setfattr -n user.2 -v bbb /mnt/sdb/y debugfs -R 'stat x' /dev/sdb | cat debugfs -R 'stat y' /dev/sdb | cat This patch defers the reinsertion to the cache so that we can locate other blocks with the same contents. Signed-off-by: Tahsin Erdogan <tahsin@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Tommi Rantala <tommi.t.rantala@nokia.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-11btrfs: preserve i_mode if __btrfs_set_acl() failsErnesto A. Fernández
commit d7d824966530acfe32b94d1ed672e6fe1638cd68 upstream. When changing a file's acl mask, btrfs_set_acl() will first set the group bits of i_mode to the value of the mask, and only then set the actual extended attribute representing the new acl. If the second part fails (due to lack of space, for example) and the file had no acl attribute to begin with, the system will from now on assume that the mask permission bits are actual group permission bits, potentially granting access to the wrong users. Prevent this by restoring the original mode bits if __btrfs_set_acl fails. Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-03xfs: quota: check result of register_shrinker()Aliaksei Karaliou
[ Upstream commit 3a3882ff26fbdbaf5f7e13f6a0bccfbf7121041d ] xfs_qm_init_quotainfo() does not check result of register_shrinker() which was tagged as __must_check recently, reported by sparse. Signed-off-by: Aliaksei Karaliou <akaraliou.dev@gmail.com> [darrick: move xfs_qm_destroy_quotainos nearer xfs_qm_init_quotainos] Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-03xfs: quota: fix missed destroy of qi_tree_lockAliaksei Karaliou
[ Upstream commit 2196881566225f3c3428d1a5f847a992944daa5b ] xfs_qm_destroy_quotainfo() does not destroy quotainfo->qi_tree_lock while destroys quotainfo->qi_quotaofflock. Signed-off-by: Aliaksei Karaliou <akaraliou.dev@gmail.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-03sget(): handle failures of register_shrinker()Al Viro
[ Upstream commit 9ee332d99e4d5a97548943b81c54668450ce641b ] Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-03f2fs: fix a bug caused by NULL extent treeYunlei He
commit dad48e73127ba10279ea33e6dbc8d3905c4d31c0 upstream. Thread A: Thread B: -f2fs_remount -sbi->mount_opt.opt = 0; <--- -f2fs_iget -do_read_inode -f2fs_init_extent_tree -F2FS_I(inode)->extent_tree is NULL -default_options && parse_options -remount return <--- -f2fs_map_blocks -f2fs_lookup_extent_tree -f2fs_bug_on(sbi, !et); The same problem with f2fs_new_inode. Signed-off-by: Yunlei He <heyunlei@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-28fs/dax.c: fix inefficiency in dax_writeback_mapping_range()Jan Kara
commit 1eb643d02b21412e603b42cdd96010a2ac31c05f upstream. dax_writeback_mapping_range() fails to update iteration index when searching radix tree for entries needing cache flushing. Thus each pagevec worth of entries is searched starting from the start which is inefficient and prone to livelocks. Update index properly. Link: http://lkml.kernel.org/r/20170619124531.21491-1-jack@suse.cz Fixes: 9973c98ecfda3 ("dax: add support for fsync/sync") Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-25binfmt_elf: compat: avoid unused function warningArnd Bergmann
When CONFIG_ELF_CORE is disabled, we get a harmless warning in the compat version of binfmt_elf: fs/compat_binfmt_elf.c:58:13: error: 'cputime_to_compat_timeval' defined but not used [-Werror=unused-function] This was addressed in mainline Linux as part of a larger rework with commit cd19c364b313 ("fs/binfmt: Convert obsolete cputime type to nsecs"). For 4.9 and earlier, this just shuts up the warning by adding an #ifdef around the function definition. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-25reiserfs: avoid a -Wmaybe-uninitialized warningArnd Bergmann
commit ab4949640d6674b617b314ad3c2c00353304bab9 upstream. The latest gcc-7.0.1 snapshot warns about an unintialized variable use: In file included from fs/reiserfs/lbalance.c:8:0: fs/reiserfs/lbalance.c: In function 'leaf_item_bottle.isra.3': fs/reiserfs/reiserfs.h:1279:13: error: '*((void *)&n_ih+8).v' may be used uninitialized in this function [-Werror=maybe-uninitialized] v2->v = (v2->v & cpu_to_le64(15ULL << 60)) | cpu_to_le64(offset); ~~^~~ fs/reiserfs/reiserfs.h:1279:13: error: '*((void *)&n_ih+8).v' may be used uninitialized in this function [-Werror=maybe-uninitialized] v2->v = (v2->v & cpu_to_le64(15ULL << 60)) | cpu_to_le64(offset); This happens because the offset/type pair that is stored in ih.key.u.k_offset_v2 is actually uninitialized when we call set_le_ih_k_offset() and set_le_ih_k_type(). After we have called both, all data is correct, but the first of the two reads uninitialized data for the type field and writes it back before it gets overwritten. This works around the warning by initializing the k_offset_v2 through the slightly larger memcpy(). [JK: Remove now unused define and make it obvious we initialize the key] Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-25btrfs: Fix possible off-by-one in btrfs_search_path_in_treeNikolay Borisov
[ Upstream commit c8bcbfbd239ed60a6562964b58034ac8a25f4c31 ] The name char array passed to btrfs_search_path_in_tree is of size BTRFS_INO_LOOKUP_PATH_MAX (4080). So the actual accessible char indexes are in the range of [0, 4079]. Currently the code uses the define but this represents an off-by-one. Implications: Size of btrfs_ioctl_ino_lookup_args is 4096, so the new byte will be written to extra space, not some padding that could be provided by the allocator. btrfs-progs store the arguments on stack, but kernel does own copy of the ioctl buffer and the off-by-one overwrite does not affect userspace, but the ending 0 might be lost. Kernel ioctl buffer is allocated dynamically so we're overwriting somebody else's memory, and the ioctl is privileged if args.objectid is not 256. Which is in most cases, but resolving a subvolume stored in another directory will trigger that path. Before this patch the buffer was one byte larger, but then the -1 was not added. Fixes: ac8e9819d71f907 ("Btrfs: add search and inode lookup ioctls") Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ added implications ] Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-22vfs: don't do RCU lookup of empty pathnamesLinus Torvalds
commit c0eb027e5aef70b71e5a38ee3e264dc0b497f343 upstream. Normal pathname lookup doesn't allow empty pathnames, but using AT_EMPTY_PATH (with name_to_handle_at() or fstatat(), for example) you can trigger an empty pathname lookup. And not only is the RCU lookup in that case entirely unnecessary (because we'll obviously immediately finalize the end result), it is actively wrong. Why? An empth path is a special case that will return the original 'dirfd' dentry - and that dentry may not actually be RCU-free'd, resulting in a potential use-after-free if we were to initialize the path lazily under the RCU read lock and depend on complete_walk() finalizing the dentry. Found by syzkaller and KASAN. Reported-by: Dmitry Vyukov <dvyukov@google.com> Reported-by: Vegard Nossum <vegard.nossum@gmail.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Eric Biggers <ebiggers3@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-22ocfs2: try a blocking lock before return AOP_TRUNCATED_PAGEGang He
commit ff26cc10aec128c3f86b5611fd5f59c71d49c0e3 upstream. If we can't get inode lock immediately in the function ocfs2_inode_lock_with_page() when reading a page, we should not return directly here, since this will lead to a softlockup problem when the kernel is configured with CONFIG_PREEMPT is not set. The method is to get a blocking lock and immediately unlock before returning, this can avoid CPU resource waste due to lots of retries, and benefits fairness in getting lock among multiple nodes, increase efficiency in case modifying the same file frequently from multiple nodes. The softlockup crash (when set /proc/sys/kernel/softlockup_panic to 1) looks like: Kernel panic - not syncing: softlockup: hung tasks CPU: 0 PID: 885 Comm: multi_mmap Tainted: G L 4.12.14-6.1-default #1 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 Call Trace: <IRQ> dump_stack+0x5c/0x82 panic+0xd5/0x21e watchdog_timer_fn+0x208/0x210 __hrtimer_run_queues+0xcc/0x200 hrtimer_interrupt+0xa6/0x1f0 smp_apic_timer_interrupt+0x34/0x50 apic_timer_interrupt+0x96/0xa0 </IRQ> RIP: 0010:unlock_page+0x17/0x30 RSP: 0000:ffffaf154080bc88 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10 RAX: dead000000000100 RBX: fffff21e009f5300 RCX: 0000000000000004 RDX: dead0000000000ff RSI: 0000000000000202 RDI: fffff21e009f5300 RBP: 0000000000000000 R08: 0000000000000000 R09: ffffaf154080bb00 R10: ffffaf154080bc30 R11: 0000000000000040 R12: ffff993749a39518 R13: 0000000000000000 R14: fffff21e009f5300 R15: fffff21e009f5300 ocfs2_inode_lock_with_page+0x25/0x30 [ocfs2] ocfs2_readpage+0x41/0x2d0 [ocfs2] filemap_fault+0x12b/0x5c0 ocfs2_fault+0x29/0xb0 [ocfs2] __do_fault+0x1a/0xa0 __handle_mm_fault+0xbe8/0x1090 handle_mm_fault+0xaa/0x1f0 __do_page_fault+0x235/0x4b0 trace_do_page_fault+0x3c/0x110 async_page_fault+0x28/0x30 RIP: 0033:0x7fa75ded638e RSP: 002b:00007ffd6657db18 EFLAGS: 00010287 RAX: 000055c7662fb700 RBX: 0000000000000001 RCX: 000055c7662fb700 RDX: 0000000000001770 RSI: 00007fa75e909000 RDI: 000055c7662fb700 RBP: 0000000000000003 R08: 000000000000000e R09: 0000000000000000 R10: 0000000000000483 R11: 00007fa75ded61b0 R12: 00007fa75e90a770 R13: 000000000000000e R14: 0000000000001770 R15: 0000000000000000 About performance improvement, we can see the testing time is reduced, and CPU utilization decreases, the detailed data is as follows. I ran multi_mmap test case in ocfs2-test package in a three nodes cluster. Before applying this patch: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2754 ocfs2te+ 20 0 170248 6980 4856 D 80.73 0.341 0:18.71 multi_mmap 1505 root rt 0 222236 123060 97224 S 2.658 6.015 0:01.44 corosync 5 root 20 0 0 0 0 S 1.329 0.000 0:00.19 kworker/u8:0 95 root 20 0 0 0 0 S 1.329 0.000 0:00.25 kworker/u8:1 2728 root 20 0 0 0 0 S 0.997 0.000 0:00.24 jbd2/sda1-33 2721 root 20 0 0 0 0 S 0.664 0.000 0:00.07 ocfs2dc-3C8CFD4 2750 ocfs2te+ 20 0 142976 4652 3532 S 0.664 0.227 0:00.28 mpirun ocfs2test@tb-node2:~>multiple_run.sh -i ens3 -k ~/linux-4.4.21-69.tar.gz -o ~/ocfs2mullog -C hacluster -s pcmk -n tb-node2,tb-node1,tb-node3 -d /dev/sda1 -b 4096 -c 32768 -t multi_mmap /mnt/shared Tests with "-b 4096 -C 32768" Thu Dec 28 14:44:52 CST 2017 multi_mmap..................................................Passed. Runtime 783 seconds. After apply this patch: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2508 ocfs2te+ 20 0 170248 6804 4680 R 54.00 0.333 0:55.37 multi_mmap 155 root 20 0 0 0 0 S 2.667 0.000 0:01.20 kworker/u8:3 95 root 20 0 0 0 0 S 2.000 0.000 0:01.58 kworker/u8:1 2504 ocfs2te+ 20 0 142976 4604 3480 R 1.667 0.225 0:01.65 mpirun 5 root 20 0 0 0 0 S 1.000 0.000 0:01.36 kworker/u8:0 2482 root 20 0 0 0 0 S 1.000 0.000 0:00.86 jbd2/sda1-33 299 root 0 -20 0 0 0 S 0.333 0.000 0:00.13 kworker/2:1H 335 root 0 -20 0 0 0 S 0.333 0.000 0:00.17 kworker/1:1H 535 root 20 0 12140 7268 1456 S 0.333 0.355 0:00.34 haveged 1282 root rt 0 222284 123108 97224 S 0.333 6.017 0:01.33 corosync ocfs2test@tb-node2:~>multiple_run.sh -i ens3 -k ~/linux-4.4.21-69.tar.gz -o ~/ocfs2mullog -C hacluster -s pcmk -n tb-node2,tb-node1,tb-node3 -d /dev/sda1 -b 4096 -c 32768 -t multi_mmap /mnt/shared Tests with "-b 4096 -C 32768" Thu Dec 28 15:04:12 CST 2017 multi_mmap..................................................Passed. Runtime 487 seconds. Link: http://lkml.kernel.org/r/1514447305-30814-1-git-send-email-ghe@suse.com Fixes: 1cce4df04f37 ("ocfs2: do not lock/unlock() inode DLM lock") Signed-off-by: Gang He <ghe@suse.com> Reviewed-by: Eric Ren <zren@suse.com> Acked-by: alex chen <alex.chen@huawei.com> Acked-by: piaojun <piaojun@huawei.com> Cc: Mark Fasheh <mfasheh@versity.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Changwei Ge <ge.changwei@h3c.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-22Btrfs: fix unexpected -EEXIST when creating new inodeLiu Bo
commit 900c9981680067573671ecc5cbfa7c5770be3a40 upstream. The highest objectid, which is assigned to new inode, is decided at the time of initializing fs roots. However, in cases where log replay gets processed, the btree which fs root owns might be changed, so we have to search it again for the highest objectid, otherwise creating new inode would end up with -EEXIST. cc: <stable@vger.kernel.org> v4.4-rc6+ Fixes: f32e48e92596 ("Btrfs: Initialize btrfs_root->highest_objectid when loading tree root and subvolume roots") Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-22Btrfs: fix btrfs_evict_inode to handle abnormal inodes correctlyLiu Bo
commit e8f1bc1493855e32b7a2a019decc3c353d94daf6 upstream. This regression is introduced in commit 3d48d9810de4 ("btrfs: Handle uninitialised inode eviction"). There are two problems, a) it is ->destroy_inode() that does the final free on inode, not ->evict_inode(), b) clear_inode() must be called before ->evict_inode() returns. This could end up hitting BUG_ON(inode->i_state != (I_FREEING | I_CLEAR)); in evict() because I_CLEAR is set in clear_inode(). Fixes: commit 3d48d9810de4 ("btrfs: Handle uninitialised inode eviction") Cc: <stable@vger.kernel.org> # v4.7-rc6+ Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-22Btrfs: fix extent state leak from tree logLiu Bo
commit 55237a5f2431a72435e3ed39e4306e973c0446b7 upstream. It's possible that btrfs_sync_log() bails out after one of the two btrfs_write_marked_extents() which convert extent state's state bit into EXTENT_NEED_WAIT from EXTENT_DIRTY/EXTENT_NEW, however only EXTENT_DIRTY and EXTENT_NEW are searched by free_log_tree() so that those extent states with EXTENT_NEED_WAIT lead to memory leak. cc: <stable@vger.kernel.org> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-22Btrfs: fix crash due to not cleaning up tree log block's dirty bitsLiu Bo
commit 1846430c24d66e85cc58286b3319c82cd54debb2 upstream. In cases that the whole fs flips into readonly status due to failures in critical sections, then log tree's blocks are still dirty, and this leads to a crash during umount time, the crash is about use-after-free, umount -> close_ctree -> stop workers -> iput(btree_inode) -> iput_final -> write_inode_now -> ... -> queue job on stop'd workers cc: <stable@vger.kernel.org> v3.12+ Fixes: 681ae50917df ("Btrfs: cleanup reserved space when freeing tree log on error") Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-22Btrfs: fix deadlock in run_delalloc_nocowLiu Bo
commit e89166990f11c3f21e1649d760dd35f9e410321c upstream. @cur_offset is not set back to what it should be (@cow_start) if btrfs_next_leaf() returns something wrong, and the range [cow_start, cur_offset) remains locked forever. cc: <stable@vger.kernel.org> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-22ext4: save error to disk in __ext4_grp_locked_error()Zhouyi Zhou
commit 06f29cc81f0350261f59643a505010531130eea0 upstream. In the function __ext4_grp_locked_error(), __save_error_info() is called to save error info in super block block, but does not sync that information to disk to info the subsequence fsck after reboot. This patch writes the error information to disk. After this patch, I think there is no obvious EXT4 error handle branches which leads to "Remounting filesystem read-only" will leave the disk partition miss the subsequence fsck. Signed-off-by: Zhouyi Zhou <zhouzhouyi@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-22ext4: fix a race in the ext4 shutdown pathHarshad Shirwadkar
commit abbc3f9395c76d554a9ed27d4b1ebfb5d9b0e4ca upstream. This patch fixes a race between the shutdown path and bio completion handling. In the ext4 direct io path with async io, after submitting a bio to the block layer, if journal starting fails, ext4_direct_IO_write() would bail out pretending that the IO failed. The caller would have had no way of knowing whether or not the IO was successfully submitted. So instead, we return -EIOCBQUEUED in this case. Now, the caller knows that the IO was submitted. The bio completion handler takes care of the error. Tested: Ran the shutdown xfstest test 461 in loop for over 2 hours across 4 machines resulting in over 400 runs. Verified that the race didn't occur. Usually the race was seen in about 20-30 iterations. Signed-off-by: Harshad Shirwadkar <harshads@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-22jbd2: fix sphinx kernel-doc build warningsTobin C. Harding
commit f69120ce6c024aa634a8fc25787205e42f0ccbe6 upstream. Sphinx emits various (26) warnings when building make target 'htmldocs'. Currently struct definitions contain duplicate documentation, some as kernel-docs and some as standard c89 comments. We can reduce duplication while cleaning up the kernel docs. Move all kernel-docs to right above each struct member. Use the set of all existing comments (kernel-doc and c89). Add documentation for missing struct members and function arguments. Signed-off-by: Tobin C. Harding <me@tobin.cc> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-22mbcache: initialize entry->e_referenced in mb_cache_entry_create()Alexander Potapenko
commit 3876bbe27d04b848750d5310a37d6b76b593f648 upstream. KMSAN reported use of uninitialized |entry->e_referenced| in a condition in mb_cache_shrink(): ================================================================== BUG: KMSAN: use of uninitialized memory in mb_cache_shrink+0x3b4/0xc50 fs/mbcache.c:287 CPU: 2 PID: 816 Comm: kswapd1 Not tainted 4.11.0-rc5+ #2877 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:16 [inline] dump_stack+0x172/0x1c0 lib/dump_stack.c:52 kmsan_report+0x12a/0x180 mm/kmsan/kmsan.c:927 __msan_warning_32+0x61/0xb0 mm/kmsan/kmsan_instr.c:469 mb_cache_shrink+0x3b4/0xc50 fs/mbcache.c:287 mb_cache_scan+0x67/0x80 fs/mbcache.c:321 do_shrink_slab mm/vmscan.c:397 [inline] shrink_slab+0xc3d/0x12d0 mm/vmscan.c:500 shrink_node+0x208f/0x2fd0 mm/vmscan.c:2603 kswapd_shrink_node mm/vmscan.c:3172 [inline] balance_pgdat mm/vmscan.c:3289 [inline] kswapd+0x160f/0x2850 mm/vmscan.c:3478 kthread+0x46c/0x5f0 kernel/kthread.c:230 ret_from_fork+0x29/0x40 arch/x86/entry/entry_64.S:430 chained origin: save_stack_trace+0x37/0x40 arch/x86/kernel/stacktrace.c:59 kmsan_save_stack_with_flags mm/kmsan/kmsan.c:302 [inline] kmsan_save_stack mm/kmsan/kmsan.c:317 [inline] kmsan_internal_chain_origin+0x12a/0x1f0 mm/kmsan/kmsan.c:547 __msan_store_shadow_origin_1+0xac/0x110 mm/kmsan/kmsan_instr.c:257 mb_cache_entry_create+0x3b3/0xc60 fs/mbcache.c:95 ext4_xattr_cache_insert fs/ext4/xattr.c:1647 [inline] ext4_xattr_block_set+0x4c82/0x5530 fs/ext4/xattr.c:1022 ext4_xattr_set_handle+0x1332/0x20a0 fs/ext4/xattr.c:1252 ext4_xattr_set+0x4d2/0x680 fs/ext4/xattr.c:1306 ext4_xattr_trusted_set+0x8d/0xa0 fs/ext4/xattr_trusted.c:36 __vfs_setxattr+0x703/0x790 fs/xattr.c:149 __vfs_setxattr_noperm+0x27a/0x6f0 fs/xattr.c:180 vfs_setxattr fs/xattr.c:223 [inline] setxattr+0x6ae/0x790 fs/xattr.c:449 path_setxattr+0x1eb/0x380 fs/xattr.c:468 SYSC_lsetxattr+0x8d/0xb0 fs/xattr.c:490 SyS_lsetxattr+0x77/0xa0 fs/xattr.c:486 entry_SYSCALL_64_fastpath+0x13/0x94 origin: save_stack_trace+0x37/0x40 arch/x86/kernel/stacktrace.c:59 kmsan_save_stack_with_flags mm/kmsan/kmsan.c:302 [inline] kmsan_internal_poison_shadow+0xb1/0x1a0 mm/kmsan/kmsan.c:198 kmsan_kmalloc+0x7f/0xe0 mm/kmsan/kmsan.c:337 kmem_cache_alloc+0x1c2/0x1e0 mm/slub.c:2766 mb_cache_entry_create+0x283/0xc60 fs/mbcache.c:86 ext4_xattr_cache_insert fs/ext4/xattr.c:1647 [inline] ext4_xattr_block_set+0x4c82/0x5530 fs/ext4/xattr.c:1022 ext4_xattr_set_handle+0x1332/0x20a0 fs/ext4/xattr.c:1252 ext4_xattr_set+0x4d2/0x680 fs/ext4/xattr.c:1306 ext4_xattr_trusted_set+0x8d/0xa0 fs/ext4/xattr_trusted.c:36 __vfs_setxattr+0x703/0x790 fs/xattr.c:149 __vfs_setxattr_noperm+0x27a/0x6f0 fs/xattr.c:180 vfs_setxattr fs/xattr.c:223 [inline] setxattr+0x6ae/0x790 fs/xattr.c:449 path_setxattr+0x1eb/0x380 fs/xattr.c:468 SYSC_lsetxattr+0x8d/0xb0 fs/xattr.c:490 SyS_lsetxattr+0x77/0xa0 fs/xattr.c:486 entry_SYSCALL_64_fastpath+0x13/0x94 ================================================================== Signed-off-by: Alexander Potapenko <glider@google.com> Signed-off-by: Eric Biggers <ebiggers@google.com> Cc: stable@vger.kernel.org # v4.6 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-17ovl: fix failure to fsync lower dirAmir Goldstein
commit d796e77f1dd541fe34481af2eee6454688d13982 upstream. As a writable mount, it is not expected for overlayfs to return EINVAL/EROFS for fsync, even if dir/file is not changed. This commit fixes the case of fsync of directory, which is easier to address, because overlayfs already implements fsync file operation for directories. The problem reported by Raphael is that new PostgreSQL 10.0 with a database in overlayfs where lower layer in squashfs fails to start. The failure is due to fsync error, when PostgreSQL does fsync on all existing db directories on startup and a specific directory exists lower layer with no changes. Reported-by: Raphael Hertzog <raphael@ouaza.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Tested-by: Raphaël Hertzog <hertzog@debian.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-17btrfs: Handle btrfs_set_extent_delalloc failure in fixup workerNikolay Borisov
commit f3038ee3a3f1017a1cbe9907e31fa12d366c5dcb upstream. This function was introduced by 247e743cbe6e ("Btrfs: Use async helpers to deal with pages that have been improperly dirtied") and it didn't do any error handling then. This function might very well fail in ENOMEM situation, yet it's not handled, this could lead to inconsistent state. So let's handle the failure by setting the mapping error bit. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>