summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2017-01-07mm: workingset: fix use-after-free in shadow node shrinkerJohannes Weiner
Several people report seeing warnings about inconsistent radix tree nodes followed by crashes in the workingset code, which all looked like use-after-free access from the shadow node shrinker. Dave Jones managed to reproduce the issue with a debug patch applied, which confirmed that the radix tree shrinking indeed frees shadow nodes while they are still linked to the shadow LRU: WARNING: CPU: 2 PID: 53 at lib/radix-tree.c:643 delete_node+0x1e4/0x200 CPU: 2 PID: 53 Comm: kswapd0 Not tainted 4.10.0-rc2-think+ #3 Call Trace: delete_node+0x1e4/0x200 __radix_tree_delete_node+0xd/0x10 shadow_lru_isolate+0xe6/0x220 __list_lru_walk_one.isra.4+0x9b/0x190 list_lru_walk_one+0x23/0x30 scan_shadow_nodes+0x2e/0x40 shrink_slab.part.44+0x23d/0x5d0 shrink_node+0x22c/0x330 kswapd+0x392/0x8f0 This is the WARN_ON_ONCE(!list_empty(&node->private_list)) placed in the inlined radix_tree_shrink(). The problem is with 14b468791fa9 ("mm: workingset: move shadow entry tracking to radix tree exceptional tracking"), which passes an update callback into the radix tree to link and unlink shadow leaf nodes when tree entries change, but forgot to pass the callback when reclaiming a shadow node. While the reclaimed shadow node itself is unlinked by the shrinker, its deletion from the tree can cause the left-most leaf node in the tree to be shrunk. If that happens to be a shadow node as well, we don't unlink it from the LRU as we should. Consider this tree, where the s are shadow entries: root->rnode | [0 n] | | [s ] [sssss] Now the shadow node shrinker reclaims the rightmost leaf node through the shadow node LRU: root->rnode | [0 ] | [s ] Because the parent of the deleted node is the first level below the root and has only one child in the left-most slot, the intermediate level is shrunk and the node containing the single shadow is put in its place: root->rnode | [s ] The shrinker again sees a single left-most slot in a first level node and thus decides to store the shadow in root->rnode directly and free the node - which is a leaf node on the shadow node LRU. root->rnode | s Without the update callback, the freed node remains on the shadow LRU, where it causes later shrinker runs to crash. Pass the node updater callback into __radix_tree_delete_node() in case the deletion causes the left-most branch in the tree to collapse too. Also add warnings when linked nodes are freed right away, rather than wait for the use-after-free when the list is scanned much later. Fixes: 14b468791fa9 ("mm: workingset: move shadow entry tracking to radix tree exceptional tracking") Reported-by: Dave Chinner <david@fromorbit.com> Reported-by: Hugh Dickins <hughd@google.com> Reported-by: Andrea Arcangeli <aarcange@redhat.com> Reported-and-tested-by: Dave Jones <davej@codemonkey.org.uk> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Chris Leech <cleech@redhat.com> Cc: Lee Duncan <lduncan@suse.com> Cc: Jan Kara <jack@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <mawilcox@linuxonhyperv.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-01-07mm: stop leaking PageTablesHugh Dickins
4.10-rc loadtest (even on x86, and even without THPCache) fails with "fork: Cannot allocate memory" or some such; and /proc/meminfo shows PageTables growing. Commit 953c66c2b22a ("mm: THP page cache support for ppc64") that got merged in rc1 removed the freeing of an unused preallocated pagetable after do_fault_around() has called map_pages(). This is usually a good optimization, so that the followup doesn't have to reallocate one; but it's not sufficient to shift the freeing into alloc_set_pte(), since there are failure cases (most commonly VM_FAULT_RETRY) which never reach finish_fault(). Check and free it at the outer level in do_fault(), then we don't need to worry in alloc_set_pte(), and can restore that to how it was (I cannot find any reason to pte_free() under lock as it was doing). And fix a separate pagetable leak, or crash, introduced by the same change, that could only show up on some ppc64: why does do_set_pmd()'s failure case attempt to withdraw a pagetable when it never deposited one, at the same time overwriting (so leaking) the vmf->prealloc_pte? Residue of an earlier implementation, perhaps? Delete it. Fixes: 953c66c2b22a ("mm: THP page cache support for ppc64") Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Michael Neuling <mikey@neuling.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-01-01Merge branch 'libnvdimm-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull DAX updates from Dan Williams: "The completion of Jan's DAX work for 4.10. As I mentioned in the libnvdimm-for-4.10 pull request, these are some final fixes for the DAX dirty-cacheline-tracking invalidation work that was merged through the -mm, ext4, and xfs trees in -rc1. These patches were prepared prior to the merge window, but we waited for 4.10-rc1 to have a stable merge base after all the prerequisites were merged. Quoting Jan on the overall changes in these patches: "So I'd like all these 6 patches to go for rc2. The first three patches fix invalidation of exceptional DAX entries (a bug which is there for a long time) - without these patches data loss can occur on power failure even though user called fsync(2). The other three patches change locking of DAX faults so that ->iomap_begin() is called in a more relaxed locking context and we are safe to start a transaction there for ext4" These have received a build success notification from the kbuild robot, and pass the latest libnvdimm unit tests. There have not been any -next releases since -rc1, so they have not appeared there" * 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: ext4: Simplify DAX fault path dax: Call ->iomap_begin without entry lock during dax fault dax: Finish fault completely when loading holes dax: Avoid page invalidation races and unnecessary radix tree traversals mm: Invalidate DAX radix tree entries only if appropriate ext2: Return BH_New buffers for zeroed blocks
2016-12-29mm/filemap: fix parameters to test_bit()Olof Johansson
mm/filemap.c: In function 'clear_bit_unlock_is_negative_byte': mm/filemap.c:933:9: error: too few arguments to function 'test_bit' return test_bit(PG_waiters); ^~~~~~~~ Fixes: b91e1302ad9b ('mm: optimize PageWaiters bit use for unlock_page()') Signed-off-by: Olof Johansson <olof@lixom.net> Brown-paper-bag-by: Linus Torvalds <dummy@duh.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-29mm: optimize PageWaiters bit use for unlock_page()Linus Torvalds
In commit 62906027091f ("mm: add PageWaiters indicating tasks are waiting for a page bit") Nick Piggin made our page locking no longer unconditionally touch the hashed page waitqueue, which not only helps performance in general, but is particularly helpful on NUMA machines where the hashed wait queues can bounce around a lot. However, the "clear lock bit atomically and then test the waiters bit" sequence turns out to be much more expensive than it needs to be, because you get a nasty stall when trying to access the same word that just got updated atomically. On architectures where locking is done with LL/SC, this would be trivial to fix with a new primitive that clears one bit and tests another atomically, but that ends up not working on x86, where the only atomic operations that return the result end up being cmpxchg and xadd. The atomic bit operations return the old value of the same bit we changed, not the value of an unrelated bit. On x86, we could put the lock bit in the high bit of the byte, and use "xadd" with that bit (where the overflow ends up not touching other bits), and look at the other bits of the result. However, an even simpler model is to just use a regular atomic "and" to clear the lock bit, and then the sign bit in eflags will indicate the resulting state of the unrelated bit #7. So by moving the PageWaiters bit up to bit #7, we can atomically clear the lock bit and test the waiters bit on x86 too. And architectures with LL/SC (which is all the usual RISC suspects), the particular bit doesn't matter, so they are fine with this approach too. This avoids the extra access to the same atomic word, and thus avoids the costly stall at page unlock time. The only downside is that the interface ends up being a bit odd and specialized: clear a bit in a byte, and test the sign bit. Nick doesn't love the resulting name of the new primitive, but I'd rather make the name be descriptive and very clear about the limitation imposed by trying to work across all relevant architectures than make it be some generic thing that doesn't make the odd semantics explicit. So this introduces the new architecture primitive clear_bit_unlock_is_negative_byte(); and adds the trivial implementation for x86. We have a generic non-optimized fallback (that just does a "clear_bit()"+"test_bit(7)" combination) which can be overridden by any architecture that can do better. According to Nick, Power has the same hickup x86 has, for example, but some other architectures may not even care. All these optimizations mean that my page locking stress-test (which is just executing a lot of small short-lived shell scripts: "make test" in the git source tree) no longer makes our page locking look horribly bad. Before all these optimizations, just the unlock_page() costs were just over 3% of all CPU overhead on "make test". After this, it's down to 0.66%, so just a quarter of the cost it used to be. (The difference on NUMA is bigger, but there this micro-optimization is likely less noticeable, since the big issue on NUMA was not the accesses to 'struct page', but the waitqueue accesses that were already removed by Nick's earlier commit). Acked-by: Nick Piggin <npiggin@gmail.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Bob Peterson <rpeterso@redhat.com> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Andrew Lutomirski <luto@kernel.org> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-26mm: Invalidate DAX radix tree entries only if appropriateJan Kara
Currently invalidate_inode_pages2_range() and invalidate_mapping_pages() just delete all exceptional radix tree entries they find. For DAX this is not desirable as we track cache dirtiness in these entries and when they are evicted, we may not flush caches although it is necessary. This can for example manifest when we write to the same block both via mmap and via write(2) (to different offsets) and fsync(2) then does not properly flush CPU caches when modification via write(2) was the last one. Create appropriate DAX functions to handle invalidation of DAX entries for invalidate_inode_pages2_range() and invalidate_mapping_pages() and wire them up into the corresponding mm functions. Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-12-25mm: add PageWaiters indicating tasks are waiting for a page bitNicholas Piggin
Add a new page flag, PageWaiters, to indicate the page waitqueue has tasks waiting. This can be tested rather than testing waitqueue_active which requires another cacheline load. This bit is always set when the page has tasks on page_waitqueue(page), and is set and cleared under the waitqueue lock. It may be set when there are no tasks on the waitqueue, which will cause a harmless extra wakeup check that will clears the bit. The generic bit-waitqueue infrastructure is no longer used for pages. Instead, waitqueues are used directly with a custom key type. The generic code was not flexible enough to have PageWaiters manipulation under the waitqueue lock (which simplifies concurrency). This improves the performance of page lock intensive microbenchmarks by 2-3%. Putting two bits in the same word opens the opportunity to remove the memory barrier between clearing the lock bit and testing the waiters bit, after some work on the arch primitives (e.g., ensuring memory operand widths match and cover both bits). Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Bob Peterson <rpeterso@redhat.com> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Andrew Lutomirski <luto@kernel.org> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-25mm: Use owner_priv bit for PageSwapCache, valid when PageSwapBackedNicholas Piggin
A page is not added to the swap cache without being swap backed, so PageSwapBacked mappings can use PG_owner_priv_1 for PageSwapCache. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Bob Peterson <rpeterso@redhat.com> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Andrew Lutomirski <luto@kernel.org> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-24Replace <asm/uaccess.h> with <linux/uaccess.h> globallyLinus Torvalds
This was entirely automated, using the script by Al: PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>' sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \ $(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h) to do the replacement at the end of the merge window. Requested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-20mm: fadvise: avoid expensive remote LRU cache draining after FADV_DONTNEEDJohannes Weiner
When FADV_DONTNEED cannot drop all pages in the range, it observes that some pages might still be on per-cpu LRU caches after recent instantiation and so initiates remote calls to all CPUs to flush their local caches. However, in most cases, the fadvise happens from the same context that instantiated the pages, and any pre-LRU pages in the specified range are most likely sitting on the local CPU's LRU cache, and so in many cases this results in unnecessary remote calls, which, in a loaded system, can hold up the fadvise() call significantly. [ I didn't record it in the extreme case we observed at Facebook, unfortunately. We had a slow-to-respond system and noticed it lru_add_drain_all() leading the profile during fadvise calls. This patch came out of thinking about the code and how we commonly call FADV_DONTNEED. FWIW, I wrote a silly directory tree walker/searcher that recurses through /usr to read and FADV_DONTNEED each file it finds. On a 2 socket 40 ht machine, over 1% is spent in lru_add_drain_all(). With the patch, that cost is gone; the local drain cost shows at 0.09%. ] Try to avoid the remote call by flushing the local LRU cache before even attempting to invalidate anything. It's a cheap operation, and the local LRU cache is the most likely to hold any pre-LRU pages in the specified fadvise range. Link: http://lkml.kernel.org/r/20161214210017.GA1465@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-17Merge uncontroversial parts of branch 'readlink' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs Pull partial readlink cleanups from Miklos Szeredi. This is the uncontroversial part of the readlink cleanup patch-set that simplifies the default readlink handling. Miklos and Al are still discussing the rest of the series. * git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: vfs: make generic_readlink() static vfs: remove ".readlink = generic_readlink" assignments vfs: default to generic_readlink() vfs: replace calling i_op->readlink with vfs_readlink() proc/self: use generic_readlink ecryptfs: use vfs_get_link() bad_inode: add missing i_op initializers
2016-12-14Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge more updates from Andrew Morton: - a few misc things - kexec updates - DMA-mapping updates to better support networking DMA operations - IPC updates - various MM changes to improve DAX fault handling - lots of radix-tree changes, mainly to the test suite. All leading up to reimplementing the IDA/IDR code to be a wrapper layer over the radix-tree. However the final trigger-pulling patch is held off for 4.11. * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (114 commits) radix tree test suite: delete unused rcupdate.c radix tree test suite: add new tag check radix-tree: ensure counts are initialised radix tree test suite: cache recently freed objects radix tree test suite: add some more functionality idr: reduce the number of bits per level from 8 to 6 rxrpc: abstract away knowledge of IDR internals tpm: use idr_find(), not idr_find_slowpath() idr: add ida_is_empty radix tree test suite: check multiorder iteration radix-tree: fix replacement for multiorder entries radix-tree: add radix_tree_split_preload() radix-tree: add radix_tree_split radix-tree: add radix_tree_join radix-tree: delete radix_tree_range_tag_if_tagged() radix-tree: delete radix_tree_locate_item() radix-tree: improve multiorder iterators btrfs: fix race in btrfs_free_dummy_fs_info() radix-tree: improve dump output radix-tree: make radix_tree_find_next_bit more useful ...
2016-12-14radix-tree: delete radix_tree_range_tag_if_tagged()Matthew Wilcox
This is an exceptionally complicated function with just one caller (tag_pages_for_writeback). We devote a large portion of the runtime of the test suite to testing this one function which has one caller. By introducing the new function radix_tree_iter_tag_set(), we can eliminate all of the complexity while keeping the performance. The caller can now use a fairly standard radix_tree_for_each() loop, and it doesn't need to worry about tricksy things like 'start' wrapping. The test suite continues to spend a large amount of time investigating this function, but now it's testing the underlying primitives such as radix_tree_iter_resume() and the radix_tree_for_each_tagged() iterator which are also used by other parts of the kernel. Link: http://lkml.kernel.org/r/1480369871-5271-57-git-send-email-mawilcox@linuxonhyperv.com Signed-off-by: Matthew Wilcox <willy@infradead.org> Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14radix-tree: delete radix_tree_locate_item()Matthew Wilcox
This rather complicated function can be better implemented as an iterator. It has only one caller, so move the functionality to the only place that needs it. Update the test suite to follow the same pattern. Link: http://lkml.kernel.org/r/1480369871-5271-56-git-send-email-mawilcox@linuxonhyperv.com Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Acked-by: Konstantin Khlebnikov <koct9i@gmail.com> Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14radix-tree: improve multiorder iteratorsMatthew Wilcox
This fixes several interlinked problems with the iterators in the presence of multiorder entries. 1. radix_tree_iter_next() would only advance by one slot, which would result in the iterators returning the same entry more than once if there were sibling entries. 2. radix_tree_next_slot() could return an internal pointer instead of a user pointer if a tagged multiorder entry was immediately followed by an entry of lower order. 3. radix_tree_next_slot() expanded to a lot more code than it used to when multiorder support was compiled in. And I wasn't comfortable with entry_to_node() being in a header file. Fixing radix_tree_iter_next() for the presence of sibling entries necessarily involves examining the contents of the radix tree, so we now need to pass 'slot' to radix_tree_iter_next(), and we need to change the calling convention so it is called *before* dropping the lock which protects the tree. Also rename it to radix_tree_iter_resume(), as some people thought it was necessary to call radix_tree_iter_next() each time around the loop. radix_tree_next_slot() becomes closer to how it looked before multiorder support was introduced. It only checks to see if the next entry in the chunk is a sibling entry or a pointer to a node; this should be rare enough that handling this case out of line is not a performance impact (and such impact is amortised by the fact that the entry we just processed was a multiorder entry). Also, radix_tree_next_slot() used to force a new chunk lookup for untagged entries, which is more expensive than the out of line sibling entry skipping. Link: http://lkml.kernel.org/r/1480369871-5271-55-git-send-email-mawilcox@linuxonhyperv.com Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14dax: protect PTE modification on WP fault by radix tree entry lockJan Kara
Currently PTE gets updated in wp_pfn_shared() after dax_pfn_mkwrite() has released corresponding radix tree entry lock. When we want to writeprotect PTE on cache flush, we need PTE modification to happen under radix tree entry lock to ensure consistent updates of PTE and radix tree (standard faults use page lock to ensure this consistency). So move update of PTE bit into dax_pfn_mkwrite(). Link: http://lkml.kernel.org/r/1479460644-25076-20-git-send-email-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14mm: export follow_pte()Jan Kara
DAX will need to implement its own version of page_check_address(). To avoid duplicating page table walking code, export follow_pte() which does what we need. Link: http://lkml.kernel.org/r/1479460644-25076-18-git-send-email-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14mm: change return values of finish_mkwrite_fault()Jan Kara
Currently finish_mkwrite_fault() returns 0 when PTE got changed before we acquired PTE lock and VM_FAULT_WRITE when we succeeded in modifying the PTE. This is somewhat confusing since 0 generally means success, it is also inconsistent with finish_fault() which returns 0 on success. Change finish_mkwrite_fault() to return 0 on success and VM_FAULT_NOPAGE when PTE changed. Practically, there should be no behavioral difference since we bail out from the fault the same way regardless whether we return 0, VM_FAULT_NOPAGE, or VM_FAULT_WRITE. Also note that VM_FAULT_WRITE has no effect for shared mappings since the only two places that check it - KSM and GUP - care about private mappings only. Generally the meaning of VM_FAULT_WRITE for shared mappings is not well defined and we should probably clean that up. Link: http://lkml.kernel.org/r/1479460644-25076-17-git-send-email-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14mm: provide helper for finishing mkwrite faultsJan Kara
Provide a helper function for finishing write faults due to PTE being read-only. The helper will be used by DAX to avoid the need of complicating generic MM code with DAX locking specifics. Link: http://lkml.kernel.org/r/1479460644-25076-16-git-send-email-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14mm: move part of wp_page_reuse() into the single call siteJan Kara
wp_page_reuse() handles write shared faults which is needed only in wp_page_shared(). Move the handling only into that location to make wp_page_reuse() simpler and avoid a strange situation when we sometimes pass in locked page, sometimes unlocked etc. Link: http://lkml.kernel.org/r/1479460644-25076-15-git-send-email-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14mm: use vmf->page during WP faultsJan Kara
So far we set vmf->page during WP faults only when we needed to pass it to the ->page_mkwrite handler. Set it in all the cases now and use that instead of passing page pointer explicitly around. Link: http://lkml.kernel.org/r/1479460644-25076-14-git-send-email-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14mm: pass vm_fault structure into do_page_mkwrite()Jan Kara
We will need more information in the ->page_mkwrite() helper for DAX to be able to fully finish faults there. Pass vm_fault structure to do_page_mkwrite() and use it there so that information propagates properly from upper layers. Link: http://lkml.kernel.org/r/1479460644-25076-13-git-send-email-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14mm: factor out common parts of write fault handlingJan Kara
Currently we duplicate handling of shared write faults in wp_page_reuse() and do_shared_fault(). Factor them out into a common function. Link: http://lkml.kernel.org/r/1479460644-25076-12-git-send-email-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14mm: move handling of COW faults into DAX codeJan Kara
Move final handling of COW faults from generic code into DAX fault handler. That way generic code doesn't have to be aware of peculiarities of DAX locking so remove that knowledge and make locking functions private to fs/dax.c. Link: http://lkml.kernel.org/r/1479460644-25076-11-git-send-email-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14mm: factor out functionality to finish page faultsJan Kara
Introduce finish_fault() as a helper function for finishing page faults. It is rather thin wrapper around alloc_set_pte() but since we'd want to call this from DAX code or filesystems, it is still useful to avoid some boilerplate code. Link: http://lkml.kernel.org/r/1479460644-25076-10-git-send-email-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14mm: allow full handling of COW faults in ->fault handlersJan Kara
Patch series "dax: Clear dirty bits after flushing caches", v5. Patchset to clear dirty bits from radix tree of DAX inodes when caches for corresponding pfns have been flushed. In principle, these patches enable handlers to easily update PTEs and do other work necessary to finish the fault without duplicating the functionality present in the generic code. I'd like to thank Kirill and Ross for reviews of the series! This patch (of 20): To allow full handling of COW faults add memcg field to struct vm_fault and a return value of ->fault() handler meaning that COW fault is fully handled and memcg charge must not be canceled. This will allow us to remove knowledge about special DAX locking from the generic fault code. Link: http://lkml.kernel.org/r/1479460644-25076-9-git-send-email-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14mm: add orig_pte field into vm_faultJan Kara
Add orig_pte field to vm_fault structure to allow ->page_mkwrite handlers to fully handle the fault. This also allows us to save some passing of extra arguments around. Link: http://lkml.kernel.org/r/1479460644-25076-8-git-send-email-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14mm: use passed vm_fault structure for in wp_pfn_shared()Jan Kara
Instead of creating another vm_fault structure, use the one passed to wp_pfn_shared() for passing arguments into pfn_mkwrite handler. Link: http://lkml.kernel.org/r/1479460644-25076-7-git-send-email-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14mm: trim __do_fault() argumentsJan Kara
Use vm_fault structure to pass cow_page, page, and entry in and out of the function. That reduces number of __do_fault() arguments from 4 to 1. Link: http://lkml.kernel.org/r/1479460644-25076-6-git-send-email-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14mm: use passed vm_fault structure in __do_fault()Jan Kara
Instead of creating another vm_fault structure, use the one passed to __do_fault() for passing arguments into fault handler. Link: http://lkml.kernel.org/r/1479460644-25076-5-git-send-email-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14mm: use pgoff in struct vm_fault instead of passing it separatelyJan Kara
struct vm_fault has already pgoff entry. Use it instead of passing pgoff as a separate argument and then assigning it later. Link: http://lkml.kernel.org/r/1479460644-25076-4-git-send-email-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14mm: use vmf->address instead of of vmf->virtual_addressJan Kara
Every single user of vmf->virtual_address typed that entry to unsigned long before doing anything with it so the type of virtual_address does not really provide us any additional safety. Just use masked vmf->address which already has the appropriate type. Link: http://lkml.kernel.org/r/1479460644-25076-3-git-send-email-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14mm: join struct fault_env and vm_faultJan Kara
Currently we have two different structures for passing fault information around - struct vm_fault and struct fault_env. DAX will need more information in struct vm_fault to handle its faults so the content of that structure would become event closer to fault_env. Furthermore it would need to generate struct fault_env to be able to call some of the generic functions. So at this point I don't think there's much use in keeping these two structures separate. Just embed into struct vm_fault all that is needed to use it for both purposes. Link: http://lkml.kernel.org/r/1479460644-25076-2-git-send-email-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14mm: unexport __get_user_pages_unlocked()Lorenzo Stoakes
Unexport the low-level __get_user_pages_unlocked() function and replaces invocations with calls to more appropriate higher-level functions. In hva_to_pfn_slow() we are able to replace __get_user_pages_unlocked() with get_user_pages_unlocked() since we can now pass gup_flags. In async_pf_execute() and process_vm_rw_single_vec() we need to pass different tsk, mm arguments so get_user_pages_remote() is the sane replacement in these cases (having added manual acquisition and release of mmap_sem.) Additionally get_user_pages_remote() reintroduces use of the FOLL_TOUCH flag. However, this flag was originally silently dropped by commit 1e9877902dc7 ("mm/gup: Introduce get_user_pages_remote()"), so this appears to have been unintentional and reintroducing it is therefore not an issue. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/20161027095141.2569-3-lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Jan Kara <jack@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krcmar <rkrcmar@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14mm: add locked parameter to get_user_pages_remote()Lorenzo Stoakes
Patch series "mm: unexport __get_user_pages_unlocked()". This patch series continues the cleanup of get_user_pages*() functions taking advantage of the fact we can now pass gup_flags as we please. It firstly adds an additional 'locked' parameter to get_user_pages_remote() to allow for its callers to utilise VM_FAULT_RETRY functionality. This is necessary as the invocation of __get_user_pages_unlocked() in process_vm_rw_single_vec() makes use of this and no other existing higher level function would allow it to do so. Secondly existing callers of __get_user_pages_unlocked() are replaced with the appropriate higher-level replacement - get_user_pages_unlocked() if the current task and memory descriptor are referenced, or get_user_pages_remote() if other task/memory descriptors are referenced (having acquiring mmap_sem.) This patch (of 2): Add a int *locked parameter to get_user_pages_remote() to allow VM_FAULT_RETRY faulting behaviour similar to get_user_pages_[un]locked(). Taking into account the previous adjustments to get_user_pages*() functions allowing for the passing of gup_flags, we are now in a position where __get_user_pages_unlocked() need only be exported for his ability to allow VM_FAULT_RETRY behaviour, this adjustment allows us to subsequently unexport __get_user_pages_unlocked() as well as allowing for future flexibility in the use of get_user_pages_remote(). [sfr@canb.auug.org.au: merge fix for get_user_pages_remote API change] Link: http://lkml.kernel.org/r/20161122210511.024ec341@canb.auug.org.au Link: http://lkml.kernel.org/r/20161027095141.2569-2-lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Jan Kara <jack@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krcmar <rkrcmar@redhat.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14mm: add support for releasing multiple instances of a pageAlexander Duyck
Add a function that allows us to batch free a page that has multiple references outstanding. Specifically this function can be used to drop a page being used in the page frag alloc cache. With this drivers can make use of functionality similar to the page frag alloc cache without having to do any workarounds for the fact that there is no function that frees multiple references. Link: http://lkml.kernel.org/r/20161110113606.76501.70752.stgit@ahduyck-blue-test.jf.intel.com Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: Chris Metcalf <cmetcalf@mellanox.com> Cc: David Howells <dhowells@redhat.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Hans-Christian Noren Egtvedt <egtvedt@samfundet.no> Cc: Helge Deller <deller@gmx.de> Cc: James Hogan <james.hogan@imgtec.com> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Cc: Jonas Bonn <jonas@southpole.se> Cc: Keguang Zhang <keguang.zhang@gmail.com> Cc: Ley Foon Tan <lftan@altera.com> Cc: Mark Salter <msalter@redhat.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Rich Felker <dalias@libc.org> Cc: Richard Kuo <rkuo@codeaurora.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Steven Miao <realmz6@gmail.com> Cc: Tobias Klauser <tklauser@distanz.ch> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14mm, compaction: allow compaction for GFP_NOFS requestsMichal Hocko
compaction has been disabled for GFP_NOFS and GFP_NOIO requests since the direct compaction was introduced by commit 56de7263fcf3 ("mm: compaction: direct compact when a high-order allocation fails"). The main reason is that the migration of page cache pages might recurse back to fs/io layer and we could potentially deadlock. This is overly conservative because all the anonymous memory is migrateable in the GFP_NOFS context just fine. This might be a large portion of the memory in many/most workkloads. Remove the GFP_NOFS restriction and make sure that we skip all fs pages (those with a mapping) while isolating pages to be migrated. We cannot consider clean fs pages because they might need a metadata update so only isolate pages without any mapping for nofs requests. The effect of this patch will be probably very limited in many/most workloads because higher order GFP_NOFS requests are quite rare, although different configurations might lead to very different results. David Chinner has mentioned a heavy metadata workload with 64kB block which to quote him: : Unfortunately, there was an era of cargo cult configuration tweaks in the : Ceph community that has resulted in a large number of production machines : with XFS filesystems configured this way. And a lot of them store large : numbers of small files and run under significant sustained memory : pressure. : : I slowly working towards getting rid of these high order allocations and : replacing them with the equivalent number of single page allocations, but : I haven't got that (complex) change working yet. We can do the following to simulate that workload: $ mkfs.xfs -f -n size=64k <dev> $ mount <dev> /mnt/scratch $ time ./fs_mark -D 10000 -S0 -n 100000 -s 0 -L 32 \ -d /mnt/scratch/0 -d /mnt/scratch/1 \ -d /mnt/scratch/2 -d /mnt/scratch/3 \ -d /mnt/scratch/4 -d /mnt/scratch/5 \ -d /mnt/scratch/6 -d /mnt/scratch/7 \ -d /mnt/scratch/8 -d /mnt/scratch/9 \ -d /mnt/scratch/10 -d /mnt/scratch/11 \ -d /mnt/scratch/12 -d /mnt/scratch/13 \ -d /mnt/scratch/14 -d /mnt/scratch/15 and indeed is hammers the system with many high order GFP_NOFS requests as per a simle tracepoint during the load: $ echo '!(gfp_flags & 0x80) && (gfp_flags &0x400000)' > $TRACE_MNT/events/kmem/mm_page_alloc/filter I am getting 5287609 order=0 37 order=1 1594905 order=2 3048439 order=3 6699207 order=4 66645 order=5 My testing was done in a kvm guest so performance numbers should be taken with a grain of salt but there seems to be a difference when the patch is applied: * Original kernel FSUse% Count Size Files/sec App Overhead 1 1600000 0 4300.1 20745838 3 3200000 0 4239.9 23849857 5 4800000 0 4243.4 25939543 6 6400000 0 4248.4 19514050 8 8000000 0 4262.1 20796169 9 9600000 0 4257.6 21288675 11 11200000 0 4259.7 19375120 13 12800000 0 4220.7 22734141 14 14400000 0 4238.5 31936458 16 16000000 0 4231.5 23409901 18 17600000 0 4045.3 23577700 19 19200000 0 2783.4 58299526 21 20800000 0 2678.2 40616302 23 22400000 0 2693.5 83973996 and xfs complaining about memory allocation not making progress [ 2304.372647] XFS: fs_mark(3289) possible memory allocation deadlock size 65624 in kmem_alloc (mode:0x2408240) [ 2304.443323] XFS: fs_mark(3285) possible memory allocation deadlock size 65728 in kmem_alloc (mode:0x2408240) [ 4796.772477] XFS: fs_mark(3424) possible memory allocation deadlock size 46936 in kmem_alloc (mode:0x2408240) [ 4796.775329] XFS: fs_mark(3423) possible memory allocation deadlock size 51416 in kmem_alloc (mode:0x2408240) [ 4797.388808] XFS: fs_mark(3424) possible memory allocation deadlock size 65728 in kmem_alloc (mode:0x2408240) * Patched kernel FSUse% Count Size Files/sec App Overhead 1 1600000 0 4289.1 19243934 3 3200000 0 4241.6 32828865 5 4800000 0 4248.7 32884693 6 6400000 0 4314.4 19608921 8 8000000 0 4269.9 24953292 9 9600000 0 4270.7 33235572 11 11200000 0 4346.4 40817101 13 12800000 0 4285.3 29972397 14 14400000 0 4297.2 20539765 16 16000000 0 4219.6 18596767 18 17600000 0 4273.8 49611187 19 19200000 0 4300.4 27944451 21 20800000 0 4270.6 22324585 22 22400000 0 4317.6 22650382 24 24000000 0 4065.2 22297964 So the dropdown at Count 19200000 didn't happen and there was only a single warning about allocation not making progress [ 3063.815003] XFS: fs_mark(3272) possible memory allocation deadlock size 65624 in kmem_alloc (mode:0x2408240) This suggests that the patch has helped even though there is not all that much of anonymous memory as the workload mostly generates fs metadata. I assume the success rate would be higher with more anonymous memory which should be the case in many workloads. [akpm@linux-foundation.org: fix comment] Link: http://lkml.kernel.org/r/20161012114721.31853-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Joonsoo Kim <js1304@gmail.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull namespace updates from Eric Biederman: "After a lot of discussion and work we have finally reachanged a basic understanding of what is necessary to make unprivileged mounts safe in the presence of EVM and IMA xattrs which the last commit in this series reflects. While technically it is a revert the comments it adds are important for people not getting confused in the future. Clearing up that confusion allows us to seriously work on unprivileged mounts of fuse in the next development cycle. The rest of the fixes in this set are in the intersection of user namespaces, ptrace, and exec. I started with the first fix which started a feedback cycle of finding additional issues during review and fixing them. Culiminating in a fix for a bug that has been present since at least Linux v1.0. Potentially these fixes were candidates for being merged during the rc cycle, and are certainly backport candidates but enough little things turned up during review and testing that I decided they should be handled as part of the normal development process just to be certain there were not any great surprises when it came time to backport some of these fixes" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: Revert "evm: Translate user/group ids relative to s_user_ns when computing HMAC" exec: Ensure mm->user_ns contains the execed files ptrace: Don't allow accessing an undumpable mm ptrace: Capture the ptracer's creds not PT_PTRACE_CAP mm: Add a user_ns owner to mm_struct and fix ptrace permission checks
2016-12-14vfs,mm: fix return value of read() at s_maxbytesLinus Torvalds
We truncated the possible read iterator to s_maxbytes in commit c2a9737f45e2 ("vfs,mm: fix a dead loop in truncate_inode_pages_range()"), but our end condition handling was wrong: it's not an error to try to read at the end of the file. Reading past the end should return EOF (0), not EINVAL. See for example https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1649342 http://lists.gnu.org/archive/html/bug-coreutils/2016-12/msg00008.html where a md5sum of a maximally sized file fails because the final read is exactly at s_maxbytes. Fixes: c2a9737f45e2 ("vfs,mm: fix a dead loop in truncate_inode_pages_range()") Reported-by: Joseph Salisbury <joseph.salisbury@canonical.com> Cc: Wei Fang <fangwei1@huawei.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14Merge tag 'ext4_for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "This merge request includes the dax-4.0-iomap-pmd branch which is needed for both ext4 and xfs dax changes to use iomap for DAX. It also includes the fscrypt branch which is needed for ubifs encryption work as well as ext4 encryption and fscrypt cleanups. Lots of cleanups and bug fixes, especially making sure ext4 is robust against maliciously corrupted file systems --- especially maliciously corrupted xattr blocks and a maliciously corrupted superblock. Also fix ext4 support for 64k block sizes so it works well on ppcle. Fixed mbcache so we don't miss some common xattr blocks that can be merged" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (86 commits) dax: Fix sleep in atomic contex in grab_mapping_entry() fscrypt: Rename FS_WRITE_PATH_FL to FS_CTX_HAS_BOUNCE_BUFFER_FL fscrypt: Delay bounce page pool allocation until needed fscrypt: Cleanup page locking requirements for fscrypt_{decrypt,encrypt}_page() fscrypt: Cleanup fscrypt_{decrypt,encrypt}_page() fscrypt: Never allocate fscrypt_ctx on in-place encryption fscrypt: Use correct index in decrypt path. fscrypt: move the policy flags and encryption mode definitions to uapi header fscrypt: move non-public structures and constants to fscrypt_private.h fscrypt: unexport fscrypt_initialize() fscrypt: rename get_crypt_info() to fscrypt_get_crypt_info() fscrypto: move ioctl processing more fully into common code fscrypto: remove unneeded Kconfig dependencies MAINTAINERS: fscrypto: recommend linux-fsdevel for fscrypto patches ext4: do not perform data journaling when data is encrypted ext4: return -ENOMEM instead of success ext4: reject inodes with negative size ext4: remove another test in ext4_alloc_file_blocks() Documentation: fix description of ext4's block_validity mount option ext4: fix checks for data=ordered and journal_async_commit options ...
2016-12-13Merge branch 'for-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wqLinus Torvalds
Pull workqueue updates from Tejun Heo: "Mostly patches to initialize workqueue subsystem earlier and get rid of keventd_up(). The patches were headed for the last merge cycle but got delayed due to a bug found late minute, which is fixed now. Also, to help debugging, destroy_workqueue() is more chatty now on a sanity check failure." * 'for-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: move wq_numa_init() to workqueue_init() workqueue: remove keventd_up() debugobj, workqueue: remove keventd_up() usage slab, workqueue: remove keventd_up() usage power, workqueue: remove keventd_up() usage tty, workqueue: remove keventd_up() usage mce, workqueue: remove keventd_up() usage workqueue: make workqueue available early during boot workqueue: dump workqueue state on sanity check failures in destroy_workqueue()
2016-12-13Merge branch 'for-4.10' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu Pull percpu update from Tejun Heo: "This includes just one patch to reject non-power-of-2 alignments and trigger warning. Interestingly, this actually caught a bug in XEN ARM64" * 'for-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: percpu: ensure the requested alignment is power of two
2016-12-13Merge tag 'char-misc-4.10-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc Pull char/misc driver updates from Greg KH: "Here's the big char/misc driver patches for 4.10-rc1. Lots of tiny changes over lots of "minor" driver subsystems, the largest being some new FPGA drivers. Other than that, a few other new drivers, but no new driver subsystems added for this kernel cycle, a nice change. All of these have been in linux-next with no reported issues" * tag 'char-misc-4.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (107 commits) uio-hv-generic: store physical addresses instead of virtual Tools: hv: kvp: configurable external scripts path uio-hv-generic: new userspace i/o driver for VMBus vmbus: add support for dynamic device id's hv: change clockevents unbind tactics hv: acquire vmbus_connection.channel_mutex in vmbus_free_channels() hyperv: Fix spelling of HV_UNKOWN mei: bus: enable non-blocking RX mei: fix the back to back interrupt handling mei: synchronize irq before initiating a reset. VME: Remove shutdown entry from vme_driver auxdisplay: ht16k33: select framebuffer helper modules MAINTAINERS: add git url for fpga fpga: Clarify how write_init works streaming modes fpga zynq: Fix incorrect ISR state on bootup fpga zynq: Remove priv->dev fpga zynq: Add missing \n to messages fpga: Add COMPILE_TEST to all drivers uio: pruss: add clk_disable() char/pcmcia: add some error checking in scr24x_read() ...
2016-12-13Merge tag 'pm-4.10-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management updates from Rafael Wysocki: "Again, cpufreq gets more changes than the other parts this time (one new driver, one old driver less, a bunch of enhancements of the existing code, new CPU IDs, fixes, cleanups) There also are some changes in cpuidle (idle injection rework, a couple of new CPU IDs, online/offline rework in intel_idle, fixes and cleanups), in the generic power domains framework (mostly related to supporting power domains containing CPUs), and in the Operating Performance Points (OPP) library (mostly related to supporting devices with multiple voltage regulators) In addition to that, the system sleep state selection interface is modified to make it easier for distributions with unchanged user space to support suspend-to-idle as the default system suspend method, some issues are fixed in the PM core, the latency tolerance PM QoS framework is improved a bit, the Intel RAPL power capping driver is cleaned up and there are some fixes and cleanups in the devfreq subsystem Specifics: - New cpufreq driver for Broadcom STB SoCs and a Device Tree binding for it (Markus Mayer) - Support for ARM Integrator/AP and Integrator/CP in the generic DT cpufreq driver and elimination of the old Integrator cpufreq driver (Linus Walleij) - Support for the zx296718, r8a7743 and r8a7745, Socionext UniPhier, and PXA SoCs in the the generic DT cpufreq driver (Baoyou Xie, Geert Uytterhoeven, Masahiro Yamada, Robert Jarzmik) - cpufreq core fix to eliminate races that may lead to using inactive policy objects and related cleanups (Rafael Wysocki) - cpufreq schedutil governor update to make it use SCHED_FIFO kernel threads (instead of regular workqueues) for doing delayed work (to reduce the response latency in some cases) and related cleanups (Viresh Kumar) - New cpufreq sysfs attribute for resetting statistics (Markus Mayer) - cpufreq governors fixes and cleanups (Chen Yu, Stratos Karafotis, Viresh Kumar) - Support for using generic cpufreq governors in the intel_pstate driver (Rafael Wysocki) - Support for per-logical-CPU P-state limits and the EPP/EPB (Energy Performance Preference/Energy Performance Bias) knobs in the intel_pstate driver (Srinivas Pandruvada) - New CPU ID for Knights Mill in intel_pstate (Piotr Luc) - intel_pstate driver modification to use the P-state selection algorithm based on CPU load on platforms with the system profile in the ACPI tables set to "mobile" (Srinivas Pandruvada) - intel_pstate driver cleanups (Arnd Bergmann, Rafael Wysocki, Srinivas Pandruvada) - cpufreq powernv driver updates including fast switching support (for the schedutil governor), fixes and cleanus (Akshay Adiga, Andrew Donnellan, Denis Kirjanov) - acpi-cpufreq driver rework to switch it over to the new CPU offline/online state machine (Sebastian Andrzej Siewior) - Assorted cleanups in cpufreq drivers (Wei Yongjun, Prashanth Prakash) - Idle injection rework (to make it use the regular idle path instead of a home-grown custom one) and related powerclamp thermal driver updates (Peter Zijlstra, Jacob Pan, Petr Mladek, Sebastian Andrzej Siewior) - New CPU IDs for Atom Z34xx and Knights Mill in intel_idle (Andy Shevchenko, Piotr Luc) - intel_idle driver cleanups and switch over to using the new CPU offline/online state machine (Anna-Maria Gleixner, Sebastian Andrzej Siewior) - cpuidle DT driver update to support suspend-to-idle properly (Sudeep Holla) - cpuidle core cleanups and misc updates (Daniel Lezcano, Pan Bian, Rafael Wysocki) - Preliminary support for power domains including CPUs in the generic power domains (genpd) framework and related DT bindings (Lina Iyer) - Assorted fixes and cleanups in the generic power domains (genpd) framework (Colin Ian King, Dan Carpenter, Geert Uytterhoeven) - Preliminary support for devices with multiple voltage regulators and related fixes and cleanups in the Operating Performance Points (OPP) library (Viresh Kumar, Masahiro Yamada, Stephen Boyd) - System sleep state selection interface rework to make it easier to support suspend-to-idle as the default system suspend method (Rafael Wysocki) - PM core fixes and cleanups, mostly related to the interactions between the system suspend and runtime PM frameworks (Ulf Hansson, Sahitya Tummala, Tony Lindgren) - Latency tolerance PM QoS framework imorovements (Andrew Lutomirski) - New Knights Mill CPU ID for the Intel RAPL power capping driver (Piotr Luc) - Intel RAPL power capping driver fixes, cleanups and switch over to using the new CPU offline/online state machine (Jacob Pan, Thomas Gleixner, Sebastian Andrzej Siewior) - Fixes and cleanups in the exynos-ppmu, exynos-nocp, rk3399_dmc, rockchip-dfi devfreq drivers and the devfreq core (Axel Lin, Chanwoo Choi, Javier Martinez Canillas, MyungJoo Ham, Viresh Kumar) - Fix for false-positive KASAN warnings during resume from ACPI S3 (suspend-to-RAM) on x86 (Josh Poimboeuf) - Memory map verification during resume from hibernation on x86 to ensure a consistent address space layout (Chen Yu) - Wakeup sources debugging enhancement (Xing Wei) - rockchip-io AVS driver cleanup (Shawn Lin)" * tag 'pm-4.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (127 commits) devfreq: rk3399_dmc: Don't use OPP structures outside of RCU locks devfreq: rk3399_dmc: Remove dangling rcu_read_unlock() devfreq: exynos: Don't use OPP structures outside of RCU locks Documentation: intel_pstate: Document HWP energy/performance hints cpufreq: intel_pstate: Support for energy performance hints with HWP cpufreq: intel_pstate: Add locking around HWP requests PM / sleep: Print active wakeup sources when blocking on wakeup_count reads PM / core: Fix bug in the error handling of async suspend PM / wakeirq: Fix dedicated wakeirq for drivers not using autosuspend PM / Domains: Fix compatible for domain idle state PM / OPP: Don't WARN on multiple calls to dev_pm_opp_set_regulators() PM / OPP: Allow platform specific custom set_opp() callbacks PM / OPP: Separate out _generic_set_opp() PM / OPP: Add infrastructure to manage multiple regulators PM / OPP: Pass struct dev_pm_opp_supply to _set_opp_voltage() PM / OPP: Manage supply's voltage/current in a separate structure PM / OPP: Don't use OPP structure outside of rcu protected section PM / OPP: Reword binding supporting multiple regulators per device PM / OPP: Fix incorrect cpu-supply property in binding cpuidle: Add a kerneldoc comment to cpuidle_use_deepest_state() ..
2016-12-13Merge branch 'for-4.10/block' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block layer updates from Jens Axboe: "This is the main block pull request this series. Contrary to previous release, I've kept the core and driver changes in the same branch. We always ended up having dependencies between the two for obvious reasons, so makes more sense to keep them together. That said, I'll probably try and keep more topical branches going forward, especially for cycles that end up being as busy as this one. The major parts of this pull request is: - Improved support for O_DIRECT on block devices, with a small private implementation instead of using the pig that is fs/direct-io.c. From Christoph. - Request completion tracking in a scalable fashion. This is utilized by two components in this pull, the new hybrid polling and the writeback queue throttling code. - Improved support for polling with O_DIRECT, adding a hybrid mode that combines pure polling with an initial sleep. From me. - Support for automatic throttling of writeback queues on the block side. This uses feedback from the device completion latencies to scale the queue on the block side up or down. From me. - Support from SMR drives in the block layer and for SD. From Hannes and Shaun. - Multi-connection support for nbd. From Josef. - Cleanup of request and bio flags, so we have a clear split between which are bio (or rq) private, and which ones are shared. From Christoph. - A set of patches from Bart, that improve how we handle queue stopping and starting in blk-mq. - Support for WRITE_ZEROES from Chaitanya. - Lightnvm updates from Javier/Matias. - Supoort for FC for the nvme-over-fabrics code. From James Smart. - A bunch of fixes from a whole slew of people, too many to name here" * 'for-4.10/block' of git://git.kernel.dk/linux-block: (182 commits) blk-stat: fix a few cases of missing batch flushing blk-flush: run the queue when inserting blk-mq flush elevator: make the rqhash helpers exported blk-mq: abstract out blk_mq_dispatch_rq_list() helper blk-mq: add blk_mq_start_stopped_hw_queue() block: improve handling of the magic discard payload blk-wbt: don't throttle discard or write zeroes nbd: use dev_err_ratelimited in io path nbd: reset the setup task for NBD_CLEAR_SOCK nvme-fabrics: Add FC LLDD loopback driver to test FC-NVME nvme-fabrics: Add target support for FC transport nvme-fabrics: Add host support for FC transport nvme-fabrics: Add FC transport LLDD api definitions nvme-fabrics: Add FC transport FC-NVME definitions nvme-fabrics: Add FC transport error codes to nvme.h Add type 0x28 NVME type code to scsi fc headers nvme-fabrics: patch target code in prep for FC transport support nvme-fabrics: set sqe.command_id in core not transports parser: add u64 number parser nvme-rdma: align to generic ib_event logging helper ...
2016-12-12Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge updates from Andrew Morton: - various misc bits - most of MM (quite a lot of MM material is awaiting the merge of linux-next dependencies) - kasan - printk updates - procfs updates - MAINTAINERS - /lib updates - checkpatch updates * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (123 commits) init: reduce rootwait polling interval time to 5ms binfmt_elf: use vmalloc() for allocation of vma_filesz checkpatch: don't emit unified-diff error for rename-only patches checkpatch: don't check c99 types like uint8_t under tools checkpatch: avoid multiple line dereferences checkpatch: don't check .pl files, improve absolute path commit log test scripts/checkpatch.pl: fix spelling checkpatch: don't try to get maintained status when --no-tree is given lib/ida: document locking requirements a bit better lib/rbtree.c: fix typo in comment of ____rb_erase_color lib/Kconfig.debug: make CONFIG_STRICT_DEVMEM depend on CONFIG_DEVMEM MAINTAINERS: add drm and drm/i915 irc channels MAINTAINERS: add "C:" for URI for chat where developers hang out MAINTAINERS: add drm and drm/i915 bug filing info MAINTAINERS: add "B:" for URI where to file bugs get_maintainer: look for arbitrary letter prefixes in sections printk: add Kconfig option to set default console loglevel printk/sound: handle more message headers printk/btrfs: handle more message headers printk/kdb: handle more message headers ...
2016-12-12Merge branch 'smp-hotplug-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull smp hotplug updates from Thomas Gleixner: "This is the final round of converting the notifier mess to the state machine. The removal of the notifiers and the related infrastructure will happen around rc1, as there are conversions outstanding in other trees. The whole exercise removed about 2000 lines of code in total and in course of the conversion several dozen bugs got fixed. The new mechanism allows to test almost every hotplug step standalone, so usage sites can exercise all transitions extensively. There is more room for improvement, like integrating all the pointlessly different architecture mechanisms of synchronizing, setting cpus online etc into the core code" * 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (60 commits) tracing/rb: Init the CPU mask on allocation soc/fsl/qbman: Convert to hotplug state machine soc/fsl/qbman: Convert to hotplug state machine zram: Convert to hotplug state machine KVM/PPC/Book3S HV: Convert to hotplug state machine arm64/cpuinfo: Convert to hotplug state machine arm64/cpuinfo: Make hotplug notifier symmetric mm/compaction: Convert to hotplug state machine iommu/vt-d: Convert to hotplug state machine mm/zswap: Convert pool to hotplug state machine mm/zswap: Convert dst-mem to hotplug state machine mm/zsmalloc: Convert to hotplug state machine mm/vmstat: Convert to hotplug state machine mm/vmstat: Avoid on each online CPU loops mm/vmstat: Drop get_online_cpus() from init_cpu_node_state/vmstat_cpu_dead() tracing/rb: Convert to hotplug state machine oprofile/nmi timer: Convert to hotplug state machine net/iucv: Use explicit clean up labels in iucv_init() x86/pci/amd-bus: Convert to hotplug state machine x86/oprofile/nmi: Convert to hotplug state machine ...
2016-12-12mm/percpu.c: fix panic triggered by BUG_ON() falselyzijun_hu
As shown by pcpu_build_alloc_info(), the number of units within a percpu group is deduced by rounding up the number of CPUs within the group to @upa boundary/ Therefore, the number of CPUs isn't equal to the units's if it isn't aligned to @upa normally. However, pcpu_page_first_chunk() uses BUG_ON() to assert that one number is equal to the other roughly, so a panic is maybe triggered by the BUG_ON() incorrectly. In order to fix this issue, the number of CPUs is rounded up then compared with units's and the BUG_ON() is replaced with a warning and return of an error code as well, to keep system alive as much as possible. Link: http://lkml.kernel.org/r/57FCF07C.2020103@zoho.com Signed-off-by: zijun_hu <zijun_hu@htc.com> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12kasan: eliminate long stalls during quarantine reductionDmitry Vyukov
Currently we dedicate 1/32 of RAM for quarantine and then reduce it by 1/4 of total quarantine size. This can be a significant amount of memory. For example, with 4GB of RAM total quarantine size is 128MB and it is reduced by 32MB at a time. With 128GB of RAM total quarantine size is 4GB and it is reduced by 1GB. This leads to several problems: - freeing 1GB can take tens of seconds, causes rcu stall warnings and just introduces unexpected long delays at random places - if kmalloc() is called under a mutex, other threads stall on that mutex while a thread reduces quarantine - threads wait on quarantine_lock while one thread grabs a large batch of objects to evict - we walk the uncached list of object to free twice which makes all of the above worse - when a thread frees objects, they are already not accounted against global_quarantine.bytes; as the result we can have quarantine_size bytes in quarantine + unbounded amount of memory in large batches in threads that are in process of freeing Reduce size of quarantine in smaller batches to reduce the delays. The only reason to reduce it in batches is amortization of overheads, the new batch size of 1MB should be well enough to amortize spinlock lock/unlock and few function calls. Plus organize quarantine as a FIFO array of batches. This allows to not walk the list in quarantine_reduce() under quarantine_lock, which in turn reduces contention and is just faster. This improves performance of heavy load (syzkaller fuzzing) by ~20% with 4 CPUs and 32GB of RAM. Also this eliminates frequent (every 5 sec) drops of CPU consumption from ~400% to ~100% (one thread reduces quarantine while others are waiting on a mutex). Some reference numbers: 1. Machine with 4 CPUs and 4GB of memory. Quarantine size 128MB. Currently we free 32MB at at time. With new code we free 1MB at a time (1024 batches, ~128 are used). 2. Machine with 32 CPUs and 128GB of memory. Quarantine size 4GB. Currently we free 1GB at at time. With new code we free 8MB at a time (1024 batches, ~512 are used). 3. Machine with 4096 CPUs and 1TB of memory. Quarantine size 32GB. Currently we free 8GB at at time. With new code we free 4MB at a time (16K batches, ~8K are used). Link: http://lkml.kernel.org/r/1478756952-18695-1-git-send-email-dvyukov@google.com Signed-off-by: Dmitry Vyukov <dvyukov@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12kasan: support panic_on_warnDmitry Vyukov
If user sets panic_on_warn, he wants kernel to panic if there is anything barely wrong with the kernel. KASAN-detected errors are definitely not less benign than an arbitrary kernel WARNING. Panic after KASAN errors if panic_on_warn is set. We use this for continuous fuzzing where we want kernel to stop and reboot on any error. Link: http://lkml.kernel.org/r/1476694764-31986-1-git-send-email-dvyukov@google.com Signed-off-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Alexander Potapenko <glider@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>