summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2016-08-02memcg: put soft limit reclaim out of way if the excess tree is emptyMichal Hocko
We've had a report about soft lockups caused by lock bouncing in the soft reclaim path: BUG: soft lockup - CPU#0 stuck for 22s! [kav4proxy-kavic:3128] RIP: 0010:[<ffffffff81469798>] [<ffffffff81469798>] _raw_spin_lock+0x18/0x20 Call Trace: mem_cgroup_soft_limit_reclaim+0x25a/0x280 shrink_zones+0xed/0x200 do_try_to_free_pages+0x74/0x320 try_to_free_pages+0x112/0x180 __alloc_pages_slowpath+0x3ff/0x820 __alloc_pages_nodemask+0x1e9/0x200 alloc_pages_vma+0xe1/0x290 do_wp_page+0x19f/0x840 handle_pte_fault+0x1cd/0x230 do_page_fault+0x1fd/0x4c0 page_fault+0x25/0x30 There are no memcgs created so there cannot be any in the soft limit excess obviously: [...] memory 0 1 1 so all this just seems to be mem_cgroup_largest_soft_limit_node trying to get spin_lock_irq(&mctz->lock) just to find out that the soft limit excess tree is empty. This is just pointless wasting of cycles and cache line bouncing during heavy parallel reclaim on large machines. The particular machine wasn't very healthy and most probably suffering from a memory leak which just caused the memory reclaim to trash heavily. But bouncing on the lock certainly didn't help... Fix this by optimistic lockless check and bail out early if the tree is empty. This is theoretically racy but that shouldn't matter all that much. First of all soft limit is a best effort feature and it is slowly getting deprecated and its usage should be really scarce. Bouncing on a lock without a good reason is surely much bigger problem, especially on large CPU machines. Link: http://lkml.kernel.org/r/1470073277-1056-1-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-02mm, hugetlb: fix huge_pte_alloc BUG_ONMichal Hocko
Zhong Jiang has reported a BUG_ON from huge_pte_alloc hitting when he runs his database load with memory online and offline running in parallel. The reason is that huge_pmd_share might detect a shared pmd which is currently migrated and so it has migration pte which is !pte_huge. There doesn't seem to be any easy way to prevent from the race and in fact seeing the migration swap entry is not harmful. Both callers of huge_pte_alloc are prepared to handle them. copy_hugetlb_page_range will copy the swap entry and make it COW if needed. hugetlb_fault will back off and so the page fault is retries if the page is still under migration and waits for its completion in hugetlb_fault. That means that the BUG_ON is wrong and we should update it. Let's simply check that all present ptes are pte_huge instead. Link: http://lkml.kernel.org/r/20160721074340.GA26398@dhcp22.suse.cz Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: zhongjiang <zhongjiang@huawei.com> Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-02mm/hugetlb: avoid soft lockup in set_max_huge_pages()Jia He
In powerpc servers with large memory(32TB), we watched several soft lockups for hugepage under stress tests. The call traces are as follows: 1. get_page_from_freelist+0x2d8/0xd50 __alloc_pages_nodemask+0x180/0xc20 alloc_fresh_huge_page+0xb0/0x190 set_max_huge_pages+0x164/0x3b0 2. prep_new_huge_page+0x5c/0x100 alloc_fresh_huge_page+0xc8/0x190 set_max_huge_pages+0x164/0x3b0 This patch fixes such soft lockups. It is safe to call cond_resched() there because it is out of spin_lock/unlock section. Link: http://lkml.kernel.org/r/1469674442-14848-1-git-send-email-hejianet@gmail.com Signed-off-by: Jia He <hejianet@gmail.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-02mm: move swap-in anonymous page into active listMinchan Kim
Every swap-in anonymous page starts from inactive lru list's head. It should be activated unconditionally when VM decide to reclaim because page table entry for the page always usually has marked accessed bit. Thus, their window size for getting a new referece is 2 * NR_inactive + NR_active while others is NR_inactive + NR_active. It's not fair that it has more chance to be referenced compared to other newly allocated page which starts from active lru list's head. Johannes: : The page can still have a valid copy on the swap device, so prefering to : reclaim that page over a fresh one could make sense. But as you point : out, having it start inactive instead of active actually ends up giving it : *more* LRU time, and that seems to be without justification. Rik: : The reason newly read in swap cache pages start on the inactive list is : that we do some amount of read-around, and do not know which pages will : get used. : : However, immediately activating the ones that DO get used, like your patch : does, is the right thing to do. Link: http://lkml.kernel.org/r/1469762740-17860-1-git-send-email-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Rik van Riel <riel@redhat.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-02mm: fail prefaulting if page table allocation failsVegard Nossum
I ran into this: BUG: sleeping function called from invalid context at mm/page_alloc.c:3784 in_atomic(): 0, irqs_disabled(): 0, pid: 1434, name: trinity-c1 2 locks held by trinity-c1/1434: #0: (&mm->mmap_sem){......}, at: [<ffffffff810ce31e>] __do_page_fault+0x1ce/0x8f0 #1: (rcu_read_lock){......}, at: [<ffffffff81378f86>] filemap_map_pages+0xd6/0xdd0 CPU: 0 PID: 1434 Comm: trinity-c1 Not tainted 4.7.0+ #58 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 Call Trace: dump_stack+0x65/0x84 panic+0x185/0x2dd ___might_sleep+0x51c/0x600 __might_sleep+0x90/0x1a0 __alloc_pages_nodemask+0x5b1/0x2160 alloc_pages_current+0xcc/0x370 pte_alloc_one+0x12/0x90 __pte_alloc+0x1d/0x200 alloc_set_pte+0xe3e/0x14a0 filemap_map_pages+0x42b/0xdd0 handle_mm_fault+0x17d5/0x28b0 __do_page_fault+0x310/0x8f0 trace_do_page_fault+0x18d/0x310 do_async_page_fault+0x27/0xa0 async_page_fault+0x28/0x30 The important bits from the above is that filemap_map_pages() is calling into the page allocator while holding rcu_read_lock (sleeping is not allowed inside RCU read-side critical sections). According to Kirill Shutemov, the prefaulting code in do_fault_around() is supposed to take care of this, but missing error handling means that the allocation failure can go unnoticed. We don't need to return VM_FAULT_OOM (or any other error) here, since we can just let the normal fault path try again. Fixes: 7267ec008b5c ("mm: postpone page table allocation until we have page to map") Link: http://lkml.kernel.org/r/1469708107-11868-1-git-send-email-vegard.nossum@oracle.com Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: "Hillf Danton" <hillf.zj@alibaba-inc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-02Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull KVM updates from Paolo Bonzini: - ARM: GICv3 ITS emulation and various fixes. Removal of the old VGIC implementation. - s390: support for trapping software breakpoints, nested virtualization (vSIE), the STHYI opcode, initial extensions for CPU model support. - MIPS: support for MIPS64 hosts (32-bit guests only) and lots of cleanups, preliminary to this and the upcoming support for hardware virtualization extensions. - x86: support for execute-only mappings in nested EPT; reduced vmexit latency for TSC deadline timer (by about 30%) on Intel hosts; support for more than 255 vCPUs. - PPC: bugfixes. * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (302 commits) KVM: PPC: Introduce KVM_CAP_PPC_HTM MIPS: Select HAVE_KVM for MIPS64_R{2,6} MIPS: KVM: Reset CP0_PageMask during host TLB flush MIPS: KVM: Fix ptr->int cast via KVM_GUEST_KSEGX() MIPS: KVM: Sign extend MFC0/RDHWR results MIPS: KVM: Fix 64-bit big endian dynamic translation MIPS: KVM: Fail if ebase doesn't fit in CP0_EBase MIPS: KVM: Use 64-bit CP0_EBase when appropriate MIPS: KVM: Set CP0_Status.KX on MIPS64 MIPS: KVM: Make entry code MIPS64 friendly MIPS: KVM: Use kmap instead of CKSEG0ADDR() MIPS: KVM: Use virt_to_phys() to get commpage PFN MIPS: Fix definition of KSEGX() for 64-bit KVM: VMX: Add VMCS to CPU's loaded VMCSs before VMPTRLD kvm: x86: nVMX: maintain internal copy of current VMCS KVM: PPC: Book3S HV: Save/restore TM state in H_CEDE KVM: PPC: Book3S HV: Pull out TM state save/restore into separate procedures KVM: arm64: vgic-its: Simplify MAPI error handling KVM: arm64: vgic-its: Make vgic_its_cmd_handle_mapi similar to other handlers KVM: arm64: vgic-its: Turn device_id validation into generic ID validation ...
2016-08-01powerpc/mm/hugetlb: Add flush_hugetlb_tlb_rangeAneesh Kumar K.V
Some archs like ppc64 need to do special things when flushing tlb for hugepage. Add a new helper to flush hugetlb tlb range. This helps us to avoid flushing the entire tlb mapping for the pid. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-07-29Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse updates from Miklos Szeredi: "This fixes error propagation from writeback to fsync/close for writeback cache mode as well as adding a missing capability flag to the INIT message. The rest are cleanups. (The commits are recent but all the code actually sat in -next for a while now. The recommits are due to conflict avoidance and the addition of Cc: stable@...)" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: use filemap_check_errors() mm: export filemap_check_errors() to modules fuse: fix wrong assignment of ->flags in fuse_send_init() fuse: fuse_flush must check mapping->flags for errors fuse: fsync() did not return IO errors fuse: don't mess with blocking signals new helper: wait_event_killable_exclusive() fuse: improve aio directIO write performance for size extending writes
2016-07-29mm: export filemap_check_errors() to modulesMiklos Szeredi
Can be used by fuse, btrfs and f2fs to replace opencoded variants. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-07-28Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge more updates from Andrew Morton: "The rest of MM" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (101 commits) mm, compaction: simplify contended compaction handling mm, compaction: introduce direct compaction priority mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations mm, page_alloc: make THP-specific decisions more generic mm, page_alloc: restructure direct compaction handling in slowpath mm, page_alloc: don't retry initial attempt in slowpath mm, page_alloc: set alloc_flags only once in slowpath lib/stackdepot.c: use __GFP_NOWARN for stack allocations mm, kasan: switch SLUB to stackdepot, enable memory quarantine for SLUB mm, kasan: account for object redzone in SLUB's nearest_obj() mm: fix use-after-free if memory allocation failed in vma_adjust() zsmalloc: Delete an unnecessary check before the function call "iput" mm/memblock.c: fix index adjustment error in __next_mem_range_rev() mem-hotplug: alloc new page from a nearest neighbor node when mem-offline mm: optimize copy_page_to/from_iter_iovec mm: add cond_resched() to generic_swapfile_activate() Revert "mm, mempool: only set __GFP_NOMEMALLOC if there are free elements" mm, compaction: don't isolate PageWriteback pages in MIGRATE_SYNC_LIGHT mode mm: hwpoison: remove incorrect comments make __section_nr() more efficient ...
2016-07-28mm, compaction: simplify contended compaction handlingVlastimil Babka
Async compaction detects contention either due to failing trylock on zone->lock or lru_lock, or by need_resched(). Since 1f9efdef4f3f ("mm, compaction: khugepaged should not give up due to need_resched()") the code got quite complicated to distinguish these two up to the __alloc_pages_slowpath() level, so different decisions could be taken for khugepaged allocations. After the recent changes, khugepaged allocations don't check for contended compaction anymore, so we again don't need to distinguish lock and sched contention, and simplify the current convoluted code a lot. However, I believe it's also possible to simplify even more and completely remove the check for contended compaction after the initial async compaction for costly orders, which was originally aimed at THP page fault allocations. There are several reasons why this can be done now: - with the new defaults, THP page faults no longer do reclaim/compaction at all, unless the system admin has overridden the default, or application has indicated via madvise that it can benefit from THP's. In both cases, it means that the potential extra latency is expected and worth the benefits. - even if reclaim/compaction proceeds after this patch where it previously wouldn't, the second compaction attempt is still async and will detect the contention and back off, if the contention persists - there are still heuristics like deferred compaction and pageblock skip bits in place that prevent excessive THP page fault latencies Link: http://lkml.kernel.org/r/20160721073614.24395-9-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28mm, compaction: introduce direct compaction priorityVlastimil Babka
In the context of direct compaction, for some types of allocations we would like the compaction to either succeed or definitely fail while trying as hard as possible. Current async/sync_light migration mode is insufficient, as there are heuristics such as caching scanner positions, marking pageblocks as unsuitable or deferring compaction for a zone. At least the final compaction attempt should be able to override these heuristics. To communicate how hard compaction should try, we replace migration mode with a new enum compact_priority and change the relevant function signatures. In compact_zone_order() where struct compact_control is constructed, the priority is mapped to suitable control flags. This patch itself has no functional change, as the current priority levels are mapped back to the same migration modes as before. Expanding them will be done next. Note that !CONFIG_COMPACTION variant of try_to_compact_pages() is removed, as the only caller exists under CONFIG_COMPACTION. Link: http://lkml.kernel.org/r/20160721073614.24395-8-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocationsVlastimil Babka
After the previous patch, we can distinguish costly allocations that should be really lightweight, such as THP page faults, with __GFP_NORETRY. This means we don't need to recognize khugepaged allocations via PF_KTHREAD anymore. We can also change THP page faults in areas where madvise(MADV_HUGEPAGE) was used to try as hard as khugepaged, as the process has indicated that it benefits from THP's and is willing to pay some initial latency costs. We can also make the flags handling less cryptic by distinguishing GFP_TRANSHUGE_LIGHT (no reclaim at all, default mode in page fault) from GFP_TRANSHUGE (only direct reclaim, khugepaged default). Adding __GFP_NORETRY or __GFP_KSWAPD_RECLAIM is done where needed. The patch effectively changes the current GFP_TRANSHUGE users as follows: * get_huge_zero_page() - the zero page lifetime should be relatively long and it's shared by multiple users, so it's worth spending some effort on it. We use GFP_TRANSHUGE, and __GFP_NORETRY is not added. This also restores direct reclaim to this allocation, which was unintentionally removed by commit e4a49efe4e7e ("mm: thp: set THP defrag by default to madvise and add a stall-free defrag option") * alloc_hugepage_khugepaged_gfpmask() - this is khugepaged, so latency is not an issue. So if khugepaged "defrag" is enabled (the default), do reclaim via GFP_TRANSHUGE without __GFP_NORETRY. We can remove the PF_KTHREAD check from page alloc. As a side-effect, khugepaged will now no longer check if the initial compaction was deferred or contended. This is OK, as khugepaged sleep times between collapsion attempts are long enough to prevent noticeable disruption, so we should allow it to spend some effort. * migrate_misplaced_transhuge_page() - already was masking out __GFP_RECLAIM, so just convert to GFP_TRANSHUGE_LIGHT which is equivalent. * alloc_hugepage_direct_gfpmask() - vma's with VM_HUGEPAGE (via madvise) are now allocating without __GFP_NORETRY. Other vma's keep using __GFP_NORETRY if direct reclaim/compaction is at all allowed (by default it's allowed only for madvised vma's). The rest is conversion to GFP_TRANSHUGE(_LIGHT). [mhocko@suse.com: suggested GFP_TRANSHUGE_LIGHT] Link: http://lkml.kernel.org/r/20160721073614.24395-7-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28mm, page_alloc: make THP-specific decisions more genericVlastimil Babka
Since THP allocations during page faults can be costly, extra decisions are employed for them to avoid excessive reclaim and compaction, if the initial compaction doesn't look promising. The detection has never been perfect as there is no gfp flag specific to THP allocations. At this moment it checks the whole combination of flags that makes up GFP_TRANSHUGE, and hopes that no other users of such combination exist, or would mind being treated the same way. Extra care is also taken to separate allocations from khugepaged, where latency doesn't matter that much. It is however possible to distinguish these allocations in a simpler and more reliable way. The key observation is that after the initial compaction followed by the first iteration of "standard" reclaim/compaction, both __GFP_NORETRY allocations and costly allocations without __GFP_REPEAT are declared as failures: /* Do not loop if specifically requested */ if (gfp_mask & __GFP_NORETRY) goto nopage; /* * Do not retry costly high order allocations unless they are * __GFP_REPEAT */ if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT)) goto nopage; This means we can further distinguish allocations that are costly order *and* additionally include the __GFP_NORETRY flag. As it happens, GFP_TRANSHUGE allocations do already fall into this category. This will also allow other costly allocations with similar high-order benefit vs latency considerations to use this semantic. Furthermore, we can distinguish THP allocations that should try a bit harder (such as from khugepageed) by removing __GFP_NORETRY, as will be done in the next patch. Link: http://lkml.kernel.org/r/20160721073614.24395-6-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28mm, page_alloc: restructure direct compaction handling in slowpathVlastimil Babka
The retry loop in __alloc_pages_slowpath is supposed to keep trying reclaim and compaction (and OOM), until either the allocation succeeds, or returns with failure. Success here is more probable when reclaim precedes compaction, as certain watermarks have to be met for compaction to even try, and more free pages increase the probability of compaction success. On the other hand, starting with light async compaction (if the watermarks allow it), can be more efficient, especially for smaller orders, if there's enough free memory which is just fragmented. Thus, the current code starts with compaction before reclaim, and to make sure that the last reclaim is always followed by a final compaction, there's another direct compaction call at the end of the loop. This makes the code hard to follow and adds some duplicated handling of migration_mode decisions. It's also somewhat inefficient that even if reclaim or compaction decides not to retry, the final compaction is still attempted. Some gfp flags combination also shortcut these retry decisions by "goto noretry;", making it even harder to follow. This patch attempts to restructure the code with only minimal functional changes. The call to the first compaction and THP-specific checks are now placed above the retry loop, and the "noretry" direct compaction is removed. The initial compaction is additionally restricted only to costly orders, as we can expect smaller orders to be held back by watermarks, and only larger orders to suffer primarily from fragmentation. This better matches the checks in reclaim's shrink_zones(). There are two other smaller functional changes. One is that the upgrade from async migration to light sync migration will always occur after the initial compaction. This is how it has been until recent patch "mm, oom: protect !costly allocations some more", which introduced upgrading the mode based on COMPACT_COMPLETE result, but kept the final compaction always upgraded, which made it even more special. It's better to return to the simpler handling for now, as migration modes will be further modified later in the series. The second change is that once both reclaim and compaction declare it's not worth to retry the reclaim/compact loop, there is no final compaction attempt. As argued above, this is intentional. If that final compaction were to succeed, it would be due to a wrong retry decision, or simply a race with somebody else freeing memory for us. The main outcome of this patch should be simpler code. Logically, the initial compaction without reclaim is the exceptional case to the reclaim/compaction scheme, but prior to the patch, it was the last loop iteration that was exceptional. Now the code matches the logic better. The change also enable the following patches. Link: http://lkml.kernel.org/r/20160721073614.24395-5-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28mm, page_alloc: don't retry initial attempt in slowpathVlastimil Babka
After __alloc_pages_slowpath() sets up new alloc_flags and wakes up kswapd, it first tries get_page_from_freelist() with the new alloc_flags, as it may succeed e.g. due to using min watermark instead of low watermark. It makes sense to to do this attempt before adjusting zonelist based on alloc_flags/gfp_mask, as it's still relatively a fast path if we just wake up kswapd and successfully allocate. This patch therefore moves the initial attempt above the retry label and reorganizes a bit the part below the retry label. We still have to attempt get_page_from_freelist() on each retry, as some allocations cannot do that as part of direct reclaim or compaction, and yet are not allowed to fail (even though they do a WARN_ON_ONCE() and thus should not exist). We can reuse the call meant for ALLOC_NO_WATERMARKS attempt and just set alloc_flags to ALLOC_NO_WATERMARKS if the context allows it. As a side-effect, the attempts from direct reclaim/compaction will also no longer obey watermarks once this is set, but there's little harm in that. Kswapd wakeups are also done on each retry to be safe from potential races resulting in kswapd going to sleep while a process (that may not be able to reclaim by itself) is still looping. Link: http://lkml.kernel.org/r/20160721073614.24395-4-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28mm, page_alloc: set alloc_flags only once in slowpathVlastimil Babka
In __alloc_pages_slowpath(), alloc_flags doesn't change after it's initialized, so move the initialization above the retry: label. Also make the comment above the initialization more descriptive. The only exception in the alloc_flags being constant is ALLOC_NO_WATERMARKS, which may change due to TIF_MEMDIE being set on the allocating thread. We can fix this, and make the code simpler and a bit more effective at the same time, by moving the part that determines ALLOC_NO_WATERMARKS from gfp_to_alloc_flags() to gfp_pfmemalloc_allowed(). This means we don't have to mask out ALLOC_NO_WATERMARKS in numerous places in __alloc_pages_slowpath() anymore. The only two tests for the flag can instead call gfp_pfmemalloc_allowed(). Link: http://lkml.kernel.org/r/20160721073614.24395-3-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28mm, kasan: switch SLUB to stackdepot, enable memory quarantine for SLUBAlexander Potapenko
For KASAN builds: - switch SLUB allocator to using stackdepot instead of storing the allocation/deallocation stacks in the objects; - change the freelist hook so that parts of the freelist can be put into the quarantine. [aryabinin@virtuozzo.com: fixes] Link: http://lkml.kernel.org/r/1468601423-28676-1-git-send-email-aryabinin@virtuozzo.com Link: http://lkml.kernel.org/r/1468347165-41906-3-git-send-email-glider@google.com Signed-off-by: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <adech.fo@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Steven Rostedt (Red Hat) <rostedt@goodmis.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kostya Serebryany <kcc@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Kuthonuzo Luruo <kuthonuzo.luruo@hpe.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28mm, kasan: account for object redzone in SLUB's nearest_obj()Alexander Potapenko
When looking up the nearest SLUB object for a given address, correctly calculate its offset if SLAB_RED_ZONE is enabled for that cache. Previously, when KASAN had detected an error on an object from a cache with SLAB_RED_ZONE set, the actual start address of the object was miscalculated, which led to random stacks having been reported. When looking up the nearest SLUB object for a given address, correctly calculate its offset if SLAB_RED_ZONE is enabled for that cache. Fixes: 7ed2f9e663854db ("mm, kasan: SLAB support") Link: http://lkml.kernel.org/r/1468347165-41906-2-git-send-email-glider@google.com Signed-off-by: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <adech.fo@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Steven Rostedt (Red Hat) <rostedt@goodmis.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kostya Serebryany <kcc@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Kuthonuzo Luruo <kuthonuzo.luruo@hpe.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28mm: fix use-after-free if memory allocation failed in vma_adjust()Kirill A. Shutemov
There's one case when vma_adjust() expands the vma, overlapping with *two* next vma. See case 6 of mprotect, described in the comment to vma_merge(). To handle this (and only this) situation we iterate twice over main part of the function. See "goto again". Vegard reported[1] that he sees out-of-bounds access complain from KASAN, if anon_vma_clone() on the *second* iteration fails. This happens because we free 'next' vma by the end of first iteration and don't have a way to undo this if anon_vma_clone() fails on the second iteration. The solution is to do all required allocations upfront, before we touch vmas. The allocation on the second iteration is only required if first two vmas don't have anon_vma, but third does. So we need, in total, one anon_vma_clone() call. It's easy to adjust 'exporter' to the third vma for such case. [1] http://lkml.kernel.org/r/1469514843-23778-1-git-send-email-vegard.nossum@oracle.com Link: http://lkml.kernel.org/r/1469625255-126641-1-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: Vegard Nossum <vegard.nossum@oracle.com> Cc: Rik van Riel <riel@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28zsmalloc: Delete an unnecessary check before the function call "iput"Markus Elfring
iput() tests whether its argument is NULL and then returns immediately. Thus the test around the call is not needed. This issue was detected by using the Coccinelle software. Link: http://lkml.kernel.org/r/559cf499-4a01-25f9-c87f-24d906626a57@users.sourceforge.net Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28mm/memblock.c: fix index adjustment error in __next_mem_range_rev()zijun_hu
Fix region index adjustment error when parameter type_b of __next_mem_range_rev() == NULL. Signed-off-by: zijun_hu <zijun_hu@htc.com> Cc: Alexander Kuleshov <kuleshovmail@gmail.com> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: Wei Yang <weiyang@linux.vnet.ibm.com> Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: Richard Leitner <dev@g0hl1n.net> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28mem-hotplug: alloc new page from a nearest neighbor node when mem-offlineXishi Qiu
If we offline a node, alloc the new page from a nearest neighbor node instead of the current node or other remote nodes, because re-migrate is a waste of time and the distance of the remote nodes is often very large. Also use GFP_HIGHUSER_MOVABLE to alloc new page if the zone is movable zone or highmem zone. Link: http://lkml.kernel.org/r/5795E18B.5060302@huawei.com Signed-off-by: Xishi Qiu <qiuxishi@huawei.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28mm: add cond_resched() to generic_swapfile_activate()Mikulas Patocka
generic_swapfile_activate() can take quite long time, it iterates over all blocks of a file, so add cond_resched to it. I observed about 1 second stalls when activating a swapfile that was almost unfragmented - this patch fixes it. Link: http://lkml.kernel.org/r/alpine.LRH.2.02.1607221710580.4818@file01.intranet.prod.int.rdu2.redhat.com Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Hugh Dickins <hughd@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28Revert "mm, mempool: only set __GFP_NOMEMALLOC if there are free elements"Michal Hocko
This reverts commit f9054c70d28b ("mm, mempool: only set __GFP_NOMEMALLOC if there are free elements"). There has been a report about OOM killer invoked when swapping out to a dm-crypt device. The primary reason seems to be that the swapout out IO managed to completely deplete memory reserves. Ondrej was able to bisect and explained the issue by pointing to f9054c70d28b ("mm, mempool: only set __GFP_NOMEMALLOC if there are free elements"). The reason is that the swapout path is not throttled properly because the md-raid layer needs to allocate from the generic_make_request path which means it allocates from the PF_MEMALLOC context. dm layer uses mempool_alloc in order to guarantee a forward progress which used to inhibit access to memory reserves when using page allocator. This has changed by f9054c70d28b ("mm, mempool: only set __GFP_NOMEMALLOC if there are free elements") which has dropped the __GFP_NOMEMALLOC protection when the memory pool is depleted. If we are running out of memory and the only way forward to free memory is to perform swapout we just keep consuming memory reserves rather than throttling the mempool allocations and allowing the pending IO to complete up to a moment when the memory is depleted completely and there is no way forward but invoking the OOM killer. This is less than optimal. The original intention of f9054c70d28b was to help with the OOM situations where the oom victim depends on mempool allocation to make a forward progress. David has mentioned the following backtrace: schedule schedule_timeout io_schedule_timeout mempool_alloc __split_and_process_bio dm_request generic_make_request submit_bio mpage_readpages ext4_readpages __do_page_cache_readahead ra_submit filemap_fault handle_mm_fault __do_page_fault do_page_fault page_fault We do not know more about why the mempool is depleted without being replenished in time, though. In any case the dm layer shouldn't depend on any allocations outside of the dedicated pools so a forward progress should be guaranteed. If this is not the case then the dm should be fixed rather than papering over the problem and postponing it to later by accessing more memory reserves. mempools are a mechanism to maintain dedicated memory reserves to guaratee forward progress. Allowing them an unbounded access to the page allocator memory reserves is going against the whole purpose of this mechanism. Bisected by Ondrej Kozina. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/20160721145309.GR26379@dhcp22.suse.cz Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Ondrej Kozina <okozina@redhat.com> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: NeilBrown <neilb@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Mikulas Patocka <mpatocka@redhat.com> Cc: Ondrej Kozina <okozina@redhat.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: Mel Gorman <mgorman@suse.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28mm, compaction: don't isolate PageWriteback pages in MIGRATE_SYNC_LIGHT modeHugh Dickins
At present MIGRATE_SYNC_LIGHT is allowing __isolate_lru_page() to isolate a PageWriteback page, which __unmap_and_move() then rejects with -EBUSY: of course the writeback might complete in between, but that's not what we usually expect, so probably better not to isolate it. When tested by stress-highalloc from mmtests, this has reduced the number of page migrate failures by 60-70%. Link: http://lkml.kernel.org/r/20160721073614.24395-2-vbabka@suse.cz Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28mm: hwpoison: remove incorrect commentsNaoya Horiguchi
dequeue_hwpoisoned_huge_page() can be called without page lock hold, so let's remove incorrect comment. The reason why the page lock is not really needed is that dequeue_hwpoisoned_huge_page() checks page_huge_active() inside hugetlb_lock, which allows us to avoid trying to dequeue a hugepage that are just allocated but not linked to active list yet, even without taking page lock. Link: http://lkml.kernel.org/r/20160720092901.GA15995@www9186uo.sakura.ne.jp Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reported-by: Zhan Chen <zhanc1@andrew.cmu.edu> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28make __section_nr() more efficientZhou Chengming
When CONFIG_SPARSEMEM_EXTREME is disabled, __section_nr can get the section number with a subtraction directly. Link: http://lkml.kernel.org/r/1468988310-11560-1-git-send-email-zhouchengming1@huawei.com Signed-off-by: Zhou Chengming <zhouchengming1@huawei.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Tejun Heo <tj@kernel.org> Cc: Hanjun Guo <guohanjun@huawei.com> Cc: Li Bin <huawei.libin@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28kmemleak: don't hang if user disables scanning earlyVegard Nossum
If the user tries to disable automatic scanning early in the boot process using e.g.: echo scan=off > /sys/kernel/debug/kmemleak then this command will hang until SECS_FIRST_SCAN (= 60) seconds have elapsed, even though the system is fully initialised. We can fix this using interruptible sleep and checking if we're supposed to stop whenever we wake up (like the rest of the code does). Link: http://lkml.kernel.org/r/1468835005-2873-1-git-send-email-vegard.nossum@oracle.com Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28mm/memblock.c: add new infrastructure to address the mem limit issueDennis Chen
In some cases, memblock is queried by kernel to determine whether a specified address is RAM or not. For example, the ACPI core needs this information to determine which attributes to use when mapping ACPI regions(acpi_os_ioremap). Use of incorrect memory types can result in faults, data corruption, or other issues. Removing memory with memblock_enforce_memory_limit() throws away this information, and so a kernel booted with 'mem=' may suffer from the issues described above. To avoid this, we need to keep those NOMAP regions instead of removing all above the limit, which preserves the information we need while preventing other use of those regions. This patch adds new infrastructure to retain all NOMAP memblock regions while removing others, to cater for this. Link: http://lkml.kernel.org/r/1468475036-5852-2-git-send-email-dennis.chen@arm.com Signed-off-by: Dennis Chen <dennis.chen@arm.com> Acked-by: Steve Capper <steve.capper@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Rafael J. Wysocki <rafael@kernel.org> Cc: Will Deacon <will.deacon@arm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Kaly Xin <kaly.xin@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28mm: fix memcg stack accounting for sub-page stacksAndy Lutomirski
We should account for stacks regardless of stack size, and we need to account in sub-page units if THREAD_SIZE < PAGE_SIZE. Change the units to kilobytes and Move it into account_kernel_stack(). Fixes: 12580e4b54ba8 ("mm: memcontrol: report kernel stack usage in cgroup2 memory.stat") Link: http://lkml.kernel.org/r/9b5314e3ee5eda61b0317ec1563768602c1ef438.1468523549.git.luto@kernel.org Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28mm: track NR_KERNEL_STACK in KiB instead of number of stacksAndy Lutomirski
Currently, NR_KERNEL_STACK tracks the number of kernel stacks in a zone. This only makes sense if each kernel stack exists entirely in one zone, and allowing vmapped stacks could break this assumption. Since frv has THREAD_SIZE < PAGE_SIZE, we need to track kernel stack allocations in a unit that divides both THREAD_SIZE and PAGE_SIZE on all architectures. Keep it simple and use KiB. Link: http://lkml.kernel.org/r/083c71e642c5fa5f1b6898902e1b2db7b48940d4.1468523549.git.luto@kernel.org Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28mm: CONFIG_ZONE_DEVICE stop depending on CONFIG_EXPERTDan Williams
When it was first introduced CONFIG_ZONE_DEVICE depended on disabling CONFIG_ZONE_DMA, a configuration choice reserved for "experts". However, now that the ZONE_DMA conflict has been eliminated it no longer makes sense to require CONFIG_EXPERT. Link: http://lkml.kernel.org/r/146687646274.39261.14267596518720371009.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reported-by: Eric Sandeen <sandeen@redhat.com> Reported-by: Jeff Moyer <jmoyer@redhat.com> Acked-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28memblock: include <asm/sections.h> instead of <asm-generic/sections.h>Christoph Hellwig
asm-generic headers are generic implementations for architecture specific code and should not be included by common code. Thus use the asm/ version of sections.h to get at the linker sections. Link: http://lkml.kernel.org/r/1468285103-7470-1-git-send-email-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28mm, THP: clean up return value of madvise_free_huge_pmdHuang Ying
The definition of return value of madvise_free_huge_pmd is not clear before. According to the suggestion of Minchan Kim, change the type of return value to bool and return true if we do MADV_FREE successfully on entire pmd page, otherwise, return false. Comments are added too. Link: http://lkml.kernel.org/r/1467135452-16688-2-git-send-email-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28mm/zsmalloc: use helper to clear page->flags bitGanesh Mahendran
Use ClearPagePrivate/ClearPagePrivate2 helpers to clear PG_private/PG_private_2 in page->flags Link: http://lkml.kernel.org/r/1467882338-4300-7-git-send-email-opensource.ganesh@gmail.com Signed-off-by: Ganesh Mahendran <opensource.ganesh@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28mm/zsmalloc: add __init,__exit attributeGanesh Mahendran
Add __init,__exit attribute for function that only called in module init/exit to save memory. Link: http://lkml.kernel.org/r/1467882338-4300-6-git-send-email-opensource.ganesh@gmail.com Signed-off-by: Ganesh Mahendran <opensource.ganesh@gmail.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28mm/zsmalloc: keep comments consistent with codeGanesh Mahendran
Some minor commebnt changes: 1). update zs_malloc(),zs_create_pool() function header 2). update "Usage of struct page fields" Link: http://lkml.kernel.org/r/1467882338-4300-5-git-send-email-opensource.ganesh@gmail.com Signed-off-by: Ganesh Mahendran <opensource.ganesh@gmail.com> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28mm/zsmalloc: avoid calculate max objects of zspage twiceGanesh Mahendran
Currently, if a class can not be merged, the max objects of zspage in that class may be calculated twice. This patch calculate max objects of zspage at the begin, and pass the value to can_merge() to decide whether the class can be merged. Also this patch remove function get_maxobj_per_zspage(), as there is no other place to call this function. Link: http://lkml.kernel.org/r/1467882338-4300-4-git-send-email-opensource.ganesh@gmail.com Signed-off-by: Ganesh Mahendran <opensource.ganesh@gmail.com> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28mm/zsmalloc: use class->objs_per_zspage to get num of max objectsGanesh Mahendran
num of max objects in zspage is stored in each size_class now. So there is no need to re-calculate it. Link: http://lkml.kernel.org/r/1467882338-4300-3-git-send-email-opensource.ganesh@gmail.com Signed-off-by: Ganesh Mahendran <opensource.ganesh@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28mm/zsmalloc: take obj index back from find_alloced_objGanesh Mahendran
the obj index value should be updated after return from find_alloced_obj() to avoid CPU burning caused by unnecessary object scanning. Link: http://lkml.kernel.org/r/1467882338-4300-2-git-send-email-opensource.ganesh@gmail.com Signed-off-by: Ganesh Mahendran <opensource.ganesh@gmail.com> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28mm/zsmalloc: use obj_index to keep consistent with othersGanesh Mahendran
This is a cleanup patch. Change "index" to "obj_index" to keep consistent with others in zsmalloc. Link: http://lkml.kernel.org/r/1467882338-4300-1-git-send-email-opensource.ganesh@gmail.com Signed-off-by: Ganesh Mahendran <opensource.ganesh@gmail.com> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28mm: bail out in shrink_inactive_list()Minchan Kim
With node-lru, if there are enough reclaimable pages in highmem but nothing in lowmem, VM can try to shrink inactive list although the requested zone is lowmem. The problem is that if the inactive list is full of highmem pages then a direct reclaimer searching for a lowmem page waste CPU scanning uselessly. It just burns out CPU. Even, many direct reclaimers are stalled by too_many_isolated if lots of parallel reclaimer are going on although there are no reclaimable memory in inactive list. I tried the experiment 4 times in 32bit 2G 8 CPU KVM machine to get elapsed time. hackbench 500 process 2 = Old = 1st: 289s 2nd: 310s 3rd: 112s 4th: 272s = Now = 1st: 31s 2nd: 132s 3rd: 162s 4th: 50s. [akpm@linux-foundation.org: fixes per Mel] Link: http://lkml.kernel.org/r/1469433119-1543-1-git-send-email-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28mm, vmscan: account for skipped pages as a partial scanMel Gorman
Page reclaim determines whether a pgdat is unreclaimable by examining how many pages have been scanned since a page was freed and comparing that to the LRU sizes. Skipped pages are not reclaim candidates but contribute to scanned. This can prematurely mark a pgdat as unreclaimable and trigger an OOM kill. This patch accounts for skipped pages as a partial scan so that an unreclaimable pgdat will still be marked as such but by scaling the cost of a skip, it'll avoid the pgdat being marked prematurely. Link: http://lkml.kernel.org/r/1469110261-7365-6-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28mm: consider whether to decivate based on eligible zones inactive ratioMel Gorman
Minchan Kim reported that with per-zone lru state it was possible to identify that a normal zone with 8^M anonymous pages could trigger OOM with non-atomic order-0 allocations as all pages in the zone were in the active list. gfp_mask=0x26004c0(GFP_KERNEL|__GFP_REPEAT|__GFP_NOTRACK), order=0 Call Trace: __alloc_pages_nodemask+0xe52/0xe60 ? new_slab+0x39c/0x3b0 new_slab+0x39c/0x3b0 ___slab_alloc.constprop.87+0x6da/0x840 ? __alloc_skb+0x3c/0x260 ? enqueue_task_fair+0x73/0xbf0 ? poll_select_copy_remaining+0x140/0x140 __slab_alloc.isra.81.constprop.86+0x40/0x6d ? __alloc_skb+0x3c/0x260 kmem_cache_alloc+0x22c/0x260 ? __alloc_skb+0x3c/0x260 __alloc_skb+0x3c/0x260 alloc_skb_with_frags+0x4e/0x1a0 sock_alloc_send_pskb+0x16a/0x1b0 ? wait_for_unix_gc+0x31/0x90 unix_stream_sendmsg+0x28d/0x340 sock_sendmsg+0x2d/0x40 sock_write_iter+0x6c/0xc0 __vfs_write+0xc0/0x120 vfs_write+0x9b/0x1a0 ? __might_fault+0x49/0xa0 SyS_write+0x44/0x90 do_fast_syscall_32+0xa6/0x1e0 Mem-Info: active_anon:101103 inactive_anon:102219 isolated_anon:0 active_file:503 inactive_file:544 isolated_file:0 unevictable:0 dirty:0 writeback:34 unstable:0 slab_reclaimable:6298 slab_unreclaimable:74669 mapped:863 shmem:0 pagetables:100998 bounce:0 free:23573 free_pcp:1861 free_cma:0 Node 0 active_anon:404412kB inactive_anon:409040kB active_file:2012kB inactive_file:2176kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:3452kB dirty:0kB writeback:136kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:1320845 all_unreclaimable? yes DMA free:3296kB min:68kB low:84kB high:100kB active_anon:5540kB inactive_anon:0kB active_file:0kB inactive_file:0kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:248kB slab_unreclaimable:2628kB kernel_stack:792kB pagetables:2316kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB lowmem_reserve[]: 0 809 1965 1965 Normal free:3600kB min:3604kB low:4504kB high:5404kB active_anon:86304kB inactive_anon:0kB active_file:160kB inactive_file:376kB present:897016kB managed:858524kB mlocked:0kB slab_reclaimable:24944kB slab_unreclaimable:296048kB kernel_stack:163832kB pagetables:35892kB bounce:0kB free_pcp:3076kB local_pcp:656kB free_cma:0kB lowmem_reserve[]: 0 0 9247 9247 HighMem free:86156kB min:512kB low:1796kB high:3080kB active_anon:312852kB inactive_anon:410024kB active_file:1924kB inactive_file:2012kB present:1183736kB managed:1183736kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:365784kB bounce:0kB free_pcp:3868kB local_pcp:720kB free_cma:0kB lowmem_reserve[]: 0 0 0 0 DMA: 8*4kB (UM) 8*8kB (UM) 4*16kB (M) 2*32kB (UM) 2*64kB (UM) 1*128kB (M) 3*256kB (UME) 2*512kB (UE) 1*1024kB (E) 0*2048kB 0*4096kB = 3296kB Normal: 240*4kB (UME) 160*8kB (UME) 23*16kB (ME) 3*32kB (UE) 3*64kB (UME) 2*128kB (ME) 1*256kB (U) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3408kB HighMem: 10942*4kB (UM) 3102*8kB (UM) 866*16kB (UM) 76*32kB (UM) 11*64kB (UM) 4*128kB (UM) 1*256kB (M) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 86344kB Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB 54409 total pagecache pages 53215 pages in swap cache Swap cache stats: add 300982, delete 247765, find 157978/226539 Free swap = 3803244kB Total swap = 4192252kB 524186 pages RAM 295934 pages HighMem/MovableOnly 9642 pages reserved 0 pages cma reserved The problem is due to the active deactivation logic in inactive_list_is_low: Node 0 active_anon:404412kB inactive_anon:409040kB IOW, (inactive_anon of node * inactive_ratio > active_anon of node) due to highmem anonymous stat so VM never deactivates normal zone's anonymous pages. This patch is a modified version of Minchan's original solution but based upon it. The problem with Minchan's patch is that any low zone with an imbalanced list could force a rotation. In this patch, a zone-constrained global reclaim will rotate the list if the inactive/active ratio of all eligible zones needs to be corrected. It is possible that higher zone pages will be initially rotated prematurely but this is the safer choice to maintain overall LRU age. Link: http://lkml.kernel.org/r/20160722090929.GJ10438@techsingularity.net Signed-off-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28mm: remove reclaim and compaction retry approximationsMel Gorman
If per-zone LRU accounting is available then there is no point approximating whether reclaim and compaction should retry based on pgdat statistics. This is effectively a revert of "mm, vmstat: remove zone and node double accounting by approximating retries" with the difference that inactive/active stats are still available. This preserves the history of why the approximation was retried and why it had to be reverted to handle OOM kills on 32-bit systems. Link: http://lkml.kernel.org/r/1469110261-7365-4-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28mm, vmscan: remove highmem_file_pagesMel Gorman
With the reintroduction of per-zone LRU stats, highmem_file_pages is redundant so remove it. [mgorman@techsingularity.net: wrong stat is being accumulated in highmem_dirtyable_memory] Link: http://lkml.kernel.org/r/20160725092324.GM10438@techsingularity.netLink: http://lkml.kernel.org/r/1469110261-7365-3-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28mm: add per-zone lru list statMinchan Kim
When I did stress test with hackbench, I got OOM message frequently which didn't ever happen in zone-lru. gfp_mask=0x26004c0(GFP_KERNEL|__GFP_REPEAT|__GFP_NOTRACK), order=0 .. .. __alloc_pages_nodemask+0xe52/0xe60 ? new_slab+0x39c/0x3b0 new_slab+0x39c/0x3b0 ___slab_alloc.constprop.87+0x6da/0x840 ? __alloc_skb+0x3c/0x260 ? _raw_spin_unlock_irq+0x27/0x60 ? trace_hardirqs_on_caller+0xec/0x1b0 ? finish_task_switch+0xa6/0x220 ? poll_select_copy_remaining+0x140/0x140 __slab_alloc.isra.81.constprop.86+0x40/0x6d ? __alloc_skb+0x3c/0x260 kmem_cache_alloc+0x22c/0x260 ? __alloc_skb+0x3c/0x260 __alloc_skb+0x3c/0x260 alloc_skb_with_frags+0x4e/0x1a0 sock_alloc_send_pskb+0x16a/0x1b0 ? wait_for_unix_gc+0x31/0x90 ? alloc_set_pte+0x2ad/0x310 unix_stream_sendmsg+0x28d/0x340 sock_sendmsg+0x2d/0x40 sock_write_iter+0x6c/0xc0 __vfs_write+0xc0/0x120 vfs_write+0x9b/0x1a0 ? __might_fault+0x49/0xa0 SyS_write+0x44/0x90 do_fast_syscall_32+0xa6/0x1e0 sysenter_past_esp+0x45/0x74 Mem-Info: active_anon:104698 inactive_anon:105791 isolated_anon:192 active_file:433 inactive_file:283 isolated_file:22 unevictable:0 dirty:0 writeback:296 unstable:0 slab_reclaimable:6389 slab_unreclaimable:78927 mapped:474 shmem:0 pagetables:101426 bounce:0 free:10518 free_pcp:334 free_cma:0 Node 0 active_anon:418792kB inactive_anon:423164kB active_file:1732kB inactive_file:1132kB unevictable:0kB isolated(anon):768kB isolated(file):88kB mapped:1896kB dirty:0kB writeback:1184kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:1478632 all_unreclaimable? yes DMA free:3304kB min:68kB low:84kB high:100kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:4088kB kernel_stack:0kB pagetables:2480kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB lowmem_reserve[]: 0 809 1965 1965 Normal free:3436kB min:3604kB low:4504kB high:5404kB present:897016kB managed:858460kB mlocked:0kB slab_reclaimable:25556kB slab_unreclaimable:311712kB kernel_stack:164608kB pagetables:30844kB bounce:0kB free_pcp:620kB local_pcp:104kB free_cma:0kB lowmem_reserve[]: 0 0 9247 9247 HighMem free:33808kB min:512kB low:1796kB high:3080kB present:1183736kB managed:1183736kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:372252kB bounce:0kB free_pcp:428kB local_pcp:72kB free_cma:0kB lowmem_reserve[]: 0 0 0 0 DMA: 2*4kB (UM) 2*8kB (UM) 0*16kB 1*32kB (U) 1*64kB (U) 2*128kB (UM) 1*256kB (U) 1*512kB (M) 0*1024kB 1*2048kB (U) 0*4096kB = 3192kB Normal: 33*4kB (MH) 79*8kB (ME) 11*16kB (M) 4*32kB (M) 2*64kB (ME) 2*128kB (EH) 7*256kB (EH) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3244kB HighMem: 2590*4kB (UM) 1568*8kB (UM) 491*16kB (UM) 60*32kB (UM) 6*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 33064kB Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB 25121 total pagecache pages 24160 pages in swap cache Swap cache stats: add 86371, delete 62211, find 42865/60187 Free swap = 4015560kB Total swap = 4192252kB 524186 pages RAM 295934 pages HighMem/MovableOnly 9658 pages reserved 0 pages cma reserved The order-0 allocation for normal zone failed while there are a lot of reclaimable memory(i.e., anonymous memory with free swap). I wanted to analyze the problem but it was hard because we removed per-zone lru stat so I couldn't know how many of anonymous memory there are in normal/dma zone. When we investigate OOM problem, reclaimable memory count is crucial stat to find a problem. Without it, it's hard to parse the OOM message so I believe we should keep it. With per-zone lru stat, gfp_mask=0x26004c0(GFP_KERNEL|__GFP_REPEAT|__GFP_NOTRACK), order=0 Mem-Info: active_anon:101103 inactive_anon:102219 isolated_anon:0 active_file:503 inactive_file:544 isolated_file:0 unevictable:0 dirty:0 writeback:34 unstable:0 slab_reclaimable:6298 slab_unreclaimable:74669 mapped:863 shmem:0 pagetables:100998 bounce:0 free:23573 free_pcp:1861 free_cma:0 Node 0 active_anon:404412kB inactive_anon:409040kB active_file:2012kB inactive_file:2176kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:3452kB dirty:0kB writeback:136kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:1320845 all_unreclaimable? yes DMA free:3296kB min:68kB low:84kB high:100kB active_anon:5540kB inactive_anon:0kB active_file:0kB inactive_file:0kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:248kB slab_unreclaimable:2628kB kernel_stack:792kB pagetables:2316kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB lowmem_reserve[]: 0 809 1965 1965 Normal free:3600kB min:3604kB low:4504kB high:5404kB active_anon:86304kB inactive_anon:0kB active_file:160kB inactive_file:376kB present:897016kB managed:858524kB mlocked:0kB slab_reclaimable:24944kB slab_unreclaimable:296048kB kernel_stack:163832kB pagetables:35892kB bounce:0kB free_pcp:3076kB local_pcp:656kB free_cma:0kB lowmem_reserve[]: 0 0 9247 9247 HighMem free:86156kB min:512kB low:1796kB high:3080kB active_anon:312852kB inactive_anon:410024kB active_file:1924kB inactive_file:2012kB present:1183736kB managed:1183736kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:365784kB bounce:0kB free_pcp:3868kB local_pcp:720kB free_cma:0kB lowmem_reserve[]: 0 0 0 0 DMA: 8*4kB (UM) 8*8kB (UM) 4*16kB (M) 2*32kB (UM) 2*64kB (UM) 1*128kB (M) 3*256kB (UME) 2*512kB (UE) 1*1024kB (E) 0*2048kB 0*4096kB = 3296kB Normal: 240*4kB (UME) 160*8kB (UME) 23*16kB (ME) 3*32kB (UE) 3*64kB (UME) 2*128kB (ME) 1*256kB (U) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3408kB HighMem: 10942*4kB (UM) 3102*8kB (UM) 866*16kB (UM) 76*32kB (UM) 11*64kB (UM) 4*128kB (UM) 1*256kB (M) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 86344kB Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB 54409 total pagecache pages 53215 pages in swap cache Swap cache stats: add 300982, delete 247765, find 157978/226539 Free swap = 3803244kB Total swap = 4192252kB 524186 pages RAM 295934 pages HighMem/MovableOnly 9642 pages reserved 0 pages cma reserved With that, we can see normal zone has a 86M reclaimable memory so we can know something goes wrong(I will fix the problem in next patch) in reclaim. [mgorman@techsingularity.net: rename zone LRU stats in /proc/vmstat] Link: http://lkml.kernel.org/r/20160725072300.GK10438@techsingularity.net Link: http://lkml.kernel.org/r/1469110261-7365-2-git-send-email-mgorman@techsingularity.net Signed-off-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28mm, vmscan: release/reacquire lru_lock on pgdat changeMel Gorman
With node-lru, the locking is based on the pgdat. As Minchan pointed out, there is an opportunity to reduce LRU lock release/acquire in check_move_unevictable_pages by only changing lock on a pgdat change. [mgorman@techsingularity.net: remove double initialisation] Link: http://lkml.kernel.org/r/20160719074835.GC10438@techsingularity.net Link: http://lkml.kernel.org/r/1468853426-12858-3-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28mm, vmscan: remove redundant check in shrink_zones()Mel Gorman
As pointed out by Minchan Kim, shrink_zones() checks for populated zones in a zonelist but a zonelist can never contain unpopulated zones. While it's not related to the node-lru series, it can be cleaned up now. Link: http://lkml.kernel.org/r/1468853426-12858-2-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Suggested-by: Minchan Kim <minchan@kernel.org> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>