summaryrefslogtreecommitdiff
path: root/include/linux/mmzone.h
AgeCommit message (Collapse)Author
2015-04-07mm: move zone lock to a different cache line than order-0 free page listsMel Gorman
Huang Ying reported the following problem due to commit 3484b2de9499 ("mm: rearrange zone fields into read-only, page alloc, statistics and page reclaim lines") from the Intel performance tests 24b7e5819ad5cbef 3484b2de9499df23c4604a513b ---------------- -------------------------- %stddev %change %stddev \ | \ 152288 \261 0% -46.2% 81911 \261 0% aim7.jobs-per-min 237 \261 0% +85.6% 440 \261 0% aim7.time.elapsed_time 237 \261 0% +85.6% 440 \261 0% aim7.time.elapsed_time.max 25026 \261 0% +70.7% 42712 \261 0% aim7.time.system_time 2186645 \261 5% +32.0% 2885949 \261 4% aim7.time.voluntary_context_switches 4576561 \261 1% +24.9% 5715773 \261 0% aim7.time.involuntary_context_switches The problem is specific to very large machines under stress. It was not reproducible with the machines I had used to justify the original patch because large numbers of CPUs are required. When pressure is high enough, the cache line is bouncing between CPUs trying to acquire the lock and the holder of the lock adjusting free lists. The intention was that the acquirer of the lock would automatically have the cache line holding the free lists but according to Huang, this is not a universal win. One possibility is to move the zone lock to its own cache line but it increases the size of the zone. This patch moves the lock to the other end of the free lists where they do not contend under high pressure. It does mean the page allocator paths now require more cache lines but Huang reports that it restores performance to previous levels on large machines %stddev %change %stddev \ | \ 84568 \261 1% +94.3% 164280 \261 1% aim7.jobs-per-min 2881944 \261 2% -35.1% 1870386 \261 8% aim7.time.voluntary_context_switches 681 \261 1% -3.4% 658 \261 0% aim7.time.user_time 5538139 \261 0% -12.1% 4867884 \261 0% aim7.time.involuntary_context_switches 44174 \261 1% -46.0% 23848 \261 1% aim7.time.system_time 426 \261 1% -48.4% 219 \261 1% aim7.time.elapsed_time 426 \261 1% -48.4% 219 \261 1% aim7.time.elapsed_time.max 468 \261 1% -43.1% 266 \261 2% uptime.boot Signed-off-by: Mel Gorman <mgorman@suse.de> Reported-by: Huang Ying <ying.huang@intel.com> Tested-by: Huang Ying <ying.huang@intel.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-11mm: microoptimize zonelist operationsVlastimil Babka
next_zones_zonelist() returns a zoneref pointer, as well as a zone pointer via extra parameter. Since the latter can be trivially obtained by dereferencing the former, the overhead of the extra parameter is unjustified. This patch thus removes the zone parameter from next_zones_zonelist(). Both callers happen to be in the same header file, so it's simple to add the zoneref dereference inline. We save some bytes of code size. add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-105 (-105) function old new delta nr_free_zone_pages 129 115 -14 __alloc_pages_nodemask 2300 2285 -15 get_page_from_freelist 2652 2576 -76 add/remove: 0/0 grow/shrink: 1/0 up/down: 10/0 (10) function old new delta try_to_compact_pages 569 579 +10 Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Minchan Kim <minchan@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-11mm: fix typo of MIGRATE_RESERVE in commentBaoquan He
Found it when I want to jump to the definition of MIGRATE_RESERVE ctags. Signed-off-by: Baoquan He <bhe@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13mm/page_ext: resurrect struct page extending code for debuggingJoonsoo Kim
When we debug something, we'd like to insert some information to every page. For this purpose, we sometimes modify struct page itself. But, this has drawbacks. First, it requires re-compile. This makes us hesitate to use the powerful debug feature so development process is slowed down. And, second, sometimes it is impossible to rebuild the kernel due to third party module dependency. At third, system behaviour would be largely different after re-compile, because it changes size of struct page greatly and this structure is accessed by every part of kernel. Keeping this as it is would be better to reproduce errornous situation. This feature is intended to overcome above mentioned problems. This feature allocates memory for extended data per page in certain place rather than the struct page itself. This memory can be accessed by the accessor functions provided by this code. During the boot process, it checks whether allocation of huge chunk of memory is needed or not. If not, it avoids allocating memory at all. With this advantage, we can include this feature into the kernel in default and can avoid rebuild and solve related problems. Until now, memcg uses this technique. But, now, memcg decides to embed their variable to struct page itself and it's code to extend struct page has been removed. I'd like to use this code to develop debug feature, so this patch resurrect it. To help these things to work well, this patch introduces two callbacks for clients. One is the need callback which is mandatory if user wants to avoid useless memory allocation at boot-time. The other is optional, init callback, which is used to do proper initialization after memory is allocated. Detailed explanation about purpose of these functions is in code comment. Please refer it. Others are completely same with previous extension code in memcg. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Dave Hansen <dave@sr71.net> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Jungsoo Son <jungsoo.son@lge.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10mm: embed the memcg pointer directly into struct pageJohannes Weiner
Memory cgroups used to have 5 per-page pointers. To allow users to disable that amount of overhead during runtime, those pointers were allocated in a separate array, with a translation layer between them and struct page. There is now only one page pointer remaining: the memcg pointer, that indicates which cgroup the page is associated with when charged. The complexity of runtime allocation and the runtime translation overhead is no longer justified to save that *potential* 0.19% of memory. With CONFIG_SLUB, page->mem_cgroup actually sits in the doubleword padding after the page->private member and doesn't even increase struct page, and then this patch actually saves space. Remaining users that care can still compile their kernels without CONFIG_MEMCG. text data bss dec hex filename 8828345 1725264 983040 11536649 b00909 vmlinux.old 8827425 1725264 966656 11519345 afc571 vmlinux.new [mhocko@suse.cz: update Documentation/cgroups/memory.txt] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vladimir Davydov <vdavydov@parallels.com> Cc: Tejun Heo <tj@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Konstantin Khlebnikov <koct9i@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-11-13mm/page_alloc: fix incorrect isolation behavior by rechecking migratetypeJoonsoo Kim
Before describing bugs itself, I first explain definition of freepage. 1. pages on buddy list are counted as freepage. 2. pages on isolate migratetype buddy list are *not* counted as freepage. 3. pages on cma buddy list are counted as CMA freepage, too. Now, I describe problems and related patch. Patch 1: There is race conditions on getting pageblock migratetype that it results in misplacement of freepages on buddy list, incorrect freepage count and un-availability of freepage. Patch 2: Freepages on pcp list could have stale cached information to determine migratetype of buddy list to go. This causes misplacement of freepages on buddy list and incorrect freepage count. Patch 4: Merging between freepages on different migratetype of pageblocks will cause freepages accouting problem. This patch fixes it. Without patchset [3], above problem doesn't happens on my CMA allocation test, because CMA reserved pages aren't used at all. So there is no chance for above race. With patchset [3], I did simple CMA allocation test and get below result: - Virtual machine, 4 cpus, 1024 MB memory, 256 MB CMA reservation - run kernel build (make -j16) on background - 30 times CMA allocation(8MB * 30 = 240MB) attempts in 5 sec interval - Result: more than 5000 freepage count are missed With patchset [3] and this patchset, I found that no freepage count are missed so that I conclude that problems are solved. On my simple memory offlining test, these problems also occur on that environment, too. This patch (of 4): There are two paths to reach core free function of buddy allocator, __free_one_page(), one is free_one_page()->__free_one_page() and the other is free_hot_cold_page()->free_pcppages_bulk()->__free_one_page(). Each paths has race condition causing serious problems. At first, this patch is focused on first type of freepath. And then, following patch will solve the problem in second type of freepath. In the first type of freepath, we got migratetype of freeing page without holding the zone lock, so it could be racy. There are two cases of this race. 1. pages are added to isolate buddy list after restoring orignal migratetype CPU1 CPU2 get migratetype => return MIGRATE_ISOLATE call free_one_page() with MIGRATE_ISOLATE grab the zone lock unisolate pageblock release the zone lock grab the zone lock call __free_one_page() with MIGRATE_ISOLATE freepage go into isolate buddy list, although pageblock is already unisolated This may cause two problems. One is that we can't use this page anymore until next isolation attempt of this pageblock, because freepage is on isolate buddy list. The other is that freepage accouting could be wrong due to merging between different buddy list. Freepages on isolate buddy list aren't counted as freepage, but ones on normal buddy list are counted as freepage. If merge happens, buddy freepage on normal buddy list is inevitably moved to isolate buddy list without any consideration of freepage accouting so it could be incorrect. 2. pages are added to normal buddy list while pageblock is isolated. It is similar with above case. This also may cause two problems. One is that we can't keep these freepages from being allocated. Although this pageblock is isolated, freepage would be added to normal buddy list so that it could be allocated without any restriction. And the other problem is same as case 1, that it, incorrect freepage accouting. This race condition would be prevented by checking migratetype again with holding the zone lock. Because it is somewhat heavy operation and it isn't needed in common case, we want to avoid rechecking as much as possible. So this patch introduce new variable, nr_isolate_pageblock in struct zone to check if there is isolated pageblock. With this, we can avoid to re-check migratetype in common case and do it only if there is isolated pageblock or migratetype is MIGRATE_ISOLATE. This solve above mentioned problems. Changes from v3: Add one more check in free_one_page() that checks whether migratetype is MIGRATE_ISOLATE or not. Without this, abovementioned case 1 could happens. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Michal Nazarewicz <mina86@mina86.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Cc: Wen Congyang <wency@cn.fujitsu.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Laura Abbott <lauraa@codeaurora.org> Cc: Heesub Shin <heesub.shin@samsung.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Ritesh Harjani <ritesh.list@gmail.com> Cc: Gioh Kim <gioh.kim@lge.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09mm: clean up zone flagsJohannes Weiner
Page reclaim tests zone_is_reclaim_dirty(), but the site that actually sets this state does zone_set_flag(zone, ZONE_TAIL_LRU_DIRTY), sending the reader through layers indirection just to track down a simple bit. Remove all zone flag wrappers and just use bitops against zone->flags directly. It's just as readable and the lines are barely any longer. Also rename ZONE_TAIL_LRU_DIRTY to ZONE_DIRTY to match ZONE_WRITEBACK, and remove the zone_flags_t typedef. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm: page_alloc: reduce cost of the fair zone allocation policyMel Gorman
The fair zone allocation policy round-robins allocations between zones within a node to avoid age inversion problems during reclaim. If the first allocation fails, the batch counts are reset and a second attempt made before entering the slow path. One assumption made with this scheme is that batches expire at roughly the same time and the resets each time are justified. This assumption does not hold when zones reach their low watermark as the batches will be consumed at uneven rates. Allocation failure due to watermark depletion result in additional zonelist scans for the reset and another watermark check before hitting the slowpath. On UMA, the benefit is negligible -- around 0.25%. On 4-socket NUMA machine it's variable due to the variability of measuring overhead with the vmstat changes. The system CPU overhead comparison looks like 3.16.0-rc3 3.16.0-rc3 3.16.0-rc3 vanilla vmstat-v5 lowercost-v5 User 746.94 774.56 802.00 System 65336.22 32847.27 40852.33 Elapsed 27553.52 27415.04 27368.46 However it is worth noting that the overall benchmark still completed faster and intuitively it makes sense to take as few passes as possible through the zonelists. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm: move zone->pages_scanned into a vmstat counterMel Gorman
zone->pages_scanned is a write-intensive cache line during page reclaim and it's also updated during page free. Move the counter into vmstat to take advantage of the per-cpu updates and do not update it in the free paths unless necessary. On a small UMA machine running tiobench the difference is marginal. On a 4-node machine the overhead is more noticable. Note that automatic NUMA balancing was disabled for this test as otherwise the system CPU overhead is unpredictable. 3.16.0-rc3 3.16.0-rc3 3.16.0-rc3 vanillarearrange-v5 vmstat-v5 User 746.94 759.78 774.56 System 65336.22 58350.98 32847.27 Elapsed 27553.52 27282.02 27415.04 Note that the overhead reduction will vary depending on where exactly pages are allocated and freed. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm: rearrange zone fields into read-only, page alloc, statistics and page ↵Mel Gorman
reclaim lines The arrangement of struct zone has changed over time and now it has reached the point where there is some inappropriate sharing going on. On x86-64 for example o The zone->node field is shared with the zone lock and zone->node is accessed frequently from the page allocator due to the fair zone allocation policy. o span_seqlock is almost never used by shares a line with free_area o Some zone statistics share a cache line with the LRU lock so reclaim-intensive and allocator-intensive workloads can bounce the cache line on a stat update This patch rearranges struct zone to put read-only and read-mostly fields together and then splits the page allocator intensive fields, the zone statistics and the page reclaim intensive fields into their own cache lines. Note that the type of lowmem_reserve changes due to the watermark calculations being signed and avoiding a signed/unsigned conversion there. On the test configuration I used the overall size of struct zone shrunk by one cache line. On smaller machines, this is not likely to be noticable. However, on a 4-node NUMA machine running tiobench the system CPU overhead is reduced by this patch. 3.16.0-rc3 3.16.0-rc3 vanillarearrange-v5r9 User 746.94 759.78 System 65336.22 58350.98 Elapsed 27553.52 27282.02 Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mem-hotplug: improve zone_movable_is_highmem logicWang Nan
In original code, zone_movable_is_highmem() assumes ZONE_MOVABLE not highmem if CONFIG_HAVE_MEMBLOCK_NODE_MAP is not set. In online_pages, it extracts pages from the previous zone before ZONE_MOVABLE. Which is logically inconsistent: If HAVE_MEMBLOCK_NODE_MAP is turned off but HIGHMEM is on, zone_movable_is_highmem() makes movable zone not highmem, but online_pages() extracts pages from ZONE_HIGHMEM. This inconsistency doesn't cause real problem currently, because all architectures support online_pages also have HAVE_MEMBLOCK_NODE_MAP. However, fixing it makes code clear, and also helps futher coding. Signed-off-by: Wang Nan <wangnan0@huawei.com> Cc: Zhang Zhen <zhangzhen@huawei.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Jiang Liu <liuj97@gmail.com> Cc: Li Zefan <lizefan@huawei.com> Cc: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04mm: page_alloc: use unsigned int for order in more placesMel Gorman
X86 prefers the use of unsigned types for iterators and there is a tendency to mix whether a signed or unsigned type if used for page order. This converts a number of sites in mm/page_alloc.c to use unsigned int for order where possible. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jan Kara <jack@suse.cz> Cc: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04mm: page_alloc: reduce number of times page_to_pfn is calledMel Gorman
In the free path we calculate page_to_pfn multiple times. Reduce that. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Jan Kara <jack@suse.cz> Cc: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04mm: page_alloc: use word-based accesses for get/set pageblock bitmapsMel Gorman
The test_bit operations in get/set pageblock flags are expensive. This patch reads the bitmap on a word basis and use shifts and masks to isolate the bits of interest. Similarly masks are used to set a local copy of the bitmap and then use cmpxchg to update the bitmap if there have been no other changes made in parallel. In a test running dd onto tmpfs the overhead of the pageblock-related functions went from 1.27% in profiles to 0.5%. In addition to the performance benefits, this patch closes races that are possible between: a) get_ and set_pageblock_migratetype(), where get_pageblock_migratetype() reads part of the bits before and other part of the bits after set_pageblock_migratetype() has updated them. b) set_pageblock_migratetype() and set_pageblock_skip(), where the non-atomic read-modify-update set bit operation in set_pageblock_skip() will cause lost updates to some bits changed in the set_pageblock_migratetype(). Joonsoo Kim first reported the case a) via code inspection. Vlastimil Babka's testing with a debug patch showed that either a) or b) occurs roughly once per mmtests' stress-highalloc benchmark (although not necessarily in the same pageblock). Furthermore during development of unrelated compaction patches, it was observed that frequent calls to {start,undo}_isolate_page_range() the race occurs several thousands of times and has resulted in NULL pointer dereferences in move_freepages() and free_one_page() in places where free_list[migratetype] is manipulated by e.g. list_move(). Further debugging confirmed that migratetype had invalid value of 6, causing out of bounds access to the free_list array. That confirmed that the race exist, although it may be extremely rare, and currently only fatal where page isolation is performed due to memory hot remove. Races on pageblocks being updated by set_pageblock_migratetype(), where both old and new migratetype are lower MIGRATE_RESERVE, currently cannot result in an invalid value being observed, although theoretically they may still lead to unexpected creation or destruction of MIGRATE_RESERVE pageblocks. Furthermore, things could get suddenly worse when memory isolation is used more, or when new migratetypes are added. After this patch, the race has no longer been observed in testing. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reported-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reported-and-tested-by: Vlastimil Babka <vbabka@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jan Kara <jack@suse.cz> Cc: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04mm, compaction: add per-zone migration pfn cache for async compactionDavid Rientjes
Each zone has a cached migration scanner pfn for memory compaction so that subsequent calls to memory compaction can start where the previous call left off. Currently, the compaction migration scanner only updates the per-zone cached pfn when pageblocks were not skipped for async compaction. This creates a dependency on calling sync compaction to avoid having subsequent calls to async compaction from scanning an enormous amount of non-MOVABLE pageblocks each time it is called. On large machines, this could be potentially very expensive. This patch adds a per-zone cached migration scanner pfn only for async compaction. It is updated everytime a pageblock has been scanned in its entirety and when no pages from it were successfully isolated. The cached migration scanner pfn for sync compaction is updated only when called for sync compaction. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Greg Thelen <gthelen@google.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04mem-hotplug: implement get/put_online_memsVladimir Davydov
kmem_cache_{create,destroy,shrink} need to get a stable value of cpu/node online mask, because they init/destroy/access per-cpu/node kmem_cache parts, which can be allocated or destroyed on cpu/mem hotplug. To protect against cpu hotplug, these functions use {get,put}_online_cpus. However, they do nothing to synchronize with memory hotplug - taking the slab_mutex does not eliminate the possibility of race as described in patch 2. What we need there is something like get_online_cpus, but for memory. We already have lock_memory_hotplug, which serves for the purpose, but it's a bit of a hammer right now, because it's backed by a mutex. As a result, it imposes some limitations to locking order, which are not desirable, and can't be used just like get_online_cpus. That's why in patch 1 I substitute it with get/put_online_mems, which work exactly like get/put_online_cpus except they block not cpu, but memory hotplug. [ v1 can be found at https://lkml.org/lkml/2014/4/6/68. I NAK'ed it by myself, because it used an rw semaphore for get/put_online_mems, making them dead lock prune. ] This patch (of 2): {un}lock_memory_hotplug, which is used to synchronize against memory hotplug, is currently backed by a mutex, which makes it a bit of a hammer - threads that only want to get a stable value of online nodes mask won't be able to proceed concurrently. Also, it imposes some strong locking ordering rules on it, which narrows down the set of its usage scenarios. This patch introduces get/put_online_mems, which are the same as get/put_online_cpus, but for memory hotplug, i.e. executing a code inside a get/put_online_mems section will guarantee a stable value of online nodes, present pages, etc. lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Toshi Kani <toshi.kani@hp.com> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: Jiang Liu <liuj97@gmail.com> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Wen Congyang <wency@cn.fujitsu.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04mm: page_alloc: do not cache reclaim distancesMel Gorman
pgdat->reclaim_nodes tracks if a remote node is allowed to be reclaimed by zone_reclaim due to its distance. As it is expected that zone_reclaim_mode will be rarely enabled it is unreasonable for all machines to take a penalty. Fortunately, the zone_reclaim_mode() path is already slow and it is the path that takes the hit. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03mm: keep page cache radix tree nodes in checkJohannes Weiner
Previously, page cache radix tree nodes were freed after reclaim emptied out their page pointers. But now reclaim stores shadow entries in their place, which are only reclaimed when the inodes themselves are reclaimed. This is problematic for bigger files that are still in use after they have a significant amount of their cache reclaimed, without any of those pages actually refaulting. The shadow entries will just sit there and waste memory. In the worst case, the shadow entries will accumulate until the machine runs out of memory. To get this under control, the VM will track radix tree nodes exclusively containing shadow entries on a per-NUMA node list. Per-NUMA rather than global because we expect the radix tree nodes themselves to be allocated node-locally and we want to reduce cross-node references of otherwise independent cache workloads. A simple shrinker will then reclaim these nodes on memory pressure. A few things need to be stored in the radix tree node to implement the shadow node LRU and allow tree deletions coming from the list: 1. There is no index available that would describe the reverse path from the node up to the tree root, which is needed to perform a deletion. To solve this, encode in each node its offset inside the parent. This can be stored in the unused upper bits of the same member that stores the node's height at no extra space cost. 2. The number of shadow entries needs to be counted in addition to the regular entries, to quickly detect when the node is ready to go to the shadow node LRU list. The current entry count is an unsigned int but the maximum number of entries is 64, so a shadow counter can easily be stored in the unused upper bits. 3. Tree modification needs tree lock and tree root, which are located in the address space, so store an address_space backpointer in the node. The parent pointer of the node is in a union with the 2-word rcu_head, so the backpointer comes at no extra cost as well. 4. The node needs to be linked to an LRU list, which requires a list head inside the node. This does increase the size of the node, but it does not change the number of objects that fit into a slab page. [akpm@linux-foundation.org: export the right function] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03mm: thrash detection-based file cache sizingJohannes Weiner
The VM maintains cached filesystem pages on two types of lists. One list holds the pages recently faulted into the cache, the other list holds pages that have been referenced repeatedly on that first list. The idea is to prefer reclaiming young pages over those that have shown to benefit from caching in the past. We call the recently usedbut ultimately was not significantly better than a FIFO policy and still thrashed cache based on eviction speed, rather than actual demand for cache. This patch solves one half of the problem by decoupling the ability to detect working set changes from the inactive list size. By maintaining a history of recently evicted file pages it can detect frequently used pages with an arbitrarily small inactive list size, and subsequently apply pressure on the active list based on actual demand for cache, not just overall eviction speed. Every zone maintains a counter that tracks inactive list aging speed. When a page is evicted, a snapshot of this counter is stored in the now-empty page cache radix tree slot. On refault, the minimum access distance of the page can be assessed, to evaluate whether the page should be part of the active list or not. This fixes the VM's blindness towards working set changes in excess of the inactive list. And it's the foundation to further improve the protection ability and reduce the minimum inactive list size of 50%. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Bob Liu <bob.liu@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-03-10mm: fix GFP_THISNODE callers and clarifyJohannes Weiner
GFP_THISNODE is for callers that implement their own clever fallback to remote nodes. It restricts the allocation to the specified node and does not invoke reclaim, assuming that the caller will take care of it when the fallback fails, e.g. through a subsequent allocation request without GFP_THISNODE set. However, many current GFP_THISNODE users only want the node exclusive aspect of the flag, without actually implementing their own fallback or triggering reclaim if necessary. This results in things like page migration failing prematurely even when there is easily reclaimable memory available, unless kswapd happens to be running already or a concurrent allocation attempt triggers the necessary reclaim. Convert all callsites that don't implement their own fallback strategy to __GFP_THISNODE. This restricts the allocation a single node too, but at the same time allows the allocator to enter the slowpath, wake kswapd, and invoke direct reclaim if necessary, to make the allocation happen when memory is full. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Rik van Riel <riel@redhat.com> Cc: Jan Stancek <jstancek@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-21mm: numa: limit scope of lock for NUMA migrate rate limitingMel Gorman
NUMA migrate rate limiting protects a migration counter and window using a lock but in some cases this can be a contended lock. It is not critical that the number of pages be perfect, lost updates are acceptable. Reduce the importance of this lock. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Alex Thorlton <athorlton@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-21mm: get rid of unnecessary pageblock scanning in setup_zone_migrate_reserveYasuaki Ishimatsu
Yasuaki Ishimatsu reported memory hot-add spent more than 5 _hours_ on 9TB memory machine since onlining memory sections is too slow. And we found out setup_zone_migrate_reserve spent >90% of the time. The problem is, setup_zone_migrate_reserve scans all pageblocks unconditionally, but it is only necessary if the number of reserved block was reduced (i.e. memory hot remove). Moreover, maximum MIGRATE_RESERVE per zone is currently 2. It means that the number of reserved pageblocks is almost always unchanged. This patch adds zone->nr_migrate_reserve_block to maintain the number of MIGRATE_RESERVE pageblocks and it reduces the overhead of setup_zone_migrate_reserve dramatically. The following table shows time of onlining a memory section. Amount of memory | 128GB | 192GB | 256GB| --------------------------------------------- linux-3.12 | 23.9 | 31.4 | 44.5 | This patch | 8.3 | 8.3 | 8.6 | Mel's proposal patch | 10.9 | 19.2 | 31.3 | --------------------------------------------- (millisecond) 128GB : 4 nodes and each node has 32GB of memory 192GB : 6 nodes and each node has 32GB of memory 256GB : 8 nodes and each node has 32GB of memory (*1) Mel proposed his idea by the following threads. https://lkml.org/lkml/2013/10/30/272 [akpm@linux-foundation.org: tweak comment] Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Reported-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Tested-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11mm: vmscan: fix do_try_to_free_pages() livelockLisa Du
This patch is based on KOSAKI's work and I add a little more description, please refer https://lkml.org/lkml/2012/6/14/74. Currently, I found system can enter a state that there are lots of free pages in a zone but only order-0 and order-1 pages which means the zone is heavily fragmented, then high order allocation could make direct reclaim path's long stall(ex, 60 seconds) especially in no swap and no compaciton enviroment. This problem happened on v3.4, but it seems issue still lives in current tree, the reason is do_try_to_free_pages enter live lock: kswapd will go to sleep if the zones have been fully scanned and are still not balanced. As kswapd thinks there's little point trying all over again to avoid infinite loop. Instead it changes order from high-order to 0-order because kswapd think order-0 is the most important. Look at 73ce02e9 in detail. If watermarks are ok, kswapd will go back to sleep and may leave zone->all_unreclaimable =3D 0. It assume high-order users can still perform direct reclaim if they wish. Direct reclaim continue to reclaim for a high order which is not a COSTLY_ORDER without oom-killer until kswapd turn on zone->all_unreclaimble= . This is because to avoid too early oom-kill. So it means direct_reclaim depends on kswapd to break this loop. In worst case, direct-reclaim may continue to page reclaim forever when kswapd sleeps forever until someone like watchdog detect and finally kill the process. As described in: http://thread.gmane.org/gmane.linux.kernel.mm/103737 We can't turn on zone->all_unreclaimable from direct reclaim path because direct reclaim path don't take any lock and this way is racy. Thus this patch removes zone->all_unreclaimable field completely and recalculates zone reclaimable state every time. Note: we can't take the idea that direct-reclaim see zone->pages_scanned directly and kswapd continue to use zone->all_unreclaimable. Because, it is racy. commit 929bea7c71 (vmscan: all_unreclaimable() use zone->all_unreclaimable as a name) describes the detail. [akpm@linux-foundation.org: uninline zone_reclaimable_pages() and zone_reclaimable()] Cc: Aaditya Kumar <aaditya.kumar.30@gmail.com> Cc: Ying Han <yinghan@google.com> Cc: Nick Piggin <npiggin@gmail.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux.com> Cc: Bob Liu <lliubbo@gmail.com> Cc: Neil Zhang <zhangwm@marvell.com> Cc: Russell King - ARM Linux <linux@arm.linux.org.uk> Reviewed-by: Michal Hocko <mhocko@suse.cz> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Lisa Du <cldu@marvell.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11mm: page_alloc: fair zone allocator policyJohannes Weiner
Each zone that holds userspace pages of one workload must be aged at a speed proportional to the zone size. Otherwise, the time an individual page gets to stay in memory depends on the zone it happened to be allocated in. Asymmetry in the zone aging creates rather unpredictable aging behavior and results in the wrong pages being reclaimed, activated etc. But exactly this happens right now because of the way the page allocator and kswapd interact. The page allocator uses per-node lists of all zones in the system, ordered by preference, when allocating a new page. When the first iteration does not yield any results, kswapd is woken up and the allocator retries. Due to the way kswapd reclaims zones below the high watermark while a zone can be allocated from when it is above the low watermark, the allocator may keep kswapd running while kswapd reclaim ensures that the page allocator can keep allocating from the first zone in the zonelist for extended periods of time. Meanwhile the other zones rarely see new allocations and thus get aged much slower in comparison. The result is that the occasional page placed in lower zones gets relatively more time in memory, even gets promoted to the active list after its peers have long been evicted. Meanwhile, the bulk of the working set may be thrashing on the preferred zone even though there may be significant amounts of memory available in the lower zones. Even the most basic test -- repeatedly reading a file slightly bigger than memory -- shows how broken the zone aging is. In this scenario, no single page should be able stay in memory long enough to get referenced twice and activated, but activation happens in spades: $ grep active_file /proc/zoneinfo nr_inactive_file 0 nr_active_file 0 nr_inactive_file 0 nr_active_file 8 nr_inactive_file 1582 nr_active_file 11994 $ cat data data data data >/dev/null $ grep active_file /proc/zoneinfo nr_inactive_file 0 nr_active_file 70 nr_inactive_file 258753 nr_active_file 443214 nr_inactive_file 149793 nr_active_file 12021 Fix this with a very simple round robin allocator. Each zone is allowed a batch of allocations that is proportional to the zone's size, after which it is treated as full. The batch counters are reset when all zones have been tried and the allocator enters the slowpath and kicks off kswapd reclaim. Allocation and reclaim is now fairly spread out to all available/allowable zones: $ grep active_file /proc/zoneinfo nr_inactive_file 0 nr_active_file 0 nr_inactive_file 174 nr_active_file 4865 nr_inactive_file 53 nr_active_file 860 $ cat data data data data >/dev/null $ grep active_file /proc/zoneinfo nr_inactive_file 0 nr_active_file 0 nr_inactive_file 666622 nr_active_file 4988 nr_inactive_file 190969 nr_active_file 937 When zone_reclaim_mode is enabled, allocations will now spread out to all zones on the local node, not just the first preferred zone (which on a 4G node might be a tiny Normal zone). Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Paul Bolle <paul.bollee@gmail.com> Cc: Zlatko Calusic <zcalusic@bitsync.net> Tested-by: Kevin Hilman <khilman@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-09mm: remove unused functions is_{normal_idx, normal, dma32, dma}Zhang Yanfei
These functions are nowhere used, so remove them. Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03sparsemem: add BUILD_BUG_ON when sizeof mem_section is non-power-of-2Cody P Schafer
Instead of leaving a hidden trap for the next person who comes along and wants to add something to mem_section, add a big fat warning about it needing to be a power-of-2, and insert a BUILD_BUG_ON() in sparse_init() to catch mistakes. Right now non-power-of-2 mem_sections cause a number of WARNs at boot (which don't clearly point to the size of mem_section as an issue), but the system limps on (temporarily, at least). This is based upon Dave Hansen's earlier RFC where he ran into the same issue: "sparsemem: fix boot when SECTIONS_PER_ROOT is not power-of-2" http://lkml.indiana.edu/hypermail/linux/kernel/1205.2/03077.html Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Jiang Liu <liuj97@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03mm: use a dedicated lock to protect totalram_pages and zone->managed_pagesJiang Liu
Currently lock_memory_hotplug()/unlock_memory_hotplug() are used to protect totalram_pages and zone->managed_pages. Other than the memory hotplug driver, totalram_pages and zone->managed_pages may also be modified at runtime by other drivers, such as Xen balloon, virtio_balloon etc. For those cases, memory hotplug lock is a little too heavy, so introduce a dedicated lock to protect totalram_pages and zone->managed_pages. Now we have a simplified locking rules totalram_pages and zone->managed_pages as: 1) no locking for read accesses because they are unsigned long. 2) no locking for write accesses at boot time in single-threaded context. 3) serialize write accesses at runtime by acquiring the dedicated managed_page_count_lock. Also adjust zone->managed_pages when freeing reserved pages into the buddy system, to keep totalram_pages and zone->managed_pages in consistence. [akpm@linux-foundation.org: don't export adjust_managed_page_count to modules (for now)] Signed-off-by: Jiang Liu <jiang.liu@huawei.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Michel Lespinasse <walken@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: <sworddragon2@aol.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: David Howells <dhowells@redhat.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: Jianguo Wu <wujianguo@huawei.com> Cc: Joonsoo Kim <js1304@gmail.com> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Wen Congyang <wency@cn.fujitsu.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Russell King <rmk@arm.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03mmzone: note that node_size_lock should be manipulated via pgdat_resize_lock()Cody P Schafer
Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03mm: fix comment referring to non-existent size_seqlock, change to span_seqlockCody P Schafer
Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03mm: vmscan: block kswapd if it is encountering pages under writebackMel Gorman
Historically, kswapd used to congestion_wait() at higher priorities if it was not making forward progress. This made no sense as the failure to make progress could be completely independent of IO. It was later replaced by wait_iff_congested() and removed entirely by commit 258401a6 (mm: don't wait on congested zones in balance_pgdat()) as it was duplicating logic in shrink_inactive_list(). This is problematic. If kswapd encounters many pages under writeback and it continues to scan until it reaches the high watermark then it will quickly skip over the pages under writeback and reclaim clean young pages or push applications out to swap. The use of wait_iff_congested() is not suited to kswapd as it will only stall if the underlying BDI is really congested or a direct reclaimer was unable to write to the underlying BDI. kswapd bypasses the BDI congestion as it sets PF_SWAPWRITE but even if this was taken into account then it would cause direct reclaimers to stall on writeback which is not desirable. This patch sets a ZONE_WRITEBACK flag if direct reclaim or kswapd is encountering too many pages under writeback. If this flag is set and kswapd encounters a PageReclaim page under writeback then it'll assume that the LRU lists are being recycled too quickly before IO can complete and block waiting for some IO to complete. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Michal Hocko <mhocko@suse.cz> Acked-by: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Jiri Slaby <jslaby@suse.cz> Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu> Tested-by: Zlatko Calusic <zcalusic@bitsync.net> Cc: dormando <dormando@rydia.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03mm: vmscan: have kswapd writeback pages based on dirty pages encountered, ↵Mel Gorman
not priority Currently kswapd queues dirty pages for writeback if scanning at an elevated priority but the priority kswapd scans at is not related to the number of unqueued dirty encountered. Since commit "mm: vmscan: Flatten kswapd priority loop", the priority is related to the size of the LRU and the zone watermark which is no indication as to whether kswapd should write pages or not. This patch tracks if an excessive number of unqueued dirty pages are being encountered at the end of the LRU. If so, it indicates that dirty pages are being recycled before flusher threads can clean them and flags the zone so that kswapd will start writing pages until the zone is balanced. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Jiri Slaby <jslaby@suse.cz> Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu> Tested-by: Zlatko Calusic <zcalusic@bitsync.net> Cc: dormando <dormando@rydia.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-30Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial Pull trivial tree updates from Jiri Kosina: "Usual stuff, mostly comment fixes, typo fixes, printk fixes and small code cleanups" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (45 commits) mm: Convert print_symbol to %pSR gfs2: Convert print_symbol to %pSR m32r: Convert print_symbol to %pSR iostats.txt: add easy-to-find description for field 6 x86 cmpxchg.h: fix wrong comment treewide: Fix typo in printk and comments doc: devicetree: Fix various typos docbook: fix 8250 naming in device-drivers pata_pdc2027x: Fix compiler warning treewide: Fix typo in printks mei: Fix comments in drivers/misc/mei treewide: Fix typos in kernel messages pm44xx: Fix comment for "CONFIG_CPU_IDLE" doc: Fix typo "CONFIG_CGROUP_CGROUP_MEMCG_SWAP" mmzone: correct "pags" to "pages" in comment. kernel-parameters: remove outdated 'noresidual' parameter Remove spurious _H suffixes from ifdef comments sound: Remove stray pluses from Kconfig file radio-shark: Fix printk "CONFIG_LED_CLASS" doc: put proper reference to CONFIG_MODULE_SIG_ENFORCE ...
2013-03-27mmzone: correct "pags" to "pages" in comment.Cody P Schafer
Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2013-03-22mm: zone_end_pfn is too smallRuss Anderson
Booting with 32 TBytes memory hits BUG at mm/page_alloc.c:552! (output below). The key hint is "page 4294967296 outside zone". 4294967296 = 0x100000000 (bit 32 is set). The problem is in include/linux/mmzone.h: 530 static inline unsigned zone_end_pfn(const struct zone *zone) 531 { 532 return zone->zone_start_pfn + zone->spanned_pages; 533 } zone_end_pfn is "unsigned" (32 bits). Changing it to "unsigned long" (64 bits) fixes the problem. zone_end_pfn() was added recently in commit 108bcc96ef70 ("mm: add & use zone_end_pfn() and zone_spans_pfn()") Output from the failure. No AGP bridge found page 4294967296 outside zone [ 4294967296 - 4327469056 ] ------------[ cut here ]------------ kernel BUG at mm/page_alloc.c:552! invalid opcode: 0000 [#1] SMP Modules linked in: CPU 0 Pid: 0, comm: swapper Not tainted 3.9.0-rc2.dtp+ #10 RIP: free_one_page+0x382/0x430 Process swapper (pid: 0, threadinfo ffffffff81942000, task ffffffff81955420) Call Trace: __free_pages_ok+0x96/0xb0 __free_pages+0x25/0x50 __free_pages_bootmem+0x8a/0x8c __free_memory_core+0xea/0x131 free_low_memory_core_early+0x4a/0x98 free_all_bootmem+0x45/0x47 mem_init+0x7b/0x14c start_kernel+0x216/0x433 x86_64_start_reservations+0x2a/0x2c x86_64_start_kernel+0x144/0x153 Code: 89 f1 ba 01 00 00 00 31 f6 d3 e2 4c 89 ef e8 66 a4 01 00 e9 2c fe ff ff 0f 0b eb fe 0f 0b 66 66 2e 0f 1f 84 00 00 00 00 00 eb f3 <0f> 0b eb fe 0f 0b 0f 1f 84 00 00 00 00 00 eb f6 0f 0b eb fe 49 Signed-off-by: Russ Anderson <rja@sgi.com> Reported-by: George Beshers <gbeshers@sgi.com> Acked-by: Hedi Berriche <hedi@sgi.com> Cc: Cody P Schafer <cody@linux.vnet.ibm.com> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23mmzone: add pgdat_{end_pfn,is_empty}() helpers & consolidate.Cody P Schafer
Add pgdat_end_pfn() and pgdat_is_empty() helpers which match the similar zone_*() functions. Change node_end_pfn() to be a wrapper of pgdat_end_pfn(). Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com> Cc: David Hansen <dave@linux.vnet.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23mm: add zone_is_empty() and zone_is_initialized()Cody P Schafer
Factoring out these 2 checks makes it more clear what we are actually checking for. Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com> Cc: David Hansen <dave@linux.vnet.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23mm: add & use zone_end_pfn() and zone_spans_pfn()Cody P Schafer
Add 2 helpers (zone_end_pfn() and zone_spans_pfn()) to reduce code duplication. This also switches to using them in compaction (where an additional variable needed to be renamed), page_alloc, vmstat, memory_hotplug, and kmemleak. Note that in compaction.c I avoid calling zone_end_pfn() repeatedly because I expect at some point the sycronization issues with start_pfn & spanned_pages will need fixing, either by actually using the seqlock or clever memory barrier usage. Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com> Cc: David Hansen <dave@linux.vnet.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23mm: move page flags layout to separate headerPeter Zijlstra
This is a preparation patch for moving page->_last_nid into page->flags that moves page flag layout information to a separate header. This patch is necessary because otherwise there would be a circular dependency between mm_types.h and mm.h. Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Simon Jeons <simon.jeons@gmail.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23mm: remove MIGRATE_ISOLATE check in hotpathMinchan Kim
Several functions test MIGRATE_ISOLATE and some of those are hotpath but MIGRATE_ISOLATE is used only if we enable CONFIG_MEMORY_ISOLATION(ie, CMA, memory-hotplug and memory-failure) which are not common config option. So let's not add unnecessary overhead and code when we don't enable CONFIG_MEMORY_ISOLATION. Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Michal Nazarewicz <mina86@mina86.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-01-04mm: fix zone_watermark_ok_safe() accounting of isolated pagesBartlomiej Zolnierkiewicz
Commit 702d1a6e0766 ("memory-hotplug: fix kswapd looping forever problem") added an isolated pageblocks counter (nr_pageblock_isolate in struct zone) and used it to adjust free pages counter in zone_watermark_ok_safe() to prevent kswapd looping forever problem. Then later, commit 2139cbe627b8 ("cma: fix counting of isolated pages") fixed accounting of isolated pages in global free pages counter. It made the previous zone_watermark_ok_safe() fix unnecessary and potentially harmful (cause now isolated pages may be accounted twice making free pages counter incorrect). This patch removes the special isolated pageblocks counter altogether which fixes zone_watermark_ok_safe() free pages check. Reported-by: Tomasz Stanislawski <t.stanislaws@samsung.com> Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com> Cc: Minchan Kim <minchan@kernel.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Aaditya Kumar <aaditya.kumar.30@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.cz> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-16Merge tag 'balancenuma-v11' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma Pull Automatic NUMA Balancing bare-bones from Mel Gorman: "There are three implementations for NUMA balancing, this tree (balancenuma), numacore which has been developed in tip/master and autonuma which is in aa.git. In almost all respects balancenuma is the dumbest of the three because its main impact is on the VM side with no attempt to be smart about scheduling. In the interest of getting the ball rolling, it would be desirable to see this much merged for 3.8 with the view to building scheduler smarts on top and adapting the VM where required for 3.9. The most recent set of comparisons available from different people are mel: https://lkml.org/lkml/2012/12/9/108 mingo: https://lkml.org/lkml/2012/12/7/331 tglx: https://lkml.org/lkml/2012/12/10/437 srikar: https://lkml.org/lkml/2012/12/10/397 The results are a mixed bag. In my own tests, balancenuma does reasonably well. It's dumb as rocks and does not regress against mainline. On the other hand, Ingo's tests shows that balancenuma is incapable of converging for this workloads driven by perf which is bad but is potentially explained by the lack of scheduler smarts. Thomas' results show balancenuma improves on mainline but falls far short of numacore or autonuma. Srikar's results indicate we all suffer on a large machine with imbalanced node sizes. My own testing showed that recent numacore results have improved dramatically, particularly in the last week but not universally. We've butted heads heavily on system CPU usage and high levels of migration even when it shows that overall performance is better. There are also cases where it regresses. Of interest is that for specjbb in some configurations it will regress for lower numbers of warehouses and show gains for higher numbers which is not reported by the tool by default and sometimes missed in treports. Recently I reported for numacore that the JVM was crashing with NullPointerExceptions but currently it's unclear what the source of this problem is. Initially I thought it was in how numacore batch handles PTEs but I'm no longer think this is the case. It's possible numacore is just able to trigger it due to higher rates of migration. These reports were quite late in the cycle so I/we would like to start with this tree as it contains much of the code we can agree on and has not changed significantly over the last 2-3 weeks." * tag 'balancenuma-v11' of git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma: (50 commits) mm/rmap, migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable mm/rmap: Convert the struct anon_vma::mutex to an rwsem mm: migrate: Account a transhuge page properly when rate limiting mm: numa: Account for failed allocations and isolations as migration failures mm: numa: Add THP migration for the NUMA working set scanning fault case build fix mm: numa: Add THP migration for the NUMA working set scanning fault case. mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node mm: sched: numa: Control enabling and disabling of NUMA balancing if !SCHED_DEBUG mm: sched: numa: Control enabling and disabling of NUMA balancing mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrate mm: numa: Use a two-stage filter to restrict pages being migrated for unlikely task<->node relationships mm: numa: migrate: Set last_nid on newly allocated page mm: numa: split_huge_page: Transfer last_nid on tail page mm: numa: Introduce last_nid to the page frame sched: numa: Slowly increase the scanning period as NUMA faults are handled mm: numa: Rate limit setting of pte_numa if node is saturated mm: numa: Rate limit the amount of memory that is migrated between nodes mm: numa: Structures for Migrate On Fault per NUMA migration rate limiting mm: numa: Migrate pages handled during a pmd_numa hinting fault mm: numa: Migrate on reference policy ...
2012-12-12mm: introduce new field "managed_pages" to struct zoneJiang Liu
Currently a zone's present_pages is calcuated as below, which is inaccurate and may cause trouble to memory hotplug. spanned_pages - absent_pages - memmap_pages - dma_reserve. During fixing bugs caused by inaccurate zone->present_pages, we found zone->present_pages has been abused. The field zone->present_pages may have different meanings in different contexts: 1) pages existing in a zone. 2) pages managed by the buddy system. For more discussions about the issue, please refer to: http://lkml.org/lkml/2012/11/5/866 https://patchwork.kernel.org/patch/1346751/ This patchset tries to introduce a new field named "managed_pages" to struct zone, which counts "pages managed by the buddy system". And revert zone->present_pages to count "physical pages existing in a zone", which also keep in consistence with pgdat->node_present_pages. We will set an initial value for zone->managed_pages in function free_area_init_core() and will adjust it later if the initial value is inaccurate. For DMA/normal zones, the initial value is set to: (spanned_pages - absent_pages - memmap_pages - dma_reserve) Later zone->managed_pages will be adjusted to the accurate value when the bootmem allocator frees all free pages to the buddy system in function free_all_bootmem_node() and free_all_bootmem(). The bootmem allocator doesn't touch highmem pages, so highmem zones' managed_pages is set to the accurate value "spanned_pages - absent_pages" in function free_area_init_core() and won't be updated anymore. This patch also adds a new field "managed_pages" to /proc/zoneinfo and sysrq showmem. [akpm@linux-foundation.org: small comment tweaks] Signed-off-by: Jiang Liu <jiang.liu@huawei.com> Cc: Wen Congyang <wency@cn.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Maciej Rutecki <maciej.rutecki@gmail.com> Tested-by: Chris Clayton <chris2553@googlemail.com> Cc: "Rafael J . Wysocki" <rjw@sisk.pl> Cc: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan@kernel.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Jianguo Wu <wujianguo@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-11mm: cma: remove watermark hacksMarek Szyprowski
Commits 2139cbe627b8 ("cma: fix counting of isolated pages") and d95ea5d18e69 ("cma: fix watermark checking") introduced a reliable method of free page accounting when memory is being allocated from CMA regions, so the workaround introduced earlier by commit 49f223a9cd96 ("mm: trigger page reclaim in alloc_contig_range() to stabilise watermarks") can be finally removed. Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Kyungmin Park <kyungmin.park@samsung.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Mel Gorman <mel@csn.ul.ie> Acked-by: Michal Nazarewicz <mina86@mina86.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-11mm: numa: Structures for Migrate On Fault per NUMA migration rate limitingAndrea Arcangeli
This defines the per-node data used by Migrate On Fault in order to rate limit the migration. The rate limiting is applied independently to each destination node. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-11-16memcg: fix hotplugged memory zone oopsHugh Dickins
When MEMCG is configured on (even when it's disabled by boot option), when adding or removing a page to/from its lru list, the zone pointer used for stats updates is nowadays taken from the struct lruvec. (On many configurations, calculating zone from page is slower.) But we have no code to update all the lruvecs (per zone, per memcg) when a memory node is hotadded. Here's an extract from the oops which results when running numactl to bind a program to a newly onlined node: BUG: unable to handle kernel NULL pointer dereference at 0000000000000f60 IP: __mod_zone_page_state+0x9/0x60 Pid: 1219, comm: numactl Not tainted 3.6.0-rc5+ #180 Bochs Bochs Process numactl (pid: 1219, threadinfo ffff880039abc000, task ffff8800383c4ce0) Call Trace: __pagevec_lru_add_fn+0xdf/0x140 pagevec_lru_move_fn+0xb1/0x100 __pagevec_lru_add+0x1c/0x30 lru_add_drain_cpu+0xa3/0x130 lru_add_drain+0x2f/0x40 ... The natural solution might be to use a memcg callback whenever memory is hotadded; but that solution has not been scoped out, and it happens that we do have an easy location at which to update lruvec->zone. The lruvec pointer is discovered either by mem_cgroup_zone_lruvec() or by mem_cgroup_page_lruvec(), and both of those do know the right zone. So check and set lruvec->zone in those; and remove the inadequate attempt to set lruvec->zone from lruvec_init(), which is called before NODE_DATA(node) has been allocated in such cases. Ah, there was one exceptionr. For no particularly good reason, mem_cgroup_force_empty_list() has its own code for deciding lruvec. Change it to use the standard mem_cgroup_zone_lruvec() and mem_cgroup_get_lru_size() too. In fact it was already safe against such an oops (the lru lists in danger could only be empty), but we're better proofed against future changes this way. I've marked this for stable (3.6) since we introduced the problem in 3.5 (now closed to stable); but I have no idea if this is the only fix needed to get memory hotadd working with memcg in 3.6, and received no answer when I enquired twice before. Reported-by: Tang Chen <tangchen@cn.fujitsu.com> Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Wen Congyang <wency@cn.fujitsu.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09CMA: migrate mlocked pagesMinchan Kim
Presently CMA cannot migrate mlocked pages so it ends up failing to allocate contiguous memory space. This patch makes mlocked pages be migrated out. Of course, it can affect realtime processes but in CMA usecase, contiguous memory allocation failing is far worse than access latency to an mlocked page being variable while CMA is running. If someone wants to make the system realtime, he shouldn't enable CMA because stalls can still happen at random times. [akpm@linux-foundation.org: tweak comment text, per Mel] Signed-off-by: Minchan Kim <minchan@kernel.org> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09mm, numa: reclaim from all nodes within reclaim distanceDavid Rientjes
RECLAIM_DISTANCE represents the distance between nodes at which it is deemed too costly to allocate from; it's preferred to try to reclaim from a local zone before falling back to allocating on a remote node with such a distance. To do this, zone_reclaim_mode is set if the distance between any two nodes on the system is greather than this distance. This, however, ends up causing the page allocator to reclaim from every zone regardless of its affinity. What we really want is to reclaim only from zones that are closer than RECLAIM_DISTANCE. This patch adds a nodemask to each node that represents the set of nodes that are within this distance. During the zone iteration, if the bit for a zone's node is set for the local node, then reclaim is attempted; otherwise, the zone is skipped. [akpm@linux-foundation.org: fix CONFIG_NUMA=n build] Signed-off-by: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan@kernel.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09mm: compaction: clear PG_migrate_skip based on compaction and reclaim activityMel Gorman
Compaction caches if a pageblock was scanned and no pages were isolated so that the pageblocks can be skipped in the future to reduce scanning. This information is not cleared by the page allocator based on activity due to the impact it would have to the page allocator fast paths. Hence there is a requirement that something clear the cache or pageblocks will be skipped forever. Currently the cache is cleared if there were a number of recent allocation failures and it has not been cleared within the last 5 seconds. Time-based decisions like this are terrible as they have no relationship to VM activity and is basically a big hammer. Unfortunately, accurate heuristics would add cost to some hot paths so this patch implements a rough heuristic. There are two cases where the cache is cleared. 1. If a !kswapd process completes a compaction cycle (migrate and free scanner meet), the zone is marked compact_blockskip_flush. When kswapd goes to sleep, it will clear the cache. This is expected to be the common case where the cache is cleared. It does not really matter if kswapd happens to be asleep or going to sleep when the flag is set as it will be woken on the next allocation request. 2. If there have been multiple failures recently and compaction just finished being deferred then a process will clear the cache and start a full scan. This situation happens if there are multiple high-order allocation requests under heavy memory pressure. The clearing of the PG_migrate_skip bits and other scans is inherently racy but the race is harmless. For allocations that can fail such as THP, they will simply fail. For requests that cannot fail, they will retry the allocation. Tests indicated that scanning rates were roughly similar to when the time-based heuristic was used and the allocation success rates were similar. Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Richard Davies <richard@arachsys.com> Cc: Shaohua Li <shli@kernel.org> Cc: Avi Kivity <avi@redhat.com> Cc: Rafael Aquini <aquini@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09mm: compaction: Restart compaction from near where it left offMel Gorman
This is almost entirely based on Rik's previous patches and discussions with him about how this might be implemented. Order > 0 compaction stops when enough free pages of the correct page order have been coalesced. When doing subsequent higher order allocations, it is possible for compaction to be invoked many times. However, the compaction code always starts out looking for things to compact at the start of the zone, and for free pages to compact things to at the end of the zone. This can cause quadratic behaviour, with isolate_freepages starting at the end of the zone each time, even though previous invocations of the compaction code already filled up all free memory on that end of the zone. This can cause isolate_freepages to take enormous amounts of CPU with certain workloads on larger memory systems. This patch caches where the migration and free scanner should start from on subsequent compaction invocations using the pageblock-skip information. When compaction starts it begins from the cached restart points and will update the cached restart points until a page is isolated or a pageblock is skipped that would have been scanned by synchronous compaction. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Cc: Richard Davies <richard@arachsys.com> Cc: Shaohua Li <shli@kernel.org> Cc: Avi Kivity <avi@redhat.com> Acked-by: Rafael Aquini <aquini@redhat.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09mm: compaction: cache if a pageblock was scanned and no pages were isolatedMel Gorman
When compaction was implemented it was known that scanning could potentially be excessive. The ideal was that a counter be maintained for each pageblock but maintaining this information would incur a severe penalty due to a shared writable cache line. It has reached the point where the scanning costs are a serious problem, particularly on long-lived systems where a large process starts and allocates a large number of THPs at the same time. Instead of using a shared counter, this patch adds another bit to the pageblock flags called PG_migrate_skip. If a pageblock is scanned by either migrate or free scanner and 0 pages were isolated, the pageblock is marked to be skipped in the future. When scanning, this bit is checked before any scanning takes place and the block skipped if set. The main difficulty with a patch like this is "when to ignore the cached information?" If it's ignored too often, the scanning rates will still be excessive. If the information is too stale then allocations will fail that might have otherwise succeeded. In this patch o CMA always ignores the information o If the migrate and free scanner meet then the cached information will be discarded if it's at least 5 seconds since the last time the cache was discarded o If there are a large number of allocation failures, discard the cache. The time-based heuristic is very clumsy but there are few choices for a better event. Depending solely on multiple allocation failures still allows excessive scanning when THP allocations are failing in quick succession due to memory pressure. Waiting until memory pressure is relieved would cause compaction to continually fail instead of using reclaim/compaction to try allocate the page. The time-based mechanism is clumsy but a better option is not obvious. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Cc: Richard Davies <richard@arachsys.com> Cc: Shaohua Li <shli@kernel.org> Cc: Avi Kivity <avi@redhat.com> Acked-by: Rafael Aquini <aquini@redhat.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Cc: Kyungmin Park <kyungmin.park@samsung.com> Cc: Mark Brown <broonie@opensource.wolfsonmicro.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>