summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2011-11-07vm: fix vm_pgoff wrap in upward expansionHugh Dickins
commit 42c36f63ac1366ab0ecc2d5717821362c259f517 upstream. Commit a626ca6a6564 ("vm: fix vm_pgoff wrap in stack expansion") fixed the case of an expanding mapping causing vm_pgoff wrapping when you had downward stack expansion. But there was another case where IA64 and PA-RISC expand mappings: upward expansion. This fixes that case too. Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-07vm: fix vm_pgoff wrap in stack expansionLinus Torvalds
commit a626ca6a656450e9f4df91d0dda238fff23285f4 upstream. Commit 982134ba6261 ("mm: avoid wrapping vm_pgoff in mremap()") fixed the case of a expanding mapping causing vm_pgoff wrapping when you used mremap. But there was another case where we expand mappings hiding in plain sight: the automatic stack expansion. This fixes that case too. This one also found by Robert Święcki, using his nasty system call fuzzer tool. Good job. Reported-and-tested-by: Robert Święcki <robert@swiecki.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-08-29mm: fix wrong vmap address calculations with odd NR_CPUS valuesClemens Ladisch
commit f982f91516fa4cfd9d20518833cd04ad714585be upstream. Commit db64fe02258f ("mm: rewrite vmap layer") introduced code that does address calculations under the assumption that VMAP_BLOCK_SIZE is a power of two. However, this might not be true if CONFIG_NR_CPUS is not set to a power of two. Wrong vmap_block index/offset values could lead to memory corruption. However, this has never been observed in practice (or never been diagnosed correctly); what caught this was the BUG_ON in vb_alloc() that checks for inconsistent vmap_block indices. To fix this, ensure that VMAP_BLOCK_SIZE always is a power of two. BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=31572 Reported-by: Pavel Kysilka <goldenfish@linuxsoft.cz> Reported-by: Matias A. Fonzo <selk@dragora.org> Signed-off-by: Clemens Ladisch <clemens@ladisch.de> Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de> Cc: Nick Piggin <npiggin@suse.de> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: Krzysztof Helt <krzysztof.h1@poczta.fm> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-07-13mm: prevent concurrent unmap_mapping_range() on the same inodeMiklos Szeredi
commit 2aa15890f3c191326678f1bd68af61ec6b8753ec upstream. Michael Leun reported that running parallel opens on a fuse filesystem can trigger a "kernel BUG at mm/truncate.c:475" Gurudas Pai reported the same bug on NFS. The reason is, unmap_mapping_range() is not prepared for more than one concurrent invocation per inode. For example: thread1: going through a big range, stops in the middle of a vma and stores the restart address in vm_truncate_count. thread2: comes in with a small (e.g. single page) unmap request on the same vma, somewhere before restart_address, finds that the vma was already unmapped up to the restart address and happily returns without doing anything. Another scenario would be two big unmap requests, both having to restart the unmapping and each one setting vm_truncate_count to its own value. This could go on forever without any of them being able to finish. Truncate and hole punching already serialize with i_mutex. Other callers of unmap_mapping_range() do not, and it's difficult to get i_mutex protection for all callers. In particular ->d_revalidate(), which calls invalidate_inode_pages2_range() in fuse, may be called with or without i_mutex. This patch adds a new mutex to 'struct address_space' to prevent running multiple concurrent unmap_mapping_range() on the same mapping. [ We'll hopefully get rid of all this with the upcoming mm preemptibility series by Peter Zijlstra, the "mm: Remove i_mmap_mutex lockbreak" patch in particular. But that is for 2.6.39 ] Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Reported-by: Michael Leun <lkml20101129@newton.leun.net> Reported-by: Gurudas Pai <gurudas.pai@oracle.com> Tested-by: Gurudas Pai <gurudas.pai@oracle.com> Acked-by: Hugh Dickins <hughd@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-07-13mm: fix negative commitlimit when gigantic hugepages are allocatedRafael Aquini
commit b0320c7b7d1ac1bd5c2d9dff3258524ab39bad32 upstream. When 1GB hugepages are allocated on a system, free(1) reports less available memory than what really is installed in the box. Also, if the total size of hugepages allocated on a system is over half of the total memory size, CommitLimit becomes a negative number. The problem is that gigantic hugepages (order > MAX_ORDER) can only be allocated at boot with bootmem, thus its frames are not accounted to 'totalram_pages'. However, they are accounted to hugetlb_total_pages() What happens to turn CommitLimit into a negative number is this calculation, in fs/proc/meminfo.c: allowed = ((totalram_pages - hugetlb_total_pages()) * sysctl_overcommit_ratio / 100) + total_swap_pages; A similar calculation occurs in __vm_enough_memory() in mm/mmap.c. Also, every vm statistic which depends on 'totalram_pages' will render confusing values, as if system were 'missing' some part of its memory. Impact of this bug: When gigantic hugepages are allocated and sysctl_overcommit_memory == OVERCOMMIT_NEVER. In a such situation, __vm_enough_memory() goes through the mentioned 'allowed' calculation and might end up mistakenly returning -ENOMEM, thus forcing the system to start reclaiming pages earlier than it would be ususal, and this could cause detrimental impact to overall system's performance, depending on the workload. Besides the aforementioned scenario, I can only think of this causing annoyances with memory reports from /proc/meminfo and free(1). [akpm@linux-foundation.org: standardize comment layout] Reported-by: Russ Anderson <rja@sgi.com> Signed-off-by: Rafael Aquini <aquini@linux.com> Acked-by: Russ Anderson <rja@sgi.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-07-13migrate: don't account swapcache as shmemAndrea Arcangeli
commit 99a15e21d96f6857dafab1e5167e5e8183215c9c upstream. swapcache will reach the below code path in migrate_page_move_mapping, and swapcache is accounted as NR_FILE_PAGES but it's not accounted as NR_SHMEM. Hugh pointed out we must use PageSwapCache instead of comparing mapping to &swapper_space, to avoid build failure with CONFIG_SWAP=n. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Hugh Dickins <hughd@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-07-13ksm: fix NULL pointer dereference in scan_get_next_rmap_item()Hugh Dickins
commit 2b472611a32a72f4a118c069c2d62a1a3f087afd upstream. Andrea Righi reported a case where an exiting task can race against ksmd::scan_get_next_rmap_item (http://lkml.org/lkml/2011/6/1/742) easily triggering a NULL pointer dereference in ksmd. ksm_scan.mm_slot == &ksm_mm_head with only one registered mm CPU 1 (__ksm_exit) CPU 2 (scan_get_next_rmap_item) list_empty() is false lock slot == &ksm_mm_head list_del(slot->mm_list) (list now empty) unlock lock slot = list_entry(slot->mm_list.next) (list is empty, so slot is still ksm_mm_head) unlock slot->mm == NULL ... Oops Close this race by revalidating that the new slot is not simply the list head again. Andrea's test case: #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <sys/mman.h> #define BUFSIZE getpagesize() int main(int argc, char **argv) { void *ptr; if (posix_memalign(&ptr, getpagesize(), BUFSIZE) < 0) { perror("posix_memalign"); exit(1); } if (madvise(ptr, BUFSIZE, MADV_MERGEABLE) < 0) { perror("madvise"); exit(1); } *(char *)NULL = 0; return 0; } Reported-by: Andrea Righi <andrea@betterlinux.com> Tested-by: Andrea Righi <andrea@betterlinux.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Chris Wright <chrisw@sous-sol.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-06-23mm: fix ENOSPC returned by handle_mm_fault()Hugh Dickins
commit e0dcd8a05be438b3d2e49ef61441ea3a463663f8 upstream. Al Viro observes that in the hugetlb case, handle_mm_fault() may return a value of the kind ENOSPC when its caller is expecting a value of the kind VM_FAULT_SIGBUS: fix alloc_huge_page()'s failure returns. Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-06-23mm/page_alloc.c: prevent unending loop in __alloc_pages_slowpath()Andrew Barry
commit cfa54a0fcfc1017c6f122b6f21aaba36daa07f71 upstream. I believe I found a problem in __alloc_pages_slowpath, which allows a process to get stuck endlessly looping, even when lots of memory is available. Running an I/O and memory intensive stress-test I see a 0-order page allocation with __GFP_IO and __GFP_WAIT, running on a system with very little free memory. Right about the same time that the stress-test gets killed by the OOM-killer, the utility trying to allocate memory gets stuck in __alloc_pages_slowpath even though most of the systems memory was freed by the oom-kill of the stress-test. The utility ends up looping from the rebalance label down through the wait_iff_congested continiously. Because order=0, __alloc_pages_direct_compact skips the call to get_page_from_freelist. Because all of the reclaimable memory on the system has already been reclaimed, __alloc_pages_direct_reclaim skips the call to get_page_from_freelist. Since there is no __GFP_FS flag, the block with __alloc_pages_may_oom is skipped. The loop hits the wait_iff_congested, then jumps back to rebalance without ever trying to get_page_from_freelist. This loop repeats infinitely. The test case is pretty pathological. Running a mix of I/O stress-tests that do a lot of fork() and consume all of the system memory, I can pretty reliably hit this on 600 nodes, in about 12 hours. 32GB/node. Signed-off-by: Andrew Barry <abarry@cray.com> Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: Rik van Riel<riel@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-06-23kmemleak: Do not return a pointer to an object that kmemleak did not getCatalin Marinas
commit 52c3ce4ec5601ee383a14f1485f6bac7b278896e upstream. The kmemleak_seq_next() function tries to get an object (and increment its use count) before returning it. If it could not get the last object during list traversal (because it may have been freed), the function should return NULL rather than a pointer to such object that it did not get. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Reported-by: Phil Carmody <ext-phil.2.carmody@nokia.com> Acked-by: Phil Carmody <ext-phil.2.carmody@nokia.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-04-14mm: avoid wrapping vm_pgoff in mremap()Linus Torvalds
commit 982134ba62618c2d69fbbbd166d0a11ee3b7e3d8 upstream. The normal mmap paths all avoid creating a mapping where the pgoff inside the mapping could wrap around due to overflow. However, an expanding mremap() can take such a non-wrapping mapping and make it bigger and cause a wrapping condition. Noticed by Robert Swiecki when running a system call fuzzer, where it caused a BUG_ON() due to terminally confusing the vma_prio_tree code. A vma dumping patch by Hugh then pinpointed the crazy wrapped case. Reported-and-tested-by: Robert Swiecki <robert@swiecki.net> Acked-by: Hugh Dickins <hughd@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-03-28shmem: let shared anonymous be nonlinear againHugh Dickins
commit bee4c36a5cf5c9f63ce1d7372aa62045fbd16d47 upstream. Up to 2.6.22, you could use remap_file_pages(2) on a tmpfs file or a shared mapping of /dev/zero or a shared anonymous mapping. In 2.6.23 we disabled it by default, but set VM_CAN_NONLINEAR to enable it on safe mappings. We made sure to set it in shmem_mmap() for tmpfs files, but missed it in shmem_zero_setup() for the others. Fix that at last. Reported-by: Kenny Simpson <theonetruekenny@yahoo.com> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-03-21percpu: fix pcpu_last_unit_cpuTejun Heo
commit 46b30ea9bc3698bc1d1e6fd726c9601d46fa0a91 upstream. pcpu_first/last_unit_cpu are used to track which cpu has the first and last units assigned. This in turn is used to determine the span of a chunk for man/unmap cache flushes and whether an address belongs to the first chunk or not in per_cpu_ptr_to_phys(). When the number of possible CPUs isn't power of two, a chunk may contain unassigned units towards the end of a chunk. The logic to determine pcpu_last_unit_cpu was incorrect when there was an unused unit at the end of a chunk. It failed to ignore the unused unit and assigned the unused marker NR_CPUS to pcpu_last_unit_cpu. This was discovered through kdump failure which was caused by malfunctioning per_cpu_ptr_to_phys() on a kvm setup with 50 possible CPUs by CAI Qian. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: CAI Qian <caiqian@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-03-21mm: page allocator: update free page counters after pages are placed on the ↵Mel Gorman
free list commit 72853e2991a2702ae93aaf889ac7db743a415dd3 upstream. When allocating a page, the system uses NR_FREE_PAGES counters to determine if watermarks would remain intact after the allocation was made. This check is made without interrupts disabled or the zone lock held and so is race-prone by nature. Unfortunately, when pages are being freed in batch, the counters are updated before the pages are added on the list. During this window, the counters are misleading as the pages do not exist yet. When under significant pressure on systems with large numbers of CPUs, it's possible for processes to make progress even though they should have been stalled. This is particularly problematic if a number of the processes are using GFP_ATOMIC as the min watermark can be accidentally breached and in extreme cases, the system can livelock. This patch updates the counters after the pages have been added to the list. This makes the allocator more cautious with respect to preserving the watermarks and mitigates livelock possibilities. [akpm@linux-foundation.org: avoid modifying incoming args] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Christoph Lameter <cl@linux.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-03-21mm: page allocator: drain per-cpu lists after direct reclaim allocation failsMel Gorman
commit 9ee493ce0a60bf42c0f8fd0b0fe91df5704a1cbf upstream. When under significant memory pressure, a process enters direct reclaim and immediately afterwards tries to allocate a page. If it fails and no further progress is made, it's possible the system will go OOM. However, on systems with large amounts of memory, it's possible that a significant number of pages are on per-cpu lists and inaccessible to the calling process. This leads to a process entering direct reclaim more often than it should increasing the pressure on the system and compounding the problem. This patch notes that if direct reclaim is making progress but allocations are still failing that the system is already under heavy pressure. In this case, it drains the per-cpu lists and tries the allocation a second time before continuing. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Christoph Lameter <cl@linux.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-03-21mm: page allocator: calculate a better estimate of NR_FREE_PAGES when memory ↵Christoph Lameter
is low and kswapd is awake commit aa45484031ddee09b06350ab8528bfe5b2c76d1c upstream. Ordinarily watermark checks are based on the vmstat NR_FREE_PAGES as it is cheaper than scanning a number of lists. To avoid synchronization overhead, counter deltas are maintained on a per-cpu basis and drained both periodically and when the delta is above a threshold. On large CPU systems, the difference between the estimated and real value of NR_FREE_PAGES can be very high. If NR_FREE_PAGES is much higher than number of real free page in buddy, the VM can allocate pages below min watermark, at worst reducing the real number of pages to zero. Even if the OOM killer kills some victim for freeing memory, it may not free memory if the exit path requires a new page resulting in livelock. This patch introduces a zone_page_state_snapshot() function (courtesy of Christoph) that takes a slightly more accurate view of an arbitrary vmstat counter. It is used to read NR_FREE_PAGES while kswapd is awake to avoid the watermark being accidentally broken. The estimate is not perfect and may result in cache line bounces but is expected to be lighter than the IPI calls necessary to continually drain the per-cpu counters while kswapd is awake. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-03-21mm: fix possible cause of a page_mapped BUGHugh Dickins
commit a3e8cc643d22d2c8ed36b9be7d9c9ca21efcf7f7 upstream. Robert Swiecki reported a BUG_ON(page_mapped) from a fuzzer, punching a hole with madvise(,, MADV_REMOVE). That path is under mutex, and cannot be explained by lack of serialization in unmap_mapping_range(). Reviewing the code, I found one place where vm_truncate_count handling should have been updated, when I switched at the last minute from one way of managing the restart_addr to another: mremap move changes the virtual addresses, so it ought to adjust the restart_addr. But rather than exporting the notion of restart_addr from memory.c, or converting to restart_pgoff throughout, simply reset vm_truncate_count to 0 to force a rescan if mremap move races with preempted truncation. We have no confirmation that this fixes Robert's BUG, but it is a fix that's worth making anyway. Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Kerin Millar <kerframil@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-03-21install_special_mapping skips security_file_mmap check.Tavis Ormandy
commit 462e635e5b73ba9a4c03913b77138cd57ce4b050 upstream. The install_special_mapping routine (used, for example, to setup the vdso) skips the security check before insert_vm_struct, allowing a local attacker to bypass the mmap_min_addr security restriction by limiting the available pages for special mappings. bprm_mm_init() also skips the check, and although I don't think this can be used to bypass any restrictions, I don't see any reason not to have the security check. $ uname -m x86_64 $ cat /proc/sys/vm/mmap_min_addr 65536 $ cat install_special_mapping.s section .bss resb BSS_SIZE section .text global _start _start: mov eax, __NR_pause int 0x80 $ nasm -D__NR_pause=29 -DBSS_SIZE=0xfffed000 -f elf -o install_special_mapping.o install_special_mapping.s $ ld -m elf_i386 -Ttext=0x10000 -Tbss=0x11000 -o install_special_mapping install_special_mapping.o $ ./install_special_mapping & [1] 14303 $ cat /proc/14303/maps 0000f000-00010000 r-xp 00000000 00:00 0 [vdso] 00010000-00011000 r-xp 00001000 00:19 2453665 /home/taviso/install_special_mapping 00011000-ffffe000 rwxp 00000000 00:00 0 [stack] It's worth noting that Red Hat are shipping with mmap_min_addr set to 4096. Signed-off-by: Tavis Ormandy <taviso@google.com> Acked-by: Kees Cook <kees@ubuntu.com> Acked-by: Robert Swiecki <swiecki@google.com> [ Changed to not drop the error code - akpm ] Reviewed-by: James Morris <jmorris@namei.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-03-21vmscan: raise the bar to PAGEOUT_IO_SYNC stallsWu Fengguang
commit e31f3698cd3499e676f6b0ea12e3528f569c4fa3 upstream. Fix "system goes unresponsive under memory pressure and lots of dirty/writeback pages" bug. http://lkml.org/lkml/2010/4/4/86 In the above thread, Andreas Mohr described that Invoking any command locked up for minutes (note that I'm talking about attempted additional I/O to the _other_, _unaffected_ main system HDD - such as loading some shell binaries -, NOT the external SSD18M!!). This happens when the two conditions are both meet: - under memory pressure - writing heavily to a slow device OOM also happens in Andreas' system. The OOM trace shows that 3 processes are stuck in wait_on_page_writeback() in the direct reclaim path. One in do_fork() and the other two in unix_stream_sendmsg(). They are blocked on this condition: (sc->order && priority < DEF_PRIORITY - 2) which was introduced in commit 78dc583d (vmscan: low order lumpy reclaim also should use PAGEOUT_IO_SYNC) one year ago. That condition may be too permissive. In Andreas' case, 512MB/1024 = 512KB. If the direct reclaim for the order-1 fork() allocation runs into a range of 512KB hard-to-reclaim LRU pages, it will be stalled. It's a severe problem in three ways. Firstly, it can easily happen in daily desktop usage. vmscan priority can easily go below (DEF_PRIORITY - 2) on _local_ memory pressure. Even if the system has 50% globally reclaimable pages, it still has good opportunity to have 0.1% sized hard-to-reclaim ranges. For example, a simple dd can easily create a big range (up to 20%) of dirty pages in the LRU lists. And order-1 to order-3 allocations are more than common with SLUB. Try "grep -v '1 :' /proc/slabinfo" to get the list of high order slab caches. For example, the order-1 radix_tree_node slab cache may stall applications at swap-in time; the order-3 inode cache on most filesystems may stall applications when trying to read some file; the order-2 proc_inode_cache may stall applications when trying to open a /proc file. Secondly, once triggered, it will stall unrelated processes (not doing IO at all) in the system. This "one slow USB device stalls the whole system" avalanching effect is very bad. Thirdly, once stalled, the stall time could be intolerable long for the users. When there are 20MB queued writeback pages and USB 1.1 is writing them in 1MB/s, wait_on_page_writeback() will stuck for up to 20 seconds. Not to mention it may be called multiple times. So raise the bar to only enable PAGEOUT_IO_SYNC when priority goes below DEF_PRIORITY/3, or 6.25% LRU size. As the default dirty throttle ratio is 20%, it will hardly be triggered by pure dirty pages. We'd better treat PAGEOUT_IO_SYNC as some last resort workaround -- its stall time is so uncomfortably long (easily goes beyond 1s). The bar is only raised for (order < PAGE_ALLOC_COSTLY_ORDER) allocations, which are easy to satisfy in 1TB memory boxes. So, although 6.25% of memory could be an awful lot of pages to scan on a system with 1TB of memory, it won't really have to busy scan that much. Andreas tested an older version of this patch and reported that it mostly fixed his problem. Mel Gorman helped improve it and KOSAKI Motohiro will fix it further in the next patch. Reported-by: Andreas Mohr <andi@lisas.de> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-03-21mm: fix corruption of hibernation caused by reusing swap during image savingKAMEZAWA Hiroyuki
commit 966cca029f739716fbcc8068b8c6dfe381f86fc3 upstream. Since 2.6.31, swap_map[]'s refcounting was changed to show that a used swap entry is just for swap-cache, can be reused. Then, while scanning free entry in swap_map[], a swap entry may be able to be reclaimed and reused. It was caused by commit c9e444103b5e7a5 ("mm: reuse unused swap entry if necessary"). But this caused deta corruption at resume. The scenario is - Assume a clean-swap cache, but mapped. - at hibernation_snapshot[], clean-swap-cache is saved as clean-swap-cache and swap_map[] is marked as SWAP_HAS_CACHE. - then, save_image() is called. And reuse SWAP_HAS_CACHE entry to save image, and break the contents. After resume: - the memory reclaim runs and finds clean-not-referenced-swap-cache and discards it because it's marked as clean. But here, the contents on disk and swap-cache is inconsistent. Hance memory is corrupted. This patch avoids the bug by not reclaiming swap-entry during hibernation. This is a quick fix for backporting. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Rafael J. Wysocki <rjw@sisk.pl> Reported-by: Ondreg Zary <linux@rainbow-software.org> Tested-by: Ondreg Zary <linux@rainbow-software.org> Tested-by: Andrea Gelmini <andrea.gelmini@gmail.com> Acked-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-03-21mm: fix up some user-visible effects of the stack guard pageLinus Torvalds
commit d7824370e26325c881b665350ce64fb0a4fde24a upstream. This commit makes the stack guard page somewhat less visible to user space. It does this by: - not showing the guard page in /proc/<pid>/maps It looks like lvm-tools will actually read /proc/self/maps to figure out where all its mappings are, and effectively do a specialized "mlockall()" in user space. By not showing the guard page as part of the mapping (by just adding PAGE_SIZE to the start for grows-up pages), lvm-tools ends up not being aware of it. - by also teaching the _real_ mlock() functionality not to try to lock the guard page. That would just expand the mapping down to create a new guard page, so there really is no point in trying to lock it in place. It would perhaps be nice to show the guard page specially in /proc/<pid>/maps (or at least mark grow-down segments some way), but let's not open ourselves up to more breakage by user space from programs that depends on the exact deails of the 'maps' file. Special thanks to Henrique de Moraes Holschuh for diving into lvm-tools source code to see what was going on with the whole new warning. Reported-and-tested-by: François Valenduc <francois.valenduc@tvcablenet.be Reported-by: Henrique de Moraes Holschuh <hmh@hmh.eng.br> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-03-21mm: fix page table unmap for stack guard page properlyLinus Torvalds
commit 11ac552477e32835cb6970bf0a70c210807f5673 upstream. We do in fact need to unmap the page table _before_ doing the whole stack guard page logic, because if it is needed (mainly 32-bit x86 with PAE and CONFIG_HIGHPTE, but other architectures may use it too) then it will do a kmap_atomic/kunmap_atomic. And those kmaps will create an atomic region that we cannot do allocations in. However, the whole stack expand code will need to do anon_vma_prepare() and vma_lock_anon_vma() and they cannot do that in an atomic region. Now, a better model might actually be to do the anon_vma_prepare() when _creating_ a VM_GROWSDOWN segment, and not have to worry about any of this at page fault time. But in the meantime, this is the straightforward fix for the issue. See https://bugzilla.kernel.org/show_bug.cgi?id=16588 for details. Reported-by: Wylda <wylda@volny.cz> Reported-by: Sedat Dilek <sedat.dilek@gmail.com> Reported-by: Mike Pagano <mpagano@gentoo.org> Reported-by: François Valenduc <francois.valenduc@tvcablenet.be> Tested-by: Ed Tomlinson <edt@aei.ca> Cc: Pekka Enberg <penberg@kernel.org> Cc: Greg KH <gregkh@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-03-21mm: fix missing page table unmap for stack guard page failure caseLinus Torvalds
commit 5528f9132cf65d4d892bcbc5684c61e7822b21e9 upstream. .. which didn't show up in my tests because it's a no-op on x86-64 and most other architectures. But we enter the function with the last-level page table mapped, and should unmap it at exit. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-03-21mm: keep a guard page below a grow-down stack segmentLinus Torvalds
commit 320b2b8de12698082609ebbc1a17165727f4c893 upstream. This is a rather minimally invasive patch to solve the problem of the user stack growing into a memory mapped area below it. Whenever we fill the first page of the stack segment, expand the segment down by one page. Now, admittedly some odd application might _want_ the stack to grow down into the preceding memory mapping, and so we may at some point need to make this a process tunable (some people might also want to have more than a single page of guarding), but let's try the minimal approach first. Tested with trivial application that maps a single page just below the stack, and then starts recursing. Without this, we will get a SIGSEGV _after_ the stack has smashed the mapping. With this patch, we'll get a nice SIGBUS just as the stack touches the page just above the mapping. Requested-by: Keith Packard <keithp@keithp.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-03-21perf_events: Fix perf_counter_mmap() hook in mprotect()Pekka Enberg
commit 63bfd7384b119409685a17d5c58f0b56e5dc03da upstream. As pointed out by Linus, commit dab5855 ("perf_counter: Add mmap event hooks to mprotect()") is fundamentally wrong as mprotect_fixup() can free 'vma' due to merging. Fix the problem by moving perf_event_mmap() hook to mprotect_fixup(). Note: there's another successful return path from mprotect_fixup() if old flags equal to new flags. We don't, however, need to call perf_event_mmap() there because 'perf' already knows the VMA is executable. Reported-by: Dave Jones <davej@redhat.com> Analyzed-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Ingo Molnar <mingo@elte.hu> Reviewed-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Pekka Enberg <penberg@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-03-21nommu: yield CPU while disposing VMSteven J. Magnani
commit 04c3496152394d17e3bc2316f9731ee3e8a026bc upstream. Depending on processor speed, page size, and the amount of memory a process is allowed to amass, cleanup of a large VM may freeze the system for many seconds. This can result in a watchdog timeout. Make sure other tasks receive some service when cleaning up large VMs. Signed-off-by: Steven J. Magnani <steve@digidescorp.com> Cc: Greg Ungerer <gerg@snapgear.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-03-21mm/vfs: revalidate page->mapping in do_generic_file_read()Dave Hansen
commit 8d056cb965b8fb7c53c564abf28b1962d1061cd3 upstream. 70 hours into some stress tests of a 2.6.32-based enterprise kernel, we ran into a NULL dereference in here: int block_is_partially_uptodate(struct page *page, read_descriptor_t *desc, unsigned long from) { ----> struct inode *inode = page->mapping->host; It looks like page->mapping was the culprit. (xmon trace is below). After closer examination, I realized that do_generic_file_read() does a find_get_page(), and eventually locks the page before calling block_is_partially_uptodate(). However, it doesn't revalidate the page->mapping after the page is locked. So, there's a small window between the find_get_page() and ->is_partially_uptodate() where the page could get truncated and page->mapping cleared. We _have_ a reference, so it can't get reclaimed, but it certainly can be truncated. I think the correct thing is to check page->mapping after the trylock_page(), and jump out if it got truncated. This patch has been running in the test environment for a month or so now, and we have not seen this bug pop up again. xmon info: 1f:mon> e cpu 0x1f: Vector: 300 (Data Access) at [c0000002ae36f770] pc: c0000000001e7a6c: .block_is_partially_uptodate+0xc/0x100 lr: c000000000142944: .generic_file_aio_read+0x1e4/0x770 sp: c0000002ae36f9f0 msr: 8000000000009032 dar: 0 dsisr: 40000000 current = 0xc000000378f99e30 paca = 0xc000000000f66300 pid = 21946, comm = bash 1f:mon> r R00 = 0025c0500000006d R16 = 0000000000000000 R01 = c0000002ae36f9f0 R17 = c000000362cd3af0 R02 = c000000000e8cd80 R18 = ffffffffffffffff R03 = c0000000031d0f88 R19 = 0000000000000001 R04 = c0000002ae36fa68 R20 = c0000003bb97b8a0 R05 = 0000000000000000 R21 = c0000002ae36fa68 R06 = 0000000000000000 R22 = 0000000000000000 R07 = 0000000000000001 R23 = c0000002ae36fbb0 R08 = 0000000000000002 R24 = 0000000000000000 R09 = 0000000000000000 R25 = c000000362cd3a80 R10 = 0000000000000000 R26 = 0000000000000002 R11 = c0000000001e7b60 R27 = 0000000000000000 R12 = 0000000042000484 R28 = 0000000000000001 R13 = c000000000f66300 R29 = c0000003bb97b9b8 R14 = 0000000000000001 R30 = c000000000e28a08 R15 = 000000000000ffff R31 = c0000000031d0f88 pc = c0000000001e7a6c .block_is_partially_uptodate+0xc/0x100 lr = c000000000142944 .generic_file_aio_read+0x1e4/0x770 msr = 8000000000009032 cr = 22000488 ctr = c0000000001e7a60 xer = 0000000020000000 trap = 300 dar = 0000000000000000 dsisr = 40000000 1f:mon> t [link register ] c000000000142944 .generic_file_aio_read+0x1e4/0x770 [c0000002ae36f9f0] c000000000142a14 .generic_file_aio_read+0x2b4/0x770 (unreliable) [c0000002ae36fb40] c0000000001b03e4 .do_sync_read+0xd4/0x160 [c0000002ae36fce0] c0000000001b153c .vfs_read+0xec/0x1f0 [c0000002ae36fd80] c0000000001b1768 .SyS_read+0x58/0xb0 [c0000002ae36fe30] c00000000000852c syscall_exit+0x0/0x40 --- Exception: c00 (System Call) at 00000080a840bc54 SP (fffca15df30) is in userspace 1f:mon> di c0000000001e7a6c c0000000001e7a6c e9290000 ld r9,0(r9) c0000000001e7a70 418200c0 beq c0000000001e7b30 # .block_is_partially_uptodate+0xd0/0x100 c0000000001e7a74 e9440008 ld r10,8(r4) c0000000001e7a78 78a80020 clrldi r8,r5,32 c0000000001e7a7c 3c000001 lis r0,1 c0000000001e7a80 812900a8 lwz r9,168(r9) c0000000001e7a84 39600001 li r11,1 c0000000001e7a88 7c080050 subf r0,r8,r0 c0000000001e7a8c 7f805040 cmplw cr7,r0,r10 c0000000001e7a90 7d6b4830 slw r11,r11,r9 c0000000001e7a94 796b0020 clrldi r11,r11,32 c0000000001e7a98 419d00a8 bgt cr7,c0000000001e7b40 # .block_is_partially_uptodate+0xe0/0x100 c0000000001e7a9c 7fa55840 cmpld cr7,r5,r11 c0000000001e7aa0 7d004214 add r8,r0,r8 c0000000001e7aa4 79080020 clrldi r8,r8,32 c0000000001e7aa8 419c0078 blt cr7,c0000000001e7b20 # .block_is_partially_uptodate+0xc0/0x100 Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Rik van Riel <riel@redhat.com> Cc: <arunabal@in.ibm.com> Cc: <sbest@us.ibm.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-03-21mm: fix is_mem_section_removable() page_order BUG_ON checkKAMEZAWA Hiroyuki
commit 572438f9b52236bd8938b1647cc15e027d27ef55 upstream. page_order() is called by memory hotplug's user interface to check the section is removable or not. (is_mem_section_removable()) It calls page_order() withoug holding zone->lock. So, even if the caller does if (PageBuddy(page)) ret = page_order(page) ... The caller may hit BUG_ON(). For fixing this, there are 2 choices. 1. add zone->lock. 2. remove BUG_ON(). is_mem_section_removable() is used for some "advice" and doesn't need to be 100% accurate. This is_removable() can be called via user program.. We don't want to take this important lock for long by user's request. So, this patch removes BUG_ON(). Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Wu Fengguang <fengguang.wu@intel.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-03-21mm: fix return value of scan_lru_pages in memory unplugKAMEZAWA Hiroyuki
commit f8f72ad5396987e05a42cf7eff826fb2a15ff148 upstream. scan_lru_pages returns pfn. So, it's type should be "unsigned long" not "int". Note: I guess this has been work until now because memory hotplug tester's machine has not very big memory.... physical address < 32bit << PAGE_SHIFT. Reported-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-03-21numa: fix slab_node(MPOL_BIND)Eric Dumazet
commit 800416f799e0723635ac2d720ad4449917a1481c upstream. When a node contains only HighMem memory, slab_node(MPOL_BIND) dereferences a NULL pointer. [ This code seems to go back all the way to commit 19770b32609b: "mm: filter based on a nodemask as well as a gfp_mask". Which was back in April 2008, and it got merged into 2.6.26. - Linus ] Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <cl@linux.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-03-21mm, x86: Saving vmcore with non-lazy freeing of vmasCliff Wickman
commit 3ee48b6af49cf534ca2f481ecc484b156a41451d upstream. During the reading of /proc/vmcore the kernel is doing ioremap()/iounmap() repeatedly. And the buildup of un-flushed vm_area_struct's is causing a great deal of overhead. (rb_next() is chewing up most of that time). This solution is to provide function set_iounmap_nonlazy(). It causes a subsequent call to iounmap() to immediately purge the vma area (with try_purge_vmap_area_lazy()). With this patch we have seen the time for writing a 250MB compressed dump drop from 71 seconds to 44 seconds. Signed-off-by: Cliff Wickman <cpw@sgi.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: kexec@lists.infradead.org LKML-Reference: <E1OwHZ4-0005WK-Tw@eag09.americas.sgi.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-07-05do_generic_file_read: clear page errors when issuing a fresh read of the pageJeff Moyer
commit 91803b499cca2fe558abad709ce83dc896b80950 upstream. I/O errors can happen due to temporary failures, like multipath errors or losing network contact with the iSCSI server. Because of that, the VM will retry readpage on the page. However, do_generic_file_read does not clear PG_error. This causes the system to be unable to actually use the data in the page cache page, even if the subsequent readpage completes successfully! The function filemap_fault has had a ClearPageError before readpage forever. This patch simply adds the same to do_generic_file_read. Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Rik van Riel <riel@redhat.com> Acked-by: Larry Woodman <lwoodman@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-07-05tmpfs: insert tmpfs cache pages to inactive list at firstKOSAKI Motohiro
commit e9d6c157385e4efa61cb8293e425c9d8beba70d3 upstream. Shaohua Li reported parallel file copy on tmpfs can lead to OOM killer. This is regression of caused by commit 9ff473b9a7 ("vmscan: evict streaming IO first"). Wow, It is 2 years old patch! Currently, tmpfs file cache is inserted active list at first. This means that the insertion doesn't only increase numbers of pages in anon LRU, but it also reduces anon scanning ratio. Therefore, vmscan will get totally confused. It scans almost only file LRU even though the system has plenty unused tmpfs pages. Historically, lru_cache_add_active_anon() was used for two reasons. 1) Intend to priotize shmem page rather than regular file cache. 2) Intend to avoid reclaim priority inversion of used once pages. But we've lost both motivation because (1) Now we have separate anon and file LRU list. then, to insert active list doesn't help such priotize. (2) In past, one pte access bit will cause page activation. then to insert inactive list with pte access bit mean higher priority than to insert active list. Its priority inversion may lead to uninteded lru chun. but it was already solved by commit 645747462 (vmscan: detect mapped file pages used only once). (Thanks Hannes, you are great!) Thus, now we can use lru_cache_add_anon() instead. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reported-by: Shaohua Li <shaohua.li@intel.com> Reviewed-by: Wu Fengguang <fengguang.wu@intel.com> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Henrique de Moraes Holschuh <hmh@hmh.eng.br> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-05-26hugetlbfs: kill applications that use MAP_NORESERVE with SIGBUS instead of ↵Mel Gorman
OOM-killer commit 4a6018f7f4f1075c1a5403b5ec0ee7262187b86c upstream. Ordinarily, application using hugetlbfs will create mappings with reserves. For shared mappings, these pages are reserved before mmap() returns success and for private mappings, the caller process is guaranteed and a child process that cannot get the pages gets killed with sigbus. An application that uses MAP_NORESERVE gets no reservations and mmap() will always succeed at the risk the page will not be available at fault time. This might be used for example on very large sparse mappings where the developer is confident the necessary huge pages exist to satisfy all faults even though the whole mapping cannot be backed by huge pages. Unfortunately, if an allocation does fail, VM_FAULT_OOM is returned to the fault handler which proceeds to trigger the OOM-killer. This is unhelpful. Even without hugetlbfs mounted, a user using mmap() can trivially trigger the OOM-killer because VM_FAULT_OOM is returned (will provide example program if desired - it's a whopping 24 lines long). It could be considered a DOS available to an unprivileged user. This patch alters hugetlbfs to kill a process that uses MAP_NORESERVE where huge pages were not available with SIGBUS instead of triggering the OOM killer. This change affects hugetlb_cow() as well. I feel there is a failure case in there, but I didn't create one. It would need a fairly specific target in terms of the faulting application and the hugepage pool size. The hugetlb_no_page() path is much easier to hit but both might as well be closed. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: David Rientjes <rientjes@google.com> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-05-12hugetlb: fix infinite loop in get_futex_key() when backed by huge pagesMel Gorman
commit 23be7468e8802a2ac1de6ee3eecb3ec7f14dc703 upstream. If a futex key happens to be located within a huge page mapped MAP_PRIVATE, get_futex_key() can go into an infinite loop waiting for a page->mapping that will never exist. See https://bugzilla.redhat.com/show_bug.cgi?id=552257 for more details about the problem. This patch makes page->mapping a poisoned value that includes PAGE_MAPPING_ANON mapped MAP_PRIVATE. This is enough for futex to continue but because of PAGE_MAPPING_ANON, the poisoned value is not dereferenced or used by futex. No other part of the VM should be dereferencing the page->mapping of a hugetlbfs page as its page cache is not on the LRU. This patch fixes the problem with the test case described in the bugzilla. [akpm@linux-foundation.org: mel cant spel] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Darren Hart <darren@dvhart.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-05-12memcg: fix prepare migrationAndrea Arcangeli
commit 93d5c9be1ddd57d4063ce463c9ac2be1e5ee14f1 upstream. If a signal is pending (task being killed by sigkill) __mem_cgroup_try_charge will write NULL into &mem, and css_put will oops on null pointer dereference. BUG: unable to handle kernel NULL pointer dereference at 0000000000000010 IP: [<ffffffff810fc6cc>] mem_cgroup_prepare_migration+0x7c/0xc0 PGD a5d89067 PUD a5d8a067 PMD 0 Oops: 0000 [#1] SMP last sysfs file: /sys/devices/platform/microcode/firmware/microcode/loading CPU 0 Modules linked in: nfs lockd nfs_acl auth_rpcgss sunrpc acpi_cpufreq pcspkr sg [last unloaded: microcode] Pid: 5299, comm: largepages Tainted: G W 2.6.34-rc3 #3 Penryn1600SLI-110dB/To Be Filled By O.E.M. RIP: 0010:[<ffffffff810fc6cc>] [<ffffffff810fc6cc>] mem_cgroup_prepare_migration+0x7c/0xc0 [nishimura@mxp.nes.nec.co.jp: fix merge issues] Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-04-26readahead: fix NULL filp dereferenceWu Fengguang
commit 70655c06bd3f25111312d63985888112aed15ac5 upstream. btrfs relocate_file_extent_cluster() calls us with NULL filp: [ 4005.426805] BUG: unable to handle kernel NULL pointer dereference at 00000021 [ 4005.426818] IP: [<c109a130>] page_cache_sync_readahead+0x18/0x3e Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Yan Zheng <yanzheng@21cn.com> Reported-by: Kirill A. Shutemov <kirill@shutemov.name> Tested-by: Kirill A. Shutemov <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-04-01tmpfs: cleanup mpol_parse_str()KOSAKI Motohiro
commit 926f2ae04f183098cf9a30521776fb2759c8afeb upstream. mpol_parse_str() made lots 'err' variable related bug. Because it is ugly and reviewing unfriendly. This patch simplifies it. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Ravikiran Thirumalai <kiran@scalex86.org> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Mel Gorman <mel@csn.ul.ie> Acked-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-04-01tmpfs: handle MPOL_LOCAL mount option properlyKOSAKI Motohiro
commit 12821f5fb942e795f8009ece14bde868893bd811 upstream. commit 71fe804b6d5 (mempolicy: use struct mempolicy pointer in shmem_sb_info) added mpol=local mount option. but its feature is broken since it was born. because such code always return 1 (i.e. mount failure). This patch fixes it. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Ravikiran Thirumalai <kiran@scalex86.org> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Mel Gorman <mel@csn.ul.ie> Acked-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-04-01tmpfs: mpol=bind:0 don't cause mount error.KOSAKI Motohiro
commit d69b2e63e9172afb4d07c305601b79a55509ac4c upstream. Currently, following mount operation cause mount error. % mount -t tmpfs -ompol=bind:0 none /tmp Because commit 71fe804b6d5 (mempolicy: use struct mempolicy pointer in shmem_sb_info) corrupted MPOL_BIND parse code. This patch restore the needed one. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Ravikiran Thirumalai <kiran@scalex86.org> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Mel Gorman <mel@csn.ul.ie> Acked-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-04-01tmpfs: fix oops on mounts with mpol=defaultRavikiran G Thirumalai
commit 413b43deab8377819aba1dbad2abf0c15d59b491 upstream. Fix an 'oops' when a tmpfs mount point is mounted with the mpol=default mempolicy. Upon remounting a tmpfs mount point with 'mpol=default' option, the mount code crashed with a null pointer dereference. The initial problem report was on 2.6.27, but the problem exists in mainline 2.6.34-rc as well. On examining the code, we see that mpol_new returns NULL if default mempolicy was requested. This 'NULL' mempolicy is accessed to store the node mask resulting in oops. The following patch fixes it. Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Mel Gorman <mel@csn.ul.ie> Acked-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-03-15slab: initialize unused alien cache entry as NULL at alloc_alien_cache().Haicheng Li
commit f3186a9c51eabe75b2780153ed7f07778d78b16e upstream. Comparing with existing code, it's a simpler way to use kzalloc_node() to ensure that each unused alien cache entry is NULL. CC: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Andi Kleen <ak@linux.intel.com> Acked-by: Christoph Lameter <cl@linux-foundation.org> Acked-by: Matt Mackall <mpm@selenic.com> Signed-off-by: Haicheng Li <haicheng.li@linux.intel.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-03-15readahead: introduce FMODE_RANDOM for POSIX_FADV_RANDOMWu Fengguang
commit 0141450f66c3c12a3aaa869748caa64241885cdf upstream. This fixes inefficient page-by-page reads on POSIX_FADV_RANDOM. POSIX_FADV_RANDOM used to set ra_pages=0, which leads to poor performance: a 16K read will be carried out in 4 _sync_ 1-page reads. In other places, ra_pages==0 means - it's ramfs/tmpfs/hugetlbfs/sysfs/configfs - some IO error happened where multi-page read IO won't help or should be avoided. POSIX_FADV_RANDOM actually want a different semantics: to disable the *heuristic* readahead algorithm, and to use a dumb one which faithfully submit read IO for whatever application requests. So introduce a flag FMODE_RANDOM for POSIX_FADV_RANDOM. Note that the random hint is not likely to help random reads performance noticeably. And it may be too permissive on huge request size (its IO size is not limited by read_ahead_kb). In Quentin's report (http://lkml.org/lkml/2009/12/24/145), the overall (NFS read) performance of the application increased by 313%! Tested-by: Quentin Barnes <qbarnes+nfs@yahoo-inc.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Andi Kleen <andi@firstfloor.org> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: <qbarnes+nfs@yahoo-inc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-02-22memcg: fix oom killing a child process in an other cgroupKAMEZAWA Hiroyuki
Presently the oom-killer is memcg aware and it finds the worst process from processes under memcg(s) in oom. Then, it kills victim's child first. It may kill a child in another cgroup and may not be any help for recovery. And it will break the assumption users have. This patch fixes it. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: David Rientjes <rientjes@google.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-02-21mm: Make copy_from_user() in migrate.c statically predictableH. Peter Anvin
x86-32 has had a static test for copy_on_user() overflow for a while. This test currently fails in mm/migrate.c resulting in an allyesconfig/allmodconfig build failure on x86-32: In function ‘copy_from_user’, inlined from ‘do_pages_stat’ at /home/hpa/kernel/git/mm/migrate.c:1012: /home/hpa/kernel/git/arch/x86/include/asm/uaccess_32.h:212: error: call to ‘copy_from_user_overflow’ declared Make the logic more explicit and therefore easier for gcc to understand. v2: rewrite the loop entirely using a more normal structure for a chunked-data loop (Linus Torvalds) Reported-by: Len Brown <lenb@kernel.org> Signed-off-by: H. Peter Anvin <hpa@zytor.com> Reviewed-and-Tested-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Arjan van de Ven <arjan@linux.kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-02-06Fix potential crash with sys_move_pagesLinus Torvalds
We incorrectly depended on the 'node_state/node_isset()' functions testing the node range, rather than checking it explicitly. That's not reliable, even if it might often happen to work. So do the proper explicit test. Reported-by: Marcus Meissner <meissner@suse.de> Acked-and-tested-by: Brice Goglin <Brice.Goglin@inria.fr> Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-02-02hugetlb: fix section mismatchesJeff Mahoney
hugetlb_sysfs_add_hstate is called by hugetlb_register_node directly during init and also indirectly via sysfs after init. This patch removes the __init tag from hugetlb_sysfs_add_hstate. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-02-02mm: flush dcache before writing into page to avoid aliasanfei zhou
The cache alias problem will happen if the changes of user shared mapping is not flushed before copying, then user and kernel mapping may be mapped into two different cache line, it is impossible to guarantee the coherence after iov_iter_copy_from_user_atomic. So the right steps should be: flush_dcache_page(page); kmap_atomic(page); write to page; kunmap_atomic(page); flush_dcache_page(page); More precisely, we might create two new APIs flush_dcache_user_page and flush_dcache_kern_page to replace the two flush_dcache_page accordingly. Here is a snippet tested on omap2430 with VIPT cache, and I think it is not ARM-specific: int val = 0x11111111; fd = open("abc", O_RDWR); addr = mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); *(addr+0) = 0x44444444; tmp = *(addr+0); *(addr+1) = 0x77777777; write(fd, &val, sizeof(int)); close(fd); The results are not always 0x11111111 0x77777777 at the beginning as expected. Sometimes we see 0x44444444 0x77777777. Signed-off-by: Anfei <anfei.zhou@gmail.com> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: <linux-arch@vger.kernel.org> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-02-02mm: purge fragmented percpu vmap blocksNick Piggin
Improve handling of fragmented per-CPU vmaps. We previously don't free up per-CPU maps until all its addresses have been used and freed. So fragmented blocks could fill up vmalloc space even if they actually had no active vmap regions within them. Add some logic to allow all CPUs to have these blocks purged in the case of failure to allocate a new vm area, and also put some logic to trim such blocks of a current CPU if we hit them in the allocation path (so as to avoid a large build up of them). Christoph reported some vmap allocation failures when using the per CPU vmap APIs in XFS, which cannot be reproduced after this patch and the previous bug fix. Cc: linux-mm@kvack.org Cc: stable@kernel.org Tested-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Nick Piggin <npiggin@suse.de> -- Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-02-02mm: percpu-vmap fix RCU list walkingNick Piggin
RCU list walking of the per-cpu vmap cache was broken. It did not use RCU primitives, and also the union of free_list and rcu_head is obviously wrong (because free_list is indeed the list we are RCU walking). While we are there, remove a couple of unused fields from an earlier iteration. These APIs aren't actually used anywhere, because of problems with the XFS conversion. Christoph has now verified that the problems are solved with these patches. Also it is an exported interface, so I think it will be good to be merged now (and Christoph wants to get the XFS changes into their local tree). Cc: stable@kernel.org Cc: linux-mm@kvack.org Tested-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Nick Piggin <npiggin@suse.de> -- Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>