summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2018-03-20workqueue: Allow retrieval of current task's work structLukas Wunner
[ Upstream commit 27d4ee03078aba88c5e07dcc4917e8d01d046f38 ] Introduce a helper to retrieve the current task's work struct if it is a workqueue worker. This allows us to fix a long-standing deadlock in several DRM drivers wherein the ->runtime_suspend callback waits for a specific worker to finish and that worker in turn calls a function which waits for runtime suspend to finish. That function is invoked from multiple call sites and waiting for runtime suspend to finish is the correct thing to do except if it's executing in the context of the worker. Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Dave Airlie <airlied@redhat.com> Cc: Ben Skeggs <bskeggs@redhat.com> Cc: Alex Deucher <alexander.deucher@amd.com> Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lyude Paul <lyude@redhat.com> Signed-off-by: Lukas Wunner <lukas@wunner.de> Link: https://patchwork.freedesktop.org/patch/msgid/2d8f603074131eb87e588d2b803a71765bd3a2fd.1518338788.git.lukas@wunner.de Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
2018-03-20hrtimer: Ensure POSIX compliance (relative CLOCK_REALTIME hrtimers)Anna-Maria Gleixner
[ Upstream commit 48d0c9becc7f3c66874c100c126459a9da0fdced ] The POSIX specification defines that relative CLOCK_REALTIME timers are not affected by clock modifications. Those timers have to use CLOCK_MONOTONIC to ensure POSIX compliance. The introduction of the additional HRTIMER_MODE_PINNED mode broke this requirement for pinned timers. There is no user space visible impact because user space timers are not using pinned mode, but for consistency reasons this needs to be fixed. Check whether the mode has the HRTIMER_MODE_REL bit set instead of comparing with HRTIMER_MODE_ABS. Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de> Cc: Christoph Hellwig <hch@lst.de> Cc: John Stultz <john.stultz@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: keescook@chromium.org Fixes: 597d0275736d ("timers: Framework for identifying pinned timers") Link: http://lkml.kernel.org/r/20171221104205.7269-7-anna-maria@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
2018-03-04profile: hide unused functions when !CONFIG_PROC_FSArnd Bergmann
[ Upstream commit ade356b99a4187578609f2a91c4d2ed88e4e70dc ] A couple of functions and variables in the profile implementation are used only on SMP systems by the procfs code, but are unused if either procfs is disabled or in uniprocessor kernels. gcc prints a harmless warning about the unused symbols: kernel/profile.c:243:13: error: 'profile_flip_buffers' defined but not used [-Werror=unused-function] static void profile_flip_buffers(void) ^ kernel/profile.c:266:13: error: 'profile_discard_flip_buffers' defined but not used [-Werror=unused-function] static void profile_discard_flip_buffers(void) ^ kernel/profile.c:330:12: error: 'profile_cpu_callback' defined but not used [-Werror=unused-function] static int profile_cpu_callback(struct notifier_block *info, ^ This adds further #ifdef to the file, to annotate exactly in which cases they are used. I have done several thousand ARM randconfig kernels with this patch applied and no longer get any warnings in this file. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Robin Holt <robinmholt@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
2018-03-04blktrace: fix unlocked registration of tracepointsJens Axboe
[ Upstream commit a6da0024ffc19e0d47712bb5ca4fd083f76b07df ] We need to ensure that tracepoints are registered and unregistered with the users of them. The existing atomic count isn't enough for that. Add a lock around the tracepoints, so we serialize access to them. This fixes cases where we have multiple users setting up and tearing down tracepoints, like this: CPU: 0 PID: 2995 Comm: syzkaller857118 Not tainted 4.14.0-rc5-next-20171018+ #36 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:16 [inline] dump_stack+0x194/0x257 lib/dump_stack.c:52 panic+0x1e4/0x41c kernel/panic.c:183 __warn+0x1c4/0x1e0 kernel/panic.c:546 report_bug+0x211/0x2d0 lib/bug.c:183 fixup_bug+0x40/0x90 arch/x86/kernel/traps.c:177 do_trap_no_signal arch/x86/kernel/traps.c:211 [inline] do_trap+0x260/0x390 arch/x86/kernel/traps.c:260 do_error_trap+0x120/0x390 arch/x86/kernel/traps.c:297 do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:310 invalid_op+0x18/0x20 arch/x86/entry/entry_64.S:905 RIP: 0010:tracepoint_add_func kernel/tracepoint.c:210 [inline] RIP: 0010:tracepoint_probe_register_prio+0x397/0x9a0 kernel/tracepoint.c:283 RSP: 0018:ffff8801d1d1f6c0 EFLAGS: 00010293 RAX: ffff8801d22e8540 RBX: 00000000ffffffef RCX: ffffffff81710f07 RDX: 0000000000000000 RSI: ffffffff85b679c0 RDI: ffff8801d5f19818 RBP: ffff8801d1d1f7c8 R08: ffffffff81710c10 R09: 0000000000000004 R10: ffff8801d1d1f6b0 R11: 0000000000000003 R12: ffffffff817597f0 R13: 0000000000000000 R14: 00000000ffffffff R15: ffff8801d1d1f7a0 tracepoint_probe_register+0x2a/0x40 kernel/tracepoint.c:304 register_trace_block_rq_insert include/trace/events/block.h:191 [inline] blk_register_tracepoints+0x1e/0x2f0 kernel/trace/blktrace.c:1043 do_blk_trace_setup+0xa10/0xcf0 kernel/trace/blktrace.c:542 blk_trace_setup+0xbd/0x180 kernel/trace/blktrace.c:564 sg_ioctl+0xc71/0x2d90 drivers/scsi/sg.c:1089 vfs_ioctl fs/ioctl.c:45 [inline] do_vfs_ioctl+0x1b1/0x1520 fs/ioctl.c:685 SYSC_ioctl fs/ioctl.c:700 [inline] SyS_ioctl+0x8f/0xc0 fs/ioctl.c:691 entry_SYSCALL_64_fastpath+0x1f/0xbe RIP: 0033:0x444339 RSP: 002b:00007ffe05bb5b18 EFLAGS: 00000206 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00000000006d66c0 RCX: 0000000000444339 RDX: 000000002084cf90 RSI: 00000000c0481273 RDI: 0000000000000009 RBP: 0000000000000082 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000206 R12: ffffffffffffffff R13: 00000000c0481273 R14: 0000000000000000 R15: 0000000000000000 since we can now run these in parallel. Ensure that the exported helpers for doing this are grabbing the queue trace mutex. Reported-by: Steven Rostedt <rostedt@goodmis.org> Tested-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
2018-03-04kernel/async.c: revert "async: simplify lowest_in_progress()"Rasmus Villemoes
[ Upstream commit 4f7e988e63e336827f4150de48163bed05d653bd ] This reverts commit 92266d6ef60c ("async: simplify lowest_in_progress()") which was simply wrong: In the case where domain is NULL, we now use the wrong offsetof() in the list_first_entry macro, so we don't actually fetch the ->cookie value, but rather the eight bytes located sizeof(struct list_head) further into the struct async_entry. On 64 bit, that's the data member, while on 32 bit, that's a u64 built from func and data in some order. I think the bug happens to be harmless in practice: It obviously only affects callers which pass a NULL domain, and AFAICT the only such caller is async_synchronize_full() -> async_synchronize_full_domain(NULL) -> async_synchronize_cookie_domain(ASYNC_COOKIE_MAX, NULL) and the ASYNC_COOKIE_MAX means that in practice we end up waiting for the async_global_pending list to be empty - but it would break if somebody happened to pass (void*)-1 as the data element to async_schedule, and of course also if somebody ever does a async_synchronize_cookie_domain(, NULL) with a "finite" cookie value. Maybe the "harmless in practice" means this isn't -stable material. But I'm not completely confident my quick git grep'ing is enough, and there might be affected code in one of the earlier kernels that has since been removed, so I'll leave the decision to the stable guys. Link: http://lkml.kernel.org/r/20171128104938.3921-1-linux@rasmusvillemoes.dk Fixes: 92266d6ef60c "async: simplify lowest_in_progress()" Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Acked-by: Tejun Heo <tj@kernel.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Adam Wallis <awallis@codeaurora.org> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: <stable@vger.kernel.org> [3.10+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
2018-02-28posix-timer: Properly check sigevent->sigev_notifyThomas Gleixner
[ Upstream commit cef31d9af908243421258f1df35a4a644604efbe ] timer_create() specifies via sigevent->sigev_notify the signal delivery for the new timer. The valid modes are SIGEV_NONE, SIGEV_SIGNAL, SIGEV_THREAD and (SIGEV_SIGNAL | SIGEV_THREAD_ID). The sanity check in good_sigevent() is only checking the valid combination for the SIGEV_THREAD_ID bit, i.e. SIGEV_SIGNAL, but if SIGEV_THREAD_ID is not set it accepts any random value. This has no real effects on the posix timer and signal delivery code, but it affects show_timer() which handles the output of /proc/$PID/timers. That function uses a string array to pretty print sigev_notify. The access to that array has no bound checks, so random sigev_notify cause access beyond the array bounds. Add proper checks for the valid notify modes and remove the SIGEV_THREAD_ID masking from various code pathes as SIGEV_NONE can never be set in combination with SIGEV_THREAD_ID. Reported-by: Eric Biggers <ebiggers3@gmail.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Reported-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: John Stultz <john.stultz@linaro.org> Cc: stable@vger.kernel.org Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
2018-02-28bpf: fix divides by zeroEric Dumazet
[ Upstream commit c366287ebd698ef5e3de300d90cd62ee9ee7373e ] Divides by zero are not nice, lets avoid them if possible. Also do_div() seems not needed when dealing with 32bit operands, but this seems a minor detail. Fixes: bd4cf0ed331a ("net: filter: rework/optimize internal BPF interpreter's instruction set") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
2018-02-28bpf: arsh is not supported in 32 bit alu thus reject itDaniel Borkmann
[ Upstream commit 7891a87efc7116590eaba57acc3c422487802c6f ] The following snippet was throwing an 'unknown opcode cc' warning in BPF interpreter: 0: (18) r0 = 0x0 2: (7b) *(u64 *)(r10 -16) = r0 3: (cc) (u32) r0 s>>= (u32) r0 4: (95) exit Although a number of JITs do support BPF_ALU | BPF_ARSH | BPF_{K,X} generation, not all of them do and interpreter does neither. We can leave existing ones and implement it later in bpf-next for the remaining ones, but reject this properly in verifier for the time being. Fixes: 17a5267067f3 ("bpf: verifier (add verifier core)") Reported-by: syzbot+93c4904c5c70348a6890@syzkaller.appspotmail.com Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
2018-02-28time: Avoid undefined behaviour in ktime_add_safe()Vegard Nossum
[ Upstream commit 979515c5645830465739254abc1b1648ada41518 ] I ran into this: ================================================================================ UBSAN: Undefined behaviour in kernel/time/hrtimer.c:310:16 signed integer overflow: 9223372036854775807 + 50000 cannot be represented in type 'long long int' CPU: 2 PID: 4798 Comm: trinity-c2 Not tainted 4.8.0-rc1+ #91 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org 04/01/2014 0000000000000000 ffff88010ce6fb88 ffffffff82344740 0000000041b58ab3 ffffffff84f97a20 ffffffff82344694 ffff88010ce6fbb0 ffff88010ce6fb60 000000000000c350 ffff88010ce6f968 dffffc0000000000 ffffffff857bc320 Call Trace: [<ffffffff82344740>] dump_stack+0xac/0xfc [<ffffffff82344694>] ? _atomic_dec_and_lock+0xc4/0xc4 [<ffffffff8242df78>] ubsan_epilogue+0xd/0x8a [<ffffffff8242e6b4>] handle_overflow+0x202/0x23d [<ffffffff8242e4b2>] ? val_to_string.constprop.6+0x11e/0x11e [<ffffffff8236df71>] ? timerqueue_add+0x151/0x410 [<ffffffff81485c48>] ? hrtimer_start_range_ns+0x3b8/0x1380 [<ffffffff81795631>] ? memset+0x31/0x40 [<ffffffff8242e6fd>] __ubsan_handle_add_overflow+0xe/0x10 [<ffffffff81488ac9>] hrtimer_nanosleep+0x5d9/0x790 [<ffffffff814884f0>] ? hrtimer_init_sleeper+0x80/0x80 [<ffffffff813a9ffb>] ? __might_sleep+0x5b/0x260 [<ffffffff8148be10>] common_nsleep+0x20/0x30 [<ffffffff814906c7>] SyS_clock_nanosleep+0x197/0x210 [<ffffffff81490530>] ? SyS_clock_getres+0x150/0x150 [<ffffffff823c7113>] ? __this_cpu_preempt_check+0x13/0x20 [<ffffffff8162ef60>] ? __context_tracking_exit.part.3+0x30/0x1b0 [<ffffffff81490530>] ? SyS_clock_getres+0x150/0x150 [<ffffffff81007bd3>] do_syscall_64+0x1b3/0x4b0 [<ffffffff845f85aa>] entry_SYSCALL64_slow_path+0x25/0x25 ================================================================================ Add a new ktime_add_unsafe() helper which doesn't check for overflow, but doesn't throw a UBSAN warning when it does overflow either. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Prarit Bhargava <prarit@redhat.com> Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com> Signed-off-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
2018-02-28tracing: Fix converting enum's from the map in trace_event_eval_update()Steven Rostedt (VMware)
[ Upstream commit 1ebe1eaf2f02784921759992ae1fde1a9bec8fd0 ] Since enums do not get converted by the TRACE_EVENT macro into their values, the event format displaces the enum name and not the value. This breaks tools like perf and trace-cmd that need to interpret the raw binary data. To solve this, an enum map was created to convert these enums into their actual numbers on boot up. This is done by TRACE_EVENTS() adding a TRACE_DEFINE_ENUM() macro. Some enums were not being converted. This was caused by an optization that had a bug in it. All calls get checked against this enum map to see if it should be converted or not, and it compares the call's system to the system that the enum map was created under. If they match, then they call is processed. To cut down on the number of iterations needed to find the maps with a matching system, since calls and maps are grouped by system, when a match is made, the index into the map array is saved, so that the next call, if it belongs to the same system as the previous call, could start right at that array index and not have to scan all the previous arrays. The problem was, the saved index was used as the variable to know if this is a call in a new system or not. If the index was zero, it was assumed that the call is in a new system and would keep incrementing the saved index until it found a matching system. The issue arises when the first matching system was at index zero. The next map, if it belonged to the same system, would then think it was the first match and increment the index to one. If the next call belong to the same system, it would begin its search of the maps off by one, and miss the first enum that should be converted. This left a single enum not converted properly. Also add a comment to describe exactly what that index was for. It took me a bit too long to figure out what I was thinking when debugging this issue. Link: http://lkml.kernel.org/r/717BE572-2070-4C1E-9902-9F2E0FEDA4F8@oracle.com Cc: stable@vger.kernel.org Fixes: 0c564a538aa93 ("tracing: Add TRACE_DEFINE_ENUM() macro to map enums to their values") Reported-by: Chuck Lever <chuck.lever@oracle.com> Teste-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
2018-02-28futex: Prevent overflow by strengthen input validationLi Jinyue
[ Upstream commit fbe0e839d1e22d88810f3ee3e2f1479be4c0aa4a ] UBSAN reports signed integer overflow in kernel/futex.c: UBSAN: Undefined behaviour in kernel/futex.c:2041:18 signed integer overflow: 0 - -2147483648 cannot be represented in type 'int' Add a sanity check to catch negative values of nr_wake and nr_requeue. Signed-off-by: Li Jinyue <lijinyue@huawei.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: peterz@infradead.org Cc: dvhart@infradead.org Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/1513242294-31786-1-git-send-email-lijinyue@huawei.com Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
2018-02-28gcov: disable for COMPILE_TESTArnd Bergmann
[ Upstream commit cc622420798c4bcf093785d872525087a7798db9 ] Enabling gcov is counterproductive to compile testing: it significantly increases the kernel image size, compile time, and it produces lots of false positive "may be used uninitialized" warnings as the result of missed optimizations. This is in line with how UBSAN_SANITIZE_ALL and PROFILE_ALL_BRANCHES work, both of which have similar problems. With an ARM allmodconfig kernel, I see the build time drop from 283 minutes CPU time to 225 minutes, and the vmlinux size drops from 43MB to 26MB. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Peter Oberparleiter <oberpar@linux.vnet.ibm.com> Signed-off-by: Michal Marek <mmarek@suse.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
2018-02-28futex: Replace barrier() in unqueue_me() with READ_ONCE()Jianyu Zhan
[ Upstream commit 29b75eb2d56a714190a93d7be4525e617591077a ] Commit e91467ecd1ef ("bug in futex unqueue_me") introduced a barrier() in unqueue_me() to prevent the compiler from rereading the lock pointer which might change after a check for NULL. Replace the barrier() with a READ_ONCE() for the following reasons: 1) READ_ONCE() is a weaker form of barrier() that affects only the specific load operation, while barrier() is a general compiler level memory barrier. READ_ONCE() was not available at the time when the barrier was added. 2) Aside of that READ_ONCE() is descriptive and self explainatory while a barrier without comment is not clear to the casual reader. No functional change. [ tglx: Massaged changelog ] Signed-off-by: Jianyu Zhan <nasa4836@gmail.com> Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> Acked-by: Darren Hart <dvhart@linux.intel.com> Cc: dave@stgolabs.net Cc: peterz@infradead.org Cc: linux@rasmusvillemoes.dk Cc: akpm@linux-foundation.org Cc: fengguang.wu@intel.com Cc: bigeasy@linutronix.de Link: http://lkml.kernel.org/r/1457314344-5685-1-git-send-email-nasa4836@gmail.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
2018-02-28module: Issue warnings when tainting kernelLibor Pechacek
[ Upstream commit 3205c36cf7d96024626f92d65f560035df1abcb2 ] While most of the locations where a kernel taint bit is set are accompanied with a warning message, there are two which set their bits silently. If the tainting module gets unloaded later on, it is almost impossible to tell what was the reason for setting the flag. Signed-off-by: Libor Pechacek <lpechacek@suse.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
2018-02-28module: keep percpu symbols in module's symtabMiroslav Benes
[ Upstream commit e0224418516b4d8a6c2160574bac18447c354ef0 ] Currently, percpu symbols from .data..percpu ELF section of a module are not copied over and stored in final symtab array of struct module. Consequently such symbol cannot be returned via kallsyms API (for example kallsyms_lookup_name). This can be especially confusing when the percpu symbol is exported. Only its __ksymtab et al. are present in its symtab. The culprit is in layout_and_allocate() function where SHF_ALLOC flag is dropped for .data..percpu section. There is in fact no need to copy the section to final struct module, because kernel module loader allocates extra percpu section by itself. Unfortunately only symbols from SHF_ALLOC sections are copied due to a check in is_core_symbol(). The patch changes is_core_symbol() function to copy over also percpu symbols (their st_shndx points to .data..percpu ELF section). We do it only if CONFIG_KALLSYMS_ALL is set to be consistent with the rest of the function (ELF section is SHF_ALLOC but !SHF_EXECINSTR). Finally elf_type() returns type 'a' for a percpu symbol because its address is absolute. Signed-off-by: Miroslav Benes <mbenes@suse.cz> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
2018-02-28kernel/signal.c: remove the no longer needed SIGNAL_UNKILLABLE check in ↵Oleg Nesterov
complete_signal() [ Upstream commit 426915796ccaf9c2bd9bb06dc5702225957bc2e5 ] complete_signal() checks SIGNAL_UNKILLABLE before it starts to destroy the thread group, today this is wrong in many ways. If nothing else, fatal_signal_pending() should always imply that the whole thread group (except ->group_exit_task if it is not NULL) is killed, this check breaks the rule. After the previous changes we can rely on sig_task_ignored(); sig_fatal(sig) && SIGNAL_UNKILLABLE can only be true if we actually want to kill this task and sig == SIGKILL OR it is traced and debugger can intercept the signal. This should hopefully fix the problem reported by Dmitry. This test-case static int init(void *arg) { for (;;) pause(); } int main(void) { char stack[16 * 1024]; for (;;) { int pid = clone(init, stack + sizeof(stack)/2, CLONE_NEWPID | SIGCHLD, NULL); assert(pid > 0); assert(ptrace(PTRACE_ATTACH, pid, 0, 0) == 0); assert(waitpid(-1, NULL, WSTOPPED) == pid); assert(ptrace(PTRACE_DETACH, pid, 0, SIGSTOP) == 0); assert(syscall(__NR_tkill, pid, SIGKILL) == 0); assert(pid == wait(NULL)); } } triggers the WARN_ON_ONCE(!(task->jobctl & JOBCTL_STOP_PENDING)) in task_participate_group_stop(). do_signal_stop()->signal_group_exit() checks SIGNAL_GROUP_EXIT and return false, but task_set_jobctl_pending() checks fatal_signal_pending() and does not set JOBCTL_STOP_PENDING. And his should fix the minor security problem reported by Kyle, SECCOMP_RET_TRACE can miss fatal_signal_pending() the same way if the task is the root of a pid namespace. Link: http://lkml.kernel.org/r/20171103184246.GD21036@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Reported-by: Kyle Huey <me@kylehuey.com> Reviewed-by: Kees Cook <keescook@chromium.org> Tested-by: Kyle Huey <me@kylehuey.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
2018-02-28kernel/signal.c: protect the SIGNAL_UNKILLABLE tasks from !sig_kernel_only() ↵Oleg Nesterov
signals [ Upstream commit ac25385089f673560867eb5179228a44ade0cfc1 ] Change sig_task_ignored() to drop the SIG_DFL && !sig_kernel_only() signals even if force == T. This simplifies the next change and this matches the same check in get_signal() which will drop these signals anyway. Link: http://lkml.kernel.org/r/20171103184227.GC21036@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Tested-by: Kyle Huey <me@kylehuey.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
2018-02-28kernel/signal.c: protect the traced SIGNAL_UNKILLABLE tasks from SIGKILLOleg Nesterov
[ Upstream commit 628c1bcba204052d19b686b5bac149a644cdb72e ] The comment in sig_ignored() says "Tracers may want to know about even ignored signals" but SIGKILL can not be reported to debugger and it is just wrong to return 0 in this case: SIGKILL should only kill the SIGNAL_UNKILLABLE task if it comes from the parent ns. Change sig_ignored() to ignore ->ptrace if sig == SIGKILL and rely on sig_task_ignored(). SISGTOP coming from within the namespace is not really right too but at least debugger can intercept it, and we can't drop it here because this will break "gdb -p 1": ptrace_attach() won't work. Perhaps we will add another ->ptrace check later, we will see. Link: http://lkml.kernel.org/r/20171103184206.GB21036@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Tested-by: Kyle Huey <me@kylehuey.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
2018-02-28kernel: make groups_sort calling a responsibility group_info allocatorsThiago Rafael Becker
[ Upstream commit bdcf0a423ea1c40bbb40e7ee483b50fc8aa3d758 ] In testing, we found that nfsd threads may call set_groups in parallel for the same entry cached in auth.unix.gid, racing in the call of groups_sort, corrupting the groups for that entry and leading to permission denials for the client. This patch: - Make groups_sort globally visible. - Move the call to groups_sort to the modifiers of group_info - Remove the call to groups_sort from set_groups Link: http://lkml.kernel.org/r/20171211151420.18655-1-thiago.becker@gmail.com Signed-off-by: Thiago Rafael Becker <thiago.becker@gmail.com> Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com> Reviewed-by: NeilBrown <neilb@suse.com> Acked-by: "J. Bruce Fields" <bfields@fieldses.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
2018-02-28kernel/acct.c: fix the acct->needcheck check in check_free_space()Oleg Nesterov
[ Upstream commit 4d9570158b6260f449e317a5f9ed030c2504a615 ] As Tsukada explains, the time_is_before_jiffies(acct->needcheck) check is very wrong, we need time_is_after_jiffies() to make sys_acct() work. Ignoring the overflows, the code should "goto out" if needcheck > jiffies, while currently it checks "needcheck < jiffies" and thus in the likely case check_free_space() does nothing until jiffies overflow. In particular this means that sys_acct() is simply broken, acct_on() sets acct->needcheck = jiffies and expects that check_free_space() should set acct->active = 1 after the free-space check, but this won't happen if jiffies increments in between. This was broken by commit 32dc73086015 ("get rid of timer in kern/acct.c") in 2011, then another (correct) commit 795a2f22a8ea ("acct() should honour the limits from the very beginning") made the problem more visible. Link: http://lkml.kernel.org/r/20171213133940.GA6554@redhat.com Fixes: 32dc73086015 ("get rid of timer in kern/acct.c") Reported-by: TSUKADA Koutaro <tsukada@ascade.co.jp> Suggested-by: TSUKADA Koutaro <tsukada@ascade.co.jp> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
2018-02-28ring-buffer: Mask out the info bits when returning buffer page lengthSteven Rostedt (VMware)
[ Upstream commit 45d8b80c2ac5d21cd1e2954431fb676bc2b1e099 ] Two info bits were added to the "commit" part of the ring buffer data page when returned to be consumed. This was to inform the user space readers that events have been missed, and that the count may be stored at the end of the page. What wasn't handled, was the splice code that actually called a function to return the length of the data in order to zero out the rest of the page before sending it up to user space. These data bits were returned with the length making the value negative, and that negative value was not checked. It was compared to PAGE_SIZE, and only used if the size was less than PAGE_SIZE. Luckily PAGE_SIZE is unsigned long which made the compare an unsigned compare, meaning the negative size value did not end up causing a large portion of memory to be randomly zeroed out. Cc: stable@vger.kernel.org Fixes: 66a8cb95ed040 ("ring-buffer: Add place holder recording of dropped events") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
2018-02-28tracing: Fix crash when it fails to alloc ring bufferJing Xia
[ Upstream commit 24f2aaf952ee0b59f31c3a18b8b36c9e3d3c2cf5 ] Double free of the ring buffer happens when it fails to alloc new ring buffer instance for max_buffer if TRACER_MAX_TRACE is configured. The root cause is that the pointer is not set to NULL after the buffer is freed in allocate_trace_buffers(), and the freeing of the ring buffer is invoked again later if the pointer is not equal to Null, as: instance_mkdir() |-allocate_trace_buffers() |-allocate_trace_buffer(tr, &tr->trace_buffer...) |-allocate_trace_buffer(tr, &tr->max_buffer...) // allocate fail(-ENOMEM),first free // and the buffer pointer is not set to null |-ring_buffer_free(tr->trace_buffer.buffer) // out_free_tr |-free_trace_buffers() |-free_trace_buffer(&tr->trace_buffer); //if trace_buffer is not null, free again |-ring_buffer_free(buf->buffer) |-rb_free_cpu_buffer(buffer->buffers[cpu]) // ring_buffer_per_cpu is null, and // crash in ring_buffer_per_cpu->pages Link: http://lkml.kernel.org/r/20171226071253.8968-1-chunyan.zhang@spreadtrum.com Cc: stable@vger.kernel.org Fixes: 737223fbca3b1 ("tracing: Consolidate buffer allocation code") Signed-off-by: Jing Xia <jing.xia@spreadtrum.com> Signed-off-by: Chunyan Zhang <chunyan.zhang@spreadtrum.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
2018-02-28tracing: Fix possible double free on failure of allocating trace bufferSteven Rostedt (VMware)
[ Upstream commit 4397f04575c44e1440ec2e49b6302785c95fd2f8 ] Jing Xia and Chunyan Zhang reported that on failing to allocate part of the tracing buffer, memory is freed, but the pointers that point to them are not initialized back to NULL, and later paths may try to free the freed memory again. Jing and Chunyan fixed one of the locations that does this, but missed a spot. Link: http://lkml.kernel.org/r/20171226071253.8968-1-chunyan.zhang@spreadtrum.com Cc: stable@vger.kernel.org Fixes: 737223fbca3b1 ("tracing: Consolidate buffer allocation code") Reported-by: Jing Xia <jing.xia@spreadtrum.com> Reported-by: Chunyan Zhang <chunyan.zhang@spreadtrum.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
2018-02-28tracing: Remove extra zeroing out of the ring buffer pageSteven Rostedt (VMware)
[ Upstream commit 6b7e633fe9c24682df550e5311f47fb524701586 ] The ring_buffer_read_page() takes care of zeroing out any extra data in the page that it returns. There's no need to zero it out again from the consumer. It was removed from one consumer of this function, but read_buffers_splice_read() did not remove it, and worse, it contained a nasty bug because of it. Cc: stable@vger.kernel.org Fixes: 2711ca237a084 ("ring-buffer: Move zeroing out excess in page to ring buffer code") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
2018-02-28sched/core: Idle_task_exit() shouldn't use switch_mm_irqs_off()Andy Lutomirski
[ Upstream commit 252d2a4117bc181b287eeddf848863788da733ae ] idle_task_exit() can be called with IRQs on x86 on and therefore should use switch_mm(), not switch_mm_irqs_off(). This doesn't seem to cause any problems right now, but it will confuse my upcoming TLB flush changes. Nonetheless, I think it should be backported because it's trivial. There won't be any meaningful performance impact because idle_task_exit() is only used when offlining a CPU. Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Fixes: f98db6013c55 ("sched/core: Add switch_mm_irqs_off() and use it in the scheduler") Link: http://lkml.kernel.org/r/ca3d1a9fa93a0b49f5a8ff729eda3640fb6abdf9.1497034141.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
2018-02-28sched/core: Add switch_mm_irqs_off() and use it in the schedulerAndy Lutomirski
[ Upstream commit f98db6013c557c216da5038d9c52045be55cd039 ] By default, this is the same thing as switch_mm(). x86 will override it as an optimization. Signed-off-by: Andy Lutomirski <luto@kernel.org> Reviewed-by: Borislav Petkov <bp@suse.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/df401df47bdd6be3e389c6f1e3f5310d70e81b2c.1461688545.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
2018-01-17gcov: support GCC 7.1Martin Liska
[ Upstream commit 05384213436ab690c46d9dfec706b80ef8d671ab ] Starting from GCC 7.1, __gcov_exit is a new symbol expected to be implemented in a profiling runtime. [akpm@linux-foundation.org: coding-style fixes] [mliska@suse.cz: v2] Link: http://lkml.kernel.org/r/e63a3c59-0149-c97e-4084-20ca8f146b26@suse.cz Link: http://lkml.kernel.org/r/8c4084fa-3885-29fe-5fc4-0d4ca199c785@suse.cz Signed-off-by: Martin Liska <mliska@suse.cz> Acked-by: Peter Oberparleiter <oberpar@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
2018-01-17gcov: add support for gcc version >= 6Florian Meier
[ Upstream commit d02038f972538b93011d78c068f44514fbde0a8c ] Link: http://lkml.kernel.org/r/20160701130914.GA23225@styxhp Signed-off-by: Florian Meier <Florian.Meier@informatik.uni-erlangen.de> Reviewed-by: Peter Oberparleiter <oberpar@linux.vnet.ibm.com> Tested-by: Peter Oberparleiter <oberpar@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
2018-01-17gcov: add support for GCC 5.1Lorenzo Stoakes
[ Upstream commit 3e44c471a2dab210f7e9b1e5f7d4d54d52df59eb ] Fix kernel gcov support for GCC 5.1. Similar to commit a992bf836f9 ("gcov: add support for GCC 4.9"), this patch takes into account the existence of a new gcov counter (see gcc's gcc/gcov-counter.def.) Firstly, it increments GCOV_COUNTERS (to 10), which makes the data structure struct gcov_info compatible with GCC 5.1. Secondly, a corresponding counter function __gcov_merge_icall_topn (Top N value tracking for indirect calls) is included in base.c with the other gcov counters unused for kernel profiling. Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Andrey Ryabinin <a.ryabinin@samsung.com> Cc: Yuan Pengfei <coolypf@qq.com> Tested-by: Peter Oberparleiter <oberpar@linux.vnet.ibm.com> Reviewed-by: Peter Oberparleiter <oberpar@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
2018-01-17sched/deadline: Use deadline instead of period when calculating overflowSteven Rostedt (VMware)
[ Upstream commit 2317d5f1c34913bac5971d93d69fb6c31bb74670 ] I was testing Daniel's changes with his test case, and tweaked it a little. Instead of having the runtime equal to the deadline, I increased the deadline ten fold. Daniel's test case had: attr.sched_runtime = 2 * 1000 * 1000; /* 2 ms */ attr.sched_deadline = 2 * 1000 * 1000; /* 2 ms */ attr.sched_period = 2 * 1000 * 1000 * 1000; /* 2 s */ To make it more interesting, I changed it to: attr.sched_runtime = 2 * 1000 * 1000; /* 2 ms */ attr.sched_deadline = 20 * 1000 * 1000; /* 20 ms */ attr.sched_period = 2 * 1000 * 1000 * 1000; /* 2 s */ The results were rather surprising. The behavior that Daniel's patch was fixing came back. The task started using much more than .1% of the CPU. More like 20%. Looking into this I found that it was due to the dl_entity_overflow() constantly returning true. That's because it uses the relative period against relative runtime vs the absolute deadline against absolute runtime. runtime / (deadline - t) > dl_runtime / dl_period There's even a comment mentioning this, and saying that when relative deadline equals relative period, that the equation is the same as using deadline instead of period. That comment is backwards! What we really want is: runtime / (deadline - t) > dl_runtime / dl_deadline We care about if the runtime can make its deadline, not its period. And then we can say "when the deadline equals the period, the equation is the same as using dl_period instead of dl_deadline". After correcting this, now when the task gets enqueued, it can throttle correctly, and Daniel's fix to the throttling of sleeping deadline tasks works even when the runtime and deadline are not the same. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luca Abeni <luca.abeni@santannapisa.it> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Romulo Silva de Oliveira <romulo.deoliveira@ufsc.br> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it> Link: http://lkml.kernel.org/r/02135a27f1ae3fe5fd032568a5a2f370e190e8d7.1488392936.git.bristot@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
2018-01-17tracing: Allocate mask_str buffer dynamicallyChangbin Du
[ Upstream commit 90e406f96f630c07d631a021fd4af10aac913e77 ] The default NR_CPUS can be very large, but actual possible nr_cpu_ids usually is very small. For my x86 distribution, the NR_CPUS is 8192 and nr_cpu_ids is 4. About 2 pages are wasted. Most machines don't have so many CPUs, so define a array with NR_CPUS just wastes memory. So let's allocate the buffer dynamically when need. With this change, the mutext tracing_cpumask_update_lock also can be removed now, which was used to protect mask_str. Link: http://lkml.kernel.org/r/1512013183-19107-1-git-send-email-changbin.du@intel.com Fixes: 36dfe9252bd4c ("ftrace: make use of tracing_cpumask") Cc: stable@vger.kernel.org Signed-off-by: Changbin Du <changbin.du@intel.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
2018-01-17audit: ensure that 'audit=1' actually enables audit for PID 1Paul Moore
[ Upstream commit 173743dd99a49c956b124a74c8aacb0384739a4c ] Prior to this patch we enabled audit in audit_init(), which is too late for PID 1 as the standard initcalls are run after the PID 1 task is forked. This means that we never allocate an audit_context (see audit_alloc()) for PID 1 and therefore miss a lot of audit events generated by PID 1. This patch enables audit as early as possible to help ensure that when PID 1 is forked it can allocate an audit_context if required. Reviewed-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
2018-01-17workqueue: trigger WARN if queue_delayed_work() is called with NULL @wqTejun Heo
[ Upstream commit 637fdbae60d6cb9f6e963c1079d7e0445c86ff7d ] If queue_delayed_work() gets called with NULL @wq, the kernel will oops asynchronuosly on timer expiration which isn't too helpful in tracking down the offender. This actually happened with smc. __queue_delayed_work() already does several input sanity checks synchronously. Add NULL @wq check. Reported-by: Dave Jones <davej@codemonkey.org.uk> Link: http://lkml.kernel.org/r/20170227171439.jshx3qplflyrgcv7@codemonkey.org.uk Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
2018-01-17kdb: Fix handling of kallsyms_symbol_next() return valueDaniel Thompson
[ Upstream commit c07d35338081d107e57cf37572d8cc931a8e32e2 ] kallsyms_symbol_next() returns a boolean (true on success). Currently kdb_read() tests the return value with an inequality that unconditionally evaluates to true. This is fixed in the obvious way and, since the conditional branch is supposed to be unreachable, we also add a WARN_ON(). Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org> Cc: linux-stable <stable@vger.kernel.org> Signed-off-by: Jason Wessel <jason.wessel@windriver.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
2018-01-17padata: avoid race in reorderingJason A. Donenfeld
[ Upstream commit de5540d088fe97ad583cc7d396586437b32149a5 ] Under extremely heavy uses of padata, crashes occur, and with list debugging turned on, this happens instead: [87487.298728] WARNING: CPU: 1 PID: 882 at lib/list_debug.c:33 __list_add+0xae/0x130 [87487.301868] list_add corruption. prev->next should be next (ffffb17abfc043d0), but was ffff8dba70872c80. (prev=ffff8dba70872b00). [87487.339011] [<ffffffff9a53d075>] dump_stack+0x68/0xa3 [87487.342198] [<ffffffff99e119a1>] ? console_unlock+0x281/0x6d0 [87487.345364] [<ffffffff99d6b91f>] __warn+0xff/0x140 [87487.348513] [<ffffffff99d6b9aa>] warn_slowpath_fmt+0x4a/0x50 [87487.351659] [<ffffffff9a58b5de>] __list_add+0xae/0x130 [87487.354772] [<ffffffff9add5094>] ? _raw_spin_lock+0x64/0x70 [87487.357915] [<ffffffff99eefd66>] padata_reorder+0x1e6/0x420 [87487.361084] [<ffffffff99ef0055>] padata_do_serial+0xa5/0x120 padata_reorder calls list_add_tail with the list to which its adding locked, which seems correct: spin_lock(&squeue->serial.lock); list_add_tail(&padata->list, &squeue->serial.list); spin_unlock(&squeue->serial.lock); This therefore leaves only place where such inconsistency could occur: if padata->list is added at the same time on two different threads. This pdata pointer comes from the function call to padata_get_next(pd), which has in it the following block: next_queue = per_cpu_ptr(pd->pqueue, cpu); padata = NULL; reorder = &next_queue->reorder; if (!list_empty(&reorder->list)) { padata = list_entry(reorder->list.next, struct padata_priv, list); spin_lock(&reorder->lock); list_del_init(&padata->list); atomic_dec(&pd->reorder_objects); spin_unlock(&reorder->lock); pd->processed++; goto out; } out: return padata; I strongly suspect that the problem here is that two threads can race on reorder list. Even though the deletion is locked, call to list_entry is not locked, which means it's feasible that two threads pick up the same padata object and subsequently call list_add_tail on them at the same time. The fix is thus be hoist that lock outside of that block. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Acked-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
2017-12-08sched: Make resched_cpu() unconditionalPaul E. McKenney
[ Upstream commit 7c2102e56a3f7d85b5d8f33efbd7aecc1f36fdd8 ] The current implementation of synchronize_sched_expedited() incorrectly assumes that resched_cpu() is unconditional, which it is not. This means that synchronize_sched_expedited() can hang when resched_cpu()'s trylock fails as follows (analysis by Neeraj Upadhyay): o CPU1 is waiting for expedited wait to complete: sync_rcu_exp_select_cpus rdp->exp_dynticks_snap & 0x1 // returns 1 for CPU5 IPI sent to CPU5 synchronize_sched_expedited_wait ret = swait_event_timeout(rsp->expedited_wq, sync_rcu_preempt_exp_done(rnp_root), jiffies_stall); expmask = 0x20, CPU 5 in idle path (in cpuidle_enter()) o CPU5 handles IPI and fails to acquire rq lock. Handles IPI sync_sched_exp_handler resched_cpu returns while failing to try lock acquire rq->lock need_resched is not set o CPU5 calls rcu_idle_enter() and as need_resched is not set, goes to idle (schedule() is not called). o CPU 1 reports RCU stall. Given that resched_cpu() is now used only by RCU, this commit fixes the assumption by making resched_cpu() unconditional. Reported-by: Neeraj Upadhyay <neeraju@codeaurora.org> Suggested-by: Neeraj Upadhyay <neeraju@codeaurora.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-11-05workqueue: replace pool->manager_arb mutex with a flagTejun Heo
[ Upstream commit 692b48258dda7c302e777d7d5f4217244478f1f6 ] Josef reported a HARDIRQ-safe -> HARDIRQ-unsafe lock order detected by lockdep: [ 1270.472259] WARNING: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected [ 1270.472783] 4.14.0-rc1-xfstests-12888-g76833e8 #110 Not tainted [ 1270.473240] ----------------------------------------------------- [ 1270.473710] kworker/u5:2/5157 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire: [ 1270.474239] (&(&lock->wait_lock)->rlock){+.+.}, at: [<ffffffff8da253d2>] __mutex_unlock_slowpath+0xa2/0x280 [ 1270.474994] [ 1270.474994] and this task is already holding: [ 1270.475440] (&pool->lock/1){-.-.}, at: [<ffffffff8d2992f6>] worker_thread+0x366/0x3c0 [ 1270.476046] which would create a new lock dependency: [ 1270.476436] (&pool->lock/1){-.-.} -> (&(&lock->wait_lock)->rlock){+.+.} [ 1270.476949] [ 1270.476949] but this new dependency connects a HARDIRQ-irq-safe lock: [ 1270.477553] (&pool->lock/1){-.-.} ... [ 1270.488900] to a HARDIRQ-irq-unsafe lock: [ 1270.489327] (&(&lock->wait_lock)->rlock){+.+.} ... [ 1270.494735] Possible interrupt unsafe locking scenario: [ 1270.494735] [ 1270.495250] CPU0 CPU1 [ 1270.495600] ---- ---- [ 1270.495947] lock(&(&lock->wait_lock)->rlock); [ 1270.496295] local_irq_disable(); [ 1270.496753] lock(&pool->lock/1); [ 1270.497205] lock(&(&lock->wait_lock)->rlock); [ 1270.497744] <Interrupt> [ 1270.497948] lock(&pool->lock/1); , which will cause a irq inversion deadlock if the above lock scenario happens. The root cause of this safe -> unsafe lock order is the mutex_unlock(pool->manager_arb) in manage_workers() with pool->lock held. Unlocking mutex while holding an irq spinlock was never safe and this problem has been around forever but it never got noticed because the only time the mutex is usually trylocked while holding irqlock making actual failures very unlikely and lockdep annotation missed the condition until the recent b9c16a0e1f73 ("locking/mutex: Fix lockdep_assert_held() fail"). Using mutex for pool->manager_arb has always been a bit of stretch. It primarily is an mechanism to arbitrate managership between workers which can easily be done with a pool flag. The only reason it became a mutex is that pool destruction path wants to exclude parallel managing operations. This patch replaces the mutex with a new pool flag POOL_MANAGER_ACTIVE and make the destruction path wait for the current manager on a wait queue. v2: Drop unnecessary flag clearing before pool destruction as suggested by Boqun. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-11-05locking/lockdep: Add nest_lock integrity testPeter Zijlstra
[ Upstream commit 7fb4a2cea6b18dab56d609530d077f168169ed6b ] Boqun reported that hlock->references can overflow. Add a debug test for that to generate a clear error when this happens. Without this, lockdep is likely to report a mysterious failure on unlock. Reported-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Nicolai Hähnle <Nicolai.Haehnle@amd.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-11-05bpf/verifier: reject BPF_ALU64|BPF_ENDEdward Cree
[ Upstream commit e67b8a685c7c984e834e3181ef4619cd7025a136 ] Neither ___bpf_prog_run nor the JITs accept it. Also adds a new test case. Fixes: 17a5267067f3 ("bpf: verifier (add verifier core)") Signed-off-by: Edward Cree <ecree@solarflare.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-11-05sched/cpuset/pm: Fix cpuset vs. suspend-resume bugsPeter Zijlstra
[ Upstream commit 50e76632339d4655859523a39249dd95ee5e93e7 ] Cpusets vs. suspend-resume is _completely_ broken. And it got noticed because it now resulted in non-cpuset usage breaking too. On suspend cpuset_cpu_inactive() doesn't call into cpuset_update_active_cpus() because it doesn't want to move tasks about, there is no need, all tasks are frozen and won't run again until after we've resumed everything. But this means that when we finally do call into cpuset_update_active_cpus() after resuming the last frozen cpu in cpuset_cpu_active(), the top_cpuset will not have any difference with the cpu_active_mask and this it will not in fact do _anything_. So the cpuset configuration will not be restored. This was largely hidden because we would unconditionally create identity domains and mobile users would not in fact use cpusets much. And servers what do use cpusets tend to not suspend-resume much. An addition problem is that we'd not in fact wait for the cpuset work to finish before resuming the tasks, allowing spurious migrations outside of the specified domains. Fix the rebuild by introducing cpuset_force_rebuild() and fix the ordering with cpuset_wait_for_hotplug(). Reported-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <stable@vger.kernel.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J. Wysocki <rjw@rjwysocki.net> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: deb7aa308ea2 ("cpuset: reorganize CPU / memory hotplug handling") Link: http://lkml.kernel.org/r/20170907091338.orwxrqkbfkki3c24@hirez.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-11-05ftrace: Fix kmemleak in unregister_ftrace_graphShu Wang
[ Upstream commit 2b0b8499ae75df91455bbeb7491d45affc384fb0 ] The trampoline allocated by function tracer was overwriten by function_graph tracer, and caused a memory leak. The save_global_trampoline should have saved the previous trampoline in register_ftrace_graph() and restored it in unregister_ftrace_graph(). But as it is implemented, save_global_trampoline was only used in unregister_ftrace_graph as default value 0, and it overwrote the previous trampoline's value. Causing the previous allocated trampoline to be lost. kmmeleak backtrace: kmemleak_vmalloc+0x77/0xc0 __vmalloc_node_range+0x1b5/0x2c0 module_alloc+0x7c/0xd0 arch_ftrace_update_trampoline+0xb5/0x290 ftrace_startup+0x78/0x210 register_ftrace_function+0x8b/0xd0 function_trace_init+0x4f/0x80 tracing_set_tracer+0xe6/0x170 tracing_set_trace_write+0x90/0xd0 __vfs_write+0x37/0x170 vfs_write+0xb2/0x1b0 SyS_write+0x55/0xc0 do_syscall_64+0x67/0x180 return_from_SYSCALL_64+0x0/0x6a [ Looking further into this, I found that this was left over from when the function and function graph tracers shared the same ftrace_ops. But in commit 5f151b2401 ("ftrace: Fix function_profiler and function tracer together"), the two were separated, and the save_global_trampoline no longer was necessary (and it may have been broken back then too). -- Steven Rostedt ] Link: http://lkml.kernel.org/r/20170912021454.5976-1-shuwang@redhat.com Cc: stable@vger.kernel.org Fixes: 5f151b2401 ("ftrace: Fix function_profiler and function tracer together") Signed-off-by: Shu Wang <shuwang@redhat.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-11-04tracing: Erase irqsoff trace with empty writeBo Yan
[ Upstream commit 8dd33bcb7050dd6f8c1432732f930932c9d3a33e ] One convenient way to erase trace is "echo > trace". However, this is currently broken if the current tracer is irqsoff tracer. This is because irqsoff tracer use max_buffer as the default trace buffer. Set the max_buffer as the one to be cleared when it's the trace buffer currently in use. Link: http://lkml.kernel.org/r/1505754215-29411-1-git-send-email-byan@nvidia.com Cc: <mingo@redhat.com> Cc: stable@vger.kernel.org Fixes: 4acd4d00f ("tracing: give easy way to clear trace buffer") Signed-off-by: Bo Yan <byan@nvidia.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-11-04tracing: Fix trace_pipe behavior for instance tracesTahsin Erdogan
[ Upstream commit 75df6e688ccd517e339a7c422ef7ad73045b18a2 ] When reading data from trace_pipe, tracing_wait_pipe() performs a check to see if tracing has been turned off after some data was read. Currently, this check always looks at global trace state, but it should be checking the trace instance where trace_pipe is located at. Because of this bug, cat instances/i1/trace_pipe in the following script will immediately exit instead of waiting for data: cd /sys/kernel/debug/tracing echo 0 > tracing_on mkdir -p instances/i1 echo 1 > instances/i1/tracing_on echo 1 > instances/i1/events/sched/sched_process_exec/enable cat instances/i1/trace_pipe Link: http://lkml.kernel.org/r/20170917102348.1615-1-tahsin@google.com Cc: stable@vger.kernel.org Fixes: 10246fa35d4f ("tracing: give easy way to clear trace buffer") Signed-off-by: Tahsin Erdogan <tahsin@google.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-10-04ftrace: Fix memleak when unregistering dynamic ops when tracing disabledSteven Rostedt (VMware)
[ Upstream commit edb096e00724f02db5f6ec7900f3bbd465c6c76f ] If function tracing is disabled by the user via the function-trace option or the proc sysctl file, and a ftrace_ops that was allocated on the heap is unregistered, then the shutdown code exits out without doing the proper clean up. This was found via kmemleak and running the ftrace selftests, as one of the tests unregisters with function tracing disabled. # cat kmemleak unreferenced object 0xffffffffa0020000 (size 4096): comm "swapper/0", pid 1, jiffies 4294668889 (age 569.209s) hex dump (first 32 bytes): 55 ff 74 24 10 55 48 89 e5 ff 74 24 18 55 48 89 U.t$.UH...t$.UH. e5 48 81 ec a8 00 00 00 48 89 44 24 50 48 89 4c .H......H.D$PH.L backtrace: [<ffffffff81d64665>] kmemleak_vmalloc+0x85/0xf0 [<ffffffff81355631>] __vmalloc_node_range+0x281/0x3e0 [<ffffffff8109697f>] module_alloc+0x4f/0x90 [<ffffffff81091170>] arch_ftrace_update_trampoline+0x160/0x420 [<ffffffff81249947>] ftrace_startup+0xe7/0x300 [<ffffffff81249bd2>] register_ftrace_function+0x72/0x90 [<ffffffff81263786>] trace_selftest_ops+0x204/0x397 [<ffffffff82bb8971>] trace_selftest_startup_function+0x394/0x624 [<ffffffff81263a75>] run_tracer_selftest+0x15c/0x1d7 [<ffffffff82bb83f1>] init_trace_selftests+0x75/0x192 [<ffffffff81002230>] do_one_initcall+0x90/0x1e2 [<ffffffff82b7d620>] kernel_init_freeable+0x350/0x3fe [<ffffffff81d61ec3>] kernel_init+0x13/0x122 [<ffffffff81d72c6a>] ret_from_fork+0x2a/0x40 [<ffffffffffffffff>] 0xffffffffffffffff Cc: stable@vger.kernel.org Fixes: 12cce594fa ("ftrace/x86: Allow !CONFIG_PREEMPT dynamic ops to use allocated trampolines") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-10-04tracing: Apply trace_clock changes to instance max bufferBaohong Liu
[ Upstream commit 170b3b1050e28d1ba0700e262f0899ffa4fccc52 ] Currently trace_clock timestamps are applied to both regular and max buffers only for global trace. For instance trace, trace_clock timestamps are applied only to regular buffer. But, regular and max buffers can be swapped, for example, following a snapshot. So, for instance trace, bad timestamps can be seen following a snapshot. Let's apply trace_clock timestamps to instance max buffer as well. Link: http://lkml.kernel.org/r/ebdb168d0be042dcdf51f81e696b17fabe3609c1.1504642143.git.tom.zanussi@linux.intel.com Cc: stable@vger.kernel.org Fixes: 277ba0446 ("tracing: Add interface to allow multiple trace buffers") Signed-off-by: Baohong Liu <baohong.liu@intel.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-10-04ftrace: Fix selftest goto location on errorSteven Rostedt (VMware)
[ Upstream commit 46320a6acc4fb58f04bcf78c4c942cc43b20f986 ] In the second iteration of trace_selftest_ops(), the error goto label is wrong in the case where trace_selftest_test_global_cnt is off. In the case of error, it leaks the dynamic ops that was allocated. Cc: stable@vger.kernel.org Fixes: 95950c2e ("ftrace: Add self-tests for multiple function trace users") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-10-03locktorture: Fix potential memory leak with rw lock testYang Shi
[ Upstream commit f4dbba591945dc301c302672adefba9e2ec08dc5 ] When running locktorture module with the below commands with kmemleak enabled: $ modprobe locktorture torture_type=rw_lock_irq $ rmmod locktorture The below kmemleak got caught: root@10:~# echo scan > /sys/kernel/debug/kmemleak [ 323.197029] kmemleak: 2 new suspected memory leaks (see /sys/kernel/debug/kmemleak) root@10:~# cat /sys/kernel/debug/kmemleak unreferenced object 0xffffffc07592d500 (size 128): comm "modprobe", pid 368, jiffies 4294924118 (age 205.824s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 c3 7b 02 00 00 00 00 00 .........{...... 00 00 00 00 00 00 00 00 d7 9b 02 00 00 00 00 00 ................ backtrace: [<ffffff80081e5a88>] create_object+0x110/0x288 [<ffffff80086c6078>] kmemleak_alloc+0x58/0xa0 [<ffffff80081d5acc>] __kmalloc+0x234/0x318 [<ffffff80006fa130>] 0xffffff80006fa130 [<ffffff8008083ae4>] do_one_initcall+0x44/0x138 [<ffffff800817e28c>] do_init_module+0x68/0x1cc [<ffffff800811c848>] load_module+0x1a68/0x22e0 [<ffffff800811d340>] SyS_finit_module+0xe0/0xf0 [<ffffff80080836f0>] el0_svc_naked+0x24/0x28 [<ffffffffffffffff>] 0xffffffffffffffff unreferenced object 0xffffffc07592d480 (size 128): comm "modprobe", pid 368, jiffies 4294924118 (age 205.824s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 3b 6f 01 00 00 00 00 00 ........;o...... 00 00 00 00 00 00 00 00 23 6a 01 00 00 00 00 00 ........#j...... backtrace: [<ffffff80081e5a88>] create_object+0x110/0x288 [<ffffff80086c6078>] kmemleak_alloc+0x58/0xa0 [<ffffff80081d5acc>] __kmalloc+0x234/0x318 [<ffffff80006fa22c>] 0xffffff80006fa22c [<ffffff8008083ae4>] do_one_initcall+0x44/0x138 [<ffffff800817e28c>] do_init_module+0x68/0x1cc [<ffffff800811c848>] load_module+0x1a68/0x22e0 [<ffffff800811d340>] SyS_finit_module+0xe0/0xf0 [<ffffff80080836f0>] el0_svc_naked+0x24/0x28 [<ffffffffffffffff>] 0xffffffffffffffff It is because cxt.lwsa and cxt.lrsa don't get freed in module_exit, so free them in lock_torture_cleanup() and free writer_tasks if reader_tasks is failed at memory allocation. Signed-off-by: Yang Shi <yang.shi@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-10-03perf/core: Fix group {cpu,task} validationMark Rutland
[ Upstream commit 64aee2a965cf2954a038b5522f11d2cd2f0f8f3e ] Regardless of which events form a group, it does not make sense for the events to target different tasks and/or CPUs, as this leaves the group inconsistent and impossible to schedule. The core perf code assumes that these are consistent across (successfully intialised) groups. Core perf code only verifies this when moving SW events into a HW context. Thus, we can violate this requirement for pure SW groups and pure HW groups, unless the relevant PMU driver happens to perform this verification itself. These mismatched groups subsequently wreak havoc elsewhere. For example, we handle watchpoints as SW events, and reserve watchpoint HW on a per-CPU basis at pmu::event_init() time to ensure that any event that is initialised is guaranteed to have a slot at pmu::add() time. However, the core code only checks the group leader's cpu filter (via event_filter_match()), and can thus install follower events onto CPUs violating thier (mismatched) CPU filters, potentially installing them into a CPU without sufficient reserved slots. This can be triggered with the below test case, resulting in warnings from arch backends. #define _GNU_SOURCE #include <linux/hw_breakpoint.h> #include <linux/perf_event.h> #include <sched.h> #include <stdio.h> #include <sys/prctl.h> #include <sys/syscall.h> #include <unistd.h> static int perf_event_open(struct perf_event_attr *attr, pid_t pid, int cpu, int group_fd, unsigned long flags) { return syscall(__NR_perf_event_open, attr, pid, cpu, group_fd, flags); } char watched_char; struct perf_event_attr wp_attr = { .type = PERF_TYPE_BREAKPOINT, .bp_type = HW_BREAKPOINT_RW, .bp_addr = (unsigned long)&watched_char, .bp_len = 1, .size = sizeof(wp_attr), }; int main(int argc, char *argv[]) { int leader, ret; cpu_set_t cpus; /* * Force use of CPU0 to ensure our CPU0-bound events get scheduled. */ CPU_ZERO(&cpus); CPU_SET(0, &cpus); ret = sched_setaffinity(0, sizeof(cpus), &cpus); if (ret) { printf("Unable to set cpu affinity\n"); return 1; } /* open leader event, bound to this task, CPU0 only */ leader = perf_event_open(&wp_attr, 0, 0, -1, 0); if (leader < 0) { printf("Couldn't open leader: %d\n", leader); return 1; } /* * Open a follower event that is bound to the same task, but a * different CPU. This means that the group should never be possible to * schedule. */ ret = perf_event_open(&wp_attr, 0, 1, leader, 0); if (ret < 0) { printf("Couldn't open mismatched follower: %d\n", ret); return 1; } else { printf("Opened leader/follower with mismastched CPUs\n"); } /* * Open as many independent events as we can, all bound to the same * task, CPU0 only. */ do { ret = perf_event_open(&wp_attr, 0, 0, -1, 0); } while (ret >= 0); /* * Force enable/disble all events to trigger the erronoeous * installation of the follower event. */ printf("Opened all events. Toggling..\n"); for (;;) { prctl(PR_TASK_PERF_EVENTS_DISABLE, 0, 0, 0, 0); prctl(PR_TASK_PERF_EVENTS_ENABLE, 0, 0, 0, 0); } return 0; } Fix this by validating this requirement regardless of whether we're moving events. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Zhou Chengming <zhouchengming1@huawei.com> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/1498142498-15758-1-git-send-email-mark.rutland@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-10-03tracing: Fix freeing of filter in create_filter() when set_str is falseSteven Rostedt (VMware)
[ Upstream commit 8b0db1a5bdfcee0dbfa89607672598ae203c9045 ] Performing the following task with kmemleak enabled: # cd /sys/kernel/tracing/events/irq/irq_handler_entry/ # echo 'enable_event:kmem:kmalloc:3 if irq >' > trigger # echo 'enable_event:kmem:kmalloc:3 if irq > 31' > trigger # echo scan > /sys/kernel/debug/kmemleak # cat /sys/kernel/debug/kmemleak unreferenced object 0xffff8800b9290308 (size 32): comm "bash", pid 1114, jiffies 4294848451 (age 141.139s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<ffffffff81cef5aa>] kmemleak_alloc+0x4a/0xa0 [<ffffffff81357938>] kmem_cache_alloc_trace+0x158/0x290 [<ffffffff81261c09>] create_filter_start.constprop.28+0x99/0x940 [<ffffffff812639c9>] create_filter+0xa9/0x160 [<ffffffff81263bdc>] create_event_filter+0xc/0x10 [<ffffffff812655e5>] set_trigger_filter+0xe5/0x210 [<ffffffff812660c4>] event_enable_trigger_func+0x324/0x490 [<ffffffff812652e2>] event_trigger_write+0x1a2/0x260 [<ffffffff8138cf87>] __vfs_write+0xd7/0x380 [<ffffffff8138f421>] vfs_write+0x101/0x260 [<ffffffff8139187b>] SyS_write+0xab/0x130 [<ffffffff81cfd501>] entry_SYSCALL_64_fastpath+0x1f/0xbe [<ffffffffffffffff>] 0xffffffffffffffff The function create_filter() is passed a 'filterp' pointer that gets allocated, and if "set_str" is true, it is up to the caller to free it, even on error. The problem is that the pointer is not freed by create_filter() when set_str is false. This is a bug, and it is not up to the caller to free the filter on error if it doesn't care about the string. Link: http://lkml.kernel.org/r/1502705898-27571-2-git-send-email-chuhu@redhat.com Cc: stable@vger.kernel.org Fixes: 38b78eb85 ("tracing: Factorize filter creation") Reported-by: Chunyu Hu <chuhu@redhat.com> Tested-by: Chunyu Hu <chuhu@redhat.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-10-03audit: Fix use after free in audit_remove_watch_rule()Jan Kara
[ Upstream commit d76036ab47eafa6ce52b69482e91ca3ba337d6d6 ] audit_remove_watch_rule() drops watch's reference to parent but then continues to work with it. That is not safe as parent can get freed once we drop our reference. The following is a trivial reproducer: mount -o loop image /mnt touch /mnt/file auditctl -w /mnt/file -p wax umount /mnt auditctl -D <crash in fsnotify_destroy_mark()> Grab our own reference in audit_remove_watch_rule() earlier to make sure mark does not get freed under us. CC: stable@vger.kernel.org Reported-by: Tony Jones <tonyj@suse.de> Signed-off-by: Jan Kara <jack@suse.cz> Tested-by: Tony Jones <tonyj@suse.de> Signed-off-by: Paul Moore <paul@paul-moore.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>