commit 35327468a79dd9e343eaf7e66cc372f8277b2a84 Author: Sasha Levin Date: Sat Dec 24 11:09:44 2016 -0500 Linux 4.1.37 Signed-off-by: Sasha Levin commit c27edfb64bbff3cd5d8fc723161128dd4bb57789 Author: Sumit Saxena Date: Wed Nov 9 02:59:42 2016 -0800 scsi: megaraid_sas: fix macro MEGASAS_IS_LOGICAL to avoid regression [ Upstream commit 5e5ec1759dd663a1d5a2f10930224dd009e500e8 ] This patch will fix regression caused by commit 1e793f6fc0db ("scsi: megaraid_sas: Fix data integrity failure for JBOD (passthrough) devices"). The problem was that the MEGASAS_IS_LOGICAL macro did not have braces and as a result the driver ended up exposing a lot of non-existing SCSI devices (all SCSI commands to channels 1,2,3 were returned as SUCCESS-DID_OK by driver). [mkp: clarified patch description] Fixes: 1e793f6fc0db920400574211c48f9157a37e3945 Reported-by: Jens Axboe CC: stable@vger.kernel.org Signed-off-by: Kashyap Desai Signed-off-by: Sumit Saxena Tested-by: Sumit Saxena Reviewed-by: Tomas Henzl Tested-by: Jens Axboe Signed-off-by: Martin K. Petersen Signed-off-by: Sasha Levin commit 016d02981cceb7b0f3436278b71fe3ea87542e20 Author: Michal Kubeček Date: Wed Dec 14 13:24:58 2016 +0100 tipc: check minimum bearer MTU [ Upstream commit 3de81b758853f0b29c61e246679d20b513c4cfec ] Qian Zhang (张谦) reported a potential socket buffer overflow in tipc_msg_build() which is also known as CVE-2016-8632: due to insufficient checks, a buffer overflow can occur if MTU is too short for even tipc headers. As anyone can set device MTU in a user/net namespace, this issue can be abused by a regular user. As agreed in the discussion on Ben Hutchings' original patch, we should check the MTU at the moment a bearer is attached rather than for each processed packet. We also need to repeat the check when bearer MTU is adjusted to new device MTU. UDP case also needs a check to avoid overflow when calculating bearer MTU. References: CVE-2016-8632 Fixes: b97bf3fd8f6a ("[TIPC] Initial merge") Signed-off-by: Michal Kubecek Reported-by: Qian Zhang (张谦) Acked-by: Ying Xue Signed-off-by: David S. Miller Conflicts: net/tipc/bearer.c net/tipc/bearer.h due to 1a90632da8c17a27e0c93538ee987764adee43a5: tipc: eliminate remnants of hungarian notation and b1c29f6b10d5981c89d3ea9b9991ca97141ed6d0 tipc: simplify resetting and disabling of bearers Signed-off-by: Philipp Hahn Signed-off-by: Sasha Levin commit efcf38bd40200212ef3de3d38e11c42958f8afaa Author: Kees Cook Date: Wed Dec 14 13:24:57 2016 +0100 net: ping: check minimum size on ICMP header length [ Upstream commit 0eab121ef8750a5c8637d51534d5e9143fb0633f ] Prior to commit c0371da6047a ("put iov_iter into msghdr") in v3.19, there was no check that the iovec contained enough bytes for an ICMP header, and the read loop would walk across neighboring stack contents. Since the iov_iter conversion, bad arguments are noticed, but the returned error is EFAULT. Returning EINVAL is a clearer error and also solves the problem prior to v3.19. This was found using trinity with KASAN on v3.18: BUG: KASAN: stack-out-of-bounds in memcpy_fromiovec+0x60/0x114 at addr ffffffc071077da0 Read of size 8 by task trinity-c2/9623 page:ffffffbe034b9a08 count:0 mapcount:0 mapping: (null) index:0x0 flags: 0x0() page dumped because: kasan: bad access detected CPU: 0 PID: 9623 Comm: trinity-c2 Tainted: G BU 3.18.0-dirty #15 Hardware name: Google Tegra210 Smaug Rev 1,3+ (DT) Call trace: [] dump_backtrace+0x0/0x1ac arch/arm64/kernel/traps.c:90 [] show_stack+0x10/0x1c arch/arm64/kernel/traps.c:171 [< inline >] __dump_stack lib/dump_stack.c:15 [] dump_stack+0x7c/0xd0 lib/dump_stack.c:50 [< inline >] print_address_description mm/kasan/report.c:147 [< inline >] kasan_report_error mm/kasan/report.c:236 [] kasan_report+0x380/0x4b8 mm/kasan/report.c:259 [< inline >] check_memory_region mm/kasan/kasan.c:264 [] __asan_load8+0x20/0x70 mm/kasan/kasan.c:507 [] memcpy_fromiovec+0x5c/0x114 lib/iovec.c:15 [< inline >] memcpy_from_msg include/linux/skbuff.h:2667 [] ping_common_sendmsg+0x50/0x108 net/ipv4/ping.c:674 [] ping_v4_sendmsg+0xd8/0x698 net/ipv4/ping.c:714 [] inet_sendmsg+0xe0/0x12c net/ipv4/af_inet.c:749 [< inline >] __sock_sendmsg_nosec net/socket.c:624 [< inline >] __sock_sendmsg net/socket.c:632 [] sock_sendmsg+0x124/0x164 net/socket.c:643 [< inline >] SYSC_sendto net/socket.c:1797 [] SyS_sendto+0x178/0x1d8 net/socket.c:1761 CVE-2016-8399 Reported-by: Qidan He Fixes: c319b4d76b9e ("net: ipv4: add IPPROTO_ICMP socket kind") Cc: stable@vger.kernel.org Signed-off-by: Kees Cook Signed-off-by: David S. Miller Signed-off-by: Philipp Hahn Signed-off-by: Sasha Levin commit e29fdf045048addaea61c837b60e3c4d2ec43614 Author: Philip Pettersson Date: Wed Dec 14 13:24:56 2016 +0100 packet: fix race condition in packet_set_ring [ Upstream commit 84ac7260236a49c79eede91617700174c2c19b0c ] When packet_set_ring creates a ring buffer it will initialize a struct timer_list if the packet version is TPACKET_V3. This value can then be raced by a different thread calling setsockopt to set the version to TPACKET_V1 before packet_set_ring has finished. This leads to a use-after-free on a function pointer in the struct timer_list when the socket is closed as the previously initialized timer will not be deleted. The bug is fixed by taking lock_sock(sk) in packet_setsockopt when changing the packet version while also taking the lock at the start of packet_set_ring. References: CVE-2016-8655 Fixes: f6fb8f100b80 ("af-packet: TPACKET_V3 flexible buffer implementation.") Signed-off-by: Philip Pettersson Signed-off-by: Eric Dumazet Signed-off-by: David S. Miller Signed-off-by: Philipp Hahn Signed-off-by: Sasha Levin commit fabaaaa96d54077b4a9f2c811e55dc09ff2874db Author: Sabrina Dubroca Date: Wed Dec 14 13:24:55 2016 +0100 net: add recursion limit to GRO [ Debian: net-add-recursion-limit-to-gro.patch ] Currently, GRO can do unlimited recursion through the gro_receive handlers. This was fixed for tunneling protocols by limiting tunnel GRO to one level with encap_mark, but both VLAN and TEB still have this problem. Thus, the kernel is vulnerable to a stack overflow, if we receive a packet composed entirely of VLAN headers. This patch adds a recursion counter to the GRO layer to prevent stack overflow. When a gro_receive function hits the recursion limit, GRO is aborted for this skb and it is processed normally. Thanks to Vladimír Beneš for the initial bug report. Fixes: CVE-2016-7039 Fixes: 9b174d88c257 ("net: Add Transparent Ethernet Bridging GRO support.") Fixes: 66e5133f19e9 ("vlan: Add GRO support for non hardware accelerated vlan") Signed-off-by: Sabrina Dubroca Reviewed-by: Jiri Benc Acked-by: Hannes Frederic Sowa Signed-off-by: Philipp Hahn Signed-off-by: Sasha Levin commit 7abf32087c1dabacf707506585afc7b69aad21b3 Author: Jaganath Kanakkassery Date: Wed Dec 14 13:24:54 2016 +0100 Bluetooth: Fix potential NULL dereference in RFCOMM bind callback [ Upstream commit 951b6a0717db97ce420547222647bcc40bf1eacd ] addr can be NULL and it should not be dereferenced before NULL checking. References: CVE-2015-8956 Signed-off-by: Jaganath Kanakkassery Signed-off-by: Marcel Holtmann Signed-off-by: Philipp Hahn Signed-off-by: Sasha Levin commit 74cd81c810b98e9373b8ebd2981b5bd3bbee1ae1 Author: Jann Horn Date: Wed Dec 14 13:24:53 2016 +0100 ptrace: being capable wrt a process requires mapped uids/gids [ bugfix/all/ptrace-being-capable-wrt-a-process-requires-mapped-uids-gids.patch ] ptrace_has_cap() checks whether the current process should be treated as having a certain capability for ptrace checks against another process. Until now, this was equivalent to has_ns_capability(current, target_ns, CAP_SYS_PTRACE). However, if a root-owned process wants to enter a user namespace for some reason without knowing who owns it and therefore can't change to the namespace owner's uid and gid before entering, as soon as it has entered the namespace, the namespace owner can attach to it via ptrace and thereby gain access to its uid and gid. While it is possible for the entering process to switch to the uid of a claimed namespace owner before entering, causing the attempt to enter to fail if the claimed uid is wrong, this doesn't solve the problem of determining an appropriate gid. With this change, the entering process can first enter the namespace and then safely inspect the namespace's properties, e.g. through /proc/self/{uid_map,gid_map}, assuming that the namespace owner doesn't have access to uid 0. Changed in v2: The caller needs to be capable in the namespace into which tcred's uids/gids can be mapped. Rederences: CVE-2015-8709 References: https://lkml.org/lkml/2015/12/25/71 Signed-off-by: Jann Horn Signed-off-by: Philipp Hahn Signed-off-by: Sasha Levin commit 8165fc3eb28cbd2e4cca07308f3a205ab347a9d1 Author: Dan Carpenter Date: Wed Dec 14 13:24:52 2016 +0100 scsi: arcmsr: Buffer overflow in arcmsr_iop_message_xfer() [ Upstream commit 7bc2b55a5c030685b399bb65b6baa9ccc3d1f167 ] We need to put an upper bound on "user_len" so the memcpy() doesn't overflow. References: CVE-2016-7425 Cc: Reported-by: Marco Grassi Signed-off-by: Dan Carpenter Reviewed-by: Tomas Henzl Signed-off-by: Martin K. Petersen Signed-off-by: Philipp Hahn Signed-off-by: Sasha Levin commit 1171afc4a34e2926e6e8e27c896cf328c8825ac3 Author: Eric W. Biederman Date: Wed Dec 14 13:24:51 2016 +0100 mnt: Add a per mount namespace limit on the number of mounts [ Upstream commit d29216842a85c7970c536108e093963f02714498 ] CAI Qian pointed out that the semantics of shared subtrees make it possible to create an exponentially increasing number of mounts in a mount namespace. mkdir /tmp/1 /tmp/2 mount --make-rshared / for i in $(seq 1 20) ; do mount --bind /tmp/1 /tmp/2 ; done Will create create 2^20 or 1048576 mounts, which is a practical problem as some people have managed to hit this by accident. As such CVE-2016-6213 was assigned. Ian Kent described the situation for autofs users as follows: > The number of mounts for direct mount maps is usually not very large because of > the way they are implemented, large direct mount maps can have performance > problems. There can be anywhere from a few (likely case a few hundred) to less > than 10000, plus mounts that have been triggered and not yet expired. > > Indirect mounts have one autofs mount at the root plus the number of mounts that > have been triggered and not yet expired. > > The number of autofs indirect map entries can range from a few to the common > case of several thousand and in rare cases up to between 30000 and 50000. I've > not heard of people with maps larger than 50000 entries. > > The larger the number of map entries the greater the possibility for a large > number of active mounts so it's not hard to expect cases of a 1000 or somewhat > more active mounts. So I am setting the default number of mounts allowed per mount namespace at 100,000. This is more than enough for any use case I know of, but small enough to quickly stop an exponential increase in mounts. Which should be perfect to catch misconfigurations and malfunctioning programs. For anyone who needs a higher limit this can be changed by writing to the new /proc/sys/fs/mount-max sysctl. Tested-by: CAI Qian Signed-off-by: "Eric W. Biederman" Conflicts: fs/namespace.c kernel/sysctl.c Signed-off-by: Philipp Hahn Signed-off-by: Sasha Levin commit 62fa696b7b435e93ed114dd6a23aa0881d7f81b9 Author: Jan Kara Date: Wed Dec 14 13:24:50 2016 +0100 posix_acl: Clear SGID bit when setting file permissions [ Upstream commit 073931017b49d9458aa351605b43a7e34598caef ] When file permissions are modified via chmod(2) and the user is not in the owning group or capable of CAP_FSETID, the setgid bit is cleared in inode_change_ok(). Setting a POSIX ACL via setxattr(2) sets the file permissions as well as the new ACL, but doesn't clear the setgid bit in a similar way; this allows to bypass the check in chmod(2). Fix that. References: CVE-2016-7097 Reviewed-by: Christoph Hellwig Reviewed-by: Jeff Layton Signed-off-by: Jan Kara Signed-off-by: Andreas Gruenbacher Signed-off-by: Philipp Hahn Signed-off-by: Sasha Levin commit de42b9559d0c540152260d484dbc70b3e81f8738 Author: Jan Kara Date: Wed Dec 14 13:24:49 2016 +0100 fs: Avoid premature clearing of capabilities [ Upstream commit 030b533c4fd4d2ec3402363323de4bb2983c9cee ] Currently, notify_change() clears capabilities or IMA attributes by calling security_inode_killpriv() before calling into ->setattr. Thus it happens before any other permission checks in inode_change_ok() and user is thus allowed to trigger clearing of capabilities or IMA attributes for any file he can look up e.g. by calling chown for that file. This is unexpected and can lead to user DoSing a system. Fix the problem by calling security_inode_killpriv() at the end of inode_change_ok() instead of from notify_change(). At that moment we are sure user has permissions to do the requested change. References: CVE-2015-1350 Reviewed-by: Christoph Hellwig Signed-off-by: Jan Kara Signed-off-by: Philipp Hahn Signed-off-by: Sasha Levin commit cb8e1eef351b640cfdb1a753ef44494fbf59186d Author: Jan Kara Date: Wed Dec 14 13:24:48 2016 +0100 fs: Give dentry to inode_change_ok() instead of inode [ Upstream commit 31051c85b5e2aaaf6315f74c72a732673632a905 ] inode_change_ok() will be resposible for clearing capabilities and IMA extended attributes and as such will need dentry. Give it as an argument to inode_change_ok() instead of an inode. Also rename inode_change_ok() to setattr_prepare() to better relect that it does also some modifications in addition to checks. References: CVE-2015-1350 Reviewed-by: Christoph Hellwig Signed-off-by: Jan Kara Signed-off-by: Philipp Hahn Signed-off-by: Sasha Levin commit 2ee3ceeccd34c3fe589506c0fd822a8773a828bf Author: Andreas Gruenbacher Date: Wed Dec 14 13:24:47 2016 +0100 nfsd: Disable NFSv2 timestamp workaround for NFSv3+ NFSv2 can set the atime and/or mtime of a file to specific timestamps but not to the server's current time. To implement the equivalent of utimes("file", NULL), it uses a heuristic. NFSv3 and later do support setting the atime and/or mtime to the server's current time directly. The NFSv2 heuristic is still enabled, and causes timestamps to be set wrong sometimes. Fix this by moving the heuristic into the NFSv2 specific code. We can leave it out of the create code path: the owner can always set timestamps arbitrarily, and the workaround would never trigger. References: CVE-2015-1350 Signed-off-by: Andreas Gruenbacher Reviewed-by: Christoph Hellwig Signed-off-by: J. Bruce Fields Signed-off-by: Philipp Hahn Signed-off-by: Sasha Levin commit 820bc4582ab98a15f6f08aec443dbe731ef04874 Author: Jan Kara Date: Wed Dec 14 13:24:46 2016 +0100 fuse: Propagate dentry down to inode_change_ok() [ Upstream commit 62490330769c1ce5dcba3f1f3e8f4005e9b797e6 ] To avoid clearing of capabilities or security related extended attributes too early, inode_change_ok() will need to take dentry instead of inode. Propagate it down to fuse_do_setattr(). References: CVE-2015-1350 Acked-by: Miklos Szeredi Reviewed-by: Christoph Hellwig Signed-off-by: Jan Kara Conflicts: Missing file_dentry() from d101a125954eae1d397adda94ca6319485a50493 Signed-off-by: Philipp Hahn Signed-off-by: Sasha Levin commit 89bc54c5402751feb71957101670a4da191f9210 Author: Jan Kara Date: Wed Dec 14 13:24:45 2016 +0100 xfs: Propagate dentry down to inode_change_ok() [ upstream commit 69bca80744eef58fa155e8042996b968fec17b26 ] To avoid clearing of capabilities or security related extended attributes too early, inode_change_ok() will need to take dentry instead of inode. Propagate dentry down to functions calling inode_change_ok(). This is rather straightforward except for xfs_set_mode() function which does not have dentry easily available. Luckily that function does not call inode_change_ok() anyway so we just have to do a little dance with function prototypes. References: CVE-2015-1350 Acked-by: Dave Chinner Reviewed-by: Christoph Hellwig Signed-off-by: Jan Kara Conflicts: Missing file_dentry() from d101a125954eae1d397adda94ca6319485a50493 Signed-off-by: Philipp Hahn Signed-off-by: Sasha Levin commit 1b364dc9edba5ad079a43853fd319a3a255f5b29 Author: Andreas Dilger Date: Wed Dec 14 13:24:44 2016 +0100 xattr: Option to disable meta-data block cache mbcache provides absolutely no value for Lustre xattrs (because they are unique and cannot be shared between files) and as we can see it has a noticable overhead in some cases. In the past there was a CONFIG_MBCACHE option that would allow it to be disabled, but this was removed in newer kernels, so we will need to patch ldiskfs to fix this. References: References: References: CVE-2015-8952 On 13.12.2016 at 15:58 Ben Hutchings wrote: > I decided not to apply this as it's a userland ABI extension that we > would then need to carry indefinitely. Signed-off-by: Philipp Hahn Signed-off-by: Sasha Levin commit 9a66bc6ee0f9908ba98a7d19b94d49ec231ab0e1 Author: Eric Dumazet Date: Wed Dec 14 13:24:43 2016 +0100 tcp: fix use after free in tcp_xmit_retransmit_queue() [ Upstream commit bb1fceca22492109be12640d49f5ea5a544c6bb4 ] When tcp_sendmsg() allocates a fresh and empty skb, it puts it at the tail of the write queue using tcp_add_write_queue_tail() Then it attempts to copy user data into this fresh skb. If the copy fails, we undo the work and remove the fresh skb. Unfortunately, this undo lacks the change done to tp->highest_sack and we can leave a dangling pointer (to a freed skb) Later, tcp_xmit_retransmit_queue() can dereference this pointer and access freed memory. For regular kernels where memory is not unmapped, this might cause SACK bugs because tcp_highest_sack_seq() is buggy, returning garbage instead of tp->snd_nxt, but with various debug features like CONFIG_DEBUG_PAGEALLOC, this can crash the kernel. This bug was found by Marco Grassi thanks to syzkaller. Fixes: 6859d49475d4 ("[TCP]: Abstract tp->highest_sack accessing & point to next skb") Reported-by: Marco Grassi Signed-off-by: Eric Dumazet Cc: Ilpo Järvinen Cc: Yuchung Cheng Cc: Neal Cardwell Acked-by: Neal Cardwell Reviewed-by: Cong Wang Signed-off-by: David S. Miller References: CVE-2016-6828 Signed-off-by: Philipp Hahn Signed-off-by: Sasha Levin commit ebdb88b8e4656bab98515b2bb6f21e9f5925ff3f Author: Sebastian Andrzej Siewior Date: Fri Nov 4 19:39:40 2016 +0100 x86/kexec: add -fno-PIE [ Upstream commit 90944e40ba1838de4b2a9290cf273f9d76bd3bdd ] If the gcc is configured to do -fPIE by default then the build aborts later with: | Unsupported relocation type: unknown type rel type name (29) Tagging it stable so it is possible to compile recent stable kernels as well. Cc: stable@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Michal Marek Signed-off-by: Sasha Levin commit 672612a21845b010644d7e2c14ccad69057a267b Author: Sebastian Andrzej Siewior Date: Fri Nov 4 19:39:39 2016 +0100 scripts/has-stack-protector: add -fno-PIE [ Upstream commit 82031ea29e454b574bc6f49a33683a693ca5d907 ] Adding -no-PIE to the fstack protector check. -no-PIE was introduced before -fstack-protector so there is no need for a runtime check. Without it the build stops: |Cannot use CONFIG_CC_STACKPROTECTOR_STRONG: -fstack-protector-strong available but compiler is broken due to -mcmodel=kernel + -fPIE if -fPIE is enabled by default. Tagging it stable so it is possible to compile recent stable kernels as well. Cc: stable@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Michal Marek Signed-off-by: Sasha Levin commit e06ded86d961bcf8356a82220f6392d918ebfa83 Author: Andy Lutomirski Date: Wed Sep 28 12:34:14 2016 -0700 x86/init: Fix cr4_init_shadow() on CR4-less machines [ Upstream commit e1bfc11c5a6f40222a698a818dc269113245820e ] cr4_init_shadow() will panic on 486-like machines without CR4. Fix it using __read_cr4_safe(). Reported-by: david@saggiorato.net Signed-off-by: Andy Lutomirski Reviewed-by: Borislav Petkov Cc: Borislav Petkov Cc: Brian Gerst Cc: Denys Vlasenko Cc: H. Peter Anvin Cc: Josh Poimboeuf Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: stable@vger.kernel.org Fixes: 1e02ce4cccdc ("x86: Store a per-cpu shadow copy of CR4") Link: http://lkml.kernel.org/r/43a20f81fb504013bf613913dc25574b45336a61.1475091074.git.luto@kernel.org Signed-off-by: Ingo Molnar Signed-off-by: Sasha Levin commit eec7469393728b5d32a2294911e5bc562a576e3e Author: Roger Quadros Date: Thu Sep 29 08:32:55 2016 +0100 ARM: 8617/1: dma: fix dma_max_pfn() [ Upstream commit d248220f0465b818887baa9829e691fe662b2c5e ] Since commit 6ce0d2001692 ("ARM: dma: Use dma_pfn_offset for dma address translation"), dma_to_pfn() already returns the PFN with the physical memory start offset so we don't need to add it again. This fixes USB mass storage lock-up problem on systems that can't do DMA over the entire physical memory range (e.g.) Keystone 2 systems with 4GB RAM can only do DMA over the first 2GB. [K2E-EVM]. What happens there is that without this patch SCSI layer sets a wrong bounce buffer limit in scsi_calculate_bounce_limit() for the USB mass storage device. dma_max_pfn() evaluates to 0x8fffff and bounce_limit is set to 0x8fffff000 whereas maximum DMA'ble physical memory on Keystone 2 is 0x87fffffff. This results in non DMA'ble pages being given to the USB controller and hence the lock-up. NOTE: in the above case, USB-SCSI-device's dma_pfn_offset was showing as 0. This should have really been 0x780000 as on K2e, LOWMEM_START is 0x80000000 and HIGHMEM_START is 0x800000000. DMA zone is 2GB so dma_max_pfn should be 0x87ffff. The incorrect dma_pfn_offset for the USB storage device is because USB devices are not correctly inheriting the dma_pfn_offset from the USB host controller. This will be fixed by a separate patch. Fixes: 6ce0d2001692 ("ARM: dma: Use dma_pfn_offset for dma address translation") Cc: stable@vger.kernel.org Cc: Greg Kroah-Hartman Cc: Santosh Shilimkar Cc: Arnd Bergmann Cc: Olof Johansson Cc: Catalin Marinas Cc: Linus Walleij Reported-by: Grygorii Strashko Signed-off-by: Roger Quadros Signed-off-by: Russell King Signed-off-by: Sasha Levin commit 58024f829d0b57d1094c21b6df6937cd62c33b1d Author: zhong jiang Date: Wed Sep 28 15:22:30 2016 -0700 mm,ksm: fix endless looping in allocating memory when ksm enable [ Upstream commit 5b398e416e880159fe55eefd93c6588fa072cd66 ] I hit the following hung task when runing a OOM LTP test case with 4.1 kernel. Call trace: [] __switch_to+0x74/0x8c [] __schedule+0x23c/0x7bc [] schedule+0x3c/0x94 [] rwsem_down_write_failed+0x214/0x350 [] down_write+0x64/0x80 [] __ksm_exit+0x90/0x19c [] mmput+0x118/0x11c [] do_exit+0x2dc/0xa74 [] do_group_exit+0x4c/0xe4 [] get_signal+0x444/0x5e0 [] do_signal+0x1d8/0x450 [] do_notify_resume+0x70/0x78 The oom victim cannot terminate because it needs to take mmap_sem for write while the lock is held by ksmd for read which loops in the page allocator ksm_do_scan scan_get_next_rmap_item down_read get_next_rmap_item alloc_rmap_item #ksmd will loop permanently. There is no way forward because the oom victim cannot release any memory in 4.1 based kernel. Since 4.6 we have the oom reaper which would solve this problem because it would release the memory asynchronously. Nevertheless we can relax alloc_rmap_item requirements and use __GFP_NORETRY because the allocation failure is acceptable as ksm_do_scan would just retry later after the lock got dropped. Such a patch would be also easy to backport to older stable kernels which do not have oom_reaper. While we are at it add GFP_NOWARN so the admin doesn't have to be alarmed by the allocation failure. Link: http://lkml.kernel.org/r/1474165570-44398-1-git-send-email-zhongjiang@huawei.com Signed-off-by: zhong jiang Suggested-by: Hugh Dickins Suggested-by: Michal Hocko Acked-by: Michal Hocko Acked-by: Hugh Dickins Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Sasha Levin commit d427d645ccf9a214859e25d116584d758ba2a6a6 Author: Sergei Miroshnichenko Date: Wed Sep 7 16:51:12 2016 +0300 can: dev: fix deadlock reported after bus-off [ Upstream commit 9abefcb1aaa58b9d5aa40a8bb12c87d02415e4c8 ] A timer was used to restart after the bus-off state, leading to a relatively large can_restart() executed in an interrupt context, which in turn sets up pinctrl. When this happens during system boot, there is a high probability of grabbing the pinctrl_list_mutex, which is locked already by the probe() of other device, making the kernel suspect a deadlock condition [1]. To resolve this issue, the restart_timer is replaced by a delayed work. [1] https://github.com/victronenergy/venus/issues/24 Signed-off-by: Sergei Miroshnichenko Cc: linux-stable Signed-off-by: Marc Kleine-Budde Signed-off-by: Sasha Levin commit 791a928972743620602700e2e45341b5d1753a5e Author: Joonwoo Park Date: Sun Sep 11 21:14:58 2016 -0700 cpuset: handle race between CPU hotplug and cpuset_hotplug_work [ Upstream commit 28b89b9e6f7b6c8fef7b3af39828722bca20cfee ] A discrepancy between cpu_online_mask and cpuset's effective_cpus mask is inevitable during hotplug since cpuset defers updating of effective_cpus mask using a workqueue, during which time nothing prevents the system from more hotplug operations. For that reason guarantee_online_cpus() walks up the cpuset hierarchy until it finds an intersection under the assumption that top cpuset's effective_cpus mask intersects with cpu_online_mask even with such a race occurring. However a sequence of CPU hotplugs can open a time window, during which none of the effective CPUs in the top cpuset intersect with cpu_online_mask. For example when there are 4 possible CPUs 0-3 and only CPU0 is online: ======================== =========================== cpu_online_mask top_cpuset.effective_cpus ======================== =========================== echo 1 > cpu2/online. CPU hotplug notifier woke up hotplug work but not yet scheduled. [0,2] [0] echo 0 > cpu0/online. The workqueue is still runnable. [2] [0] ======================== =========================== Now there is no intersection between cpu_online_mask and top_cpuset.effective_cpus. Thus invoking sys_sched_setaffinity() at this moment can cause following: Unable to handle kernel NULL pointer dereference at virtual address 000000d0 ------------[ cut here ]------------ Kernel BUG at ffffffc0001389b0 [verbose debug info unavailable] Internal error: Oops - BUG: 96000005 [#1] PREEMPT SMP Modules linked in: CPU: 2 PID: 1420 Comm: taskset Tainted: G W 4.4.8+ #98 task: ffffffc06a5c4880 ti: ffffffc06e124000 task.ti: ffffffc06e124000 PC is at guarantee_online_cpus+0x2c/0x58 LR is at cpuset_cpus_allowed+0x4c/0x6c Process taskset (pid: 1420, stack limit = 0xffffffc06e124020) Call trace: [] guarantee_online_cpus+0x2c/0x58 [] cpuset_cpus_allowed+0x4c/0x6c [] sched_setaffinity+0xc0/0x1ac [] SyS_sched_setaffinity+0x98/0xac [] el0_svc_naked+0x24/0x28 The top cpuset's effective_cpus are guaranteed to be identical to cpu_online_mask eventually. Hence fall back to cpu_online_mask when there is no intersection between top cpuset's effective_cpus and cpu_online_mask. Signed-off-by: Joonwoo Park Acked-by: Li Zefan Cc: Tejun Heo Cc: cgroups@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: # 3.17+ Signed-off-by: Tejun Heo Signed-off-by: Sasha Levin commit 6b82b0601ac232d85a6728b9f6dfcb8160674a0c Author: Karl Beldan Date: Mon Aug 29 07:45:49 2016 +0000 mtd: nand: davinci: Reinitialize the HW ECC engine in 4bit hwctl [ Upstream commit f6d7c1b5598b6407c3f1da795dd54acf99c1990c ] This fixes subpage writes when using 4-bit HW ECC. There has been numerous reports about ECC errors with devices using this driver for a while. Also the 4-bit ECC has been reported as broken with subpages in [1] and with 16 bits NANDs in the driver and in mach* board files both in mainline and in the vendor BSPs. What I saw with 4-bit ECC on a 16bits NAND (on an LCDK) which got me to try reinitializing the ECC engine: - R/W on whole pages properly generates/checks RS code - try writing the 1st subpage only of a blank page, the subpage is well written and the RS code properly generated, re-reading the same page the HW detects some ECC error, reading the same page again no ECC error is detected Note that the ECC engine is already reinitialized in the 1-bit case. Tested on my LCDK with UBI+UBIFS using subpages. This could potentially get rid of the issue workarounded in [1]. [1] 28c015a9daab ("mtd: davinci-nand: disable subpage write for keystone-nand") Fixes: 6a4123e581b3 ("mtd: nand: davinci_nand, 4-bit ECC for smallpage") Cc: Signed-off-by: Karl Beldan Acked-by: Boris Brezillon Signed-off-by: Brian Norris Signed-off-by: Sasha Levin commit e537a0977f3ee837b6c101037c89936348063b2d Author: Rob Clark Date: Mon Aug 22 15:15:23 2016 -0400 drm/msm: fix use of copy_from_user() while holding spinlock [ Upstream commit 89f82cbb0d5c0ab768c8d02914188aa2211cd2e3 ] Use instead __copy_from_user_inatomic() and fallback to slow-path where we drop and re-aquire the lock in case of fault. Cc: stable@vger.kernel.org Reported-by: Vaishali Thakkar Signed-off-by: Rob Clark Signed-off-by: Sasha Levin commit b56eb9cdc5f12808474c03c80da92a669b9880da Author: Pawel Moll Date: Tue Aug 2 16:45:37 2016 +0100 bus: arm-ccn: Fix PMU handling of MN [ Upstream commit 4e486cba285ff06a1f28f0fc2991dde1482d1dcf ] The "Miscellaneous Node" fell through cracks of node initialisation, as its ID is shared with HN-I. This patch treats MN as a special case (which it is), adding separate validation check for it and pre-defining the node ID in relevant events descriptions. That way one can simply run: # perf stat -a -e ccn/mn_ecbarrier/ Additionally, direction in the MN pseudo-events XP watchpoint definitions is corrected to be "TX" (1) as they are defined from the crosspoint point of view (thus barriers are transmitted from XP to MN). Cc: stable@vger.kernel.org # 3.17+ Signed-off-by: Pawel Moll Signed-off-by: Sasha Levin commit 7298a8bf4c63161114b47c6d627480dd5a2c1e1a Author: Pawel Moll Date: Thu Apr 2 14:01:06 2015 +0100 bus: arm-ccn: Provide required event arguments [ Upstream commit 8f06c51fac1ca4104b8b64872f310e28186aea42 ] Since 688d4dfcdd624192cbf03c08402e444d1d11f294 "perf tools: Support parsing parameterized events" the perf userspace tools understands "argument=?" syntax in the events file, making sure that required arguments are provided by the user and not defaulting to 0, causing confusion. This patch adds the required arguments lists for CCN events. Signed-off-by: Pawel Moll Signed-off-by: Sasha Levin