android_kernel_samsung_univ.../net
Florian Westphal 0c0be310ba netlink: remove mmapped netlink support
commit d1b4c689d4130bcfd3532680b64db562300716b6 upstream.

mmapped netlink has a number of unresolved issues:

- TX zerocopy support had to be disabled more than a year ago via
  commit 4682a03586 ("netlink: Always copy on mmap TX.")
  because the content of the mmapped area can change after netlink
  attribute validation but before message processing.

- RX support was implemented mainly to speed up nfqueue dumping packet
  payload to userspace.  However, since commit ae08ce0021
  ("netfilter: nfnetlink_queue: zero copy support") we avoid one copy
  with the socket-based interface too (via the skb_zerocopy helper).

The other problem is that skbs attached to mmaped netlink socket
behave different from normal skbs:

- they don't have a shinfo area, so all functions that use skb_shinfo()
(e.g. skb_clone) cannot be used.

- reserving headroom prevents userspace from seeing the content as
it expects message to start at skb->head.
See for instance
commit aa3a022094fa ("netlink: not trim skb for mmaped socket when dump").

- skbs handed e.g. to netlink_ack must have non-NULL skb->sk, else we
crash because it needs the sk to check if a tx ring is attached.

Also not obvious, leads to non-intuitive bug fixes such as 7c7bdf359
("netfilter: nfnetlink: use original skbuff when acking batches").

mmaped netlink also didn't play nicely with the skb_zerocopy helper
used by nfqueue and openvswitch.  Daniel Borkmann fixed this via
commit 6bb0fef489 ("netlink, mmap: fix edge-case leakages in nf queue
zero-copy")' but at the cost of also needing to provide remaining
length to the allocation function.

nfqueue also has problems when used with mmaped rx netlink:
- mmaped netlink doesn't allow use of nfqueue batch verdict messages.
  Problem is that in the mmap case, the allocation time also determines
  the ordering in which the frame will be seen by userspace (A
  allocating before B means that A is located in earlier ring slot,
  but this also means that B might get a lower sequence number then A
  since seqno is decided later.  To fix this we would need to extend the
  spinlocked region to also cover the allocation and message setup which
  isn't desirable.
- nfqueue can now be configured to queue large (GSO) skbs to userspace.
  Queing GSO packets is faster than having to force a software segmentation
  in the kernel, so this is a desirable option.  However, with a mmap based
  ring one has to use 64kb per ring slot element, else mmap has to fall back
  to the socket path (NL_MMAP_STATUS_COPY) for all large packets.

To use the mmap interface, userspace not only has to probe for mmap netlink
support, it also has to implement a recv/socket receive path in order to
handle messages that exceed the size of an rx ring element.

Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Ken-ichirou MATSUZAWA <chamaken@gmail.com>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cc: Shi Yuejie <shiyuejie@outlook.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-03-22 12:04:13 +01:00
..
6lowpan
9p
802
8021q net: add recursion limit to GRO 2016-11-15 07:46:38 +01:00
appletalk
atm
ax25 ax25: Fix segfault after sock connection timeout 2017-02-04 09:45:09 +01:00
batman-adv batman-adv: Check for alloc errors when preparing TT local data 2016-12-15 08:49:23 -08:00
bluetooth
bridge bridge: netlink: call br_changelink() during br_dev_newlink() 2017-02-04 09:45:09 +01:00
caif net: caif: fix misleading indentation 2016-09-30 10:18:35 +02:00
can can: Fix kernel panic at security_sock_rcv_skb 2017-02-18 16:39:26 +01:00
ceph libceph: verify authorize reply on connect 2017-01-09 08:07:52 +01:00
core net: use a work queue to defer net_disable_timestamp() work 2017-02-18 16:39:26 +01:00
dcb
dccp dccp: fix freeing skb too early for IPV6_RECVPKTINFO 2017-02-26 11:07:50 +01:00
decnet
dns_resolver
dsa net: dsa: Bring back device detaching in dsa_slave_suspend() 2017-02-04 09:45:10 +01:00
ethernet net: introduce device min_header_len 2017-02-18 16:39:27 +01:00
hsr
ieee802154
ipv4 ip: fix IP_CHECKSUM handling 2017-02-26 11:07:50 +01:00
ipv6 sit: fix a double free on error path 2017-02-18 16:39:27 +01:00
ipx
irda irda: Fix lockdep annotations in hashbin_delete(). 2017-02-26 11:07:50 +01:00
iucv
key
l2tp l2tp: do not use udp_ioctl() 2017-02-18 16:39:28 +01:00
l3mdev
lapb
llc net/llc: avoid BUG_ON() in skb_orphan() 2017-02-26 11:07:49 +01:00
mac80211 mac80211: flush delayed work when entering suspend 2017-03-15 09:57:14 +08:00
mac802154
mpls
netfilter netfilter: nft_dynset: fix element timeout for HZ != 1000 2016-11-26 09:54:54 +01:00
netlabel
netlink netlink: remove mmapped netlink support 2017-03-22 12:04:13 +01:00
netrom
nfc
openvswitch openvswitch: maintain correct checksum state in conntrack actions 2017-02-04 09:45:09 +01:00
packet packet: Do not call fanout_release from atomic contexts 2017-02-26 11:07:50 +01:00
phonet
rds
rfkill
rose
rxrpc
sched net, sched: fix soft lockup in tc_classify 2017-01-15 13:41:34 +01:00
sctp sctp: avoid BUG_ON on sctp_wait_for_sndbuf 2017-02-18 16:39:27 +01:00
sunrpc svcrpc: fix oops in absence of krb5 module 2017-02-09 08:02:45 +01:00
switchdev
tipc tipc: fix NULL pointer dereference in shutdown() 2016-09-30 10:18:36 +02:00
unix af_unix: move unix_mknod() out of bindlock 2017-02-04 09:45:09 +01:00
vmw_vsock
wimax
wireless nl80211: fix sched scan netlink socket owner destruction 2017-01-19 20:17:20 +01:00
x25
xfrm
compat.c
Kconfig
Makefile
socket.c net: socket: fix recvmmsg not returning error from sock_error 2017-02-26 11:07:50 +01:00
sysctl_net.c