kernel-hacking-2024-linux-s.../net/ipv4
John Fastabend c5d2177a72 bpf, sockmap: Fix race in ingress receive verdict with redirect to self
A socket in a sockmap may have different combinations of programs attached
depending on configuration. There can be no programs in which case the socket
acts as a sink only. There can be a TX program in this case a BPF program is
attached to sending side, but no RX program is attached. There can be an RX
program only where sends have no BPF program attached, but receives are hooked
with BPF. And finally, both TX and RX programs may be attached. Giving us the
permutations:

 None, Tx, Rx, and TxRx

To date most of our use cases have been TX case being used as a fast datapath
to directly copy between local application and a userspace proxy. Or Rx cases
and TxRX applications that are operating an in kernel based proxy. The traffic
in the first case where we hook applications into a userspace application looks
like this:

  AppA  redirect   AppB
   Tx <-----------> Rx
   |                |
   +                +
   TCP <--> lo <--> TCP

In this case all traffic from AppA (after 3whs) is copied into the AppB
ingress queue and no traffic is ever on the TCP recieive_queue.

In the second case the application never receives, except in some rare error
cases, traffic on the actual user space socket. Instead the send happens in
the kernel.

           AppProxy       socket pool
       sk0 ------------->{sk1,sk2, skn}
        ^                      |
        |                      |
        |                      v
       ingress              lb egress
       TCP                  TCP

Here because traffic is never read off the socket with userspace recv() APIs
there is only ever one reader on the sk receive_queue. Namely the BPF programs.

However, we've started to introduce a third configuration where the BPF program
on receive should process the data, but then the normal case is to push the
data into the receive queue of AppB.

       AppB
       recv()                (userspace)
     -----------------------
       tcp_bpf_recvmsg()     (kernel)
         |             |
         |             |
         |             |
       ingress_msgQ    |
         |             |
       RX_BPF          |
         |             |
         v             v
       sk->receive_queue

This is different from the App{A,B} redirect because traffic is first received
on the sk->receive_queue.

Now for the issue. The tcp_bpf_recvmsg() handler first checks the ingress_msg
queue for any data handled by the BPF rx program and returned with PASS code
so that it was enqueued on the ingress msg queue. Then if no data exists on
that queue it checks the socket receive queue. Unfortunately, this is the same
receive_queue the BPF program is reading data off of. So we get a race. Its
possible for the recvmsg() hook to pull data off the receive_queue before the
BPF hook has a chance to read it. It typically happens when an application is
banging on recv() and getting EAGAINs. Until they manage to race with the RX
BPF program.

To fix this we note that before this patch at attach time when the socket is
loaded into the map we check if it needs a TX program or just the base set of
proto bpf hooks. Then it uses the above general RX hook regardless of if we
have a BPF program attached at rx or not. This patch now extends this check to
handle all cases enumerated above, TX, RX, TXRX, and none. And to fix above
race when an RX program is attached we use a new hook that is nearly identical
to the old one except now we do not let the recv() call skip the RX BPF program.
Now only the BPF program pulls data from sk->receive_queue and recv() only
pulls data from the ingress msgQ post BPF program handling.

With this resolved our AppB from above has been up and running for many hours
without detecting any errors. We do this by correlating counters in RX BPF
events and the AppB to ensure data is never skipping the BPF program. Selftests,
was not able to detect this because we only run them for a short period of time
on well ordered send/recvs so we don't get any of the noise we see in real
application environments.

Fixes: 51199405f9 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Jussi Maki <joamaki@gmail.com>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20211103204736.248403-4-john.fastabend@gmail.com
2021-11-09 00:58:26 +01:00
..
bpfilter
netfilter Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next 2021-10-18 14:05:25 +01:00
af_inet.c net: introduce sk_forward_alloc_get() 2021-10-27 18:20:29 -07:00
ah4.c Networking changes for 5.14. 2021-06-30 15:51:09 -07:00
arp.c net: arp: introduce arp_evict_nocarrier sysctl parameter 2021-11-01 19:57:14 -07:00
bpf_tcp_ca.c bpf: Factor out helpers for ctx access checking 2021-11-01 14:10:00 -07:00
cipso_ipv4.c NET: IPV4: fix error "do not initialise globals to 0" 2021-09-19 12:43:56 +01:00
datagram.c net/ipv4/datagram.c: remove superfluous header files from datagram.c 2021-09-29 11:39:33 +01:00
devinet.c net: arp: introduce arp_evict_nocarrier sysctl parameter 2021-11-01 19:57:14 -07:00
esp4.c ipsec: Remove unneeded extra variable in esp4 esp_ssg_unref() 2021-07-20 16:14:23 +02:00
esp4_offload.c xfrm: remove description from xfrm_type struct 2021-06-09 09:38:52 +02:00
fib_frontend.c net: Use nlmsg_unicast() instead of netlink_unicast() 2021-07-13 09:28:29 -07:00
fib_lookup.h ipv4: Fix spelling mistakes 2021-06-07 14:08:30 -07:00
fib_notifier.c net: ipv4: remove superfluous header files from fib_notifier.c 2021-09-28 17:32:56 -07:00
fib_rules.c
fib_semantics.c net: ipv4: Fix rtnexthop len when RTA_FLOW is present 2021-09-24 14:07:10 +01:00
fib_trie.c memcg: enable accounting for IP address and routing-related objects 2021-07-20 06:00:38 -07:00
fou.c fou: remove sparse errors 2021-08-31 12:03:33 +01:00
gre_demux.c net: Remove the member netns_ok 2021-05-17 15:29:35 -07:00
gre_offload.c
icmp.c icmp: fix icmp_ext_echo_iio parsing in icmp_build_probe 2021-10-14 07:54:47 -07:00
igmp.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2021-08-13 06:41:22 -07:00
inet_connection_sock.c tcp: switch orphan_count to bare per-cpu counters 2021-10-15 11:28:34 +01:00
inet_diag.c net: introduce sk_forward_alloc_get() 2021-10-27 18:20:29 -07:00
inet_fragment.c
inet_hashtables.c tcp: switch orphan_count to bare per-cpu counters 2021-10-15 11:28:34 +01:00
inet_timewait_sock.c
inetpeer.c
ip_forward.c
ip_fragment.c
ip_gre.c ip: use dev_addr_set() in tunnels 2021-10-13 09:41:37 -07:00
ip_input.c
ip_options.c
ip_output.c net: ipv4: Fix the warning for dereference 2021-08-30 12:47:09 +01:00
ip_sockglue.c ipv4: guard IP_MINTTL with a static key 2021-10-25 18:02:14 -07:00
ip_tunnel.c ip: use dev_addr_set() in tunnels 2021-10-13 09:41:37 -07:00
ip_tunnel_core.c
ip_vti.c ip: use dev_addr_set() in tunnels 2021-10-13 09:41:37 -07:00
ipcomp.c Networking changes for 5.14. 2021-06-30 15:51:09 -07:00
ipconfig.c net: ipconfig: Release the rtnl_lock while waiting for carrier 2021-10-28 14:36:41 +01:00
ipip.c ip: use dev_addr_set() in tunnels 2021-10-13 09:41:37 -07:00
ipmr.c ipmr: Fix indentation issue 2021-07-07 20:52:25 -07:00
ipmr_base.c
Kconfig
Makefile
metrics.c
netfilter.c
netlink.c
nexthop.c nexthop: Fix memory leaks in nexthop notification chain listeners 2021-09-23 12:33:22 +01:00
ping.c Revert "Revert "ipv4: fix memory leaks in ip_cmsg_send() callers"" 2021-09-14 14:24:31 +01:00
proc.c tcp: switch orphan_count to bare per-cpu counters 2021-10-15 11:28:34 +01:00
protocol.c net: Remove the member netns_ok 2021-05-17 15:29:35 -07:00
raw.c Revert "Revert "ipv4: fix memory leaks in ip_cmsg_send() callers"" 2021-09-14 14:24:31 +01:00
raw_diag.c net: Use nlmsg_unicast() instead of netlink_unicast() 2021-07-13 09:28:29 -07:00
route.c net/ipv4/route.c: remove superfluous header files from route.c 2021-09-20 13:09:06 +01:00
syncookies.c net/ipv4/syncookies.c: remove superfluous header files from syncookies.c 2021-09-21 10:48:47 +01:00
sysctl_net_ipv4.c tcp: remove sk_{tr}x_skb_cache 2021-09-23 12:50:26 +01:00
tcp.c net: avoid double accounting for pure zerocopy skbs 2021-11-03 11:19:49 +00:00
tcp_bbr.c bpf: Enable TCP congestion control kfunc from modules 2021-10-05 17:07:41 -07:00
tcp_bic.c
tcp_bpf.c bpf, sockmap: Fix race in ingress receive verdict with redirect to self 2021-11-09 00:58:26 +01:00
tcp_cdg.c
tcp_cong.c net: Only allow init netns to set default tcp cong to a restricted algo 2021-05-04 11:58:28 -07:00
tcp_cubic.c bpf: Enable TCP congestion control kfunc from modules 2021-10-05 17:07:41 -07:00
tcp_dctcp.c bpf: Enable TCP congestion control kfunc from modules 2021-10-05 17:07:41 -07:00
tcp_dctcp.h
tcp_diag.c
tcp_fastopen.c net/ipv4/tcp_fastopen.c: remove superfluous header files from tcp_fastopen.c 2021-09-20 13:09:06 +01:00
tcp_highspeed.c
tcp_htcp.c
tcp_hybla.c
tcp_illinois.c
tcp_input.c tcp: adjust rcv_ssthresh according to sk_reserved_mem 2021-09-30 13:36:46 +01:00
tcp_ipv4.c ipv4: guard IP_MINTTL with a static key 2021-10-25 18:02:14 -07:00
tcp_lp.c
tcp_metrics.c
tcp_minisocks.c net/ipv4/tcp_minisocks.c: remove superfluous header files from tcp_minisocks.c 2021-09-20 13:09:06 +01:00
tcp_nv.c net/ipv4/tcp_nv.c: remove superfluous header files from tcp_nv.c 2021-09-27 12:47:39 +01:00
tcp_offload.c net, gro: Set inner transport header offset in tcp/udp GRO hook 2021-08-02 10:20:56 +01:00
tcp_output.c tcp: Use BIT() for OPTION_* constants 2021-11-04 11:26:15 +00:00
tcp_rate.c tcp: tracking packets with CE marks in BW rate sample 2021-09-24 14:16:40 +01:00
tcp_recovery.c tcp: more accurately check DSACKs to grow RACK reordering window 2021-07-27 20:07:21 +01:00
tcp_scalable.c
tcp_timer.c net: sock: introduce sk_error_report 2021-06-29 11:28:21 -07:00
tcp_ulp.c
tcp_vegas.c
tcp_vegas.h
tcp_veno.c
tcp_westwood.c
tcp_yeah.c tcp_yeah: check struct yeah size at compile time 2021-06-29 11:54:36 -07:00
tunnel4.c net: Remove the member netns_ok 2021-05-17 15:29:35 -07:00
udp.c net: Implement ->sock_is_readable() for UDP and AF_UNIX 2021-10-26 12:29:33 -07:00
udp_bpf.c net: Implement ->sock_is_readable() for UDP and AF_UNIX 2021-10-26 12:29:33 -07:00
udp_diag.c net: Use nlmsg_unicast() instead of netlink_unicast() 2021-07-13 09:28:29 -07:00
udp_impl.h
udp_offload.c fou: remove sparse errors 2021-08-31 12:03:33 +01:00
udp_tunnel_core.c net/ipv4/udp_tunnel_core.c: remove superfluous header files from udp_tunnel_core.c 2021-09-21 10:17:20 +01:00
udp_tunnel_nic.c udp_tunnel: Fix udp_tunnel_nic work-queue type 2021-09-13 12:38:45 +01:00
udp_tunnel_stub.c
udplite.c net: Remove the member netns_ok 2021-05-17 15:29:35 -07:00
xfrm4_input.c
xfrm4_output.c
xfrm4_policy.c
xfrm4_protocol.c net: Remove the member netns_ok 2021-05-17 15:29:35 -07:00
xfrm4_state.c
xfrm4_tunnel.c net/ipv4/xfrm4_tunnel.c: remove superfluous header files from xfrm4_tunnel.c 2021-09-23 10:10:00 +02:00