Commit graph

574457 commits

Author SHA1 Message Date
John Underwood
f534039dd8 i40e: add check for null VSI
Return from i40e_vsi_reinit_setup() if vsi param is NULL.
This makes this code consistent with all the other code that
checks for NULL before using one of the VSI pointers accessed
with an indexed variable. (Indexed VSI pointers are
intentionally set to NULL in i40e_vsi_clear() and
i40e_remove().

Change-ID: I3bc8b909c70fd2439334eeae994d151f61480985
Signed-off-by: John Underwood <johnx.underwood@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2016-02-18 23:52:45 -08:00
Anjali Singhai Jain
fe72608272 i40e: Expose some registers to program parser, FD and RSS logic
This patch adds 7 new register definitions for programming the
parser, flow director and RSS blocks in the HW.

Change-ID: I31e76673125275f3c69a14c646361919d04dc987
Signed-off-by: Anjali Singhai Jain <anjali.singhai@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2016-02-18 23:49:31 -08:00
Carolyn Wyborny
730a8f8777 i40e: Fix for unexpected messaging
This fixes an issue where a previously removed message
has returned.  Changing the message type to dev_dbg
leaves the info, if desired, but takes it out of normal
everyday usage. Also changed call to only provide port
data when its valid and not when its not (delete case).

Change-ID: Ief6f33b915f6364c24fa8e5789c2fc3168b5e2ed
Signed-off-by: Carolyn Wyborny <carolyn.wyborny@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2016-02-18 23:46:17 -08:00
Neerav Parikh
3fe06f415b i40e: Do not wait for Rx queue disable in DCB reconfig
Just like Tx queues don't wait for Rx queues to be disabled before
DCB has been reconfigured.
Check the queues are disabled only after the DCB configuration has
been applied to the VSI(s) managed by the PF driver.

In case of any timeout issue a PF reset to recover.

Change-ID: Ic51e94c25baf9a5480cee983f35d15575a88642c
Signed-off-by: Neerav Parikh <neerav.parikh@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2016-02-18 23:43:04 -08:00
Kevin Scott
4d7cec078d i40e: Increase timeout when checking GLGEN_RSTAT_DEVSTATE bit
When linking with particular PHY types (ex: copper PHY), the amount of
time it takes for the GLGEN_RSTAT_DEVSTATE to be set increases greatly,
which can lead to a timeout and failure to load the driver.

Change-ID: If02be0dfcd7c57fdde2d5c81cd63651260cd2029
Signed-off-by: Kevin Scott <kevin.c.scott@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2016-02-18 23:39:53 -08:00
Carolyn Wyborny
31b606d0c4 i40e: Fix led blink capability for 10GBaseT PHY
This patch fixes a problem where the ethtool identify adapter
functionality did not work for some copper PHY's.  Without this
patch, the blink led functionality fails on some parts.  This
patch adds PHY write code to blink led's on parts where this
functionality is contained in the PHY rather than the MAC.

Change-ID: Iee7b3453f61d5ffd0b3d03f720ee4f17f919fcc2
Signed-off-by: Carolyn Wyborny <carolyn.wyborny@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2016-02-18 23:36:42 -08:00
Carolyn Wyborny
fd077cd339 i40e: Add functions to blink led on 10GBaseT PHY
This patch adds functions to blink led on devices using
10GBaseT PHY since MAC registers used in other designs
do not work in this device configuration.

Change-ID: Id4b88c93c649fd2b88073a00b42867a77c761ca3
Signed-off-by: Carolyn Wyborny <carolyn.wyborny@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2016-02-18 23:33:28 -08:00
Alexander Duyck
3bc67973e8 i40e/i40evf: Move Tx checksum closer to TSO
On all of the other Intel drivers we place checksum close to TSO as they
have a significant amount in common and it can help to reduce the decision
tree for how to handle the frame as the first check in TSO is to see if
checksumming is offloaded, and if it is not we can skip _BOTH_ TSO and Tx
checksum offload based on a single check.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2016-02-18 23:30:19 -08:00
Alexander Duyck
2d37490b82 i40e/i40evf: Rewrite logic for 8 descriptor per packet check
This patch is meant to rewrite the logic for how we determine if we can
transmit the frame or if it needs to be linearized.

The previous code for this function was using a mix of division and modulus
division as a part of computing if we need to take the slow path.  Instead
I have replaced this by simply working with a sliding window which will
tell us if the frame would be capable of causing a single packet to span
several descriptors.

The logic for the scan is fairly simple.  If any given group of 6 fragments
is less than gso_size - 1 then it is possible for us to have one byte
coming out of the first fragment, 6 fragments, and one or more bytes coming
out of the last fragment.  This gives us a total of 8 fragments
which exceeds what we can allow so we send such frames to be linearized.

Arguably the use of modulus might be more exact as the approach I propose
may generate some false positives.  However the likelihood of us taking much
of a hit for those false positives is fairly low, and I would rather not
add more overhead in the case where we are receiving a frame composed of 4K
pages.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2016-02-18 23:27:05 -08:00
Alexander Duyck
4ec441df25 i40e/i40evf: Break up xmit_descriptor_count from maybe_stop_tx
In an upcoming patch I would like to have access to the descriptor count
used for the data portion of the frame.  For this reason I am splitting up
the descriptor count function from the function that stops the ring.

Also in order to try and reduce unnecessary duplication of code I am moving
the slow-path portions of the code out of being inline calls so that we can
just jump to them and process them instead of having to build them into
each function that calls them.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2016-02-18 23:23:51 -08:00
David S. Miller
d289cbed9d Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue
Jeff Kirsher says:

====================
40GbE Intel Wired LAN Driver Updates 2016-02-18

This series contains updates to i40e and i40evf only.

Alex Duyck provides all the patches in the series to update and fix the
drivers.  Fixed the driver to drop the outer checksum offload on UDP
tunnels, since the issue is that the upper levels of the stack never
requested such an offload and it results in possible errors.  Updates the
TSO function to just use u64 values, so we do not have to end up casting
u32 values.  In the TSO path, factored out the L4 header offsets allowing
us to ignore the L4 header offsets when dealing with the L3 checksum and
length update.  Consolidates all of the spots where we were updating
either the TCP or IP checksums in the TSO and checksum path into the TSO
function.  Fixed two issues by adding support for IPv4 encapsulated in
IPv6, first issue was the fact that iphdr(skb)->protocol was being used to
test for the outer transport protocol which breaks IPv6 support.  The second
was that we cleared the flag for v4 going to v6, but we did not take care
of txflags going the other way.  Added support for IPv6 extension headers
in setting up the Tx checksum.  Added exception handling to the Tx
checksum path so that we can handle cases of TSO where the frame is bad,
or Tx checksum where we did not recognize a protocol.  Fixed a number of
issues to make certain that we are using the correct protocols when
parsing both the inner and outer headers of a frame that is mixed between
IPv4 and IPv6 for inner and outer.  Updated the feature flags to reflect
the newly enabled/added features.

Sorry, no witty patch descriptions this time around, probably should
let Mitch help in writing patch descriptions for Alex. :-)
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-18 23:47:04 -05:00
Yuval Mintz
376471a7b6 bnx2x: Add missing HSI for big-endian machines
Commit e5d3a51cef ("bnx2x: extend DCBx support") was missing HSI
changes for big-endian machine, breaking compilation on such
platforms.

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-18 23:33:23 -05:00
Alexander Duyck
ffcc55c0c2 i40e: Add support for ATR w/ IPv6 extension headers
This patch updates the code for determining the L4 protocol and L3 header
length so that when IPv6 extension headers are being used we can determine
the offset and type of the L4 protocol.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2016-02-18 15:57:26 -08:00
Alexander Duyck
f608e6a60f i40evf: Update feature flags to reflect newly enabled features
Recent changes should have enabled support for IPv6 based tunnels and
support for TSO with outer UDP checksums.  As such we can update the
feature flags to reflect that.

In addition we can clean-up the flags that aren't needed such as SCTP and
RXCSUM since having the bits there doesn't add any value.

I also found one spot where we were setting the same flag twice.  It looks
like it was probably a git merge error that resulted in the line being
duplicated.  As such I have dropped it in this patch.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Acked-by: Anjali Singhai Jain <anjali.singhai@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2016-02-18 15:30:55 -08:00
Alexander Duyck
bc5d252b36 i40e: Update feature flags to reflect newly enabled features
Recent changes should have enabled support for IPv6 based tunnels and
support for TSO with outer UDP checksums.  As such we can update the
feature flags to reflect that.

In addition we can clean-up the flags that aren't needed such as SCTP and
RXCSUM since having the bits there doesn't add any value.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2016-02-18 15:26:28 -08:00
Alexander Duyck
84d5946d49 i40e: Do not drop support for IPv6 VXLAN or GENEVE tunnels
All of the documentation in the datasheets for the XL710 do not call out
any reason to exclude support for IPv6 based tunnels.  As such I am
dropping the code that was excluding these tunnel types from having their
port numbers recognized.  This way we can take advantage of things such as
checksum offload for inner headers over IPv6 based VXLAN or GENEVE
tunnels.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2016-02-18 15:14:59 -08:00
Alexander Duyck
6b037cd465 i40e: Fix ATR in relation to tunnels
This patch contains a number of fixes to make certain that we are using
the correct protocols when parsing both the inner and outer headers of a
frame that is mixed between IPv4 and IPv6 for inner and outer.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Acked-by: Kiran Patil <kiran.patil@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2016-02-18 15:01:35 -08:00
Alexander Duyck
5453205cd0 i40e/i40evf: Enable support for SKB_GSO_UDP_TUNNEL_CSUM
The XL722 has support for providing the outer UDP tunnel checksum on
transmits.  Make use of this feature to support segmenting UDP tunnels with
outer checksums enabled.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2016-02-18 14:46:23 -08:00
Alexander Duyck
fad57330b6 i40e/i40evf: Clean-up Rx packet checksum handling
This is mostly a minor clean-up for the Rx checksum path in order to avoid
some of the unnecessary conditional checks that were being applied.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2016-02-18 14:38:56 -08:00
David S. Miller
f58ee41e83 Merge branch 'qed-vlan-filtering'
Yuval Mintz says:

====================
qed{,e}: Add vlan filtering offload

This series adds vlan filtering offload to qede.
First patch introduces small additional infrastructure needed in
qed to support it, while second contains the main bulk of driver changes.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-18 16:07:45 -05:00
Sudarsana Reddy Kalluru
7c1bfcad9f qede: Add vlan filtering offload support
Device would start receiving only vlan-tagged traffic with tags matching
that of one of the configured vlan IDs, unless:
  - Device is expliicly placed in PROMISC mode.
  - Device exhausts its vlan filter credits.

Signed-off-by: Sudarsana Reddy Kalluru <sudarsana.kalluru@qlogic.com>
Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-18 16:07:45 -05:00
Yuval Mintz
3f9b4a6972 qed: Lay infrastructure for vlan filtering offload
Today, interfaces are working in vlan-promisc mode; But once
vlan filtering offloaded would be supported, we'll need a method to
control it directly [e.g., when setting device to PROMISC, or when
running out of vlan credits].

This adds the necessary API for L2 client to manually choose whether to
accept all vlans or only those for which filters were configured.

Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-18 16:07:44 -05:00
Andrew F. Davis
e12a285c9b net: phy: dp83848: Fix sysfs naming collision warning
Files in sysfs are created using the name from the phy_driver struct,
when two names are the same we may get a duplicate filename warning,
fix this.

Reported-by: kernel test robot <ying.huang@linux.intel.com>
Signed-off-by: Andrew F. Davis <afd@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-18 15:53:10 -05:00
Alexander Duyck
9e74a6dadb net: Optimize local checksum offload
This patch takes advantage of several assumptions we can make about the
headers of the frame in order to reduce overall processing overhead for
computing the outer header checksum.

First we can assume the entire header is in the region pointed to by
skb->head as this is what csum_start is based on.

Second, as a result of our first assumption, we can just call csum_partial
instead of making a call to skb_checksum which would end up having to
configure things so that we could walk through the frags list.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-18 15:29:25 -05:00
Benjamin Poirier
e550785c30 ipv6: Annotate change of locking mechanism for np->opt
follows up commit 45f6fad84c ("ipv6: add complete rcu protection around
np->opt") which added mixed rcu/refcount protection to np->opt.

Given the current implementation of rcu_pointer_handoff(), this has no
effect at runtime.

Signed-off-by: Benjamin Poirier <bpoirier@suse.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-18 15:27:25 -05:00
David S. Miller
f0a404505c Merge branch 'iptunnel-pkt-scrub-consolidate'
Jiri Benc says:

====================
iptunnel: scrub packet in iptunnel_pull_header

As every IP tunnel has to scrub skb on decapsulation, iptunnel_pull_header
tried to do that and open coded part of skb_scrub_packet. Various tunneling
protocols (VXLAN, Geneve) then called full skb_scrub_packet on their own,
duplicating part of the scrubbing already done.

Consolidate the code, calling skb_scrub_packet from iptunnel_pull_header.
This will allow additional cleanups in VXLAN code, as the packet is scrubbed
early during rx processing after this patchset and VXLAN can start filling
out skb fields earlier.

The full picture of vxlan cleanup patches can be seen at:
https://github.com/jbenc/linux-vxlan/commits/master
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-18 14:35:02 -05:00
Jiri Benc
7f290c9435 iptunnel: scrub packet in iptunnel_pull_header
Part of skb_scrub_packet was open coded in iptunnel_pull_header. Let it call
skb_scrub_packet directly instead.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-18 14:34:54 -05:00
Jiri Benc
c9e78efb6f vxlan: move vxlan device lookup before iptunnel_pull_header
This is in preparation for iptunnel_pull_header calling skb_scrub_packet.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-18 14:34:54 -05:00
Jiri Benc
9fc4754582 geneve: move geneve device lookup before iptunnel_pull_header
This is in preparation for iptunnel_pull_header calling skb_scrub_packet.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-18 14:34:31 -05:00
Jiri Benc
1e9f12ec92 geneve: implement geneve_get_sk_family helper
Similarly to the existing vxlan_get_sk_family.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-18 14:34:31 -05:00
Vivien Didelot
7c25b16dbb net: bridge: log port STP state on change
Remove the shared br_log_state function and print the info directly in
br_set_state, where the net_bridge_port state is actually changed.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Acked-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-18 14:20:08 -05:00
David S. Miller
37a6351a64 Merge branch 'cxgb4-addr-sync'
Hariprasad Shenai says:

====================
cxgb4: Use __dev_[um]c_[un]sync for MAC address syncing

This patch series adds support to use __dev_uc_sync/__dev_mc_sync to add
MAC address and __dev_uc_unsync/__dev_mc_unsync to delete MAC address.

This patch series has been created against net-next tree and includes
patches on cxgb4 and cxgb4vf driver.

We have included all the maintainers of respective drivers. Kindly review
the change and let us know in case of any review comments.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-18 14:16:13 -05:00
Hariprasad Shenai
fe5d2709b0 cxgb4vf: Use __dev_uc_sync/__dev_mc_sync to sync MAC address
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-18 14:16:12 -05:00
Hariprasad Shenai
fc08a01a69 cxgb4: Use __dev_uc_sync/__dev_mc_sync to sync MAC address
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-18 14:16:12 -05:00
Alexander Duyck
529f1f652e i40e/i40evf: Add exception handling for Tx checksum
Add exception handling to the Tx checksum path so that we can handle cases
of TSO where the frame is bad, or Tx checksum where we didn't recognize a
protocol

Drop I40E_TX_FLAGS_CSUM as it is unused, move the CHECKSUM_PARTIAL check
into the function itself so that we can decrease indent.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2016-02-18 11:05:58 -08:00
Alexander Duyck
475b4205aa i40e/i40evf: Do not write to descriptor unless we complete
This patch defers writing to the Tx descriptor bits until we know we have
successfully completed a given operation.  So for example we defer updating
the tunnelling portion of the context descriptor until we have fully
identified the type.

The advantage to this approach is that we can assemble values as we go
instead of having to try and kludge everything together all at once.  As a
result we can significantly clean up the tunneling configuration for
instance as we can just do a pointer walk and do the math for the distance
between each set of points.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2016-02-18 11:00:56 -08:00
Jiri Benc
07dabf20d9 vxlan: tun_id is 64bit, not 32bit
The tun_id field in struct ip_tunnel_key is __be64, not __be32. We need to
convert the vni to tun_id correctly.

Fixes: 54bfd872bf ("vxlan: keep flags and vni in network byte order")
Reported-by: Paolo Abeni <pabeni@redhat.com>
Tested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Thadeu Lima de Souza Cascardo <cascardo@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-18 13:55:24 -05:00
Alexander Duyck
a3fd9d8876 i40e/i40evf: Handle IPv6 extension headers in checksum offload
This patch adds support for IPv6 extension headers in setting up the Tx
checksum.  Without this patch extension headers would cause IPv6 traffic to
fail as the transport protocol could not be identified.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2016-02-18 10:49:42 -08:00
Alexander Duyck
a0064728f8 i40e/i40evf: Add support for IPv4 encapsulated in IPv6
This patch fixes two issues.  First was the fact that iphdr(skb)->protocl
was being used to test for the outer transport protocol.  This completely
breaks IPv6 support.  Second was the fact that we cleared the flag for v4
going to v6, but we didn't take care of txflags going the other way.  As
such we would have the v6 flag still set even if the inner header was v4.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2016-02-18 10:45:12 -08:00
Alexander Duyck
b96b78f2b7 i40e/i40evf: Replace header pointers with unions of pointers in Tx checksum path
The Tx checksum path was maintaining a set of 3 pointers and two lengths in
order to prepare the packet for being checksummed.  The thing is we only
really needed 2 pointers, and the lengths that were being maintained can
easily be computed.

As such we can replace the IPv4 and IPv6 header pointers with one single
union that represents both, or a generic pointer to the start of the
network header.  For the L4 headers we can do the same with TCP and a
generic pointer to the start of the transport header.  The length of the
TCP header is obtained by simply multiplying doff by 4, and the network
header length can be obtained by subtracting the network header pointer
from the transport header pointer.

While I was at it I renamed l4_hdr to l4_proto to make it a bit more clear
and less likely to be confused with l4.hdr which is the transport header
pointer.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2016-02-18 10:40:22 -08:00
Alexander Duyck
c777019af1 i40e/i40evf: Consolidate all header changes into TSO function
This patch goes through and pulls all of the spots where we were updating
either the TCP or IP checksums in the TSO and checksum path into the TSO
function.  The general idea here is that we should only be updating the
header after we verify we have completed a skb_cow_head check to verify the
head is writable.

One other advantage to doing this is that it makes things much more
obvious.  For example, in the case of IPv6 there was one spot where the
offset of the IPv4 header checksum was being updated which is obviously
incorrect.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2016-02-18 10:37:15 -08:00
Alexander Duyck
c49a7bc330 i40e/i40evf: Factor out L4 header and checksum from L3 bits in TSO path
This patch makes it so that the L4 header offsets and such can be ignored
when dealing with the L3 checksum and length update.  This is done making
use of two things.

First we can just use the offset from the L4 header to the start of the
packet to determine the L4 offset, and from that we can then make use of
the data offset to determine the full length of the headers.

As far as adjusting the checksum to remove the length we can simply add the
inverse of the length instead of having to recompute the entire
pseudo-header without the length.  In the case of an IPv6 header this
should be significantly cheaper since we can make use of a value we already
needed instead of having to read the source and destination address out of
the packet.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2016-02-18 10:34:05 -08:00
Alexander Duyck
03f9d6a59f i40e/i40evf: Use u64 values instead of casting them in TSO function
Instead of casing u32 values to u64 it makes more sense to just start out
with u64 values in the first place.  This way we don't need to create a
mess with all of the casts needed to populate a 64b value.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2016-02-18 10:30:55 -08:00
Alexander Duyck
a9c9a81f58 i40e/i40evf: Drop outer checksum offload that was not requested
The i40e and i40evf drivers contained code for inserting an outer checksum
on UDP tunnels.  The issue however is that the upper levels of the stack
never requested such an offload and it results in possible errors.

In addition the same logic was being applied to the Rx side where it was
attempting to validate the outer checksum, but the logic there was
incorrect in that it was testing for the resultant sum to be equal to the
header checksum instead of being equal to 0.

Since this code is so massively flawed, and doing things that we didn't ask
for it to do I am just dropping it, and will bring it back later to use as
an offload for SKB_GSO_UDP_TUNNEL_CSUM which can make use of such a
feature.

As far as the Rx feature I am dropping it completely since it would need to
be massively expanded and applied to IPv4 and IPv6 checksums for all parts,
not just the one that supports Tx checksum offload for the outer.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2016-02-18 10:27:45 -08:00
David S. Miller
f169af2cdc Merge branch 'netlink-mmap-remove'
Florian Westphal says:

====================
netlink: remove mmapped netlink support

As discussed during netconf 2016 in Seville, this series removes
CONFIG_NETLINK_MMAP.

Close to three years after it was merged it has retained several problems
that do not appear to be fixable.

No official netfilter libmnl release contains support for mmap backed netlink
sockets. No openvswitch release makes use of it either.

To use the mmap interface, userspace not only has to probe for mmap netlink
support, it also has to implement a recv/socket receive path in order to
handle messages that exceed the size of an rx ring element (NL_MMAP_STATUS_COPY).

So if there are odd programs out there that attempt to use MMAP netlink
they should continue to work as they already need a socket based code path
to work properly.

The actual revert (first patch) has a list of problems.
The followup patches remove a couple of helpers that are no longer needed
after the revert.

I did a few tests with mmap vs. socket based interface on a 4.4 based
kernel on an i7-4790 box and there are no performance advantages:

loopback, single nfqueue, queueing in -t filter INPUT:
traffic generated by 8 * ping -q -f localhost:
socket backend:
real    0m27.325s
user    0m3.993s
sys     0m23.292s

with mmap ring backend:
real    0m29.054s
user    0m4.924s
sys     0m24.127s

with single tcp stream, unidirectional, loopback mtu set at 1500
(nc localhost discard < /dev/zero > /dev/null):

socket interface:
time nfqdump -b $((8 * 1024 * 1024 * 1024)) -w /dev/null
real    0m15.960s
user    0m1.756s
sys     0m11.143s

mmap ring:
real    0m16.441s
user    0m3.040s
sys     0m13.687s

socket interface nfqdump[1] with --gso option (i.e. MTU is exceeded,
no kernel-side segmentation and checksum fixups) completes in about 5s.

I also tested dumping a conntrack table with 1m entries.
On my box this takes about 2.4 seconds for both mmap and socket backend:

time LD_PRELOAD=../../src/.libs/libmnl.so ./nfct-dump-sk > /dev/null
mnl_cb_run: Success
messages: 1000000
real    0m2.485s
user    0m1.085s
sys     0m1.400s

time LD_PRELOAD=../../src/.libs/libmnl.so ./nfct-dump-mmap > /dev/null
messages: 1000000
real    0m2.451s
user    0m1.124s
sys     0m1.328s

[1] https://git.breakpoint.cc/cgit/fw/nfqdump.git/
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-18 11:42:41 -05:00
Florian Westphal
c5b0db3263 nfnetlink: Revert "nfnetlink: add support for memory mapped netlink"
reverts commit 3ab1f683bf ("nfnetlink: add support for memory mapped
netlink")'

Like previous commits in the series, remove wrappers that are not needed
after mmapped netlink removal.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-18 11:42:22 -05:00
Florian Westphal
905f0a739a nfnetlink: remove nfnetlink_alloc_skb
Following mmapped netlink removal this code can be simplified by
removing the alloc wrapper.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-18 11:42:19 -05:00
Florian Westphal
263ea09084 Revert "genl: Add genlmsg_new_unicast() for unicast message allocation"
This reverts commit bb9b18fb55 ("genl: Add genlmsg_new_unicast() for
unicast message allocation")'.

Nothing wrong with it; its no longer needed since this was only for
mmapped netlink support.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-18 11:42:19 -05:00
Florian Westphal
551ddc057e openvswitch: Revert: "Enable memory mapped Netlink i/o"
revert commit 795449d8b8 ("openvswitch: Enable memory mapped Netlink i/o").
Following the mmaped netlink removal this code can be removed.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-18 11:42:18 -05:00
Florian Westphal
d1b4c689d4 netlink: remove mmapped netlink support
mmapped netlink has a number of unresolved issues:

- TX zerocopy support had to be disabled more than a year ago via
  commit 4682a03586 ("netlink: Always copy on mmap TX.")
  because the content of the mmapped area can change after netlink
  attribute validation but before message processing.

- RX support was implemented mainly to speed up nfqueue dumping packet
  payload to userspace.  However, since commit ae08ce0021
  ("netfilter: nfnetlink_queue: zero copy support") we avoid one copy
  with the socket-based interface too (via the skb_zerocopy helper).

The other problem is that skbs attached to mmaped netlink socket
behave different from normal skbs:

- they don't have a shinfo area, so all functions that use skb_shinfo()
(e.g. skb_clone) cannot be used.

- reserving headroom prevents userspace from seeing the content as
it expects message to start at skb->head.
See for instance
commit aa3a022094 ("netlink: not trim skb for mmaped socket when dump").

- skbs handed e.g. to netlink_ack must have non-NULL skb->sk, else we
crash because it needs the sk to check if a tx ring is attached.

Also not obvious, leads to non-intuitive bug fixes such as 7c7bdf359
("netfilter: nfnetlink: use original skbuff when acking batches").

mmaped netlink also didn't play nicely with the skb_zerocopy helper
used by nfqueue and openvswitch.  Daniel Borkmann fixed this via
commit 6bb0fef489 ("netlink, mmap: fix edge-case leakages in nf queue
zero-copy")' but at the cost of also needing to provide remaining
length to the allocation function.

nfqueue also has problems when used with mmaped rx netlink:
- mmaped netlink doesn't allow use of nfqueue batch verdict messages.
  Problem is that in the mmap case, the allocation time also determines
  the ordering in which the frame will be seen by userspace (A
  allocating before B means that A is located in earlier ring slot,
  but this also means that B might get a lower sequence number then A
  since seqno is decided later.  To fix this we would need to extend the
  spinlocked region to also cover the allocation and message setup which
  isn't desirable.
- nfqueue can now be configured to queue large (GSO) skbs to userspace.
  Queing GSO packets is faster than having to force a software segmentation
  in the kernel, so this is a desirable option.  However, with a mmap based
  ring one has to use 64kb per ring slot element, else mmap has to fall back
  to the socket path (NL_MMAP_STATUS_COPY) for all large packets.

To use the mmap interface, userspace not only has to probe for mmap netlink
support, it also has to implement a recv/socket receive path in order to
handle messages that exceed the size of an rx ring element.

Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Ken-ichirou MATSUZAWA <chamaken@gmail.com>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-18 11:42:18 -05:00