review.tizen.org Git - platform/kernel/linux-rpi.git/log

tcp: refine tcp_tso_should_defer() after EDT adoption

tcp_tso_should_defer() last step tries to check if the probable
next ACK packet is coming in less than half rtt.

Problem is that the head->tstamp might be in the future,
so we need to use signed arithmetics to avoid overflows.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: do not try to defer skbs with eor mark (MSG_EOR)

Applications using MSG_EOR are giving a strong hint to TCP stack :

Subsequent sendmsg() can not append more bytes to skbs having
the EOR mark.

Do not try to TSO defer suchs skbs, there is really no hope.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: minor optimization in tcp ack fast path processing

Bitwise operation is a little faster.
So I replace after() with using the flag FLAG_SND_UNA_ADVANCED as it is
already set before.

In addtion, there's another similar improvement in tcp_cwnd_reduction().

Cc: Joe Perches <joe@perches.com>
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'mv88e6xxx-Support-more-SERDES-interfacxes'

Andrew Lunn says:

====================
net: dsa: mv88e6xxx: Support more SERDES interfacxes

Currently the SERDES interfaces for ports 9 and 10 on the mv88e6390x
are supported, allowing upto 10G. However, when unused, these SERDES
interfaces can be used by some of the lower ports for 1000Base-X.

The tricky bit here is ordering. The SERDES have to become free from
ports 9 or 10 before they can be used with lower ports. Normally, this
would happen only when these ports would be configured up, which is
too late. So at probe time, defaulting ports 9 and 10 to 1000BaseX
frees them for use with lower ports. If they are actually needed, they
will be taken back when port 9 and 10 goes up.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: Add support for SERDES on ports 2-8 for 6390X

The 6390X family has 8 SERDES interfaces. When ports 9 and 10 are not
using all their SERDES interfaces, the unused ones can be assigned to
ports 2-8. Add support for interrupts from SERDES interfaces connected
to these lower ports.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: Default ports 9/10 6390X CMODE to 1000BaseX

The 6390X family has 8 SERDES interfaces. This allows ports 9 and 10
to support up to 10Gbps using 4 SERDES interfaces. However, when lower
speeds are used, which need fewer SERDES interfaces, the unused SERDES
interfaces can be used by ports 2-8.

The hardware defaults to ports 9 and 10 having all 4 SERDES interfaces
assigned to them. This only gets changed when the interface is
configured after what the SFP supports has been determined, or the 10G
PHY completes auto-neg.

For hardware designs which limit ports 9 and 10 to one or two SERDES
interfaces, and place SFPs on the lower interfaces, this is too
late. Those ports with SFP should not wait until ports 9/10 are up in
order to get access to the SERDES interface. So change the default
configuration when the driver is initialised. Configure ports 9 and 10
to 1000BaseX, so they use a single SERDES interface, freeing up the
others. They can steal them back if they need them.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: Differentiate between 6390 and 6390X cmodes

The X family variants support additional ports modes, for 10G
operation, which the non-X variants don't have. Add a port_set_cmode()
for non-X variants to enforce this.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: Group cmode ops together

Move .port_set_cmode next to .port_get_cmode.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'net-phy-convert-advertise-and-supported-to-linkmode'

Andrew Lunn says:

====================
net: phy: convert advertise and supported to linkmode

This is the last part in converting phylib to make use of a linux
bitmap, not a u32, to represent links modes. This will allow support
for PHYs > 1Gbps, which need to use link modes represented by a bit >
32.

A number of MAC and PHY drivers need changes to support this. However
the previous two patchesets reduced the number somewhat, the helpers
which were introduced have been modified instead of the actual
drivers.

The follow on patches then make use of the extra bits, adding support
for more link modes.

Given how invasive this change is, i expect the build is broken for
some architectures i did not test. I will fixup the breakage as fast
as i can.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: Add support for resolving 5G and 2.5G autoneg

Now that 2.5G and 5G can be represented in phydev->advertising and
phydev->lp_advertising, add these two links modes as possible
resolutions to auto negotiation.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: Add more link modes to the settings table

Now that PHYs and MAC can support more than 32 bit masks, add link
modes which are > 31 to the PHY settings table.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: Fixup kerneldoc markup.

Add missing markup for function parameters

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: Convert u32 phydev->lp_advertising to linkmode

Convert phy drivers to report the link partner advertised modes using
a linkmode bitmap. This allows them to report the higher speeds which
don't fit in a u32.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ethernet: Convert phydev advertize and supported from u32 to link mode

There are a few MAC/PHYs combinations which now support > 1Gbps. These
may need to make use of link modes with bits > 31. Thus their
supported PHY features or advertised features cannot be implemented
using the current bitmap in a u32. Convert to using a linkmode bitmap,
which can support all the currently devices link modes, and is future
proof as more modes are added.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: remove states PHY_STARTING and PHY_PENDING

Both states aren't used. Most likely they result from an idea that
never materialized. So remove them.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

documentation of some IP/ICMP snmp counters

The snmp_counter.rst explains the meanings of snmp counters. It also
provides a set of experiments (only 1 for this initial patch),
combines the experiments' resutls and the snmp counters'
meanings. This is an initial path, only explains a part of IP/ICMP
counters and provide a simple ping test.

Signed-off-by: yupeng <yupeng0921@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tipc: improve broadcast retransmission algorithm

Currently, the broadcast retransmission algorithm is using the
'prev_retr' field in struct tipc_link to time stamp the latest broadcast
retransmission occasion. This helps to restrict retransmission of
individual broadcast packets to max once per 10 milliseconds, even
though all other criteria for retransmission are met.

We now move this time stamp to the control block of each individual
packet, and remove other limiting criteria. This simplifies the
retransmission algorithm, and eliminates any risk of logical errors
in selecting which packets can be retransmitted.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: LUU Duc Canh <canh.d.luu@dektech.com.au>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'net-sched-indirect-tc-block-cb-registration'

Jakub Kicinski says:

====================
net: sched: indirect tc block cb registration

John says:

This patchset introduces an alternative to egdev offload by allowing a
driver to register for block updates when an external device (e.g. tunnel
netdev) is bound to a TC block. Drivers can track new netdevs or register
to existing ones to receive information on such events. Based on this,
they may register for block offload rules using already existing
functions.

The patchset also implements this new indirect block registration in the
NFP driver to allow the offloading of tunnel rules. The use of egdev
offload (which is currently only used for tunnel offload) is subsequently
removed.

RFC v2 -> PATCH
- removed embedded tracking function from indir block register (now up to
   driver to clean up after itself)
- refactored NFP code due to recent submissions
- removed priv list clean function in NFP (list should be cleared by
   indirect block unregisters)

RFC v1->v2:
- free allocated owner struct in block_owner_clean function
- add geneve type helper function
- move test stub in NFP (v1 patch 2) to full tunnel offload
   implementation via indirect blocks (v2 patches 3-8)
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

nfp: flower: remove unnecessary code in flow lookup

Recent changes to NFP mean that stats updates from fw to driver no longer
require a flow lookup and (because egdev offload has been removed) the
ingress netdev for a lookup is now always known.

Remove obsolete code in a flow lookup that matches on host context and
that allows for a netdev to be NULL.

Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

nfp: flower: remove TC egdev offloads

Previously, only tunnel decap rules required egdev registration for
offload in NFP. These are now supported via indirect TC block callbacks.

Remove the egdev code from NFP.

Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

nfp: flower: offload tunnel decap rules via indirect TC blocks

Previously, TC block tunnel decap rules were only offloaded when a
callback was triggered through registration of the rules egress device.
This meant that the driver had no access to the ingress netdev and so
could not verify it was the same tunnel type that the rule implied.

Register tunnel devices for indirect TC block offloads in NFP, giving
access to new rules based on the ingress device rather than egress. Use
this to verify the netdev type of VXLAN and Geneve based rules and offload
the rules to HW if applicable.

Tunnel registration is done via a netdev notifier. On notifier
registration, this is triggered for already existing netdevs. This means
that NFP can register for offloads from devices that exist before it is
loaded (filter rules will be replayed from the TC core). Similarly, on
notifier unregister, a call is triggered for each currently active netdev.
This allows the driver to unregister any indirect block callbacks that may
still be active.

Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

nfp: flower: increase scope of netdev checking functions

Both the actions and tunnel_conf files contain local functions that check
the type of an input netdev. In preparation for re-use with tunnel offload
via indirect blocks, move these to static inline functions in a header
file.

Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

nfp: flower: allow non repr netdev offload

Previously the offload functions in NFP assumed that the ingress (or
egress) netdev passed to them was an nfp repr.

Modify the driver to permit the passing of non repr netdevs as the ingress
device for an offload rule candidate. This may include devices such as
tunnels. The driver should then base its offload decision on a combination
of ingress device and egress port for a rule.

Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: sched: register callbacks for indirect tc block binds

Currently drivers can register to receive TC block bind/unbind callbacks
by implementing the setup_tc ndo in any of their given netdevs. However,
drivers may also be interested in binds to higher level devices (e.g.
tunnel drivers) to potentially offload filters applied to them.

Introduce indirect block devs which allows drivers to register callbacks
for block binds on other devices. The callback is triggered when the
device is bound to a block, allowing the driver to register for rules
applied to that block using already available functions.

Freeing an indirect block callback will trigger an unbind event (if
necessary) to direct the driver to remove any offloaded rules and unreg
any block rule callbacks. It is the responsibility of the implementing
driver to clean any registered indirect block callbacks before exiting,
if the block it still active at such a time.

Allow registering an indirect block dev callback for a device that is
already bound to a block. In this case (if it is an ingress block),
register and also trigger the callback meaning that any already installed
rules can be replayed to the calling driver.

Signed-off-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'PHYID-matching-macros'

Heiner Kallweit says:

====================
net: phy: add macros for PHYID matching in PHY driver config

Add macros for PHYID matching to be used in PHY driver configs.
By using these macros some boilerplate code can be avoided.

Use them initially in the Realtek PHY drivers.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: realtek: use new PHYID matching macros

Use new macros for PHYID matching to avoid boilerplate code.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: add macros for PHYID matching

Add macros for PHYID matching to be used in PHY driver configs.
By using these macros some boilerplate code can be avoided.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'phylib-simplifications'

Heiner Kallweit says:

====================
net: phy: further phylib simplifications after recent changes to the state machine

After the recent changes to the state machine phylib can be further
simplified (w/o having to make any assumptions).
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: improve and inline phy_change

Now that phy_mac_interrupt() doesn't call phy_change() any longer it's
called from phy_interrupt() only. Therefore phy_interrupt_is_valid()
returns true always and the check can be removed.
In case of PHY_HALTED phy_interrupt() bails out immediately,
therefore the second check for PHY_HALTED including the call to
phy_disable_interrupts() can be removed.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: simplify phy_mac_interrupt and related functions

When using phy_mac_interrupt() the irq number is set to
PHY_IGNORE_INTERRUPT, therefore phy_interrupt_is_valid() returns false.
As a result phy_change() effectively just calls phy_trigger_machine()
when called from phy_mac_interrupt() via phy_change_work(). So we can
call phy_trigger_machine() from phy_mac_interrupt() directly and
remove some now unneeded code.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: don't set state PHY_CHANGELINK in phy_change

State PHY_CHANGELINK isn't needed here, we can call the state machine
directly. We just have to remove the check for phy_polling_mode() to
make this work also in interrupt mode. Removing this check doesn't
cause any overhead because when not polling the state machine is
called only if required by some event.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'remove-PHY_HAS_INTERRUPT'

Heiner Kallweit says:

====================
net: phy: replace PHY_HAS_INTERRUPT with a check for config_intr and ack_interrupt

Flag PHY_HAS_INTERRUPT is used only here for this small check. I think
using interrupts isn't possible if a driver defines neither
config_intr nor ack_interrupts callback. So we can replace checking
flag PHY_HAS_INTERRUPT with checking for these callbacks.
This allows to remove this flag from all driver configs.

v2:
- add helper for check in patch 1
- remove PHY_HAS_INTERRUPT from all drivers, not only Realtek
- remove flag PHY_HAS_INTERRUPT completely

v3:
- rebase patch 2
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: remove flag PHY_HAS_INTERRUPT from driver configs

Now that flag PHY_HAS_INTERRUPT has been replaced with a check for
callbacks config_intr and ack_interrupt, we can remove setting this
flag from all driver configs.
Last but not least remove flag PHY_HAS_INTERRUPT completely.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: replace PHY_HAS_INTERRUPT with a check for config_intr and ack_interrupt

Flag PHY_HAS_INTERRUPT is used only here for this small check. I think
using interrupts isn't possible if a driver defines neither
config_intr nor ack_interrupts callback. So we can replace checking
flag PHY_HAS_INTERRUPT with checking for these callbacks.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

sctp: Fix SKB list traversal in sctp_intl_store_ordered().

Same change as made to sctp_intl_store_reasm().

To be fully correct, an iterator has an undefined value when something
like skb_queue_walk() naturally terminates.

This will actually matter when SKB queues are converted over to
list_head.

Formalize what this code ends up doing with the current
implementation.

Signed-off-by: David S. Miller <davem@davemloft.net>

sctp: Fix SKB list traversal in sctp_intl_store_reasm().

To be fully correct, an iterator has an undefined value when something
like skb_queue_walk() naturally terminates.

This will actually matter when SKB queues are converted over to
list_head.

Formalize what this code ends up doing with the current
implementation.

Signed-off-by: David S. Miller <davem@davemloft.net>

iucv: Remove SKB list assumptions.

Eliminate the assumption that SKBs and SKB list heads can
be cast to eachother in SKB list handling code.

This change also appears to fix a bug since the list->next pointer is
sampled outside of holding the SKB queue lock.

Signed-off-by: David S. Miller <davem@davemloft.net>

brcmfmac: Use standard SKB list accessors in brcmf_sdiod_sglist_rw.

Instead of direct SKB list pointer accesses.

The loops in this function had to be rewritten to accommodate this
more easily.

The first loop iterates now over the target list in the outer loop,
and triggers an mmc data operation when the per-operation limits are
hit.

Then after the loops, if we have any residue, we trigger the last
and final operation.

For the page aligned workaround, where we have to copy the read data
back into the original list of SKBs, we use a two-tiered loop.  The
outer loop stays the same and iterates over pktlist, and then we have
an inner loop which uses skb_peek_next().  The break logic has been
simplified because we know that the aggregate length of the SKBs in
the source and destination lists are the same.

This change also ends up fixing a bug, having to do with the
maintainance of the seg_sz variable and how it drove the outermost
loop.  It begins as:

seg_sz = target_list->qlen;

ie. the number of packets in the target_list queue.  The loop
structure was then:

while (seq_sz) {
...
while (not at end of target_list) {
...
sg_cnt++
...
}
...
seg_sz -= sg_cnt;

The assumption built into that last statement is that sg_cnt counts
how many packets from target_list have been fully processed by the
inner loop.  But this not true.

If we hit one of the limits, such as the max segment size or the max
request size, we will break and copy a partial packet then contine
back up to the top of the outermost loop.

With the new loops we don't have this problem as we don't guard the
loop exit with a packet count, but instead use the progression of the
pkt_next SKB through the list to the end.  The general structure is:

sg_cnt = 0;
skb_queue_walk(target_list, pkt_next) {
pkt_offset = 0;
...
sg_cnt++;
...
while (pkt_offset < pkt_next->len) {
pkt_offset += sg_data_size;
if (queued up max per request)
mmc_submit_one();
}
}
if (sg_cnt)
mmc_submit_one();

The variables that maintain where we are in the MMC command state such
as req_sz, sg_cnt, and sgl are reset when we emit one of these full
sized requests.

Signed-off-by: David S. Miller <davem@davemloft.net>

OVS: remove VLAN_TAG_PRESENT - fixup

It turns out I missed one VLAN_TAG_PRESENT in OVS code while rebasing.
This fixes it.

Fixes: 9df46aefafa6 ("OVS: remove use of VLAN_TAG_PRESENT")
Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>

infiniband: nes: Fix more direct skb list accesses.

The following:

skb = skb->next;
...
if (skb == (struct sk_buff *)queue)

is transformed into:

skb = skb_peek_next(skb, queue);
...
if (!skb)

Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: leds: Don't make our own link speed names

The phy core provides a handy phy_speed_to_str() helper, so use that
instead of doing our own formatting of the different known link speeds.
To do this, increase PHY_LED_TRIGGER_SPEED_SUFFIX_SIZE to 11 so we can fit
'Unsupported' if necessary.

Signed-off-by: Kyle Roeschley <kyle.roeschley@ni.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: improve struct phy_device member interrupts handling

As a heritage from the very early days of phylib member interrupts is
defined as u32 even though it's just a flag whether interrupts are
enabled. So we can change it to a bitfield member. In addition change
the code dealing with this member in a way that it's clear we're
dealing with a bool value.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'dpaa2-eth-defer-probe-on-object-allocate'

Ioana Ciornei says:

====================
dpaa2-eth: defer probe on object allocate

Allocatable objects on the fsl-mc bus may be probed by the fsl_mc_allocator
after the first attempts of other drivers to use them. Defer the probe when
this situation happens.

Changes in v2:
- proper handling of IS_ERR_OR_NULL
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

dpaa2-ptp: defer probe when portal allocation failed

The fsl_mc_portal_allocate can fail when the requested MC portals are
not yet probed by the fsl_mc_allocator. In this situation, the driver
should defer the probe.

Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

dpaa2-eth: defer probe on object allocate

The fsl_mc_object_allocate function can fail because not all allocatable
objects are probed by the fsl_mc_allocator at the call time. Defer the
dpaa2-eth probe when this happens.

Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

udp6: cleanup stats accounting in recvmsg()

In the udp6 code path, we needed multiple tests to select the correct
mib to be updated. Since we touch at least a counter at each iteration,
it's convenient to use the recently introduced __UDPX_MIB() helper once
and remove some code duplication.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

nfp: use the new __netdev_tx_sent_queue() BQL optimisation

__netdev_tx_sent_queue() was added in commit e59020abf0f
("net: bql: add __netdev_tx_sent_queue()") and allows for
better GSO performance.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'ptp-more-accurate-PHC-system-clock-synchronization'

Miroslav Lichvar says:

====================
More accurate PHC<->system clock synchronization

RFC->v1:
- added new patches
- separated PHC timestamp from ptp_system_timestamp
- fixed memory leak in PTP_SYS_OFFSET_EXTENDED
- changed PTP_SYS_OFFSET_EXTENDED to work with array of arrays
- fixed PTP_SYS_OFFSET_EXTENDED to break correctly from loop
- fixed timecounter updates in drivers
- split gettimex in igb driver
- fixed ptp_read_* functions to be available without
  CONFIG_PTP_1588_CLOCK

This series enables a more accurate synchronization between PTP hardware
clocks and the system clock.

The first two patches are minor cleanup/bug fixes.

The third patch adds an extended version of the PTP_SYS_OFFSET ioctl,
which returns three timestamps for each measurement. The idea is to
shorten the interval between the system timestamps to contain just the
reading of the lowest register of the PHC in order to reduce the error
in the measured offset and get a smaller upper bound on the maximum
error.

The fourth patch deprecates the original gettime function.

The remaining patches update the gettime function in order to support
the new ioctl in the e1000e, igb, ixgbe, and tg3 drivers.

Tests with few different NICs in different machines show that:
- with an I219 (e1000e) the measured delay was reduced from 2500 to 1300
  ns and the error in the measured offset, when compared to the cross
  timestamping supported by the driver, was reduced by a factor of 5
- with an I210 (igb) the delay was reduced from 5100 to 1700 ns
- with an I350 (igb) the delay was reduced from 2300 to 750 ns
- with an X550 (ixgbe) the delay was reduced from 1950 to 650 ns
- with a BCM5720 (tg3) the delay was reduced from 2400 to 1200 ns
====================

Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tg3: extend PTP gettime function to read system clock

This adds support for the PTP_SYS_OFFSET_EXTENDED ioctl.

Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: Miroslav Lichvar <mlichvar@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ixgbe: extend PTP gettime function to read system clock

This adds support for the PTP_SYS_OFFSET_EXTENDED ioctl.

Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Jacob Keller <jacob.e.keller@intel.com>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: Miroslav Lichvar <mlichvar@redhat.com>
Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

igb: extend PTP gettime function to read system clock

This adds support for the PTP_SYS_OFFSET_EXTENDED ioctl.

Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Jacob Keller <jacob.e.keller@intel.com>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: Miroslav Lichvar <mlichvar@redhat.com>
Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e1000e: extend PTP gettime function to read system clock

This adds support for the PTP_SYS_OFFSET_EXTENDED ioctl.

Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Jacob Keller <jacob.e.keller@intel.com>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: Miroslav Lichvar <mlichvar@redhat.com>
Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ptp: deprecate gettime64() in favor of gettimex64()

When a driver provides gettimex64(), use it in the PTP_SYS_OFFSET ioctl
and POSIX clock's gettime() instead of gettime64(). Drivers should
provide only one of the functions.

Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Miroslav Lichvar <mlichvar@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ptp: add PTP_SYS_OFFSET_EXTENDED ioctl

The PTP_SYS_OFFSET ioctl, which can be used to measure the offset
between a PHC and the system clock, includes the total time that the
driver needs to read the PHC timestamp.

This typically involves reading of multiple PCI registers (sometimes in
multiple iterations) and the register that contains the lowest bits of
the timestamp is not read in the middle between the two readings of the
system clock. This asymmetry causes the measured offset to have a
significant error.

Introduce a new ioctl, driver function, and helper functions, which
allow the reading of the lowest register to be isolated from the other
readings in order to reduce the asymmetry. The ioctl returns three
timestamps for each measurement:
- system time right before reading the lowest bits of the PHC timestamp
- PHC time
- system time immediately after reading the lowest bits of the PHC
timestamp

Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Jacob Keller <jacob.e.keller@intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Miroslav Lichvar <mlichvar@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ptp: check gettime64 return code in PTP_SYS_OFFSET ioctl

If a gettime64 call fails, return the error and avoid copying data back
to user.

Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Miroslav Lichvar <mlichvar@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ptp: reorder declarations in ptp_ioctl()

Reorder declarations of variables as reversed Christmas tree.

Cc: Richard Cochran <richardcochran@gmail.com>
Suggested-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: Miroslav Lichvar <mlichvar@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'hns3-add-code-optimization-for-VF-reset-and-some-new-reset-feature'

Huazhong Tan says:

====================
hns3: add code optimization for VF reset and some new reset feature

Currently hardware supports below reset:
1. VF reset: triggered by sending cmd to IMP(Integrated Management
   Processor). Only reset specific VF function and do not affect
   other PF or VF.
2. PF reset: triggered by sending cmd to IMP. Only reset specific PF
   and it's VF.
3. PF FLR: triggered by PCIe subsystem. Only reset specific PF and
   it's VF.
4. VF FLR: triggered by PCIe subsystem. Only reset specific VF function
   and do not affect other PF or VF.
5. Core reset: triggered by writing to register. Reset most hardware
   unit, such as SSU, which affects all the PF and VF.
6. Global reset: triggered by writing to register. Reset all hardware
   unit, which affects all the PF and VF.
7. IMP reset: triggered by IMU(Intelligent Management Unit) when
   IMP is not longer feeding IMU's watchdog. IMU will reload the IMP
   firmware and IMP will perform global reset after firmware reloading,
   which affects all the PF and VF.

Current driver only support PF/VF reset, incomplete core and global
reset(lacking the vf reset handling). So this patchset adds complete
reset support in hns3 driver.

Also, this patchset contains some optimization related to reset.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns3: add PCIe FLR support for VF

This patch implements the .reset_prepare and .reset_done
ops from pci framework to support the VF FLR.

This patch uses hclgevf_set_def_reset_request() and
hclgevf_reset_event() to handle FLR, so when
hdev->default_reset_request is non zero, it means there is
some reset requseted by hclgevf_set_def_reset_request() need
to be processed. Also get the hdev from the ae_dev because
hclgevf_reset_event is called with handle being NULL.

Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns3: do VF's pci re-initialization while PF doing FLR

While doing PF FLR, VF's PCIe configuration space will be cleared, so
the pci and vector of VF should be re-initialized in the VF's reset
process while PF doing FLR.

Also, this patch fixes some memory not freed problem when pci
re-initialization is done during reset process.

Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns3: add PCIe FLR support for PF

This patch implements the .reset_prepare and .reset_done
ops from pci framework to support the PF FLR.

Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns3: implement the IMP reset processing for PF

The current code only print the prompt message after receiving
the IMP reset interrupt and does not perform the corresponding driver
reset operation. This patch implements the missing IMP reset handling
in the driver.
1. The driver sets the HCLGE_STATE_CMD_DISABLE to stop sending command
   after receiving the IMP reset interrupt.
2. The driver needs to notify the hardware to reload the IMP firmware.
3. The IMP firmware reloading makes the reset time of hardware longer,
   so it is necessary to extend the driver's waiting time to wait for
   the hardware reset to complete.
4. In hclge_check_event_cause, IMP reset event should have higher
   priority than other events.

Also, after clearing HCLGE_STATE_CMD_DISABLE in the hclge_cmd_init(),
it needs to check whether there is a pending reset, if so, just set
the HCLGE_STATE_CMD_DISABLE back and return.

Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns3: stop napi polling when HNS3_NIC_STATE_DOWN is set

When calling napi_disable during reset down process, if NAPIF_STATE_MISSED
is set, napi_complete will call __napi_schedule to do the polling again.
So this patch uses HNS3_NIC_STATE_DOWN to ensure the polling is not
scheduled again.

Also, when napi_complete returns true, it means polling is scheduled
again, it is not neccssary to enable the interrupt.

Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns3: add error handler for hclgevf_reset()

Since hclgevf_reset() may fail for some reasons, so it needs an error
handler to deal with it. When VF reset failed, VF can only be restored
by a higher level reset asserted by PF. So, it needs to reinitialize
its command queue, then it can respond to higher level reset.

Also, this patch adds error logging in the hclgevf_notify_client().

Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns3: stop handling command queue while resetting VF

According to hardware's description, after the reset occurs, the driver
needs to re-initialize the command queue before sending and receiving
any commands. Therefore, the VF's driver needs to identify the command
queue needs to re-initialize with HCLGEVF_STATE_CMD_DISABLE, and does
not allow sending or receiving commands before the re-initialization.

Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns3: add reset handling for VF when doing Core/Global/IMP reset

When a Core/Global/IMP reset occurs, the hardware sets the reset status
register of all PF/VF and reports a reset interrupt to all PF/VF and
firmware.

When receiving the reset interrupt:
1. The firmware will wait for 100 ms before resetting the hardware and
   clear the reset status register of all PF when hardware reset is done.
2. The PF/VF driver needs to down the netdev within 100 ms and then wait
   for hardware reset to finish.
3. After firmware clearing the reset status register of all PF, the PF
   driver reinitializes the hardware and clear the reset status register
   of it's VF.
4. After PF driver clearing the reset status register of VF, the VF driver
   reinitializes the hardware.

This patch mainly add handling for the step 4.

Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns3: add reset handling for VF when doing PF reset

When PF performs a function reset, the hardware will reset both PF
and all the VF belong to this PF. Hence, both PF's driver and VF's
driver need to perform corresponding reset operations.

Before PF driver asserting function reset to hardware, it firstly
set up VF's hardware reset status, and inform the VF driver with
HNAE3_VF_PF_FUNC_RESET, then VF driver sets this reset type to
reset_pending and shechule reset task to stop IO and waits for the
hardware reset status to clear. When PF driver has reinitialized the
hardware and is ready to process mailbox from VF, PF driver clears
VF's hardware reset status for VF to continue its reset process.

Also, this patch uses readl_poll_timeout to simplify the hardware reset
status waitting.

Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns3: adjust VF's reset process

Currently when VF need to reset itself, it will send a cmd to PF,
after receiving the VF reset requset, PF sends a cmd to inform
VF to enter the reset process and send a cmd to firmware to do the
actual reset for the VF, it is possible that firmware has resetted
the VF, but VF has not entered the reset process, which may cause
IO not stopped problem when firmware is resetting VF.

This patch fixes it by adjusting the VF reset process, when VF
need to reset itself, it will enter the reset process first, and
it will tell the PF to send cmd to firmware to reset itself.

Add member reset_pending to struct hclgevf_dev, which indicates that
there is reset event need to be processed by the VF's reset task, and
the VF's reset task chooses the highest-level one and clears other
low-level one when it processes reset_pending.

hclge_inform_reset_assert_to_vf function is unused now, but it will
be used to support the PF reset with VF working, so declare it in
the header file.

Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns3: add reset_hdev to reinit the hdev in VF's reset process

When doing reset, the reset handling function only need to
reinitialize hardware, it makes sense to add a function to
do that job. Also the error handling of hclgevf_init_hdev is
different when it is used in reset process.

This patch adds reset_hdev to reinitialize hardware when resetting.
Also, this patch removes the hclgevf_dev_ongoing_full_reset because
it is unused now.

Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cxgb4vf: free mac_hlist properly

The locally maintained list for tracking hash mac table was
not freed during driver remove.

Signed-off-by: Arjun Vynipadath <arjun@chelsio.com>
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cxgb4vf: fix memleak in mac_hlist initialization

mac_hlist was initialized during adapter_up, which will be called
every time a vf device is first brought up, or every time when device
is brought up again after bringing all devices down. This means our
state of previous list is lost, causing a memleak if entries are
present in the list. To fix that, move list init to the condition
that performs initial one time adapter setup.

Signed-off-by: Arjun Vynipadath <arjun@chelsio.com>
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cxgb4: free mac_hlist properly

The locally maintained list for tracking hash mac table was
not freed during driver remove.

Signed-off-by: Arjun Vynipadath <arjun@chelsio.com>
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: tcp: remove BUG_ON from tcp_v4_err

if skb is NULL pointer, and the following access of skb's
skb_mstamp_ns will trigger panic, which is same as BUG_ON

Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

xen/netfront: remove unnecessary wmb

RING_PUSH_REQUESTS_AND_CHECK_NOTIFY is already able to make sure backend sees
requests before req_prod is updated.

Signed-off-by: Jacob Wen <jian.w.wen@oracle.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'nfp-abm-move-code-and-improve-parameter-validation'

Jakub Kicinski says:

====================
nfp: abm: move code and improve parameter validation

This set starts by separating Qdisc handling code into a new file.
Next two patches allow early access to TLV-based capabilities during
probe, previously the capabilities were parsed just before netdevs
were registered, but its cleaner to do some basic validation earlier
and avoid cleanup work.

Next three patches improve RED's parameter validation.  First we provide
a more precise message about why offload failed (and move the parameter
validation to a helper).  Next we make sure we don't set the top bit
in the 32 bit max RED threshold value.  Because FW is treating the value
as signed it reportedly causes slow downs (unnecessary queuing and
marking) when top bit is set with recent firmwares.  Last (and perhaps
least importantly) we offload the harddrop parameter of the Qdisc.
We don't plan to offload harddrop RED, but it seems prudent to make
sure user didn't set that flag as device behaviour would have differed.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

nfp: abm: refuse RED offload with harddrop set

RED Qdisc will now inform the drivers about the state of the harddrop
flag. Refuse to offload in case harddrop is set.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: sched: red: inform offloads about harddrop setting

To mirror software behaviour on offload more precisely inform
the drivers about the state of the harddrop flag.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

nfp: abm: don't set negative threshold

Turns out the threshold value is used in signed compares in the FW,
so we should avoid setting the top bit.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

nfp: abm: provide more precise info about offload parameter validation

Improve log messages printed when RED can't be offloaded because
of Qdisc parameters.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

nfp: parse vNIC TLV capabilities at alloc time

In certain cases initialization logic which follows allocation of
the vNIC structure may want to validate the capabilities of that vNIC.
This is easy before vNIC is initialized for normal capabilities which
are at fixed offsets in control memory, easy to locate and read, but
poses a challenge if the capabilities are in form of TLVs. Parse
the TLVs early on so other code can just access parsed info, instead
of having to do the parsing by itself.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

nfp: pass ctrl_bar pointer to nfp_net_alloc

Move setting ctrl_bar pointer to the nfp_net_alloc function,
to make sure we can parse capabilities early in the following
patch.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

nfp: abm: split qdisc offload code into a separate file

The Qdisc offload code is logically separate, and we will soon
do significant surgery on it to support more Qdiscs, so move
it to a separate file.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp_bbr: update comments to reflect pacing_margin_percent

Recently, in commit ab408b6dc744 ("tcp: switch tcp and sch_fq to new
earliest departure time model"), the TCP BBR code switched to a new
approach of using an explicit bbr_pacing_margin_percent for shaving a
pacing rate "haircut", rather than the previous implict
approach. Update an old comment to reflect the new approach.

Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'net-Use-__vlan_hwaccel_-helpers'

Michał Mirosław says:

====================
net: Use __vlan_hwaccel_*() helpers

This series removes from networking core and driver code an assumption
about how VLAN tag presence is stored in an skb. This will allow to free
up overloading of VLAN.CFI bit to incidate tag's presence.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

sky2: use __vlan_hwaccel helpers

Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlx4: use __vlan_hwaccel helpers

Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>

benet: use __vlan_hwaccel helpers

Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>

ipv4/tunnel: use __vlan_hwaccel helpers

Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>

bridge: use __vlan_hwaccel helpers

This removes assumption than vlan_tci != 0 when tag is present.

Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>

8021q: use __vlan_hwaccel helpers

Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>

nfnetlink/queue: use __vlan_hwaccel helpers

Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/core: use __vlan_hwaccel helpers

This removes assumptions about VLAN_TAG_PRESENT bit.

Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>

cxgb4: use __vlan_hwaccel helpers

Use __vlan_hwaccel_put_tag() to set vlan tag and proto fields.

Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: move __skb_checksum_complete*() to skbuff.c

__skb_checksum_complete_head() and __skb_checksum_complete()
are both declared in skbuff.h, they fit better in skbuff.c
than datagram.c.

Cc: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'net-ethernet-ti-cpsw-fix-vlan-mcast'

Ivan Khoronzhuk says:

====================
net: ethernet: ti: cpsw: fix vlan mcast

The cpsw holds separate mcast entires for vlan entries. At this moment
driver adds only not vlan mcast addresses, omitting vlan/mcast entries.
As result mcast for vlans doesn't work. It can be fixed by adding same
mcast entries for every created vlan, but this patchseries uses more
sophisticated way and allows to create mcast entries only for vlans
that really require it. Generic functions from this series can be
reused for fixing vlan and macvlan unicast.

Simple example of ALE table before and after this series, having same
mcast entries as for vlan 100 as for real device (reserved vlan 2),
and one mcast address only for vlan 100 - 01:1b:19:00:00:00.

<---- Before this patchset ---->
vlan , vid = 2, untag_force = 0x5, reg_mcast = 0x5, mem_list = 0x5
mcast, vid = 2, addr = ff:ff:ff:ff:ff:ff, port_mask = 0x1
ucast, vid = 2, addr = 74:da:ea:47:7d:9d, persistant, port_num = 0x0
vlan , vid = 0, untag_force = 0x7, reg_mcast = 0x0, mem_list = 0x7
mcast, vid = 2, addr = 33:33:00:00:00:01, port_mask = 0x1
mcast, vid = 2, addr = 01:00:5e:00:00:01, port_mask = 0x1
vlan , vid = 1, untag_force = 0x3, reg_mcast = 0x3, mem_list = 0x3
mcast, vid = 1, addr = ff:ff:ff:ff:ff:ff, port_mask = 0x1
ucast, vid = 1, addr = 74:da:ea:47:7d:9c, persistant, port_num = 0x0
mcast, vid = 1, addr = 33:33:00:00:00:01, port_mask = 0x1
mcast, vid = 1, addr = 01:00:5e:00:00:01, port_mask = 0x1
mcast, vid = 2, addr = 01:80:c2:00:00:00, port_mask = 0x1
mcast, vid = 2, addr = 01:80:c2:00:00:03, port_mask = 0x1
mcast, vid = 2, addr = 01:80:c2:00:00:0e, port_mask = 0x1
mcast, vid = 1, addr = 01:80:c2:00:00:00, port_mask = 0x1
mcast, vid = 1, addr = 01:80:c2:00:00:03, port_mask = 0x1
mcast, vid = 1, addr = 01:80:c2:00:00:0e, port_mask = 0x1
mcast, vid = 2, addr = 33:33:ff:47:7d:9d, port_mask = 0x1
mcast, vid = 2, addr = 33:33:00:00:00:fb, port_mask = 0x1
mcast, vid = 2, addr = 33:33:00:01:00:03, port_mask = 0x1
mcast, vid = 1, addr = 33:33:ff:47:7d:9c, port_mask = 0x1
mcast, vid = 1, addr = 33:33:00:00:00:fb, port_mask = 0x1
mcast, vid = 1, addr = 33:33:00:01:00:03, port_mask = 0x1
mcast, vid = 1, addr = 01:00:5e:00:00:fb, port_mask = 0x1
mcast, vid = 1, addr = 01:00:5e:00:00:fc, port_mask = 0x1
vlan , vid = 100, untag_force = 0x0, reg_mcast = 0x5, mem_list = 0x5
ucast, vid = 100, addr = 74:da:ea:47:7d:9d, persistant, port_num = 0x0
mcast, vid = 100, addr = ff:ff:ff:ff:ff:ff, port_mask = 0x1
mcast, vid = 2, addr = 01:1b:19:00:00:00, port_mask = 0x1
^^^
Here mcast entry (ptpl2), has to be added only for vlan 100
but added for reserved vlan 2...that's not enough.

<---- After this patchset ---->
vlan , vid = 2, untag_force = 0x5, reg_mcast = 0x5, mem_list = 0x5
mcast, vid = 2, addr = ff:ff:ff:ff:ff:ff, port_mask = 0x1
ucast, vid = 2, addr = 74:da:ea:47:7d:9d, persistant, port_num = 0x0
vlan , vid = 0, untag_force = 0x7, reg_mcast = 0x0, mem_list = 0x7
mcast, vid = 2, addr = 33:33:00:00:00:01, port_mask = 0x1
mcast, vid = 2, addr = 01:00:5e:00:00:01, port_mask = 0x1
vlan , vid = 1, untag_force = 0x3, reg_mcast = 0x3, mem_list = 0x3
mcast, vid = 1, addr = ff:ff:ff:ff:ff:ff, port_mask = 0x1
ucast, vid = 1, addr = 74:da:ea:47:7d:9c, persistant, port_num = 0x0
mcast, vid = 1, addr = 33:33:00:00:00:01, port_mask = 0x1
mcast, vid = 1, addr = 01:00:5e:00:00:01, port_mask = 0x1
mcast, vid = 2, addr = 01:80:c2:00:00:00, port_mask = 0x1
mcast, vid = 2, addr = 01:80:c2:00:00:03, port_mask = 0x1
mcast, vid = 2, addr = 01:80:c2:00:00:0e, port_mask = 0x1
mcast, vid = 1, addr = 01:80:c2:00:00:00, port_mask = 0x1
mcast, vid = 1, addr = 01:80:c2:00:00:03, port_mask = 0x1
mcast, vid = 1, addr = 01:80:c2:00:00:0e, port_mask = 0x1
mcast, vid = 2, addr = 33:33:ff:47:7d:9d, port_mask = 0x1
mcast, vid = 1, addr = 33:33:ff:47:7d:9c, port_mask = 0x1
mcast, vid = 2, addr = 33:33:00:00:00:fb, port_mask = 0x1
mcast, vid = 2, addr = 33:33:00:01:00:03, port_mask = 0x1
mcast, vid = 1, addr = 33:33:00:00:00:fb, port_mask = 0x1
mcast, vid = 1, addr = 33:33:00:01:00:03, port_mask = 0x1
vlan , vid = 100, untag_force = 0x0, reg_mcast = 0x5, mem_list = 0x5
ucast, vid = 100, addr = 74:da:ea:47:7d:9d, persistant, port_num = 0x0
mcast, vid = 100, addr = ff:ff:ff:ff:ff:ff, port_mask = 0x1
mcast, vid = 100, addr = 33:33:00:00:00:01, port_mask = 0x1
mcast, vid = 100, addr = 01:00:5e:00:00:01, port_mask = 0x1
mcast, vid = 100, addr = 33:33:ff:47:7d:9d, port_mask = 0x1
mcast, vid = 100, addr = 01:80:c2:00:00:00, port_mask = 0x1
mcast, vid = 100, addr = 01:80:c2:00:00:03, port_mask = 0x1
mcast, vid = 100, addr = 01:80:c2:00:00:0e, port_mask = 0x1
mcast, vid = 100, addr = 33:33:00:00:00:fb, port_mask = 0x1
mcast, vid = 100, addr = 33:33:00:01:00:03, port_mask = 0x1
mcast, vid = 100, addr = 01:1b:19:00:00:00, port_mask = 0x1
^^^
    Here mcast entry (ptpl2), is added only for vlan 100
    as it should be.

Based on net-next/master

v2..v1:
  net: ethernet: ti: cpsw: fix vlan mcast
- removed limit for legacy switch cpsw mode
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: ethernet: ti: cpsw: fix vlan configuration while down/up

The vlan configuration is not restored after interface donw/up sequence
(if dual-emac - both interfaces). Tested on am572x EVM.

Steps to check:
~# ip link add link eth1 name eth1.100 type vlan id 100
~# ifconfig eth0 down
~# ifconfig eth1 down

Try to remove vid and observe warning:
~# ip link del eth1.100
[ 739.526757] net eth1: removing vlanid 100 from vlan filter
[ 739.533322] failed to kill vid 0081/100 for device eth1

This patch fixes it, restoring only vlan ALE entries and all other
unicast/multicast entries are restored by system calling rx_mode ndo.

Reviewed-by: Grygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ethernet: ti: cpsw: fix vlan mcast

At this moment, mcast addresses are added for real device only
(reserved vlans for dual-emac mode), even if a mcast address was added
for some vlan only, thus ALE doesn't have corresponding vlan mcast
entries after vlan socket joined multicast group. So ALE drops vlan
frames with mcast addresses intended for vlans and potentially can
receive mcast frames for base ndev. That's not correct. So, fix it by
creating only vlan/mcast entries as requested. Patch doesn't use any
additional lists and is based on device mc address list and cpsw ALE
table entries.

Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: 8021q: vlan_core: allow use list of vlans for real device

It's redundancy for the drivers to hold the list of vlans when
absolutely the same list exists in vlan core. In most cases it's
needed only to traverse the vlan devices, their vids and sync some
settings with h/w, so add API to simplify this.

At least some of these drivers also can benefit:
grep "for_each.*vid" -r drivers/net/ethernet/

drivers/net/ethernet/hisilicon/hns3/hns3_enet.c:
drivers/net/ethernet/synopsys/dwc-xlgmac-hw.c:
drivers/net/ethernet/qlogic/qlge/qlge_main.c:
drivers/net/ethernet/qlogic/qlcnic/qlcnic_main.c:
drivers/net/ethernet/via/via-rhine.c:
drivers/net/ethernet/via/via-velocity.c:
drivers/net/ethernet/intel/igb/igb_main.c:
drivers/net/ethernet/intel/ice/ice_main.c:
drivers/net/ethernet/intel/e1000/e1000_main.c:
drivers/net/ethernet/intel/i40e/i40e_main.c:
drivers/net/ethernet/intel/e1000e/netdev.c:
drivers/net/ethernet/intel/igbvf/netdev.c:
drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c:
drivers/net/ethernet/intel/ixgb/ixgb_main.c:
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c:
drivers/net/ethernet/amd/xgbe/xgbe-dev.c:
drivers/net/ethernet/emulex/benet/be_main.c:
drivers/net/ethernet/neterion/vxge/vxge-main.c:
drivers/net/ethernet/adaptec/starfire.c:
drivers/net/ethernet/brocade/bna/bnad.c:

Reviewed-by: Grygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: core: dev_addr_lists: add auxiliary func to handle reference address updates

In order to avoid all table update, and only remove or add new
address, the auxiliary function exists, named __hw_addr_sync_dev().
It allows end driver do nothing when nothing changed and add/rm when
concrete address is firstly added or lastly removed. But it doesn't
include cases when an address of real device or vlan was reused by
other vlans or vlan/macval devices.

For handaling events when address was reused/unreused the patch adds
new auxiliary routine - __hw_addr_ref_sync_dev(). It allows to do
nothing when nothing was changed and do updates only for an address
being added/reused/deleted/unreused. Thus, clone address changes for
vlans can be mirrored in the table. The function is exclusive with
__hw_addr_sync_dev(). It's responsibility of the end driver to
identify address vlan device, if it needs so.

Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

sfc: use the new __netdev_tx_sent_queue BQL optimisation

As added in 3e59020abf0f ("net: bql: add __netdev_tx_sent_queue()"), which
see for performance rationale.

Signed-off-by: Edward Cree <ecree@solarflare.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'net-Remove-VLAN_TAG_PRESENT-from-drivers'

Michał Mirosław says:

====================
net: Remove VLAN_TAG_PRESENT from drivers

This series removes VLAN_TAG_PRESENT use from network drivers in
preparation to removing its special meaning.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>