liquidio: fix Octeon core watchdog timeout false alarm
Detection of watchdog timeout of Octeon cores is flawed and susceptible to
false alarms. Refactor by removing the detection code, and in its place,
leverage existing code that monitors for an indication from the NIC
firmware that an Octeon core crashed; expand the meaning of the indication
to "an Octeon core crashed or its watchdog timer expired". Detection of
watchdog timeout is now delegated to an exception handler in the NIC
firmware; this is free of false alarms.
Also if there's an Octeon core crash or watchdog timeout:
(1) Disable VF Ethernet links.
(2) Decrement the module refcount by an amount equal to the number of
active VFs of the NIC whose Octeon core crashed or had a watchdog
timeout. The refcount will continue to reflect the active VFs of
other liquidio NIC(s) (if present) whose Octeon cores are faultless.
Item (2) is needed to avoid the case of not being able to unload the driver
because the module refcount is stuck at some non-zero number. There is
code that, in normal cases, decrements the refcount upon receiving a
message from the firmware that a VF driver was unloaded. But in
exceptional cases like an Octeon core crash or watchdog timeout, arrival of
that particular message from the firmware might be unreliable. That normal
case code is changed to not touch the refcount in the exceptional case to
avoid contention (over the refcount) with the liquidio_watchdog kernel
thread who will carry out item (2).
Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com>
Signed-off-by: Derek Chickles <derek.chickles@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>