habanalabs/gaudi: recover from CPU WD event
authorOded Gabbay <ogabbay@kernel.org>
Thu, 21 Oct 2021 11:02:40 +0000 (14:02 +0300)
committerOded Gabbay <ogabbay@kernel.org>
Sun, 26 Dec 2021 06:59:03 +0000 (08:59 +0200)
There are rare cases where the device CPU's watchdog has expired and as
a result, the watchdog reset has happened and the CPU will now move to
running its preboot f/w.

When that happens, the driver will only know that a heartbeat failure
occurred. As a result, the driver will send a message to the CPU's main
f/w asking it to reset the device, but because the CPU is now running
preboot, it won't respond and the re-initialization process will later
fail when trying to load the f/w.

The solution is to send the request to the preboot as well, only if the
reset was caused because of HB failure.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
drivers/misc/habanalabs/gaudi/gaudi.c

index 825737d..d2b7ecb 100644 (file)
@@ -1,7 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0
 
 /*
- * Copyright 2016-2020 HabanaLabs, Ltd.
+ * Copyright 2016-2021 HabanaLabs, Ltd.
  * All Rights Reserved.
  */
 
@@ -4296,6 +4296,24 @@ static void gaudi_hw_fini(struct hl_device *hdev, bool hard_reset, bool fw_reset
 
                WREG32(irq_handler_offset,
                        gaudi_irq_map_table[GAUDI_EVENT_HALT_MACHINE].cpu_id);
+
+               /* This is a hail-mary attempt to revive the card in the small chance that the
+                * f/w has experienced a watchdog event, which caused it to return back to preboot.
+                * In that case, triggering reset through GIC won't help. We need to trigger the
+                * reset as if Linux wasn't loaded.
+                *
+                * We do it only if the reset cause was HB, because that would be the indication
+                * of such an event.
+                *
+                * In case watchdog hasn't expired but we still got HB, then this won't do any
+                * damage.
+                */
+               if (hdev->curr_reset_cause == HL_RESET_CAUSE_HEARTBEAT) {
+                       if (hdev->asic_prop.hard_reset_done_by_fw)
+                               hl_fw_ask_hard_reset_without_linux(hdev);
+                       else
+                               hl_fw_ask_halt_machine_without_linux(hdev);
+               }
        } else {
                if (hdev->asic_prop.hard_reset_done_by_fw)
                        hl_fw_ask_hard_reset_without_linux(hdev);