I have a multi-stage YAML pipeline, and sometimes a stage fails due to transient issues (e.g., network timeouts, flaky tests, etc.). Instead of manually re-running the stage, I want to configure an automatic retry mechanism for failed stages in same build.
Approach Tried:I attempted to implement a monitoring stage that tracks and retries failed stages dynamically. This stage consists of two tasks:
Check the status of each stage using an API call and store the results in a variable.
Rerun failed stages based on the stored statuses.
However, when calling the API to get stage statuses, I encountered the following issue:
"Stage 'Build' is still running or undetermined. Skipping..."
Additionally, I have tried using the YAML property retryCountOnTaskFailure, but it is not a viable solution in our case. Our pipeline Provisions a Kubernetes cluster, Onboards it to Azure Arc and Runs Sonobuoy tests for various extension plugins.
Using retryCountOnTaskFailure causes resource leaks because retries leave many resources undeleted, leading to inconsistencies.
Questions:How can I reliably fetch the status of completed or failed stages in a YAML pipeline?
What is the best way to dynamically rerun only the failed stages while ensuring proper resource cleanup?
Any insights or workarounds would be greatly appreciated!