> ## Documentation Index
> Fetch the complete documentation index at: https://site.aspect.build/llms.txt
> Use this file to discover all available pages before exploring further.

# Workflows alerts

> Reference for alerts the Aspect Workflows system raises, with the underlying concepts and triage steps for Bazel CI runner operators.

This document systematically outlines the technical concepts and theoretical basis for the alert types generated by the Workflows system, along with the initial steps for triage and debugging. These concepts are foundational for understanding the operational health and behavior of the Workflows CI runners and associated infrastructure.

## Understanding workflows alerts

The Workflows system triggers alerts when critical operational thresholds experience breaches or when essential components fail their health checks. These alerts provide immediate notification of a deviation from expected behavior, allowing operators to diagnose and triage the underlying cause.

The Workflows alerts cover key aspects of the system, including:

* Storage resource capacity.
* Runner initialization and boot sequence.
* Core app process health.
* Scaling mechanism integrity.
* External resource performance.

## Key alert categories and associated concepts

The system categorizes Workflows alerts based on the component or process that fails, which dictates the scope of the required investigation.

## Runner initialization alerts

These alerts relate to the fundamental processes required for a runner to become operational. The primary log group for investigating these is generally `/aw/runner/cloud-init/output`.

* **Bootstrap failure**
  The Bootstrap Failure alert indicates that a significant number of runners have failed the initial boot sequence and are repeatedly cycling due to failing health checks.
  Concept: At this stage, the system is unable to bring any new runners online, which directly impacts scaling and job execution capacity.
  Triage Concept: Determine the cause by searching for the string BOOTSTRAP ERROR within the designated log group.

* **Warming restore error**
  The system triggers the Warming Restore Error when the process of restoring the cache fails on multiple runners, even if the runner successfully completes the bootstrap phase.
  Concept: This alert may indicate cache corruption in the primary or secondary warming caches, or a transient failure during the retrieval of cache archives from storage. A runner affected by this is either running cold or has warmed from the previous cache pointer.
  Triage Concept: Diagnosis involves searching for WARMING ERROR in the log group and verifying the successful execution and upload of the warming archive creation job.

### Runner process alerts

These alerts signify an issue with the core app processes running on an already-bootstrapped runner.

* **CI agent error**
  The system triggers this alert when the CI agent process within the runner encounters an error.
  Triage Concept: Determining the cause requires searching the `/aw/runner/ci-agent` log group on Amazon Web Services using CloudWatch log filtering.

* **Workflows CI runner error**
  This alert occurs when the main runner process logs an error at a level of CRITICAL or higher.
  Triage Concept: You can determine the specific cause by filtering the `/aw/runner/workflows` log group on AWS for the corresponding high log level. The higher log levels in this context are EMERGENCY and ALERT.

## Storage and resource alerts

These concepts relate to the size and state of the storage buckets used by Workflows.

* **Warming bucket size alert**
  This alert triggers when the size of the warming bucket exceeds a 500 GB threshold.
  Concept: The warming bucket stores objects that are frequently accessed or recently added to the archive bucket to speed up the cache warming process for runners.
  Lifecycle Policy: A lifecycle policy governs object management and deletes objects in the *archive/* directory that have the tag DeleteMark set to a value of `1`.
  Critical Requirement: Don't enable bucket versioning on the warming bucket. If you find it enabled, turn it off and safely remove all versions of deleted objects.

### Scaling mechanism alerts

These alerts point to issues within the scaling infrastructure, which is responsible for managing runner deployment.

* **Scaling lambda error rate**
  The system triggers this alert when the error rate for the scaling lambda component breaches a defined threshold.
  Triage Concept: Diagnosis requires viewing the Lambda’s logs in the dedicated log group, which has the format `/aws/lambda/aw_<host>_scaling_<state|webhook>__<repo>`.

* **Malicious lambda event**
  The system triggers this specific alert when the scaling lambda receives an event where the checksum key doesn’t match the expected value.
  Concept: This disparity suggests that someone is sending a malicious payload to the lambda endpoint.
  Triage Concept: The lambda logs the string **Potentially malicious event detected** along with additional event details. You can find the corresponding logs in the `/aws/lambda/aw_<host>_scaling_<event|webhook>__<repo>` log group.

### Infrastructure performance alerts

This alert relates to the performance of external load balancing components.

* **Application load balancer response time (Amazon Web Services only)**
  This alert triggers when the **Application Load Balancer (ALB)** records **high response times** from the underlying cluster resources.
  Concept: High ALB response times are frequently correlated with issues in the **remote cache**, such as excessive churn or other process-consuming activity within the cluster.
  Triage Concept: Triage begins by checking the service statistics for each service in the remote cluster to isolate the source of the latency. Additionally, validating the symlink integrity of the various cache drive locations, for example `/dev/cas`, is necessary to ensure that the underlying storage is accessible by the cache services.
