Skip to content

Conversation

@hugocasa
Copy link
Collaborator

@hugocasa hugocasa commented Dec 4, 2025

No description provided.

@cloudflare-workers-and-pages
Copy link

cloudflare-workers-and-pages bot commented Dec 4, 2025

Deploying windmill with  Cloudflare Pages  Cloudflare Pages

Latest commit: d432b5a
Status:🚫  Build failed.

View logs

@hugocasa hugocasa changed the title feat: triggers suspended mode [merge ee first] feat: triggers suspended mode Dec 5, 2025
@hugocasa hugocasa marked this pull request as ready for review December 5, 2025 14:38
@claude
Copy link
Contributor

claude bot commented Dec 5, 2025

Claude finished @hugocasa's task —— View job


PR Review: Triggers Suspended Mode ✅

I've completed a comprehensive review of this PR. This is a substantial and well-architected feature that adds a "suspended mode" to triggers, allowing jobs to be queued without automatic execution. Here's my detailed analysis:


🎯 Overview

This PR introduces a third state for triggers beyond enabled/disabled: suspended mode. When a trigger is suspended, it continues accepting payloads and creating jobs, but those jobs are marked with special "unassigned" job kinds (UnassignedScript, UnassignedFlow, UnassignedSinglestepFlow) and won't execute automatically. Users can then review and either resume or cancel these jobs.

Scope: ~5,200 additions, ~2,300 deletions across 100 files
Impact: All trigger types (HTTP, Websocket, Kafka, NATS, MQTT, SQS, GCP, Postgres, Email)


✅ Strengths

1. Clean Database Migration

The migration from boolean enabled to TRIGGER_MODE enum is well-structured:

  • Creates enum type correctly
  • Migrates data safely (CASE WHEN enabled = true THEN 'enabled' ELSE 'disabled')
  • Includes proper down migration for rollback
  • Covers all 9 trigger tables consistently

2. Good Abstraction with TriggerCrud Trait

The TriggerCrud trait provides excellent abstraction:

// backend/windmill-api/src/triggers/handler.rs:209-276
async fn set_trigger_mode(&self, ...) -> Result<bool>
  • Centralizes mode toggling logic
  • Handles both server-state and non-server-state triggers
  • Includes extensibility hook via set_trigger_mode_extra_action

3. Backward Compatibility

The deprecated enabled field handling shows thoughtful backward compatibility:

// backend/windmill-api/src/triggers/mod.rs:126-142
pub fn mode(&self) -> &TriggerMode {
    self.mode.as_ref().unwrap_or(
        #[allow(deprecated)]
        if self.enabled.unwrap_or(true) { &TriggerMode::Enabled }
        else { &TriggerMode::Disabled }
    )
}

4. Smart Job Reassignment Logic

The resume_suspended_trigger_jobs function intelligently handles two scenarios (backend/windmill-api/src/triggers/global_handler.rs:169-255):

  • Jobs created AFTER trigger edit: Delete and recreate with new config
  • Jobs created BEFORE trigger edit: Simple update to unsuspend
    This prevents stale configuration from being executed.

5. Listener Architecture

The listener system correctly fetches both enabled AND suspended triggers:

// backend/windmill-api/src/triggers/listener.rs:81
.and_where("(mode = 'enabled'::TRIGGER_MODE OR mode = 'suspended'::TRIGGER_MODE)")

6. OpenAPI Updated

Good practice - the OpenAPI spec was properly updated with new endpoints and the TriggerMode enum.


🐛 Issues Found

CRITICAL: Potential Data Loss in Down Migration

Location: backend/migrations/20251127104100_add_suspended_mode_to_triggers.down.sql:6

Issue: The down migration converts suspended triggers to disabled, potentially losing important state. See inline comment for details and recommendations.

Impact: If users roll back while triggers are in suspended mode with queued jobs:

  1. Suspended triggers become disabled
  2. Unassigned jobs remain in limbo (no cleanup)
  3. Users lose track of which triggers were suspended

HIGH: Missing Error Context in Transaction

Location: backend/windmill-api/src/triggers/global_handler.rs:228-254

Issue: Multiple DELETE operations in job cleanup lack error context. See inline comment for recommendations.

Impact: If a delete fails midway, debugging will be difficult without proper error messages.


MEDIUM: SQL Injection Defense

Locations:

  • backend/windmill-api/src/triggers/global_handler.rs:51
  • Similar patterns in handler.rs and listener.rs

Issue: Table names constructed from enum values. While currently safe, defense-in-depth would improve security. See inline comment for explicit validation pattern.


MEDIUM: Inconsistent Error Reporting

Location: backend/windmill-api/src/triggers/listener.rs:456-465

Issue: The disable_with_error function has inconsistent error reporting between trigger mode (uses report_critical_error) and capture mode (uses tracing::error!).

Impact: Critical errors in capture mode might not alert operators properly.

Recommendation: Unify error reporting to use report_critical_error for both paths.


LOW: Potential Race Condition in Mode Toggle

Location: backend/windmill-api/src/triggers/handler.rs:213-276

Issue: set_trigger_mode updates trigger mode and then calls set_trigger_mode_extra_action, but there's a window where the mode is updated but extra action hasn't run yet.

Impact: If a listener picks up the trigger between these two operations, it might process with inconsistent state.

Recommendation:

  1. Consider making set_trigger_mode_extra_action run BEFORE the mode update
  2. OR wrap both in a CTE/single query if possible
  3. Document the ordering requirement

LOW: Missing Index Consideration

Location: Database migration

Issue: Queries frequently filter by mode but no indexes added. See inline comment on listener.rs:81 for recommendation.


🎨 Code Quality Observations

Good Practices

  1. ✅ Explicit column selection in queries (avoiding SELECT *)
  2. ✅ Using prepared statements with bind parameters
  3. ✅ Consistent error types via Result<T, Error>
  4. ✅ Comprehensive logging with tracing::
  5. ✅ Feature flags for enterprise/optional features

Minor Improvements

  1. Verbose logging in production: Consider changing tracing::info!("SQL: {}", sql) to debug! level

  2. Magic number (listener.rs:194):

tokio::time::sleep(tokio::time::Duration::from_secs(5)).await;

Extract to constant: const PING_INTERVAL_SECS: u64 = 5;

  1. Duplicate code: The pattern of checking trigger mode appears in multiple places - consider a helper constant

🔒 Security Considerations

  1. Access Control: ✅ Properly checks scopes before mode changes
  2. Job Cancellation: ✅ Uses existing cancel_jobs which includes proper authorization
  3. SQL Injection: ⚠️ Addressed in inline comments - enum-based table names are mostly safe but could be more defensive

🚀 Performance Considerations

  1. Batch Operations: The resume_suspended_trigger_jobs processes jobs one at a time. For workspaces with hundreds of suspended jobs, consider batch updates.

  2. Lock Contention: Line 645-668 in listener.rs uses row-level locks. With many triggers, this could cause contention.


📝 Documentation & Testing

Needs Documentation

  1. Migration guide for users: How to handle existing enabled/disabled triggers
  2. API documentation for new endpoints (beyond OpenAPI)
  3. Behavioral documentation: What happens to suspended jobs when trigger is deleted?

Test Coverage Recommendations:

  • Suspended → Enabled with queued jobs
  • Suspended → Disabled with queued jobs
  • Trigger edited while jobs suspended (reassignment logic)
  • Concurrent mode changes
  • Race conditions in listener startup
  • Down migration with suspended triggers in database

📋 Testing Instructions

To manually test this feature through the app interface:

Navigate to any trigger page (e.g., HTTP Routes under Triggers section). Create or select an existing trigger. In the trigger editor toolbar, you'll see a toggle for enable/disable along with a dropdown menu. Click the dropdown and select "Suspend job execution" to put the trigger in suspended mode. The toggle will change to show three states: Disabled, Suspended, and Enabled.

Once suspended, trigger the endpoint/event that would normally execute the trigger. Go to the Runs page and add a filter for job status. You should see jobs with "Unassigned" status appearing - these are suspended jobs that won't execute automatically. Click on any suspended job to view its details. Notice the job shows as queued but won't start execution.

Back on the trigger page, if there are suspended jobs, you'll see an alert/modal appear when trying to change the trigger mode. The modal allows you to review all suspended jobs and choose to either resume them (which will execute them with the current trigger configuration) or cancel them. You can select individual jobs or batch process all suspended jobs. After resuming, jobs should transition to normal execution. After canceling, jobs should be marked as canceled in the runs page.

Test edge cases: edit the trigger configuration while jobs are suspended, then resume - jobs should use the NEW configuration. Delete a trigger with suspended jobs - verify appropriate cleanup. Switch between suspended and enabled/disabled modes multiple times to ensure state transitions work correctly.


Summary

This is a well-designed and thoughtfully implemented feature with good architecture and backward compatibility. The main concerns are:

  1. Critical: Down migration data loss (recommend fixing before merge)
  2. High: Transaction cleanup could be more robust
  3. Medium: SQL injection defense could be stronger
  4. Medium: Inconsistent error reporting between trigger/capture modes
  5. Low: Minor race conditions and performance optimizations

The code follows Windmill's patterns well, maintains consistency across trigger types, and provides a valuable feature for users who need manual review before job execution.

Recommendation: Fix the critical down migration issue and high-priority items before merge. Medium and low priority items can be addressed in follow-up PRs if needed.

Inline comments added: 5 specific code locations with detailed recommendations


ALTER TABLE email_trigger
ADD COLUMN enabled BOOLEAN;

UPDATE email_trigger SET enabled = CASE WHEN mode = 'enabled'::TRIGGER_MODE THEN true ELSE false END;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical Issue: Potential Data Loss

When rolling back, triggers in suspended mode will be converted to disabled, but any unassigned jobs will remain orphaned in the database. This could lead to:

  1. Jobs stuck in limbo that users can't see or manage
  2. Loss of user intent (which triggers were suspended vs disabled)

Recommendation: Add cleanup logic before the migration:

-- Clean up unassigned jobs before reverting mode column
-- Option A: Cancel all unassigned jobs
DELETE FROM v2_job WHERE kind IN ('unassigned_script', 'unassigned_flow', 'unassigned_singlestepflow');

-- OR Option B: Auto-execute them by converting to regular jobs  
UPDATE v2_job SET kind = CASE 
    WHEN kind = 'unassigned_script' THEN 'script'::JOB_KIND
    WHEN kind = 'unassigned_flow' THEN 'flow'::JOB_KIND
    WHEN kind = 'unassigned_singlestepflow' THEN 'singlescriptflow'::JOB_KIND
    ELSE kind
END WHERE kind IN ('unassigned_script', 'unassigned_flow', 'unassigned_singlestepflow');

Document which option is chosen and why.

_ => {}
}

let table_name = format!("{}_trigger", trigger_kind.to_string());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium Priority: SQL Injection Defense

While trigger_kind is an enum (which provides safety), constructing table names dynamically could become a vulnerability if the enum handling ever changes.

Recommendation: Add explicit validation:

let table_name = match trigger_kind {
    JobTriggerKind::Http => "http_trigger",
    JobTriggerKind::Kafka => "kafka_trigger",
    JobTriggerKind::Websocket => "websocket_trigger",
    JobTriggerKind::Mqtt => "mqtt_trigger",
    JobTriggerKind::Nats => "nats_trigger",
    JobTriggerKind::Postgres => "postgres_trigger",
    JobTriggerKind::Sqs => "sqs_trigger",
    JobTriggerKind::Gcp => "gcp_trigger",
    JobTriggerKind::Email => "email_trigger",
    JobTriggerKind::Webhook | JobTriggerKind::Schedule | _ => {
        return Err(Error::BadRequest(format!(
            "{} triggers do not support this operation",
            trigger_kind
        )));
    }
};

This provides defense-in-depth and makes supported trigger types explicit.

};

// Delete the unassigned job from all related tables
sqlx::query!("DELETE FROM v2_job_queue WHERE id = $1", job.id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High Priority: Improve Error Context

These cascading deletes cleanup orphaned data after job reassignment. While they're in a transaction, consider adding error context for better debugging:

sqlx::query!("DELETE FROM v2_job_queue WHERE id = $1", job.id)
    .execute(&mut *tx)
    .await
    .context("Failed to delete job from queue during reassignment")?;

sqlx::query!("DELETE FROM v2_job_runtime WHERE id = $1", job.id)
    .execute(&mut *tx)
    .await
    .context("Failed to delete job runtime during reassignment")?;
// ... etc

Also consider if cascade deletes via foreign keys could simplify this (delete only from v2_job and let cascade handle the rest).

}

impl BaseTriggerData {
pub fn mode(&self) -> &TriggerMode {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good Practice: Backward Compatibility

Nice handling of the deprecated enabled field! This ensures older API clients continue to work during the transition period. Consider documenting when this can be removed (e.g., "Remove in v2.0" or after a certain number of releases).

for job in jobs {
// If job was created before trigger was edited, simply update it to unsuspend
// instead of deleting and repushing
if job.created_at > trigger.edited_at {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great Logic: Smart Job Reassignment

Excellent decision to handle two scenarios differently:

  • Jobs created AFTER trigger edit: Get new config (delete & recreate)
  • Jobs created BEFORE trigger edit: Keep original config (simple update)

This prevents the confusing scenario where a user edits a trigger and then wonders why old suspended jobs are executing with the new config. The timestamp comparison is a smart way to determine user intent.

"(last_server_ping IS NULL OR last_server_ping < now() - interval '15 seconds')",
);
sqlb.fields(&fields)
.and_where("(mode = 'enabled'::TRIGGER_MODE OR mode = 'suspended'::TRIGGER_MODE)")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good Practice: Listener Includes Suspended Triggers

Correct implementation - listeners need to pick up both enabled and suspended triggers since suspended mode still processes events (just doesn't auto-execute jobs). This ensures the trigger continues to queue jobs even in suspended state.

Minor optimization consideration: If you have many triggers, consider adding an index on mode column for better query performance:

CREATE INDEX CONCURRENTLY idx_trigger_mode ON {table_name}(workspace_id, mode) 
    WHERE mode IN ('enabled', 'suspended');

JobKind::AIAgent => "jobs.run.ai_agent",
JobKind::UnassignedScript => "jobs.run.unassigned_script",
JobKind::UnassignedFlow => "jobs.run.unassigned_flow",
JobKind::UnassignedSinglestepFlow => "jobs.run.unassigned_singlestepflow",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typographical note: The enum variant UnassignedSinglestepFlow appears to be using a non-standard compound word format. Consider renaming it to UnassignedSingleStepFlow (and updating the corresponding mapping to "jobs.run.unassigned_single_step_flow") for improved readability and consistency.

@hugocasa hugocasa changed the title [merge ee first] feat: triggers suspended mode feat: triggers suspended mode Dec 8, 2025
@rubenfiszel rubenfiszel merged commit eb284df into main Dec 8, 2025
8 of 9 checks passed
@rubenfiszel rubenfiszel deleted the hc/triggers-suspended-mode branch December 8, 2025 16:20
@github-actions github-actions bot locked and limited conversation to collaborators Dec 8, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants