Building Robust Workflows: Queues, Scheduling, and Long-Running Processes

In the world of modern applications, especially those involving AI agents or complex data processing, tasks often need to run reliably in the background, at specific times, or endure for extended periods without interruption. Imagine sending out millions of personalized emails, generating daily reports, or orchestrating a multi-step AI inference process. How do you ensure these operations complete successfully, even if your server crashes or an external API temporarily fails?

This chapter dives deep into the core mechanisms Trigger.dev provides to build such resilient systems: queues, scheduling, and long-running durable execution. We’ll learn how these concepts work together to create workflows that are not just functional, but truly robust and production-ready. By the end, you’ll be equipped to design and implement background tasks that can handle failures gracefully, execute on a precise timetable, and manage complex, multi-stage operations with ease.

If you’ve followed the previous chapters, you’ve already seen how to define basic jobs. Now, we’ll enhance those jobs with advanced capabilities that are crucial for any serious application.

The Pillars of Robustness: Queues, Scheduling, and Durable Execution

Building systems that reliably perform tasks, especially when those tasks are asynchronous, time-sensitive, or long-lived, requires a solid foundation. Trigger.dev provides this foundation by abstracting away much of the complexity associated with distributed systems. It allows developers to focus on the business logic rather than the intricate details of fault tolerance and state management.

What are Queues and Why Do We Need Them?

Imagine a popular e-commerce site on Black Friday. Thousands of orders are coming in every second. If each order immediately tried to process payment, update inventory, and send confirmation emails synchronously, the system would quickly buckle under the load. This is where queues come in.

📌 Key Idea: A queue acts as a buffer between a task producer and a task consumer, decoupling these components and smoothing out processing spikes.

When you send an event to Trigger.dev, it doesn’t immediately execute your job. Instead, the event (representing a task) is placed into a managed queue. Your Trigger.dev worker then picks up events from this queue and processes them at a controlled pace.

Why Queues Matter for Production Systems:

Decoupling: The part of your application that creates a task (e.g., a user clicking “checkout”) doesn’t need to know how or when the task will be completed. It just sends it to the queue. This improves system architecture by reducing direct dependencies.
Load Leveling: Prevents your backend from being overwhelmed during traffic spikes. Tasks are processed at a manageable rate, preventing system crashes and ensuring consistent performance.
Reliability: If a worker fails while processing a task, the task can often be returned to the queue and retried by another worker, ensuring it eventually completes. This is critical for maintaining data integrity and business continuity.
Scalability: You can easily scale your workers up or down independently of the rate at which tasks are produced. This allows for efficient resource allocation based on current demand.

In Trigger.dev, when you client.sendEvent(), that event essentially enters a managed queue, ready for your defined job to pick it up. This abstraction simplifies building asynchronous systems significantly.

Scheduling Tasks: Doing Things on Time

Some tasks aren’t triggered by an immediate event but need to happen at specific intervals or times. Think of:

Generating a daily sales report at 9:00 AM.
Sending out weekly newsletters every Monday morning.
Checking for expired user sessions every hour.
Running a nightly database backup or data synchronization.

Trigger.dev supports scheduling using standard Cron expressions. If you’ve ever set up a cron job on a Linux server, you’ll be familiar with the syntax. It allows you to define highly precise, recurring schedules for your jobs. This eliminates the need for external cron services or complex server-side task managers.

⚡ Quick Note: Cron expressions are a compact way to define recurring schedules. They typically consist of five or six fields representing minute, hour, day of month, month, day of week, and an optional year.

Durable Execution for Long-Running Workflows

Consider an AI agent workflow that involves multiple, potentially time-consuming steps:

Transcribing a long audio file (5 minutes).
Summarizing the transcription using an LLM (2 minutes).
Generating an image based on the summary (1 minute).
Sending the result for human review (human might take 30 minutes to 2 hours).
Publishing the final output (1 minute).

This entire process could easily take an hour or more. What happens if your server restarts in the middle of transcription, or the LLM API times out? Without durable execution, the entire workflow might fail, losing all progress and requiring a complete restart, which is inefficient and costly.

🧠 Important: Durable execution means that a workflow’s state is persisted. If the process is interrupted (e.g., worker crash, network failure, scheduled deployment), it can resume from where it last left off, rather than restarting from the beginning. This guarantees progress and minimizes wasted computation.

Trigger.dev achieves durable execution by checkpointing the state of your job. When your job encounters an await or needs to pause (e.g., for a delay, a retry, or waiting for an external event), Trigger.dev saves its current state. If the worker process dies, another available worker can pick up the job instance and resume it from the last saved state. This is incredibly powerful for:

Long-running tasks: No more worrying about server restarts, temporary network issues, or deployment cycles interrupting critical operations.
Human-in-the-loop workflows: Your workflow can pause indefinitely, waiting for human input or approval, and then resume seamlessly.
Retries: If an API call fails, Trigger.dev can automatically retry it after a delay, ensuring forward progress without manual intervention.

Step-by-Step Implementation: Building Robust Jobs

Let’s put these concepts into practice. We’ll create a Trigger.dev project (or continue from previous chapters) and implement jobs that leverage queues, scheduling, and durable execution.

Prerequisites

Ensure you have your Trigger.dev project set up. If not, you can quickly initialize one using the latest v4-beta version:

npx trigger.dev@v4-beta init

Choose the Next.js option and follow the prompts. Make sure your environment variables (TRIGGER_SECRET_KEY, TRIGGER_PUBLIC_KEY) are correctly configured as per Chapter 2. This setup ensures your local development environment can connect to the Trigger.dev cloud service.

1. A Simple Queued Job Example

Every job you define in Trigger.dev is inherently queued. When you trigger an event, it goes into a queue. Let’s define a job that simulates a background processing task, such as image manipulation.

Open your src/trigger/jobs.ts (or equivalent) file and add the following code. This file is where all your Trigger.dev jobs are defined.

// src/trigger/jobs.ts
import { client } from "./client";
import { eventTrigger } from "@trigger.dev/sdk";

// Define a job that simulates processing an image
client.defineJob({
  id: "process-image-job", // A unique identifier for this job
  name: "Process Image in Background", // A human-readable name for the dashboard
  version: "1.0.0", // Helps manage job changes and deployments
  // This job is triggered by an event with a 'process.image' name
  trigger: eventTrigger({
    name: "process.image",
    // Define the expected structure of the event payload for type safety
    schema: {
      url: "string", // The URL of the image to process
      userId: "string", // The ID of the user who uploaded the image
    },
  }),
  // The 'run' function contains the core logic of your job
  run: async (payload, io, ctx) => {
    // Log the received payload for debugging and observability
    await io.logger.info("Starting image processing...", { payload });

    // Simulate a long-running image processing task.
    // The `io.wait` function is crucial for durable execution.
    // If the worker process restarts during these 3 seconds, the job will
    // automatically resume from this exact point on another available worker.
    await io.wait("3 seconds");

    // In a real scenario, you'd integrate with an actual image processing service here.
    // For example:
    // const processedImageResult = await someImageService.process(payload.url);
    // await io.logger.info("Image service responded", { result: processedImageResult });

    await io.logger.info(`Image ${payload.url} processed for user ${payload.userId}.`);

    // You could send another event here to notify the user or trigger another job.
    // For example:
    // await io.sendEvent("image.processed", { userId: payload.userId, processedUrl: "..." });

    return { message: "Image processing complete!" };
  },
});

Let’s break down the key elements of this job definition:

id, name, version: These are standard metadata. The id is crucial as it uniquely identifies your job within Trigger.dev.
trigger: eventTrigger({ name: "process.image", schema: { ... } }): This tells Trigger.dev that this job should run whenever an event named "process.image" is received. The schema defines the expected data structure for the payload, providing valuable type-checking and documentation.
run: async (payload, io, ctx) => { ... }: This is the asynchronous function containing your job’s logic.
- payload: An object containing the data sent with the triggering event.
- io: The Trigger.dev I/O client. This object provides durable operations like logging (io.logger), pausing (io.wait), and interacting with external services in a retryable manner (io.runTask). All operations performed with io are durable.
- ctx: Provides context about the current job run, such as the run ID.
await io.wait("3 seconds"): This is a fundamental building block of durable execution. Unlike setTimeout, which is non-durable and would lose state on a worker restart, io.wait tells Trigger.dev to pause the job, persist its state, and then resume it after the specified duration.

Triggering the Queued Job

You can trigger this job from your Next.js API route or anywhere you have access to the client instance. Let’s create a new API route, for example, src/app/api/trigger-image-process/route.ts, that will send the process.image event.

// src/app/api/trigger-image-process/route.ts
import { client } from "@/trigger/client"; // Adjust path based on your project structure
import { NextResponse } from "next/server";

// This Next.js API route will handle POST requests
export async function POST(request: Request) {
  const { imageUrl, userId } = await request.json();

  // Basic validation for incoming data
  if (!imageUrl || !userId) {
    return NextResponse.json({ error: "Missing imageUrl or userId" }, { status: 400 });
  }

  // Send the event to Trigger.dev. This event will be queued, and the
  // 'process-image-job' we defined earlier will pick it up for execution.
  const event = await client.sendEvent({
    name: "process.image", // The name of the event this job listens for
    payload: {
      url: imageUrl,
      userId: userId,
    },
  });

  // Respond to the client indicating the event was successfully sent
  return NextResponse.json({
    message: "Image processing event sent!",
    eventId: event.id, // The ID of the event in Trigger.dev
  });
}

Now, if you send a POST request to /api/trigger-image-process (e.g., using curl, Postman, or a frontend fetch call) with a JSON body like {"imageUrl": "https://example.com/pic.jpg", "userId": "user123"}, Trigger.dev will receive the event, queue it, and your worker will eventually process it. You can observe the job’s status and logs in the Trigger.dev dashboard, seeing it pause for 3 seconds and then complete.

2. Implementing a Scheduled Job

Next, let’s create a job that runs on a predefined schedule, rather than in response to an event. This is perfect for periodic tasks like health checks, data synchronization, or report generation.

Add this job definition to your src/trigger/jobs.ts file, alongside your process-image-job:

// src/trigger/jobs.ts (add to existing file)
import { client } from "./client";
import { cronTrigger } from "@trigger.dev/sdk"; // Import cronTrigger

// Define a job that runs on a schedule
client.defineJob({
  id: "scheduled-health-check", // Unique ID for this scheduled job
  name: "Hourly System Health Check",
  version: "1.0.0",
  // This job is triggered by a cron schedule
  trigger: cronTrigger({
    // Cron expression for every minute: "minute hour dayOfMonth month dayOfWeek"
    // For demonstration, let's run it every minute: "* * * * *"
    // For every hour at minute 0: "0 * * * *"
    cron: "* * * * *",
  }),
  run: async (payload, io, ctx) => {
    await io.logger.info("Running system health check...");

    // Simulate checking various system components
    const status = {
      database: "healthy",
      api_gateway: "healthy",
      cache: "unhealthy", // Oh no, a problem!
    };

    if (status.cache === "unhealthy") {
      await io.logger.error("Cache system is unhealthy!", { status });
      // In a real scenario, you might send an alert or trigger a remediation job.
      // await io.sendEvent("alert.system.cache.unhealthy", { status });
    } else {
      await io.logger.info("All systems are nominal.", { status });
    }

    return { message: "Health check complete", status };
  },
});

cronTrigger({ cron: "* * * * *" }): This is how you define a scheduled job. The cron property takes a standard cron expression. * * * * * means “every minute of every hour of every day of every month of every day of the week.”

Cron Expression Basics:

A cron expression typically consists of 5 fields, representing the schedule:

* * * * *
| | | | |
| | | | ----- Day of week (0 - 7, Sunday is 0 or 7)
| | | ------- Month (1 - 12)
| | --------- Day of month (1 - 31)
| ----------- Hour (0 - 23)
------------- Minute (0 - 59)

Common Examples:

0 * * * *: Every hour at the 0th minute (e.g., 1:00, 2:00, etc.)
0 9 * * 1: Every Monday at 9:00 AM
0 0 1 * *: On the first day of every month at midnight (midnight on the 1st)
*/5 * * * *: Every 5 minutes

Once you deploy your Trigger.dev worker (by running your Next.js app in production mode or deploying it), this job will automatically start running according to its schedule. You’ll see new job runs appearing in your Trigger.dev dashboard every minute (or as per your cron expression).

3. Combining Durability with Retries for Reliability

Trigger.dev jobs come with built-in retry mechanisms, which are essential for handling transient failures (e.g., a temporary network glitch or an external API being briefly unavailable). When an io operation (like io.runTask or an external API call wrapped in io.runTask) throws an error, Trigger.dev can automatically retry the step. This is a huge advantage over traditional background task systems where you’d have to implement complex retry logic yourself.

Let’s modify our image processing job to include a simulated external API call that might fail and show how retries work.

// src/trigger/jobs.ts (modify the existing 'process-image-job')
import { client } from "./client";
import { eventTrigger } from "@trigger.dev/sdk";

client.defineJob({
  id: "process-image-job",
  name: "Process Image in Background",
  version: "1.0.1", // Increment version since we're changing the logic
  trigger: eventTrigger({
    name: "process.image",
    schema: {
      url: "string",
      userId: "string",
    },
  }),
  run: async (payload, io, ctx) => {
    await io.logger.info("Starting image processing...", { payload });

    // Simulate an external API call that might fail.
    // `io.runTask` wraps this logic, making it durable and retryable.
    const processedData = await io.runTask(
      "call-image-api", // Unique ID for this specific task step within the job
      async () => {
        // Simulate a random failure for demonstration purposes.
        // In a real app, this would be an actual external API call.
        const shouldFail = Math.random() < 0.5; // 50% chance of failure

        if (shouldFail) {
          // If an error is thrown here, Trigger.dev will automatically retry this step.
          throw new Error("Simulated external image API failure!");
        }

        // Simulate success and return some processed data.
        await io.logger.info("Successfully called external image API.");
        return {
          originalUrl: payload.url,
          processedUrl: `https://processed.example.com/${payload.userId}-${Date.now()}.jpg`,
          metadata: { width: 800, height: 600 },
        };
      },
      {
        // Optional: Configure specific retry options for this task step.
        // Trigger.dev automatically retries `io.runTask` calls that throw errors
        // using a default exponential backoff strategy.
        // For example, to retry 3 times with a 10-second initial delay:
        // retries: { maxAttempts: 3, factor: 1, minTimeoutInMs: 10000 }
      }
    );

    // This part of the code will only execute if 'call-image-api' succeeds
    // (potentially after several retries).
    await io.logger.info(`Image processing complete for ${payload.url}.`, {
      processedData,
    });

    return { message: "Image processing complete!", data: processedData };
  },
});

// ... (keep the scheduled-health-check job below this if you defined it)

In this updated job:

We incremented the version to 1.0.1. It’s a good practice to update the version whenever you make significant logic changes to a job, especially in production environments, to ensure proper deployment and potential rollback capabilities.
io.runTask("call-image-api", async () => { ... }): This wraps our potentially failing logic. If the async function passed to io.runTask throws an error, Trigger.dev will automatically retry this specific step of the job, based on its default retry policy (usually exponential backoff). This means your entire workflow doesn’t restart from the beginning; only the failing part is re-attempted.
The retry options are commented out, showing where you could customize them. By default, Trigger.dev provides robust retry behavior, often sufficient for most transient failures.

When you trigger this job (via the API route), you might see it fail a few times and then retry, eventually succeeding, or failing permanently after exhausting its retry attempts. This behavior is fully observable in the Trigger.dev dashboard, providing a clear audit trail of each attempt.

Visualizing a Robust Workflow

This diagram illustrates how an event enters a queue, is picked up by a worker, and how io.runTask provides a durable, retryable step within the workflow, ensuring resilience against failures.

flowchart TD User_Action[User Uploads Image] --> Trigger_Event[Trigger Event] Trigger_Event --> Job_Queue[Job Queue] Job_Queue --> Worker_Instance[Worker Process] subgraph JobExecution["Job Execution with Durability"] Worker_Instance --> Call_Image_API[Call Image API] Call_Image_API -->|Fails| Retry_Step{Retry} Retry_Step -->|Retry within limits| Call_Image_API Retry_Step -->|Max attempts reached| Notify_User[Notify User] Call_Image_API -->|Succeeds| Update_DB[Update Database Record] Update_DB --> Notify_User end

Mini-Challenge: Scheduled Report Generation

Now it’s your turn to apply what you’ve learned! Create a new job that simulates generating an hourly report.

Challenge:

Define a new Trigger.dev job in your src/trigger/jobs.ts file.
Schedule it to run once every hour, specifically at 30 minutes past the hour (e.g., 1:30, 2:30, etc.).
Inside the job’s run function:
- Simulate fetching data from an external analytics API. You can use io.wait("5 seconds") to represent the API call’s duration and introduce a Math.random() < 0.6 (60% chance of failure) to simulate an unreliable API.
- Crucially, ensure this simulated API call is retryable. If it fails, Trigger.dev should automatically retry it at least twice before finally giving up.
- If the API call is successful, log a message using io.logger.info indicating a report was “generated” for the current hour, including the current timestamp (e.g., new Date().toISOString()).
- If it ultimately fails after all retries, log an io.logger.error message.

Hint:

For the cron expression, 30 * * * * will run at 30 minutes past every hour.
Use io.runTask for the simulated API call. You can explicitly set retries: { maxAttempts: 3 } (meaning 1 initial attempt + 2 retries) within the io.runTask options.
Remember to give your job a unique id and name.

What to observe/learn:

How scheduled jobs automatically appear in your Trigger.dev dashboard and execute at the precise times.
How Trigger.dev handles retries for io.runTask calls, even across potential worker restarts, ensuring durability.
The difference in dashboard logs and status for a job that completes successfully, fails after exhausting retries, or succeeds on a retry attempt.

Common Pitfalls & Troubleshooting

Even with robust tools like Trigger.dev, distributed systems can present unique challenges. Understanding common pitfalls can save significant debugging time.

Incorrect Cron Expressions: A very common mistake is misconfiguring cron expressions, leading to jobs not running or running at unexpected times.
- Symptom: Your scheduled job either doesn’t run at all, runs at the wrong frequency, or at an unexpected time.
- Troubleshooting: Use an online cron expression validator (e.g., crontab.guru) to test and visualize your expressions. Be mindful of time zones; Trigger.dev schedules are typically evaluated in UTC. If your local machine or dashboard displays times in a different zone, there might be a perceived offset.
Idempotency Issues with Retries: If a job step is retried, it might execute multiple times. If the operation isn’t designed to be idempotent, this can lead to unintended side effects.
- Symptom: Duplicate data entries, multiple notifications sent, or inconsistent state in external systems after failures and retries.
- Troubleshooting: Design your job steps to be idempotent. This means that performing the operation multiple times has the same effect as performing it once. For example, when updating a database record, use UPSERT (update or insert) instead of just INSERT. When sending notifications, ensure your notification system can handle duplicate requests gracefully or use a unique transaction ID to prevent re-sends.
Debugging Long-Running Workflows: Understanding the exact state of a job that pauses for minutes or hours can be tricky without proper logging.
- Symptom: A job appears “stuck,” or you can’t tell which specific step it’s currently on, especially if it’s waiting for external input or a long io.wait.
- Troubleshooting: Leverage io.logger.info and io.logger.error extensively. These logs are persisted with the job run in the Trigger.dev dashboard, giving you a clear, step-by-step timeline of execution and the exact state at each durable step. The Trigger.dev dashboard provides a visual trace of your job’s execution, showing each io call and its status, which is invaluable for pinpointing issues.
Resource Exhaustion for Highly Concurrent Queued Jobs: While queues prevent immediate overload by buffering tasks, a massive backlog can still consume significant resources over time.
- Symptom: Workers become slow, experience memory issues, or jobs take an excessively long time to process, even if they eventually complete.
- Troubleshooting: Monitor your queue depth and worker resource usage (CPU, memory). If your backlog consistently grows or workers are constrained, scale your Trigger.dev workers horizontally. For extremely high-throughput scenarios, consider partitioning your events (e.g., using different event names or queue properties if Trigger.dev introduces them for finer control) to distribute load across more workers or queues.

Summary

You’ve now mastered the foundational elements for building robust and reliable workflows with Trigger.dev:

Queues: Decouple task producers from consumers, enabling load leveling and graceful handling of processing spikes. Every Trigger.dev job leverages an underlying managed queue for resilience and scalability.
Scheduling: Execute tasks at precise times or intervals using familiar Cron expressions, perfect for reports, maintenance, or recurring checks, eliminating the need for external schedulers.
Durable Execution: Trigger.dev’s ability to persist job state ensures that long-running workflows, pauses (io.wait), and retries (io.runTask) can survive worker restarts and continue exactly where they last left off, guaranteeing progress.
Retries: Built-in, configurable retry mechanisms handle transient failures in external API calls or other operations, making your jobs resilient to temporary service outages without complex manual coding.

These capabilities are indispensable for any production system, especially those orchestrating complex AI agents, data pipelines, or critical business logic. By understanding and applying these principles, you can build systems that are not only powerful but also fault-tolerant and highly available.

In the next chapter, we’ll take these robust workflow capabilities and apply them to the exciting world of AI agents, exploring how Trigger.dev can manage the lifecycle and interactions of intelligent systems, leveraging its durable execution for complex, multi-step agentic behaviors.

References

This page is AI-assisted and reviewed. It references official documentation and recognized resources where relevant.