Time-Travel Debugging: Replaying Production Bugs Locally

We’ve all had that sinking feeling. There are multiple crash reports from production. We have the exact input parameters that caused the failures. We have the stack traces. Yet, when we run the code locally, it works perfectly.

We know where it broke, but we can’t see why. Was it a race condition? Did a database read return stale data that has since been overwritten? To find the cause, we have to mentally reconstruct the state of the world as it existed milliseconds before the crash. Welcome to debugging hell.

If we could simply rewind time and watch the code execute exactly as it did for those failed requests, life would be a lot easier.

In Testing Side Effects Without the Side Effects, we explored a JavaScript Effect System where business logic doesn’t execute actions directly. Instead, it returns a description of what it intends to do in the form of a simple Command object. For example:

const validatePromo = (cartContents) => {
    // Define the side effect, but don't run it yet
    const cmdValidatePromo = () => db.checkPromo(cartContents);

    // Define what happens with the result
    const next = (promo) =>
        (promo.isValid ? Success({...cartContents, promo}) : Failure('Invalid promo'));

    return Command(cmdValidatePromo, next);
};

outputs:

{
    type: 'Command',
    cmd: [Function: cmdValidatePromo], // The command waiting to be executed
    next: [Function: next]             // What to do after the command finishes
}

We often compose multiple commands in a pipeline to make the most of our Effect system:

const checkoutFlow = (cartSummary) =>
    effectPipe(
        fetchCart,
        validatePromo,
        (cartContents) => chargeCreditCard(cartSummary, cartContents)
    )(cartSummary);

Our effect pipeline handles the Success and Failure cases automatically. If a function returns Success, the subsequent function in line will be called. In the case of a Failure, the pipeline terminates.

The series of Command objects generated by the pipeline is then run by an interpreter using runEffect(checkoutFlow(cartSummary)). Because our business logic consists of pure functions that interact with the world only through data, we can record those interactions simply by adding a few hooks for services like OpenTelemetry. And if we can record them, we can replay them deterministically. Best of all, there’s no need to mock a single database or external service.

When a crash happens, we don’t just get an error message. We get a crash log containing the initial input and the execution trace complete with all outputs.

In the following simplified trace generated by our workflow, a customer uses a 100% off promo code. The business logic calculates the total as $0.00 and attempts to pass it to the payment gateway, but the payment gateway rejects the API call because the minimum charge amount is $0.50, causing a 500 Internal Server Error:

const traceLog = {
  "flowName": "checkout",
  "initialInput": {
    "userId": "some_user_id",
    "cartId": "cart_abc123",
    "promoCode": "FREE_YEAR_VIP"
  },
  "trace": [
    {
      "command": "cmdFetchCart",
      "result": {
        "cartId": "cart_abc123",
        "items": ["annual_subscription"],
        "totalAmount": "120.00"
      }
    },
    {
      "command": "cmdValidatePromo",
      "result": {
        "cartId": "cart_abc123",
        "items": ["annual_subscription"],
        "totalAmount": "120.00",
        "isValid": true,
        "discountType": "%",
        "discountValue": 100
      }
    },
    {
      "command": "cmdChargeCreditCard",
      "result": {
        "error": {
          "code": "invalid_amount",
          "message": "Amount must be non-zero."
        }
      }
    }
  ]
};

Compared to the cryptic stack traces common in imperative code, this execution trace makes the source of the error immediately obvious.

We can even go ahead and write a quick time-travel function like the one below to replay any execution trace locally, complete with built-in support for detecting time paradoxes!

function timeTravel(workflowFn, traceLog) {
    const { initialInput, trace, flowName } = traceLog;
    const format = (v) => JSON.stringify(v, null, 2);

    let currentStep = workflowFn(initialInput);
    let traceIndex = 0;
    console.log(`Replay started with initial input: ${format(initialInput)}`);
    while (true) {
        const stepName = currentStep.type === 'Command' ? currentStep.cmd.name || 'anonymous' : currentStep.type;

        if (currentStep.type === 'Success' || currentStep.type === 'Failure') {
            console.log(`Replay Finished with state: ${currentStep.type}`);
            console.log(
                currentStep.type === 'Failure'
                    ? `Error: ${format(currentStep.error)}`
                    : `Result: ${format(currentStep.value)}`
            );
            break;
        }

        if (currentStep.type === 'Command') {
            const recordedEvent = trace[traceIndex];
            if (!recordedEvent) {
                throw new Error(`Trace ended prematurely at step ${traceIndex}. Workflow expected command: ${stepName}`);
            }
            if (recordedEvent.command !== stepName) {
                throw new Error(
                    `Time paradox detected! Workflow asked for '${stepName}', but trace recorded '${recordedEvent.command}'`
                );
            }
            console.log(`Step ${++traceIndex}: ${recordedEvent.command} returned ${format(recordedEvent.result)}`);
            currentStep = currentStep.next(recordedEvent.result);
        }
    }
}

When we run timeTravel(checkoutFlow, traceLog), it will actually exercise our checkout workflow, and produce the following output. With that, we’ve successfully executed a production execution trace locally, all without touching any database or external service:

Replay started with initial input: {
  "userId": "some_user_id",
  "cartId": "cart_abc123",
  "promoCode": "FREE_YEAR_VIP"
}
Step 1: cmdFetchCart returned {
  "cartId": "cart_abc123",
  "items": ["annual_subscription"],
  "totalAmount": "120.00"
}
Step 2: cmdValidatePromo returned {
  "cartId": "cart_abc123",
  "items": ["annual_subscription"],
  "totalAmount": "120.00",
  "isValid": true,
  "discountType": "%",
  "discountValue": 100
}
Step 3: cmdChargeCreditCard returned {
  "error": {
    "code": "invalid_amount",
    "message": "Amount must be non-zero."
  }
}
Replay Finished with state: Failure
Error: {
  "code": "invalid_amount",
  "message": "Amount must be non-zero."
}

Time-travel debugging might sound like a complex feature reserved for heavy-duty enterprise tools, but it fundamentally comes down to architectural design; it takes less than 100 lines of code to implement, and that figure includes our Effect System.

Because every interaction passes through runEffect, we can easily implement a redaction layer to scrub personally identifiable information, like credit card numbers or emails, before they ever hit the trace log.

By pushing side effects to the edges and keeping our core logic pure, we gain a deterministic and secure execution trace. As a result, debugging shifts from guessing what might have happened to watching exactly what did happen, all without compromising user privacy.

GitHub Repository: pure-effect

Discussed on:

Hacker News

Related:

Time-Travel Debugging: Replaying Production Bugs Locally

Aycan Gulez