Error Handling and Resilience (Admin Guide)
This page documents how Pulsar reacts to operational failures — RabbitMQ/relay outages, Pulsar restarts, Galaxy unreachability, transient HTTP errors during staging — and what an administrator needs to configure or monitor to make those guarantees real.
The single guarantee Pulsar provides is: for every job submitted by
Galaxy, exactly one terminal status update ( complete ,
failed , cancelled , or lost ) is delivered, in
non-decreasing lifecycle order, within bounded time after the underlying
fault clears.
The mechanisms below combine to deliver that guarantee. Each section names the configuration knobs, defaults, on-disk artifacts, and operational signals so you can verify the system is doing what it claims.
1. The job lifecycle and where status updates come from
A job moves through these states (see pulsar/managers/status.py):
preprocessing -> queued -> running -> postprocessing -> complete | failed | cancelled | lost
Status transitions are emitted by the stateful manager wrapper
(pulsar/managers/stateful.py):
preprocessing— the moment the setup message is accepted and the staging thread begins; the job directory containslaunch_configbut not yetpreprocessed.running— first call toget_status()that detects the job has begun executing on the underlying runner.terminal (
complete/failed/cancelled/lost) — written as the on-disk metadata filefinal_statusafter postprocessing.
All transitions go through a single callback registered by the messaging
binding (pulsar/messaging/bind_amqp.py and
pulsar/messaging/bind_relay.py), which writes the payload to the
status-update outbox and lets a background drain thread retry the
publish. The synchronous publish path is gone.
2. The persistent status-update outbox
Why
Before this layer existed, a Pulsar that hit a broker hiccup just as a job finished would log the publish exception, kill the postprocess thread, and leave Galaxy permanently unaware that the job was done. This was the single largest data-loss path in the lifecycle.
How
Every status update is written to a per-manager directory before any attempt to publish:
<persistence_directory>/<manager>-status-outbox/ # AMQP modes
<persistence_directory>/<manager>-relay-status-outbox/ # relay mode
Each entry is a single JSON file named <seq>-<uuid>.json. seq is a
monotonic per-outbox counter primed from the maximum seq found on disk at
startup, so a lexically sorted directory listing yields strict FIFO drain
order even after a restart.
A daemon thread (named status-outbox-<dir-basename>) drains pending
entries every few seconds:
on enqueue success the file is removed only after
publish_fnreturns;on publish failure the file is left on disk, the drain thread retries on the next pass, and
enqueueitself never raises;on Pulsar startup the drain thread runs once before consumer threads start, so any backlog left by the previous process is delivered first.
What this gives you
A broker outage — even one that lasts for minutes or spans a Pulsar restart — does not lose a status update. The terminal state sits on disk until the broker is reachable again.
The postprocess thread cannot be terminated by a publish failure.
If the broker eventually has to be replaced or wiped, you can simply delete the affected outbox directories after you accept that those updates are lost — no other Pulsar state needs touching.
Configuration
persistence_directory(yaml, defaultfiles/persisted_data) — where the outbox lives. Required for at-least-once semantics. If set to__none__the outbox is disabled and the legacy synchronous publish path is used; do not run production this way.status_outbox_drain_interval(yaml, default5.0seconds) — how often the daemon thread retries pending entries when no immediate wakeup has fired.
Monitoring
The directory size of
<persistence_directory>/*-status-outbox*/is the natural backpressure signal. Steady state should be 0 entries. Sustained growth means the broker (or Galaxy’s consumer) is unreachable, not that Pulsar is broken.The log line
Outbox <dir> has N pending messages to retryfires on startup whenever the previous run left a backlog — alert on it if N is non-zero for more than a few drain intervals.
3. AMQP durability defenses
When Pulsar talks to RabbitMQ:
amqp_durable: true(off by default) declares thepulsarexchange and every per-name queue withdurable=true, and stamps publishes withdelivery_mode=2(persistent). With this enabled, setup, kill, and status-update messages survive a RabbitMQ restart. The default isfalseso existing deployments with non-durable queues keep working — RabbitMQ refuses to redeclare an existing queue with mismatched durability, so flipping this on a live deployment requires deleting the affected queues first or migrating to a fresh broker. The outbox already covers the publisher-side leak (LP1) regardless of this setting; durable queues are an extra defense for the broker-restart case specifically.amqp_publish_retry: trueenables kombu’s connection-retry policy. When set, pulsar fills in bounded defaults (max_retries: 5,interval_start: 1,interval_step: 2,interval_max: 30) so a single transient hiccup is absorbed in-band without round-tripping through the outbox drain loop.amqp_acknowledge: true(off by default) layers an additional publisher-confirms protocol on top, with its own UUID store under<persistence_directory>/amqp_ack-<manager>/. This is independent of the outbox; both can be enabled together for defense-in-depth.
Note
To enable durable queues on an existing deployment, drain or delete
the existing pulsar__* queues (e.g. rabbitmqctl delete_queue
pulsar__setup for each one) before flipping
amqp_durable: true. On a fresh install you can simply set the
flag from the start.
4. pulsar-relay durability defenses
Relay uses HTTP long-polling with a per-topic cursor (last_message_id)
that the consumer advances as it reads messages. Without persistence the
cursor would reset on every restart, silently skipping any setup or
status_update messages published while the consumer was down.
Pulsar persists the cursor at:
<persistence_directory>/<manager>-relay-cursor.json
written atomically (rename-temp) on every advance, loaded on startup. The
Galaxy-side RelayClientManager accepts a relay_cursor_path keyword
to do the same on the receiver side.
Galaxy job handlers run as separate processes — each one polls the relay
independently and tracks its own cursor — so RelayClientManager
expands the operator-supplied path with a stable, per-handler
identifier before passing it to the transport. The suffix is resolved
in this order:
relay_handler_idconstructor kwarg (typicallyapp.config.server_namefrom the Galaxy runner — stable across restarts).GALAXY_SERVER_NAMEenvironment variable (Galaxy’s standard handler tag, set by the launcher).pidNNNas a last resort, with a startupWARNINGlog: PID is unique per process but changes on every restart, so the persisted cursor would not be picked up by the next run and the F4 guarantee is degraded to “no Galaxy-side cursor persistence.”
A shared file would suffer last-writer-wins corruption when two handlers
persist concurrently and could silently rewind another handler’s
progress, which is why the suffix is mandatory whenever
relay_cursor_path is set.
5. Setup-message idempotency
Setup messages can be redelivered: if Pulsar SIGKILLs after consuming a
setup but before AMQP records the ack, the broker will redeliver to the
next consumer. submit_job short-circuits the redelivery to a no-op
only when the prior run can finish on its own:
the job has reached a terminal status (
final_statusmetadata exists), orthe job is still tracked by
active_jobsandrecover_active_jobswill resume it on next get_status().
If the prior run crashed between persisting launch_config and
activating the job — a narrow window — there is no recovery hook, so the
redelivered setup drives a fresh preprocess_and_launch instead of
silently dropping the work.
This is invisible to admins under normal operation; it surfaces in logs
as Ignoring duplicate setup message for job_id <id>.
6. Active-job recovery on restart
Pulsar’s stateful manager keeps two persistent indices under
persistence_directory:
<persistence_directory>/<manager>-active-jobs/
<persistence_directory>/<manager>-preprocessing-jobs/
Each file is a job_id; its existence means the job was active when the
process last ran. On startup recover_active_jobs walks both
directories and:
for jobs in
-preprocessing-jobs/, re-readslaunch_configand re-launches the preprocessing thread;for jobs in
-active-jobs/, calls the manager’s_recover_active_jobmethod (e.g. requeue from the persisted command line, or re-attach to the DRMAA external id).
Two outcomes you should be aware of:
If recovery cannot reattach to the job (e.g. the DRMAA external id is no longer known) Pulsar publishes a single
loststatus update and removes the active-jobs entry. From Galaxy’s perspective the job ended withlost; nothing is silently abandoned.The
queued_pythonrunner — the one used in the resilience test framework — runs jobs as direct Pulsar subprocesses and cannot survive a Pulsar SIGKILL. Such a restart yieldslost, notcomplete, for the in-flight job. Use a real DRM (DRMAA/Slurm/Kubernetes) for workloads that need to survive Pulsar restarts.
7. Staging error handling (file transfer to/from Galaxy)
Pulsar pulls inputs from Galaxy and pushes outputs back over HTTP. Both
sides go through the RetryActionExecutor with a
should_retry=is_transient_http_error predicate
(pulsar/client/transport/transient.py):
Transient errors are retried — 408, 425, 429, 500, 502, 503, 504, connection errors, and timeouts. Default policy retries indefinitely with bounded backoff (start 2 s, step 2 s, max 30 s).
Permanent errors fail the job — any 4xx other than the rate-limit set above (404, 403, 400, etc.) immediately marks the job
failedwith the upstream HTTP body in the captured stderr.
This is why a misconfigured Galaxy URL or a deleted dataset surfaces as a
single fast failed, while a Galaxy restart or a load-balancer hiccup
recovers transparently after retries.
Configuration knobs (per-manager YAML):
preprocess_action_max_retries/postprocess_action_max_retriespreprocess_action_interval_start/_step/_maxpreprocess_action_should_retry(advanced — replaces the default predicate; only do this if you understand which errors are safe to retry in your environment)
8. Status-update ordering guarantees
Pulsar (sender side)
Within one Pulsar instance, status updates for a given job are emitted in strict lifecycle order, and the outbox drains them in strict FIFO via the monotonic seq prefix described above. That ordering is preserved across process restarts: pending entries from the previous run drain before any new entries from the new run, because their seqs are smaller.
Galaxy (receiver side)
The broker can re-deliver a message whose ack was lost, and the relay backend can replay messages a client claims it has not yet seen. Galaxy must therefore enforce two invariants:
Dedupe by
(job_id, status). The pulsar status-update payload for a given (job_id, status) is idempotent in content, so a duplicate is safe to drop.A terminal status is a sink. Once Galaxy has recorded a terminal status for a job, drop any further updates for that job — both non-terminal regressions (a stale
runningarriving late) and conflicting later terminals (a delayedfailedaftercomplete).
The reference implementation of these rules is the
StatusRecorder class in
test/resilience/mock_galaxy/recorder.py; treat it as the canonical
behavior to mirror in production Galaxy.
Explicit Galaxy-side resets
Operators (or Galaxy itself) sometimes need to reset a job’s state — for
example resubmitting after a cancel or restarting a failed job.
Such a transition deliberately re-enters a non-terminal state from a
terminal one and would be rejected by the regression guard above.
The agreed escape hatch is a _galaxy_reset_token field in the status
payload (any truthy value, intended to be unique per reset event):
the receiver treats the update as the start of a fresh epoch for that
job_id;prior transitions are forgotten for the current-state view of the job;
the audit log retains the full observed history.
Pulsar itself does not emit reset tokens. Galaxy is the authority on what constitutes a reset.
9. Failure-mode → outcome cheat sheet
Failure |
Recovery mechanism |
What you see in logs / on disk |
|---|---|---|
Broker brief disconnect |
kombu/relay reconnect loop |
|
Broker down for minutes |
outbox holds terminal status |
outbox dir size > 0 until reconnect; warns on next drain |
RabbitMQ crash + restart ( |
durable queues + delivery_mode=2 |
no Pulsar-side log; messages restored from broker disk |
RabbitMQ crash + restart (default, non-durable) |
outbox replays publisher side |
in-flight inbound msgs are lost broker-side; outbound resume from outbox |
Pulsar SIGKILL during preprocessing |
|
|
Pulsar SIGKILL during running (DRM) |
|
re-attach via runner; no status change visible to Galaxy |
Pulsar SIGKILL during running (queued_python) |
job dies, |
one |
Pulsar SIGKILL after final_status, before publish |
outbox replay on restart |
|
Galaxy 5xx during input download |
transient retry |
|
Galaxy 4xx during input download |
fail-fast, |
single |
Galaxy unreachable during status_update |
outbox holds, broker buffers |
same as broker outage from Pulsar’s perspective |
Both Pulsar and broker restart together |
outbox (+ durable queues if enabled) |
everything resumes once both are up; ordering preserved |
10. Verifying resilience in your environment
The repo includes a docker-compose-based resilience suite under
test/resilience/ that exercises every failure mode above end-to-end
against a real RabbitMQ, ghcr.io/mvdbeek/pulsar-relay, and toxiproxy
fault injection. Before relying on this guide in production, run:
docker compose -f test/resilience/docker-compose.yml up -d --build
pytest test/resilience -v
48 scenarios pass across the amqp, amqp_ack, and relay modes.
See test/resilience/README.md for layout, what each scenario asserts,
and how to add new ones (e.g. for a custom runner).
11. Tunables, defaults, and where they live
Setting |
Default |
Effect |
|---|---|---|
|
|
outbox + active-jobs + relay cursor live here |
|
|
background outbox retry cadence |
|
|
opt-in: durable queues + delivery_mode=2 (broker-restart resilience) |
|
unset (off) |
kombu publish retry; defaults bounded when on |
|
|
additional publisher-confirms layer |
|
|
consumer drain_events timeout (responsiveness) |
|
|
disable to make Pulsar receive-only |
|
|
disable to make Pulsar send-only |
|
|
join consumer threads on shutdown |
For deep debugging set logging.loggers.pulsar.level: DEBUG in
server.ini (the resilience tests run with INFO and grep for the bind
and outbox log lines as readiness signals — the same pattern works for
production health checks).