Silent Failure Rate, measured: 3,013 monitored automation runs across Zapier, Make, n8n and Pipedream

Name: BenchTruth automation reliability dataset (Silent Failure Rate)
Creator: BenchTruth
License: https://creativecommons.org/licenses/by/4.0/

Live benchmark · updated 2026-07-05 · raw data downloadable below · no affiliate links on this page

Across 3,013 monitored automation runs (July 1–05 2026), we recorded zero silent failures on n8n (0%, 95% CI 0–0.17%, n=2,315), Make (0%, 95% CI 0–6.0%, n=60) and Zapier (0%, 95% CI 0–9.9%, n=35) — but the moment Pipedream's free-tier quota ran out, its webhooks kept answering "success" while silently dropping 14 of 14 deliveries.

A silent failure is the automation failure that hurts most: the platform told you it worked, and it didn't. This page is a continuously running measurement of that rate — not a review, not an opinion round-up. Every number below comes from runs we fired ourselves, with both endpoints under our own control, reconciled one by one.

Scoreboard (webhook-triggered workflows)

Platform (plan measured)	Output-expected runs	SFR	Latency p50	p95
n8n — self-hosted, bulk sampler	2,315	0% (95% CI 0%–0.17%)	724 ms	3.31 s
Make — Make Plan (paid)	60	0% (95% CI 0%–6.0%)	1.01 s	3.17 s
Zapier — Professional (paid)	35	0% (95% CI 0%–9.9%)	4.25 s	8.91 s
Pipedream — free tier, before quota exhaustion	76	0% (95% CI 0%–4.8%)	2.72 s	5.06 s

Latency = fired-at → receipt-at-our-receiver, identical network path for every platform, so the numbers are comparable with each other (not with a vendor's internal benchmark). Median delivery: n8n 724 ms, Make 1.01 s, Zapier 4.25 s — a 6× spread on the same workload.

Beyond delivered/missed, we also check partial executions (a 2-action or 2-branch run that only half-completed: 0 observed), duplicates (0 observed) and filter leaks (runs a filter should have stopped but didn't: 0 observed). All zero so far.

The one real failure mode we caught: the quota wall

On July 2 at 14:24 UTC, our Pipedream free-tier credits ran out mid-measurement. From one event to the next — five seconds apart — delivery went from 100% to 0%. The part that matters: Pipedream's webhooks continued to return a success response for every event it then silently discarded (14 of 14 output-expected runs, 95% CI 78–100%). The sender gets no error, no queue, no replay — just an "ok" and a black hole. We re-verified the behaviour 19 hours later, past the documented daily reset time: still accepting, still dropping.

We do not count these runs in Pipedream's SFR — they are a billing-edge behaviour, not an engine failure (before the wall, Pipedream ran 76 runs without a single drop). But if you run production workloads on a metered free tier, this is the failure semantics you are signing up for: at the quota boundary, "accepted" stops meaning "delivered".

Scheduled (polling) workflows

A separate always-on workflow polls our data source every 30 minutes on each platform, and we track whether every scheduled tick actually happened and whether every new item was picked up.

Platform	Scheduled polls observed	New items delivered
n8n	218	50 / 50
Make	57	7 / 7 since activation*

*One additional item changed before the Make scenario was first switched on and was superseded before its first poll — a setup artifact, not a platform miss; it is excluded above and flagged in the raw data. A cost observation previewed here (full cost benchmark coming): on Make, every empty poll of this workflow consumes 2 billable operations by design (poll + state lookup); on self-hosted n8n the same empty poll costs $0. Polling-heavy workloads pay a standing tax on per-operation platforms even when nothing happens.

Method — why these numbers are comparable

Identical workflows everywhere. Four canonical flows (single action; filter + two actions; 30-minute poll; two parallel branches) rebuilt step-for-step on every platform — same trigger type, same step count, same HTTP calls.
Both endpoints are ours. Events enter via each platform's webhook and exit as an HTTP POST to our own receiver. No third-party connectors — a Gmail outage can't masquerade as a platform failure.
Every run is ID-tagged and reconciled. The controller writes a ledger entry before it fires; every receipt echoes the run ID; a reconciler classifies each run as delivered / missed / partial / duplicate / filtered. No sampling of logs — full census.
Plans measured: Zapier Professional and Make's paid plan (both paid for by us — nobody gives us free accounts), n8n self-hosted on a 1 GB cloud VM, Pipedream free tier. Sampling rates differ by billing model: per-task platforms get smaller samples (hence wider intervals, honestly labeled); self-hosted n8n carries the bulk sample.
Window: continuous since July 1, 2026. Editor-mode test runs are excluded; the full per-run ledger is downloadable below.

Limitations we know about: paid-platform samples are still small (their intervals say so); latency includes our receiver's network hop (identical for all platforms); results describe these specific plans in this specific window, and the meter keeps running — numbers tighten every week.

Reading 0% honestly

Every platform above currently shows zero silent failures — and those zeros are not equal. Zero in 35 runs still allows a true rate near 9.9%; zero in 2,315 runs pins it below 0.17%. That is why this page reports Wilson 95% confidence intervals and keeps accumulating: rare failures only become visible in large samples. If a vendor quotes you a reliability number without a sample size, they are quoting a feeling.

FAQ

What is a silent failure in workflow automation?

A run the platform accepted (its webhook returned success) but that never produced the expected output — and no error was ever surfaced to the user. The sender believes the work happened; it did not. BenchTruth measures this as the Silent Failure Rate (SFR): (missed + partial executions) ÷ all runs expected to produce output, with a Wilson 95% confidence interval.

Which automation platform is the most reliable in 2026?

In BenchTruth's measurements so far, no platform has silently dropped a run under normal operation: n8n 0% SFR in 2,315 runs (95% CI up to 0.17%), Make 0% in 60, Zapier 0% in 35. The measured reliability risk was not the platforms' engines but their billing edges: when Pipedream's free-tier quota ran out, it accepted and silently dropped 14 of 14 deliveries while still returning success.

Why do you publish confidence intervals instead of just a percentage?

Because 0 failures in 35 runs and 0 failures in 2,315 runs are very different statements. The Wilson 95% interval makes the difference explicit: after 35 clean runs the true rate could still be as high as ~10%; after 2,315 clean runs it is below 0.17%. Any reliability claim without a sample size and interval is marketing, not measurement.

Do runs stopped by a filter still cost money on Zapier?

No — measured, not assumed. Across our filtered runs on a Zapier Professional plan, runs halted by a Filter step consumed zero tasks; Zapier's own task meter matched our count of executed action steps exactly (45 = 45). The folklore that 'filtered Zaps still burn a task' did not hold in July 2026.

Raw data

Full per-run ledger (run ID, platform, workflow, fired-at, expected vs received, outcome, per-receipt latency): benchtruth-runs.csv · CC BY 4.0 · cite as "BenchTruth reliability dataset, benchtruth.com/reliability".