Building a Self-Healing Homelab with n8n

How I used n8n to turn my homelab from something I had to babysit into something that monitors itself, fixes itself, and tells me when it can't.

There’s a certain kind of homelab pride that comes from having 20+ self-hosted services running smoothly behind Traefik. There’s also a certain kind of homelab anxiety that comes from wondering, at 11pm on a Saturday, whether everything is still actually running.

For a while, my approach to that anxiety was to occasionally open Uptime Kuma, confirm everything was green, and get on with my life. That worked fine until it didn’t: a container would crash, I’d find out hours later when I tried to use the service, and I’d spend ten minutes restarting something that could have restarted itself.

The fix, it turned out, was already sitting in my stack. I’d been using n8n for a few lightweight automations, but I hadn’t really pushed it. This is the story of how I built a homelab that monitors itself, heals itself, and stays quiet unless it actually needs my attention.

The Problem with Passive Monitoring

Uptime Kuma is excellent. It checks all my services, shows me beautiful status graphs, and sends alerts when something goes down. But those alerts went to… my email. Which I check when I remember to. And by the time I’d seen an alert, acknowledged it, and SSHed into the server, the container had usually been dead for a while.

The second problem was subtler. I had no daily visibility into resource usage. Disk space creeping up, RAM getting tight, CPU spikes: these weren’t events that triggered alerts, they were slow trends I’d only notice once they became problems. I’d actually come close to running out of disk space once without realising until I checked manually.

What I wanted was something that would:

Tell me every morning how the server is doing, whether anything needs attention or not
Actually do something when a service goes down, not just tell me about it

Both of those turned out to be n8n workflows.

Workflow One: The Daily Status Report

The first workflow runs every morning at 8am. It SSHes into the Ubuntu VM, runs a stats script, parses the output, and sends a summary to my phone via Ntfy. If any metric is above a threshold, it sends a second, urgent notification on top.

The stats script is a simple bash file that collects the numbers I actually care about:

#!/bin/bash
ROOT=$(df -h / | tail -1)
ROOT_SIZE=$(echo $ROOT | awk '{print $2}')
ROOT_USED=$(echo $ROOT | awk '{print $3}')
ROOT_PCT=$(echo $ROOT | awk '{print $5}')
DATA=$(df -h /mnt/data | tail -1)
DATA_SIZE=$(echo $DATA | awk '{print $2}')
DATA_USED=$(echo $DATA | awk '{print $3}')
DATA_PCT=$(echo $DATA | awk '{print $5}')
RAM=$(free -h | grep Mem)
RAM_TOTAL=$(echo $RAM | awk '{print $2}')
RAM_USED=$(echo $RAM | awk '{print $3}')
RAM_AVAIL=$(echo $RAM | awk '{print $7}')
CPU_IDLE=$(vmstat 1 2 | tail -1 | awk '{print $15}')
CPU_USED=$((100 - CPU_IDLE))
echo "ROOT:${ROOT_SIZE}:${ROOT_USED}:${ROOT_PCT}"
echo "DATA:${DATA_SIZE}:${DATA_USED}:${DATA_PCT}"
echo "RAM:${RAM_TOTAL}:${RAM_USED}:${RAM_AVAIL}"
echo "CPU:${CPU_USED}"

The n8n workflow picks that up via SSH, runs a Code node to parse and format it, then checks the numbers against thresholds (80% for disk, 90% for CPU) before sending the notification.

The result is a phone notification every morning that looks like this:

💾 Disk Usage
  Root (/): 34% used (16G / 47G)
  Data (/mnt/data): 2% used (19G / 916G)

🧠 RAM: 4.4Gi used / 7.7Gi total (2.9Gi available)

⚡ CPU: 10% in use

Have a great day! 🚀

On a bad day, there’s also a second notification: a red alert listing exactly which threshold was breached and by how much. The daily report still arrives regardless; the alert is additive, not a replacement. The logic for that in the Code node is straightforward:

const alerts = [];
if (rootPct >= 80) alerts.push(`Root disk at ${root.pct} (${root.used} / ${root.size})`);
if (dataPct >= 80) alerts.push(`Data disk at ${data.pct} (${data.used} / ${data.size})`);
if (cpuPct >= 90) alerts.push(`CPU at ${cpu}%`);

const hasAlerts = alerts.length > 0;

An IF node routes to an urgent Ntfy POST if hasAlerts is true, while the regular report notification fires on both branches. Simple, but it means I’ve gone from checking manually when I remember to having a genuine pulse on the server every single morning.

Workflow Two: Service Restart on Failure

This one is the more satisfying automation, because it closes the loop completely. Every five minutes, n8n polls the Uptime Kuma Prometheus metrics endpoint, checks if anything is down, and if so: restarts it and tells me it did.

The Uptime Kuma metrics endpoint (/metrics) exposes monitor status in Prometheus format. Every monitor has a monitor_status line with a value of 1 (up), 0 (down), 2 (pending), or 3 (maintenance). That’s all we need.

The HTTP Request node calls the endpoint directly over the internal Docker network, bypassing Traefik and Authentik entirely:

http://uptime-kuma:3001/metrics

Authentication uses Uptime Kuma’s API key via HTTP Basic Auth, with an empty username and the key as the password, which is the format Uptime Kuma expects.

The Code node parses the raw Prometheus text and maps monitor names to Docker Compose folders. Some services get restarted automatically, some (like AdGuard, which runs as an LXC container rather than Docker) just trigger an alert, and a few (Proxmox, the Ubuntu VM itself, Uptime Kuma) are ignored entirely since if those are down, n8n is probably dead too.

const serviceMap = {
  'Traefik': 'traefik',
  'Nextcloud': 'nextcloud',
  'Paperless': 'paperless',
  'Authentik': 'authentik',
  // ... all other Docker services
  'AdGuard': null,       // alert only, can't docker compose restart an LXC
  'Proxmox': 'ignore',   // if this is down, we have bigger problems
  'Ubuntu Server': 'ignore',
  'Uptime Kuma': 'ignore',
};

For any service that’s down and has a folder mapped, n8n SSHes into the server and runs:

cd /home/em/docker/{{ $json.folder }} && docker compose restart

Then sends a Ntfy notification: “Pairdrop was down and has been automatically restarted.” For services that can’t be auto-restarted (AdGuard), it sends an urgent alert instead: “AdGuard is down but cannot be auto-restarted. Manual intervention required.”

I tested this by manually stopping Pairdrop and triggering the workflow. Within seconds I had a notification on my phone confirming the restart had happened. The service had been down for less than five minutes and I hadn’t had to touch anything.

The Gotcha: Authentik Was in the Way

One thing I didn’t anticipate: n8n sits behind Authentik forward auth via Traefik, which means any request hitting n8n.home.emeffe.com gets redirected to the SSO login page. That’s great for browser access, but it meant Nextcloud’s outgoing webhook (used in a separate automation) was getting blocked before it could reach n8n’s webhook endpoint.

The fix was a second Traefik router specifically for the /webhook/ path, with higher priority than the main router and without the Authentik middleware:

- traefik.http.routers.n8n-webhook.rule=Host(`n8n.home.emeffe.com`) && PathPrefix(`/webhook/`)
- traefik.http.routers.n8n-webhook.middlewares=crowdsec-bouncer@file
- traefik.http.routers.n8n-webhook.priority=10
- traefik.http.routers.n8n.priority=5

Webhook traffic bypasses Authentik. Everything else still goes through SSO. CrowdSec still checks both. It’s a clean pattern that I’ll reuse any time I need a public-facing endpoint on a protected service.

The Broader Automation Stack

The daily report and the service restart are the two I’m most pleased with, but they’re part of a wider set of automations I built over the same period:

Nextcloud PDF to Paperless: When a PDF is uploaded to Nextcloud, a webhook fires to n8n, which waits 30 seconds (to let any in-progress upload finish) then SSHes in and copies the file to Paperless’s consume directory, where it gets automatically OCR’d and indexed. No more manual imports.
Proxmox Backup Verification: Every Saturday at 8am (six hours after the backup job runs), n8n SSHes directly into the Proxmox host, queries S3 with the AWS CLI, and checks that the backup files are less than 24 hours old. Green notification if everything’s fresh, urgent alert with specific file names and ages if anything is missing or stale.
CrowdSec Auto-Unban: Not strictly n8n, but in the same spirit. A systemd oneshot service that runs on boot, waits 30 seconds for Docker to settle, then removes any CrowdSec ban on my Tailscale IP. It was getting banned after every reboot, which was particularly annoying when rebooting remotely. The service also sends a Ntfy notification if it actually had to remove a ban, so I know it happened.

What I’d Tell Myself at the Start

Build the daily report first. It gives you a baseline and reveals things you didn’t know you needed to alert on.

I’d been running the homelab for months without any systematic resource monitoring. The first week of daily reports told me my root disk was at 34%, trending in a direction that would have caused problems if I hadn’t noticed. That alone justified the whole exercise.

The other lesson: n8n’s SSH node is remarkably powerful for homelab automation. Most of what I wanted to automate was already possible through existing scripts; n8n just gave those scripts scheduling, logic, and a notification layer. I didn’t need to rewrite anything, just orchestrate what was already there.

The homelab is quieter now. Not because less is happening, but because the things that used to require my attention are handled before I even know about them. That’s the goal.

All workflows mentioned in this post are available in the homelab repository as exported JSON files.