Skip to main content
Back to blog

Uptime Kuma told me everything was fine. It wasn't.

·15 min readHomelab

Last Tuesday I opened Uptime Kuma and saw a wall of green. Every service healthy. Every check passing. Nextcloud, Vaultwarden, Gitea, the website, the databases. All up.

Except Nextcloud had been crawling for three days. File syncs took minutes instead of seconds. The web interface felt like loading a page over dialup. My partner mentioned the calendar had stopped syncing to her phone. I assumed it was a network thing and moved on. It was not a network thing.

I SSH'd into the server and found the Nextcloud container using 3.8 GB of RAM, the Postgres container sitting at 97% memory, and the host itself swapping to disk. The system was alive, technically. It was answering HTTP requests with 200 status codes. Uptime Kuma dutifully recorded each one as a success.

Everything was fine. Nothing was fine.

I spent an hour digging through docker stats output and journalctl logs trying to figure out when this started. The answer was: I had no idea. There was no history. No trend data. No record of what the memory usage looked like a week ago versus today. I was working with a single snapshot of a system that had been degrading for days.

That is when I realized the question "is it up?" is fundamentally different from "is it healthy?" Uptime Kuma answers the first question and does it well. But I had been treating that answer as if it covered both.

The comfort of green checkmarks

I wrote about setting up Uptime Kuma about a year ago. At the time it felt like a huge improvement. Before monitoring, I would discover outages by trying to use a service and finding it dead. With Uptime Kuma, I know within minutes when something goes down. Telegram alerts, clean dashboard, history of incidents. It felt responsible.

And for a while, it was enough. The dashboard is satisfying. You open it, see green, and close it with confidence. The Telegram alerts work. When my VPS reboots and containers take a minute to come back up, I know immediately. That is genuinely valuable.

But it is a binary signal. Up or down. Reachable or not. It cannot tell you that memory is slowly leaking over days. It cannot show you disk filling at 2% per week. It has no idea that one container is consuming all the CPU while the others starve. It does not know that response times have been degrading from 200ms to 4 seconds over the past month. It cannot show you that your backup cron job overlaps with your database maintenance window every night at 2 AM.

Uptime Kuma gives you the symptom. It never gives you the cause. When everything is green but something feels slow, you have nowhere to look. You SSH in, run htop, see that things are bad right now, but have no context for how long it has been bad or what changed. You are debugging in the dark with a flashlight that only points at the present.

I think a lot of homelab operators stay at this stage for a long time. I did. The green dashboard creates a false sense of completeness. You feel like you have monitoring handled. You do not. You have uptime monitoring handled. That is one slice of the picture.

Level 1: Is it up?

To be clear, Uptime Kuma is not the problem. It does exactly what it says it does. I still run it. I still rely on it.

My setup has HTTP monitors for Nextcloud, Vaultwarden, and this website. TCP monitors for databases. Docker container monitors for services on the same host. Ping monitors for network infrastructure. Each checks every 60 seconds. Three consecutive failures trigger a Telegram alert.

This answers one question: is the service reachable right now? And that is a critical question. If your reverse proxy goes down, you want to know. If a container crashes and does not restart, you want to know. Uptime Kuma handles this well.

But it is Level 1. The floor. It tells you about outages after they happen. It cannot predict them. It cannot explain them. And it has a blind spot: a container that crash-loops and recovers within 60 seconds might never trigger an alert. Your service is restarting constantly, losing connections, dropping writes, and Uptime Kuma shows green because it happened to be up at check time.

I had this exact problem with a Gitea instance. The container was running out of memory, getting killed by the OOM killer, and restarting within 30 seconds. This happened multiple times per day. I only discovered it weeks later when I noticed some webhook deliveries had silently failed. Uptime Kuma showed 100% uptime for that period.

If your homelab runs three services, Level 1 is probably fine. If you are running 15+ containers across multiple machines like I do with my Proxmox setup, you need more.

Level 2: What is happening?

This is where Prometheus enters the picture. Prometheus is a time-series database that scrapes metrics from your services on a schedule. Instead of asking "is it up?" it asks "what are all the numbers right now?" and stores those numbers over time so you can see trends.

The concept is simple. You run small programs called exporters alongside your services. Each exporter exposes metrics as an HTTP endpoint in a standard format. Prometheus scrapes those endpoints on a schedule (every 30 seconds in my setup) and stores the data as time-series. Each data point has a metric name, a set of labels, a value, and a timestamp. Then you query it, graph it, alert on it.

Prometheus stores this data locally on disk and compresses it efficiently. For a homelab with a handful of exporters, you are looking at maybe 1-2 GB per month of storage. Not nothing, but very manageable.

Two exporters cover most homelab needs:

Node Exporter runs on each host and exposes system-level metrics. CPU usage, memory, disk I/O, network traffic, filesystem usage. Everything the operating system knows about itself.

cAdvisor exposes container-level metrics. Per-container CPU, memory, network, and disk usage. This is the one that would have shown me Nextcloud eating 3.8 GB of RAM days before I noticed the slowness.

One thing to know about cAdvisor: the default configuration collects metrics every second, which uses a surprising amount of CPU for something that is supposed to be passive monitoring. In a homelab, you do not need per-second resolution. Set --housekeeping_interval=30s and --docker_only=true to keep it lightweight.

Here is the Docker Compose setup for the whole metrics stack:

services:
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    restart: unless-stopped
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.time=30d'
 
  node-exporter:
    image: prom/node-exporter:latest
    container_name: node-exporter
    restart: unless-stopped
    ports:
      - "9100:9100"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
 
  cadvisor:
    image: gcr.io/cadvisor/cadvisor:latest
    container_name: cadvisor
    restart: unless-stopped
    ports:
      - "8080:8080"
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
    command:
      - '--housekeeping_interval=30s'
      - '--docker_only=true'
 
volumes:
  prometheus_data:

And the minimal prometheus.yml to tie it together:

global:
  scrape_interval: 30s
 
scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
 
  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']
 
  - job_name: 'cadvisor'
    static_configs:
      - targets: ['cadvisor:8080']

With this running, I can now see things that were invisible before. CPU usage trends over days, not just a snapshot from htop. Memory pressure building gradually over a week before an OOM kill. Container restart counts revealing instability I never noticed. Disk I/O patterns that explain why everything felt slow during backup windows.

The Nextcloud incident I described at the top would have been obvious. I would have seen memory climbing steadily over three days on a Grafana graph. I could have caught it on day one instead of day four. Better yet, with an alert rule, Grafana could have notified me automatically when container memory crossed a threshold.

The difference is context. Before, I knew the server felt slow right now. With Prometheus, I can see that memory started climbing three days ago, correlate it with a container update, and pinpoint the cause. That is the jump from "is it up?" to "what is happening?"

Level 3: What went wrong?

Metrics tell you the system is unhealthy. Logs tell you why.

When something breaks, the first thing I do is SSH into the machine and run docker logs nextcloud or journalctl -u some-service. This works for one machine and one service. It does not work when you have multiple hosts, dozens of containers, and the problem might be in any of them. You end up opening terminals, tailing logs, grepping for errors, trying to correlate timestamps across machines. It is slow and error-prone.

This is the problem centralized logging solves. Collect all logs in one place. Search them from one interface. Correlate events across services. See a memory spike on a Grafana graph, then immediately check the logs from that same time window in the same tool.

I will be honest: I am still setting this layer up. I have been exploring Loki, which is Grafana's log aggregation system, and I am writing about it here because the approach makes a lot of sense for homelabs. I do not have months of experience with it like I do with Uptime Kuma and Prometheus. But the problem it solves is real, and the architecture is well suited for small-scale self-hosted environments.

Loki takes a different approach from tools like Elasticsearch. It indexes labels (container name, service, log level) but not the actual log content. This makes it dramatically cheaper to run. You do not need a cluster of machines just to search your logs. Elasticsearch is powerful, but running it properly requires significant memory and disk. Loki can run in a single container with modest resources. For a homelab, that matters.

The collection pipeline is straightforward. Promtail runs as an agent that reads log files and Docker container logs, attaches labels, and ships them to Loki. Grafana queries Loki the same way it queries Prometheus.

  loki:
    image: grafana/loki:latest
    container_name: loki
    restart: unless-stopped
    ports:
      - "3100:3100"
    volumes:
      - loki_data:/loki
    command: -config.file=/etc/loki/local-config.yaml
 
  promtail:
    image: grafana/promtail:latest
    container_name: promtail
    restart: unless-stopped
    volumes:
      - /var/log:/var/log:ro
      - /var/lib/docker/containers:/var/lib/docker/containers:ro
      - ./promtail.yml:/etc/promtail/config.yml
    command: -config.file=/etc/promtail/config.yml

Loki uses LogQL for queries. Finding errors in Nextcloud logs is as simple as:

{container="nextcloud"} |= "error"

Or filtering for a specific time range and log level:

{container="nextcloud"} | json | level="error"

Grafana Alloy is the newer replacement for Promtail. It consolidates metrics collection, log collection, and trace collection into a single agent. Instead of running Promtail for logs and Node Exporter for metrics as separate processes, Alloy handles both. The project is where things are heading, and Grafana is clearly investing in it as the unified collection layer. But Promtail is simpler to understand when you are learning, and it still works fine. I plan to migrate to Alloy eventually, but there is no rush. Understanding the individual components first makes the consolidated tool easier to reason about later.

The value of centralized logs is hard to appreciate until you need them. When a container is crash-looping at 3 AM and you want to know why, being able to open Grafana and search logs from your phone beats SSH-ing into servers from a mobile terminal. When you see a CPU spike on a graph and want to know what caused it, being able to check the logs from that exact time window in the same interface saves real time.

The dashboard that matters

Once you have Prometheus and Grafana connected (and optionally Loki), the temptation is to build the ultimate dashboard. Twenty panels. CPU per core. Network packets per second. Every metric you can find. It looks impressive in screenshots.

Do not do this.

A dashboard you never look at is just decoration. Worse, a dashboard with too many panels trains your eyes to glaze over. You stop actually reading it. I know this because I built that dashboard. Then I tore it down and started over.

The dashboard that works for me has five panels. Each one answers a specific question I actually ask:

Container health. Status, uptime, and restart count per container. If a container has restarted 47 times today, something is wrong even if it is currently "up." This is the panel that catches crash-looping services that Uptime Kuma misses. I sort it by restart count descending. The noisy containers float to the top.

Disk usage trend. Not the current percentage. The trend line over 30 days. A disk at 60% is fine. A disk at 60% that was at 40% two weeks ago needs attention soon. The trend matters more than the number. Running out of disk space is one of the most common homelab failures, and it is entirely preventable if you can see it coming.

Memory by service. A stacked bar or table showing which containers are using the most RAM. When the host is under memory pressure, this panel immediately tells you who the culprit is. This is the panel that would have saved me from the Nextcloud incident.

CPU over time. A line graph covering the past 7 days. This reveals patterns: backup jobs that spike CPU every night, cron tasks that overlap, or a service that has been steadily climbing since an update. Patterns are the point. A single CPU measurement is almost useless. Seven days of data tells a story.

Recent errors. A log panel (powered by Loki) filtered to error and warning levels across all containers. A live feed of things going wrong, sorted by time. This one is optional until you have Loki set up, but once you do, it becomes the first place you look.

Five panels. Each answering a real question. If you find yourself never looking at a panel, remove it. Dashboards are tools, not trophies.

What I would do differently

I made mistakes building this stack. Here is what I would change if I started over.

I tried to monitor everything from day one. I added exporters for every service, built panels for metrics I did not understand, and overwhelmed myself with data. The better approach: start with what breaks most. For me that was memory issues and disk space. Monitor those first. Add more when you have a reason.

I did not set up alerts early enough. A dashboard you do not check daily is useless. I went weeks without opening Grafana, which defeated the entire purpose. Alerts are what make monitoring proactive instead of reactive. I should have configured basic threshold alerts before building dashboards.

I underestimated Grafana's learning curve. Grafana is powerful. It is also not intuitive. The query builder, panel configuration, variable templating, dashboard provisioning, and alerting system all have their own concepts and quirks. PromQL (Prometheus's query language) is its own learning curve on top of that. Simple queries like "show me CPU usage" are easy. Anything involving rates, aggregations across labels, or multi-step calculations takes practice. Budget time for learning it. The documentation is good but dense. Community dashboards on Grafana's website are a useful starting point, but I found I always ended up customizing them heavily.

cAdvisor's defaults ate my CPU. The default 1-second housekeeping interval is absurd for a homelab. I noticed my monitoring stack was using more CPU than the services it was monitoring. Setting --housekeeping_interval=30s and --docker_only=true fixed it.

I forgot about retention from the start. Prometheus stores data forever by default, which means your disk slowly fills up with metrics you will never query. Set --storage.tsdb.retention.time=30d (or whatever makes sense for you) from day one. I learned this when my monitoring volume hit 15 GB.

Where I am going next

The stack is useful now, but it is not finished. Here is what I am working toward.

Alerting rules. This is the biggest gap in my current setup. I want Grafana alerting for threshold-based conditions: memory above 90% for 5 minutes, disk above 85%, container restart count above 3 in an hour. Not just up/down alerts from Uptime Kuma, but alerts that catch problems before they become outages. A disk filling alert gives you days to act. An "out of disk space" outage gives you zero days. Alertmanager is the Prometheus-native option, but Grafana's built-in alerting has gotten good enough that I will probably start there. It can send to the same Telegram bot I already use for Uptime Kuma, which keeps notifications in one place.

Log retention policies. Loki will happily store logs forever, which means the same disk problem I had with Prometheus. I need to configure retention from the beginning this time.

Maybe tracing. For complex request flows across multiple services, distributed tracing with something like Tempo would be valuable. But I am being honest with myself: for a homelab, this is probably overkill. I will revisit if I ever run something that genuinely needs request-level visibility across services.

Grafana Alloy. Running Promtail for logs and Node Exporter for metrics as separate agents works, but Alloy consolidates both into one. Fewer containers to manage, one configuration to maintain. This is a natural next step once the rest of the stack is stable. The configuration format is different from Promtail's YAML, so it is not a drop-in replacement. But simplifying from three collection agents to one is worth the migration effort.

Observability is not a destination. It is a progression. Each layer solves a real problem, and you should not add layers you do not need yet.

For most homelabs, the right starting point is Uptime Kuma plus Prometheus plus Grafana. That covers "is it up?" and "what is happening?" Add Loki when you get tired of SSH-ing into machines to read logs. Add alerting when you get tired of manually checking dashboards. Each step addresses a specific pain point.

Start with what hurts.

Sources

Enjoying the blog? Subscribe via RSS to get new posts in your reader.

Subscribe via RSS