Troubleshooting¶

Use this page as a first pass before opening an issue. Symptom → likely cause → what to check.

No recommendations in the dashboard¶

Check	Action
Time	Stats tasks run on a schedule (defaults in `values.yaml`). Wait several intervals after install.
Prometheus URL	From a controller pod, the URL must resolve (in-cluster Service DNS, correct namespace).
Metrics	Confirm Prometheus scrapes cAdvisor / kubelet / pod metrics CruiseKube queries.
New workloads	Very new workloads may be ignored until they pass `newWorkloadThresholdHours` (env: `CRUISEKUBE_RECOMMENDATIONSETTINGS_NEWWORKLOADTHRESHOLDHOURS`).

kubectl logs -n cruisekube-system deploy/cruisekube-controller-manager --tail=200

Look for Prometheus query errors, auth failures, or TLS issues.

Metrics provider (Prometheus / PromQL)¶

CruiseKube’s metrics provider talks to a Prometheus-compatible HTTP API and runs PromQL queries on a schedule. If the dashboard stays empty or controller logs show query failures, validate the backend and metric names below.

PromQL support¶

Your endpoint must accept instant and range queries the same way Prometheus does (/api/v1/query, /api/v1/query_range). This works with:

Prometheus
Grafana Mimir / Cortex / Thanos query frontends (when they expose the Prometheus query API)
Other vendors only if they implement compatible PromQL and return the expected metric labels

Managed metrics stacks that expose a custom query language or a limited metric catalog (without the kube-state-metrics / node-exporter style names below) are not supported unless you remote-write those series into Prometheus with matching names.

Confirm PromQL from a controller pod (replace URL and namespace):

kubectl exec -n cruisekube-system deploy/cruisekube-controller-manager -- \
  wget -qO- 'http://prometheus-kube-prometheus-prometheus.monitoring.svc:9090/api/v1/query?query=up' | head -c 400

A JSON body with "status":"success" means the API is reachable; fix DNS, network policy, TLS, or CRUISEKUBE_DEPENDENCIES_INCLUSTER_PROMETHEUSURL if not.

Metric names CruiseKube depends on¶

CruiseKube’s Prometheus provider runs PromQL against the URL you configure. Metric names must match what the controller queries; job, namespace, and other labels are often filtered in code. Typical kube-prometheus-stack job labels:

Source	Expected `job` label(s)
kube-state-metrics	`kube-state-metrics`
node-exporter	`node-exporter`
kubelet / cAdvisor (`container_*` metrics)	`kubelet` or `kubernetes-nodes-cadvisor`

CruiseKube queries container metrics with job=~"kubelet|kubernetes-nodes-cadvisor" so either scrape pool works. Adjust if your install uses different job values.

Kubelet / cAdvisor (per-container usage and PSI)¶

Metric	Used for
`container_cpu_usage_seconds_total`	Workload CPU usage, cluster CPU utilization export
`container_memory_working_set_bytes`	Workload memory usage, cluster memory utilization export
`container_pressure_cpu_waiting_seconds_total`	PSI-aware CPU signals; stats task checks whether PSI data exists
`container_pressure_memory_waiting_seconds_total`	Container memory pressure (exported controller metrics)

kube-state-metrics (Kubernetes object state)¶

Metric	Used for
`kube_pod_info`	Map pods to owning workloads (with `created_by_kind` / `created_by_name`)
`kube_pod_status_phase`	Running / Pending pods; workload discovery and cluster summaries
`kube_pod_container_resource_requests`	CPU and memory requests (`resource="cpu"` and `resource="memory"`); GPU request series used to exclude GPU workloads where applicable
`kube_node_status_allocatable`	Node allocatable CPU, memory, and GPU capacity
`kube_node_status_capacity`	Node CPU capacity (e.g. load ratio in node load monitoring)
`kube_pod_container_status_last_terminated_reason`	OOM detection (`reason="OOMKilled"`)
`kube_pod_container_status_restarts_total`	OOM / restart correlation
`kube_node_spec_taint`	Exclude GPU nodes from certain cluster rollups (`key="nvidia.com/gpu"`)
`kube_node_labels`	Exclude NVIDIA accelerator nodes from certain cluster rollups

node-exporter (node-level signals)¶

Metric	Used for
`node_cpu_seconds_total`	Cluster CPU utilization rollups (`mode=~"user\|system"`)
`node_load1`	Node load monitoring and load-based metrics
`node_pressure_cpu_waiting_seconds_total`	Node CPU pressure (PSI)
`node_pressure_memory_waiting_seconds_total`	Node memory pressure (PSI)
`node_memory_MemTotal_bytes`	Cluster memory utilization expression
`node_memory_MemFree_bytes`	Same (with `Buffers` and `Cached`)
`node_memory_Buffers_bytes`	Same
`node_memory_Cached_bytes`	Same

Optional (environment-specific)¶

Metric	Used for
`karpenter_nodeclaims_disrupted_total`	Karpenter consolidation/eviction counter export (only if you run Karpenter and scrape its metrics)

Automate the checks (port-forward Prometheus locally, then pass the local port):

kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090
python3 scripts/check_prometheus_metrics.py --port 9090
# Progress on stderr; open http://localhost:9090/service-discovery while it runs if needed

Spot-check that the core families exist (Prometheus UI or /api/v1/query):

container_cpu_usage_seconds_total{job=~"kubelet|kubernetes-nodes-cadvisor"}
container_memory_working_set_bytes{job=~"kubelet|kubernetes-nodes-cadvisor"}
kube_pod_container_resource_requests{resource="cpu"}
kube_pod_container_resource_requests{resource="memory"}
kube_node_status_allocatable{resource="cpu"}
kube_node_status_allocatable{resource="memory"}
kube_pod_status_phase{phase="Running"}
kube_pod_info
node_cpu_seconds_total{job="node-exporter"}

If queries are empty:

Scrape targets — Ensure kubelet/cAdvisor, kube-state-metrics, and node-exporter (or equivalents) are scraped into this Prometheus.
Retention — Series must cover lookback windows used by stats and recommendation tasks (see chart values.yaml task schedules).
RBAC / collectors — A broken kube-state-metrics install often drops all kube_* series while node/container metrics still appear.
Wrong or incompatible Prometheus — Point CRUISEKUBE_DEPENDENCIES_INCLUSTER_PROMETHEUSURL at the store that actually holds these names, not a short-retention, federated, or heavily filtered view. Production Prometheus often drops series via relabeling, recording rules, or remote-write cost controls — see Scenario 3 — Dedicated standalone Prometheus.

If even one core metric family above is missing or mislabeled, recommendations, dashboard rollups, or exported cluster metrics are often incomplete or unreliable. Optional metrics only affect the features listed in their row (for example Karpenter counters without karpenter_nodeclaims_disrupted_total).

See Prerequisites — Prometheus for the three install scenarios.

Prometheus TLS / HTTPS¶

If Prometheus uses a private CA or self-signed cert, you may need CRUISEKUBE_DEPENDENCIES_INCLUSTER_INSECURESKIPTLSVERIFY (or local equivalent) set to "true" only in trusted environments. Prefer proper CA trust in production. See chart values.yaml env keys.

Webhook not mutating pods¶

Check	Action
`MutatingWebhookConfiguration`	`kubectl get mutatingwebhookconfiguration` — verify CruiseKube webhook exists and points to the correct service.
Certs	Chart often ships a cert-gen job; check webhook pod logs and APIServer warnings.
`failurePolicy`	Default is often Ignore—failures fail open; pods still create but unmutated.
Namespace selectors	System namespaces may be excluded by design.

kubectl logs -n cruisekube-system deploy/cruisekube-webhook-server --tail=200

“Nothing changes” but I enabled Cruise mode¶

Check	Action
Recommend vs Cruise	The workload is still on Recommend (observe-only)—only Cruise applies changes. Confirm in Policies & Configuration (Dashboard, Policies & modes).
HPA	CPU/memory metric HPA targets are skipped entirely.
Best-effort pods	Best-effort QoS classes may be excluded from optimization.

Unexpected evictions¶

Read Tradeoffs — pod eviction.
Review eviction priority for the workload in the dashboard.
Check node memory/CPU pressure—optimizer may be enforcing feasibility.
Inspect controller logs around the timestamp for eviction messages.

OOM loops or repeated restarts¶

Confirm memory application is not disabled while you expect limits to rise.
Review OOM handling — cooldown prevents thrashing; repeated OOM may indicate limit/request still too tight for real spikes.
Validate JVM and other runtimes that do not tolerate rapid memory changes.

Frontend cannot reach API¶

Port-forward the frontend and ensure cruisekubeFrontend.backendURL (Helm) points at the controller Service inside the cluster.
Check basic auth credentials on the controller HTTP API if enabled (CRUISEKUBE_SERVER_BASICAUTH_*).

Database connection errors¶

Verify Postgres host/port/secret matches global.postgresql.auth.* when using external DB.
For bundled Postgres, check PVC binding and pod readiness:

kubectl get pods -n cruisekube-system -l app.kubernetes.io/name=postgresql

Still stuck?¶

Collect and attach:

CruiseKube chart version / app version (helm list -n cruisekube-system)
Redacted values.yaml
Controller and webhook logs (last ~500 lines)
Whether Prometheus can run a sample query for container_cpu_usage_seconds_total

Then open a GitHub Issue or ask on Discord.