Diagnosing etcd Performance Issues in OpenShift
etcd performance problems are easy to miss. In severe cases, the API server, kube-scheduler, and kube-controller-manager are all restarting and the web console is unresponsive. At the other end of the spectrum, everything looks fine: every Cluster Operator is Available, no pods are restarting, it just doesn't feel as fast as it used to. In both cases, etcd is rarely the first suspect.
In this article, we walk through a real-world must-gather that was collected at 00:50 on 2025-11-18 after numerous pod restarts and multiple timeouts occurred across the cluster, interpret the restart patterns, and use the etcd-ocp-diag script to pinpoint whether the cluster is experiencing a momentary spike or a chronic disk performance problem. In a future article, we will look at the metrics side to see how they correlate with log errors and namespace events.
Collecting etcd Pod Logs
When a customer's cluster starts experiencing slowness or pod restarts, the first step they should take is to collect a must-gather and upload it to a Support Case:
$ oc adm must-gather
A must-gather snapshots the cluster's state and collects core OpenShift configurations and namespaces including the namespace's configmaps, secrets, pod logs, events, CRDs, and more. For etcd troubleshooting with a must-gather, the pod logs are what matter most.
The typical layout for the openshift-etcd namespace looks like the following example. Each control plane node runs an etcd member with its own log file. Problems often appear on all three nodes at once (a slow disk on one node can trigger leader elections that ripple across the cluster), but sometimes they're isolated to a single member.
must-gather.local.<timestamp>/
└── quay-io-openshift-release-dev-.../
└── namespaces/
└── openshift-etcd/
└── pods/
├── etcd-master-0/
│ └── etcd/
│ └── etcd/
│ └── logs/
│ ├── current.log
│ └── previous.log
├── etcd-master-1/
└── etcd-master-2/
Installing omc
To start diagnosing any OpenShift must-gather, install an open source tool called omc, which is a Go-based CLI that lets you interact with a must-gather using familiar oc-style commands. Instead of navigating the directory tree manually, you query it the same way you would a live cluster.
Install omc by running:
$ curl -sL "https://github.com/gmeghnag/omc/releases/latest/download/omc_$(uname)_$(uname -m).tar.gz" \
| tar xzf - omc && chmod +x ./omc && mv ./omc /usr/local/bin/omc
Point it at your must-gather and start querying:
$ omc use must-gather.local.<timestamp>/
A few of the commands below pipe omc output through yq (the Python wrapper around jq) to filter YAML. If you want to follow along with those, install it with pip install yq. Note that this is a different tool than the Go-based mikefarah/yq, and the -yr flags used later are specific to the Python version.
Reviewing the Must-Gather
Start with clusterversion to get a baseline of the cluster: current version, availability, and whether the cluster is mid-update or whether there are any issues. The Since column quickly lets you know if a Cluster Operator's state has changed recently or if the cluster is stable.
$ omc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.16.28 True False 44m Cluster version is 4.16.28
From the previous command, we can see that the cluster last changed 44 minutes ago. Check the cluster's update history to eliminate a cluster upgrade as the reason for the change.
$ omc get clusterversion -o yaml | yq -yr '.items[].status.history'
- completionTime: '2025-06-18T18:29:47Z'
image: quay.io/openshift-release-dev/ocp-release@sha256:7708f832ae02919f2cdb2798fdbc64e17ce7a576d1e3baabdd78a000d2d62f40
startedTime: '2025-06-18T13:04:54Z'
state: Completed
verified: true
version: 4.16.28
We now know that this cluster was previously upgraded 152 days before this must-gather was collected, so we can determine that either a configuration change or instability is the reason for the SINCE column returning 44m.
Checking Cluster Operators
Next, check Cluster Operators. Look for all operators to be on the same version, none Progressing or Degraded, and Since values measured in days or weeks. Operators flapping in a troubled cluster show Since timestamps of minutes or hours, not days.
$ omc get co
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE
authentication 4.16.28 True False False 32m
baremetal 4.16.28 True False False 1y
cloud-controller-manager 4.16.28 True False False 1y
cloud-credential 4.16.28 True False False 1y
cluster-autoscaler 4.16.28 True False False 1y
config-operator 4.16.28 True False False 1y
console 4.16.28 True False False 74d
control-plane-machine-set 4.16.28 True False False 1y
csi-snapshot-controller 4.16.28 True False False 74d
dns 4.16.28 True False False 32d
etcd 4.16.28 True False False 13d
image-registry 4.16.28 True False False 1y
ingress 4.16.28 True False False 74d
insights 4.16.28 True False False 48d
kube-apiserver 4.16.28 True False False 1y
kube-controller-manager 4.16.28 True False False 1y
kube-scheduler 4.16.28 True False False 1y
kube-storage-version-migrator 4.16.28 True False False 13d
machine-api 4.16.28 True False False 1y
machine-approver 4.16.28 True False False 1y
machine-config 4.16.28 True False False 166d
marketplace 4.16.28 True False False 1y
monitoring 4.16.28 True False False 1h
network 4.16.28 True False False 1y
node-tuning 4.16.28 True False False 152d
openshift-apiserver 4.16.28 True False False 46m
openshift-controller-manager 4.16.28 True False False 1y
openshift-samples 4.16.28 True False False 152d
operator-lifecycle-manager 4.16.28 True False False 1y
operator-lifecycle-manager-catalog 4.16.28 True False False 1y
operator-lifecycle-manager-packageserver 4.16.28 True False False 49m
service-ca 4.16.28 True False False 1y
storage 4.16.28 True False False 1y
We know that no configuration changes were made to the cluster, so let's look at the operators with the newest status change: authentication, monitoring, openshift-apiserver, and operator-lifecycle-manager-packageserver. We'll focus on the last time each was in the Degraded status or became Available.
$ omc get co authentication -o yaml | yq -yr '.status.conditions[] | select(.type == "Degraded")'
lastTransitionTime: '2025-11-18T00:01:18Z'
message: All is well
reason: AsExpected
status: 'False'
type: Degraded
$ omc get co monitoring -o yaml | yq -yr '.status.conditions[] | select(.type == "Degraded")'
lastTransitionTime: '2025-11-17T23:34:02Z'
status: 'False'
type: Degraded
$ omc get co openshift-apiserver -o yaml | yq -yr '.status.conditions[] | select(.type == "Degraded")'
lastTransitionTime: '2025-11-18T00:04:55Z'
message: All is well
reason: AsExpected
status: 'False'
type: Degraded
$ omc get co operator-lifecycle-manager-packageserver -o yaml | yq -yr '.status.conditions[] | select(.type == "Available")'
lastTransitionTime: '2025-11-18T00:01:44Z'
message: ClusterServiceVersion openshift-operator-lifecycle-manager/packageserver
observed in phase Succeeded
reason: ClusterServiceVersionSucceeded
status: 'True'
type: Available
While the cluster currently looks healthy, we can see that all 4 of these operators went into a degraded state within a 30-minute window, which confirms that something significant happened on this cluster. To verify these findings, we need to look for pods that started less than an hour ago, which would explain the Degraded state, and look for pod restarts across the control plane to determine how unstable the control plane truly is.
Checking Pod Age and Restarts
Because those operators reported their last status change within the last hour, any pods rescheduled during that incident should fall within that window. The first command below correlates pod age back to that 1-hour window by returning running pods whose AGE column reads in minutes, confirming which workloads were rescheduled when the operators flapped. The second command filters for etcd, API server, scheduler, and controller-manager pods and returns only those with at least one restart:
$ omc get pods -A | grep Running | awk '$6 ~ /m$/'
openshift-authentication oauth-openshift-85bc66dc4d-7n6kq 1/1 Running 0 45m
openshift-authentication oauth-openshift-85bc66dc4d-8sqf6 1/1 Running 0 49m
openshift-authentication oauth-openshift-85bc66dc4d-hjrn4 1/1 Running 0 45m
openshift-marketplace certified-operators-gsczr 1/1 Running 0 33m
openshift-marketplace community-operators-twqbv 1/1 Running 0 44m
openshift-marketplace redhat-marketplace-6jlkc 0/1 Running 0 9m
$ omc get pods -A --no-headers | grep -E 'etcd|apiserver|controller-manager|scheduler' | awk '$5 != 0'
openshift-apiserver apiserver-59f65c798-d6fx6 2/2 Running 9 13d
openshift-apiserver apiserver-59f65c798-rhqmm 2/2 Running 19 13d
openshift-apiserver apiserver-59f65c798-x9m7f 2/2 Running 356 152d
openshift-apiserver-operator openshift-apiserver-operator-858db8f58d-bvtrd 1/1 Running 34 13d
openshift-cloud-controller-manager-operator cluster-cloud-controller-manager-operator-58c96c984-pl9bv 3/3 Running 345 152d
openshift-controller-manager controller-manager-58b5c6d67d-f8mnv 1/1 Running 5 13d
openshift-controller-manager controller-manager-58b5c6d67d-jndg2 1/1 Running 25 13d
openshift-controller-manager controller-manager-58b5c6d67d-pl82v 1/1 Running 3 13d
openshift-controller-manager-operator openshift-controller-manager-operator-bc46ffc87-5rm9s 1/1 Running 230 152d
openshift-etcd etcd-master0-openshift 4/4 Running 13 152d
openshift-etcd etcd-master1-openshift 4/4 Running 13 152d
openshift-etcd etcd-master2-openshift 4/4 Running 5 152d
openshift-etcd-operator etcd-operator-5f9ccb4ffb-v7lkj 1/1 Running 266 152d
openshift-kube-apiserver kube-apiserver-master0-openshift 5/5 Running 6 3d
openshift-kube-apiserver kube-apiserver-master1-openshift 5/5 Running 5 3d
openshift-kube-apiserver kube-apiserver-master2-openshift 5/5 Running 10 3d
openshift-kube-apiserver-operator kube-apiserver-operator-5c496bb69b-8gclx 1/1 Running 243 152d
openshift-kube-controller-manager kube-controller-manager-master0-openshift 4/4 Running 5669 152d
openshift-kube-controller-manager kube-controller-manager-master1-openshift 4/4 Running 5467 152d
openshift-kube-controller-manager kube-controller-manager-master2-openshift 4/4 Running 5849 152d
openshift-kube-controller-manager-operator kube-controller-manager-operator-7dc4fc458b-jw4qq 1/1 Running 236 152d
openshift-kube-scheduler openshift-kube-scheduler-master0-openshift 3/3 Running 120 152d
openshift-kube-scheduler openshift-kube-scheduler-master1-openshift 3/3 Running 130 152d
openshift-kube-scheduler openshift-kube-scheduler-master2-openshift 3/3 Running 98 152d
openshift-kube-scheduler-operator openshift-kube-scheduler-operator-58bb8b85c4-fs7dd 1/1 Running 230 152d
openshift-oauth-apiserver apiserver-cf6b5f478-gxzl9 1/1 Running 317 152d
openshift-oauth-apiserver apiserver-cf6b5f478-jz5xh 1/1 Running 1 13d
openshift-oauth-apiserver apiserver-cf6b5f478-l6954 1/1 Running 8 13d
openshift-route-controller-manager route-controller-manager-7c66879896-2sjkt 1/1 Running 7 13d
openshift-route-controller-manager route-controller-manager-7c66879896-lfzfw 1/1 Running 15 13d
openshift-route-controller-manager route-controller-manager-7c66879896-t8n2v 1/1 Running 29 13d
Reviewing the output of these two commands, we can confirm that the authentication, monitoring, openshift-apiserver, and operator-lifecycle-manager-packageserver operators all had new pods scheduled. Additionally, the control plane pods have experienced between 2 and 5849 pod restarts across 14 different key OpenShift namespaces, indicating major instability in this cluster.
The etcd pods themselves don't look great with 5 to 13 restarts across 152 days. If you stopped there, you might conclude that etcd is having issues but isn't the core problem. The kube-controller-manager tells a different story. All three instances are sitting at 5,400 to 5,800 restarts over the same 152 days, roughly 38 restarts per day or one crash every 38 minutes. The kube-scheduler is also elevated at 98 to 130 restarts, and one openshift-apiserver pod has 356.
This pattern is a classic fingerprint of etcd performance pressure. The etcd pods aren't crashing outright, but slow write latency or leader elections are causing watch streams and heartbeat timeouts to fail. The kube-controller-manager and scheduler lose their connections to the API server, hit their own deadlines, and restart. The etcd operator at 266 restarts reflects the same instability: it's repeatedly trying to reconcile a cluster that keeps going sideways.
The operational impact during those restart windows is significant. The kube-controller-manager manages pod lifecycle, endpoint sync, garbage collection, and node lease renewals. When it's down, even briefly, pod scheduling stalls, endpoints go stale, and terminated pods pile up without being cleaned. Users don't see a hard outage. They see slow deployments, delayed pod scheduling, and API calls that take seconds instead of milliseconds.
Reviewing etcd Logs for Performance Issues
Manually grepping etcd logs for performance issues works, but it can quickly get tedious when you have three nodes' worth of logs covering weeks. etcd-ocp-diag automates that work. It's a Python script that scans a must-gather for etcd performance-related errors, groups them by pod and timestamp, shows statistics like max latency and first/last occurrence, and lets you compare multiple pods to determine whether the same error is hitting multiple members simultaneously.
Install it by cloning the repo, making it executable, and then copying it to your $PATH. Python 3.8 or later is the only requirement:
$ git clone https://github.com/cptmorgan-rh/etcd-ocp-diag-script.git && \
cd etcd-ocp-diag-script && chmod +x ./etcd-ocp-diag.py && mv ./etcd-ocp-diag.py /usr/local/bin/etcd-ocp-diag.py
Start with the --errors flag, which gives you a per-pod count of every performance-related error the script knows about:
$ etcd-ocp-diag.py --path must-gather.local.<timestamp>/ --errors
POD ERROR COUNT
etcd-master0-openshift waiting for ReadIndex response took too long, retrying 505
etcd-master0-openshift slow fdatasync 25
etcd-master0-openshift apply request took too long 17339
etcd-master0-openshift leader is overloaded likely from slow disk 504
etcd-master0-openshift lost leader 3
etcd-master0-openshift request stats 16229
etcd-master1-openshift waiting for ReadIndex response took too long, retrying 273
etcd-master1-openshift slow fdatasync 17
etcd-master1-openshift apply request took too long 7087
etcd-master1-openshift leader is overloaded likely from slow disk 214
etcd-master1-openshift lost leader 1
etcd-master1-openshift request stats 6103
etcd-master2-openshift waiting for ReadIndex response took too long, retrying 354
etcd-master2-openshift etcdserver: request timed out 3
etcd-master2-openshift slow fdatasync 7
etcd-master2-openshift apply request took too long 12577
etcd-master2-openshift leader is overloaded likely from slow disk 448
etcd-master2-openshift lost leader 2
etcd-master2-openshift request stats 11688
The first thing that stands out is the significant number of request stats warnings and apply request took too long errors.
Before digging into those, here is what each error the script detects actually means and how seriously to treat it:
| Error | What It Means | Severity |
|---|---|---|
waiting for ReadIndex response took too long, retrying |
A follower sent a ReadIndex request to the leader to confirm it's still leader before serving a linearizable read. The leader didn't respond within the timeout, so the follower retried. | Warning |
slow fdatasync |
etcd called fdatasync() to flush the WAL to disk and it took longer than the acceptable threshold. |
Warning |
apply request took too long |
Applying a committed Raft log entry to the state machine took longer than the acceptable threshold. | Critical |
leader is overloaded likely from slow disk |
The leader's disk is too slow to keep up with the write rate. | Critical |
lost leader |
This member was leader and lost that status, typically because followers stopped receiving its heartbeats. | Informational |
wal: sync duration |
The time to sync the Write-Ahead Log to disk exceeded the acceptable threshold. | Warning |
local node might have slow network |
etcd detected that this member is taking too long to send or receive messages from peers. | Warning |
server is likely overloaded |
etcd's internal request queue is filling faster than it can drain. Caused by a combination of high write rate and slow disk or CPU; left unchecked, requests begin timing out. | Critical |
sending buffer is full |
The outbound message buffer to a peer is saturated, meaning etcd cannot send messages fast enough. | Warning |
request stats |
A warning log line etcd emits summarizing recent request latency that took longer than the acceptable threshold. | Warning |
etcdserver: request timed out |
A client request to etcd exceeded the server-side timeout. The operation failed from the client's perspective. | Critical |
To view pods, dates, and error counts for request stats, use the --request_stats flag:
$ etcd-ocp-diag.py --path must-gather.local.<timestamp>/ --request_stats
POD DATE COUNT
etcd-master0-openshift 2025-11-17 9725
etcd-master0-openshift 2025-11-18 6504
etcd-master1-openshift 2025-11-18 6103
etcd-master2-openshift 2025-11-17 685
etcd-master2-openshift 2025-11-18 11003
One of the script's main features is that using the --date flag returns the number of errors that occurred during each hour and minute along with the highest time spent or took value, making it easier to determine whether the cluster is experiencing chronic or periodic issues. In the output below, we can see that request stats warnings spanned a 55-minute window from 23:36 on 2025-11-17 to 00:31 on 2025-11-18, and in most cases the time spent was far above the 300ms threshold that triggers the log entry.
$ etcd-ocp-diag.py --path must-gather.local.<timestamp>/ --request_stats --date 2025-11-17
POD DATE COUNT MAX_TIME
etcd-master0-openshift 23:36 834 3525.2161ms
etcd-master0-openshift 23:37 331 595.3195ms
etcd-master0-openshift 23:38 691 4450.4680ms
etcd-master0-openshift 23:39 81 1347.1995ms
etcd-master0-openshift 23:40 127 1152.4057ms
etcd-master0-openshift 23:41 52 1125.4104ms
etcd-master0-openshift 23:42 51 1049.1937ms
etcd-master0-openshift 23:43 32 1038.7765ms
etcd-master0-openshift 23:44 58 1096.3438ms
etcd-master0-openshift 23:45 481 1723.8945ms
etcd-master0-openshift 23:46 751 3182.1362ms
etcd-master0-openshift 23:47 460 3368.0477ms
etcd-master0-openshift 23:48 450 37923.3061ms
etcd-master0-openshift 23:49 781 60005.1878ms
etcd-master0-openshift 23:50 536 82543.9393ms
etcd-master0-openshift 23:51 168 1423.6394ms
etcd-master0-openshift 23:52 99 990.9552ms
etcd-master0-openshift 23:53 197 3080.9392ms
etcd-master0-openshift 23:54 55 1138.4741ms
etcd-master0-openshift 23:55 109 1128.5804ms
etcd-master0-openshift 23:56 717 2565.0926ms
etcd-master0-openshift 23:57 1352 4684.8516ms
etcd-master0-openshift 23:58 591 40000.2205ms
etcd-master0-openshift 23:59 721 63401.0197ms
etcd-master2-openshift 23:58 552 13400.5967ms
etcd-master2-openshift 23:59 133 1378.8829ms
$ etcd-ocp-diag.py --path must-gather.local.<timestamp>/ --request_stats --date 2025-11-18
POD DATE COUNT MAX_TIME
etcd-master0-openshift 00:00 181 3465.6826ms
etcd-master0-openshift 00:01 210 1404.0401ms
etcd-master0-openshift 00:02 602 1453.8899ms
etcd-master0-openshift 00:03 196 2469.9012ms
etcd-master0-openshift 00:04 21 520.4142ms
etcd-master0-openshift 00:05 907 1981.8301ms
...
etcd-master2-openshift 00:29 310 2826.0084ms
etcd-master2-openshift 00:30 195 26549.1939ms
etcd-master2-openshift 00:31 389 2219.8426ms
Apply Latency and Disk Sync Times
For a cluster with slow disk symptoms, start by using --fdatasync, which reports warnings when the storage backend takes longer than 1 second to flush the Write-Ahead Log (WAL) data to disk. Follow it up with the --ttl flag, which shows the same time-based breakdown but for the underlying apply latency rather than the client-visible request stats.
$ etcd-ocp-diag.py --path must-gather.local.<timestamp>/ --fdatasync
POD DATE COUNT
etcd-master0-openshift 2025-11-17 6
etcd-master0-openshift 2025-11-18 19
etcd-master1-openshift 2025-11-18 17
etcd-master2-openshift 2025-11-18 7
$ etcd-ocp-diag.py --path must-gather.local.<timestamp>/ --fdatasync --date 2025-11-17
POD DATE COUNT MAX_TIME
etcd-master0-openshift 23:47 2 1590.3280ms
etcd-master0-openshift 23:51 1 1024.1193ms
etcd-master0-openshift 23:58 3 1204.2944ms
$ etcd-ocp-diag.py --path must-gather.local.<timestamp>/ --fdatasync --date 2025-11-18
POD DATE COUNT MAX_TIME
etcd-master0-openshift 00:07 7 3341.3137ms
etcd-master0-openshift 00:08 4 2998.5233ms
etcd-master0-openshift 00:10 1 1214.7937ms
etcd-master0-openshift 00:11 3 1233.9415ms
etcd-master0-openshift 00:18 1 1098.6943ms
etcd-master0-openshift 00:21 3 1377.3595ms
etcd-master1-openshift 00:07 4 1969.3752ms
etcd-master1-openshift 00:08 2 1700.1053ms
etcd-master1-openshift 00:10 1 1268.7579ms
etcd-master1-openshift 00:11 1 2465.6723ms
etcd-master1-openshift 00:17 4 5332.0430ms
etcd-master1-openshift 00:18 1 1253.9097ms
etcd-master1-openshift 00:27 3 1763.0241ms
etcd-master1-openshift 00:30 1 1147.0087ms
etcd-master2-openshift 00:00 1 1004.7974ms
etcd-master2-openshift 00:03 1 1829.8329ms
etcd-master2-openshift 00:05 1 1466.1899ms
etcd-master2-openshift 00:07 1 1772.8069ms
etcd-master2-openshift 00:10 1 1280.2763ms
etcd-master2-openshift 00:11 1 1049.6459ms
etcd-master2-openshift 00:30 1 1190.8360ms
$ etcd-ocp-diag.py --path must-gather.local.<timestamp>/ --ttl
POD DATE COUNT
etcd-master0-openshift 2025-11-17 10079
etcd-master0-openshift 2025-11-18 7260
etcd-master1-openshift 2025-11-18 7087
etcd-master2-openshift 2025-11-17 646
etcd-master2-openshift 2025-11-18 11931
$ etcd-ocp-diag.py --path must-gather.local.<timestamp>/ --ttl --date 2025-11-17
POD DATE COUNT MAX_TIME
etcd-master0-openshift 23:36 847 3513.0858ms
etcd-master0-openshift 23:37 799 595.2415ms
etcd-master0-openshift 23:38 1141 4450.3839ms
etcd-master0-openshift 23:39 124 1347.0568ms
etcd-master0-openshift 23:40 193 1152.2839ms
etcd-master0-openshift 23:41 87 1082.6561ms
etcd-master0-openshift 23:42 68 1049.0904ms
etcd-master0-openshift 23:43 66 1038.6286ms
etcd-master0-openshift 23:44 71 1096.2263ms
etcd-master0-openshift 23:45 462 1723.7928ms
etcd-master0-openshift 23:46 713 3181.9526ms
etcd-master0-openshift 23:47 467 3367.8464ms
etcd-master0-openshift 23:48 267 37923.1994ms
etcd-master0-openshift 23:49 555 63233.9340ms
etcd-master0-openshift 23:50 531 82615.2260ms
etcd-master0-openshift 23:51 170 1422.9851ms
etcd-master0-openshift 23:52 151 990.8517ms
etcd-master0-openshift 23:53 202 3080.6236ms
etcd-master0-openshift 23:54 89 1138.3644ms
etcd-master0-openshift 23:55 160 1128.4636ms
etcd-master0-openshift 23:56 789 2564.9544ms
etcd-master0-openshift 23:57 1168 4684.6261ms
etcd-master0-openshift 23:58 408 39999.9523ms
etcd-master0-openshift 23:59 551 64436.3973ms
etcd-master2-openshift 23:58 504 13399.6897ms
etcd-master2-openshift 23:59 142 1378.7462ms
$ etcd-ocp-diag.py --path must-gather.local.<timestamp>/ --ttl --date 2025-11-18
POD DATE COUNT MAX_TIME
etcd-master0-openshift 00:00 158 3465.4064ms
etcd-master0-openshift 00:01 308 1403.4716ms
etcd-master0-openshift 00:02 577 1453.5319ms
etcd-master0-openshift 00:03 204 2280.3111ms
etcd-master0-openshift 00:04 73 520.2907ms
etcd-master0-openshift 00:05 1256 1981.6568ms
...
etcd-master2-openshift 00:29 492 2825.7174ms
etcd-master2-openshift 00:30 224 26548.9988ms
etcd-master2-openshift 00:31 396 2219.7259ms
Note: Even the healthiest of clusters will have some apply request took too long messages in the etcd pod logs. Double-digit to low-triple-digit occurrences per day are normal. The key is to compare the took value against expected-duration: occasional entries just over the threshold are no cause for concern, but entries where the took value is significantly above expected-duration indicate a problem that requires deeper analysis.
Correlating etcd errors
At this point, it is safe to say that the etcd pods were experiencing significant performance issues between 23:30 and 00:30 from the 17th to the 18th. Reviewing the Max Time, we are seeing apply request took too long times exceeding a minute in some cases. A delay of 400-500ms is significant. A 1+ minute delay will result in API server pods not responding and timing out.
To confirm this, let's look at the events in the openshift-kube-apiserver namespace and filter for warnings during the incident window. The commands below filter for Warning events with a lastTimestamp starting with 2025-11-17T23:5 and 2025-11-18T00:0.
$ omc get events -o yaml | yq -yr '.items[] | select(.type == "Warning" and (.lastTimestamp | startswith("2025-11-17T23:5"))) | {lastTimestamp, reason, message}'
lastTimestamp: '2025-11-17T23:58:25Z'
reason: ProbeError
message: "Readiness probe error: Get \"https://10.1.1.56:6443/readyz\": context deadline\
\ exceeded (Client.Timeout exceeded while awaiting headers)\nbody: \n"
---
lastTimestamp: '2025-11-17T23:58:36Z'
reason: ProbeError
message: 'Readiness probe error: HTTP probe failed with statuscode: 500
body: [+]ping ok
[+]log ok
[+]etcd ok
[-]etcd-readiness failed: reason withheld
[+]api-openshift-apiserver-available ok
[+]api-openshift-oauth-apiserver-available ok
[+]informer-sync ok
[+]poststarthook/openshift.io-oauth-apiserver-reachable ok
[+]poststarthook/start-kube-apiserver-admission-initializer ok
[+]poststarthook/quota.openshift.io-clusterquotamapping ok
[+]poststarthook/openshift.io-api-request-count-filter ok
[+]poststarthook/openshift.io-startkubeinformers ok
[+]poststarthook/openshift.io-openshift-apiserver-reachable ok
[+]poststarthook/generic-apiserver-start-informers ok
[+]poststarthook/priority-and-fairness-config-consumer ok
[+]poststarthook/priority-and-fairness-filter ok
[+]poststarthook/storage-object-count-tracker-hook ok
[+]poststarthook/start-apiextensions-informers ok
[+]poststarthook/start-apiextensions-controllers ok
[+]poststarthook/crd-informer-synced ok
[+]poststarthook/start-service-ip-repair-controllers ok
[+]poststarthook/rbac/bootstrap-roles ok
[+]poststarthook/scheduling/bootstrap-system-priority-classes ok
[+]poststarthook/priority-and-fairness-config-producer ok
[+]poststarthook/start-system-namespaces-controller ok
[+]poststarthook/bootstrap-controller ok
[+]poststarthook/start-cluster-authentication-info-controller ok
[+]poststarthook/start-kube-apiserver-identity-lease-controller ok
[+]poststarthook/start-kube-apiserver-identity-lease-garbage-collector ok
[+]poststarthook/start-legacy-token-tracking-controller ok
[+]poststarthook/aggregator-reload-proxy-client-cert ok
[+]poststarthook/start-kube-aggregator-informers ok
[+]poststarthook/apiservice-registration-controller ok
[+]poststarthook/apiservice-status-available-controller ok
[+]poststarthook/apiservice-wait-for-first-sync ok
[+]poststarthook/kube-apiserver-autoregistration ok
[+]autoregister-completion ok
[+]poststarthook/apiservice-openapi-controller ok
[+]poststarthook/apiservice-openapiv3-controller ok
[+]poststarthook/apiservice-discovery-controller ok
[+]shutdown ok
readyz check failed
'
$ omc get events -o yaml | yq -yr '.items[] | select(.type == "Warning" and (.lastTimestamp | startswith("2025-11-18T00:0"))) | {lastTimestamp, reason, message}'
lastTimestamp: '2025-11-18T00:01:29Z'
reason: KubeAPIReadyz
message: readyz=true
---
lastTimestamp: '2025-11-18T00:04:17Z'
reason: ProbeError
message: "(combined from similar events): Readiness probe error: Get \"https://10.1.1.56:6443/readyz\"\
: net/http: request canceled (Client.Timeout exceeded while awaiting headers)\n\
body: \n"
---
The events tie directly to the etcd latency timestamps. At 23:58:25, the kube-apiserver's readiness probe fails with context deadline exceeded. Eleven seconds later, a second failure returns a full /readyz body with only one failing check: [-]etcd-readiness failed. Everything else reports ok. This lands squarely in the peak latency window identified by running etcd-ocp-diag with --ttl, which showed apply request took too long entries exceeding 40 and 64 seconds on etcd-master0. The kube-apiserver wasn't the problem; etcd was, and the readiness body shows this. By 00:01:29, readyz=true signals that the API has recovered, and the final ProbeError at 00:04:17 shows the cluster wasn't fully stable yet, consistent with --fdatasync showing continued slow syncs across all three members through 00:30.
Digging deeper into the namespace events, we found container restarts in the openshift-kube-scheduler, openshift-kube-controller-manager, and openshift-apiserver-operator namespaces, along with ProbeError, liveness, and readiness errors in the openshift-etcd-operator and openshift-route-controller-manager namespaces.
Conclusion
Everything in this post traces back to one root cause: a disk that could not keep up with etcd's write rate.
The disk pressure showed up first as slow fdatasync, which fed directly into apply request took too long with peak latency hitting 82,615ms at 23:50, over 400 times the 200ms threshold. At those latencies, etcd could not send heartbeats on time, triggering 3 leader elections and 504 leader is overloaded likely from slow disk warnings.
The kube-apiserver felt it first. At 23:58:25, its readiness probe failed to reach /readyz. Eleven seconds later, a second failure returned a full readiness body with one failing check: [-]etcd-readiness failed. Everything else was healthy. The kube-controller-manager and kube-scheduler, which depend on a stable API server connection, began losing that connection and restarting. Four Cluster Operators went degraded within a 30-minute window, and readiness probes across the control plane started failing and triggering further restarts.
One important note about what this must-gather can and cannot show: the etcd pod logs only contain several hours up to the event. The namespace events, however, have a longer retention window and tell a broader story. They show a prior episode that resolved approximately 8 hours before this must-gather was collected, placing it well outside the log retention window and entirely invisible to etcd-ocp-diag. The kube-controller-manager restart counts, 5,400 to 5,800 over 152 days, confirm what that pattern implies: this was not a single bad night. The disk had been failing to keep up for months, and periodic latency episodes had already become a recurring pattern before anyone opened a support case.
The fix is storage. Moving the etcd databases to faster or dedicated hardware and validating performance thresholds will resolve the issue. Fix the storage issues and the cascade stops.
If you want to run etcd-ocp-diag against your own must-gathers, the script is on GitHub at cptmorgan-rh/etcd-ocp-diag-script.
Error Log Reference
Raw log examples for each error etcd-ocp-diag detects. etcd writes structured JSON logs, so each entry below is a single log line from a pod's current.log file.
request stats
{
"level": "warn",
"ts": "2025-11-17T23:50:23.943884Z",
"caller": "v3rpc/interceptor.go:197",
"msg": "request stats",
"start time": "2025-11-17T23:49:01.399932Z",
"time spent": "1m22.543939321s",
"remote": "10.1.1.58:43044",
"response type": "/etcdserverpb.KV/Range",
"request count": 0,
"request size": 98,
"response count": 0,
"response size": 32,
"request content": "key:\"/kubernetes.io/loki.grafana.com/alertingrules/\" range_end:\"/kubernetes.io/loki.grafana.com/alertingrules0\" count_only:true "
}
slow fdatasync
{
"level": "warn",
"ts": "2025-11-17T23:47:06.246348Z",
"caller": "wal/wal.go:805",
"msg": "slow fdatasync",
"took": "1.059814472s",
"expected-duration": "1s"
}
apply request took too long
{
"level": "warn",
"ts": "2025-11-17T23:36:03.279035Z",
"caller": "etcdserver/util.go:170",
"msg": "apply request took too long",
"took": "1.049703217s",
"expected-duration": "200ms",
"prefix": "",
"request": "header:<ID:3424031285281708008 username:\"etcd-client\" auth_revision:1 > txn:<compare:<target:MOD key:\"/kubernetes.io/apiserver.openshift.io/apirequestcounts/cronjobs.v1.batch\" mod_revision:1154801942 > success:<request_put:<key:\"/kubernetes.io/apiserver.openshift.io/apirequestcounts/cronjobs.v1.batch\" value_size:71506 >> failure:<request_range:<key:\"/kubernetes.io/apiserver.openshift.io/apirequestcounts/cronjobs.v1.batch\" > >>",
"response": "size:22"
}
leader is overloaded likely from slow disk
{
"level": "warn",
"ts": "2025-11-18T00:11:36.864222Z",
"caller": "etcdserver/raft.go:416",
"msg": "leader failed to send out heartbeat on time; took too long, leader is overloaded likely from slow disk",
"to": "61ce582558a9b290",
"heartbeat-interval": "100ms",
"expected-duration": "200ms",
"exceeded-duration": "27.432518ms"
}
lost leader
{
"level": "info",
"ts": "2025-11-18T00:03:32.178223Z",
"logger": "raft",
"caller": "etcdserver/zap_raft.go:77",
"msg": "raft.node: 88b5f2b6a5a62f84 lost leader 21f9907189308d61 at term 5501"
}
etcdserver: request timed out
{
"level": "warn",
"ts": "2025-11-18T00:03:33.35972Z",
"caller": "etcdserver/server.go:1201",
"msg": "failed to revoke lease",
"lease-id": "32909a5088cbfd4b",
"error": "etcdserver: request timed out"
}
waiting for ReadIndex response took too long, retrying
{
"level": "warn",
"ts": "2025-11-17T23:36:04.078837Z",
"caller": "etcdserver/v3_server.go:920",
"msg": "waiting for ReadIndex response took too long, retrying",
"sent-request-id": 3424031285281708052,
"retry-timeout": "500ms"
}