Diagnosing etcd Performance Issues in OpenShift

etcd performance problems are easy to miss. In severe cases, the API server, kube-scheduler, and kube-controller-manager are all restarting and the web console is unresponsive. At the other end of the spectrum, everything looks fine: every Cluster Operator is Available, no pods are restarting, it just doesn't feel as fast as it used to. In both cases, etcd is rarely the first suspect.

In this article, we walk through a real-world must-gather that was collected at 00:50 on 2025-11-18 after numerous pod restarts and multiple timeouts occurred across the cluster, interpret the restart patterns, and use the etcd-ocp-diag script to pinpoint whether the cluster is experiencing a momentary spike or a chronic disk performance problem. In a future article, we will look at the metrics side to see how they correlate with log errors and namespace events.

Collecting etcd Pod Logs

When a customer's cluster starts experiencing slowness or pod restarts, the first step they should take is to collect a must-gather and upload it to a Support Case:

$ oc adm must-gather

A must-gather snapshots the cluster's state and collects core OpenShift configurations and namespaces including the namespace's configmaps, secrets, pod logs, events, CRDs, and more. For etcd troubleshooting with a must-gather, the pod logs are what matter most.

The typical layout for the openshift-etcd namespace looks like the following example. Each control plane node runs an etcd member with its own log file. Problems often appear on all three nodes at once (a slow disk on one node can trigger leader elections that ripple across the cluster), but sometimes they're isolated to a single member.

must-gather.local.<timestamp>/
└── quay-io-openshift-release-dev-.../
    └── namespaces/
        └── openshift-etcd/
            └── pods/
                ├── etcd-master-0/
                │   └── etcd/
                │       └── etcd/
                │           └── logs/
                │               ├── current.log
                │               └── previous.log
                ├── etcd-master-1/
                └── etcd-master-2/

Installing omc

To start diagnosing any OpenShift must-gather, install an open source tool called omc, which is a Go-based CLI that lets you interact with a must-gather using familiar oc-style commands. Instead of navigating the directory tree manually, you query it the same way you would a live cluster.

Install omc by running:

$ curl -sL "https://github.com/gmeghnag/omc/releases/latest/download/omc_$(uname)_$(uname -m).tar.gz" \
  | tar xzf - omc && chmod +x ./omc && mv ./omc /usr/local/bin/omc

Point it at your must-gather and start querying:

$ omc use must-gather.local.<timestamp>/

A few of the commands below pipe omc output through yq (the Python wrapper around jq) to filter YAML. If you want to follow along with those, install it with pip install yq. Note that this is a different tool than the Go-based mikefarah/yq, and the -yr flags used later are specific to the Python version.


Reviewing the Must-Gather

Start with clusterversion to get a baseline of the cluster: current version, availability, and whether the cluster is mid-update or whether there are any issues. The Since column quickly lets you know if a Cluster Operator's state has changed recently or if the cluster is stable.

$ omc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.16.28   True        False         44m     Cluster version is 4.16.28

From the previous command, we can see that the cluster last changed 44 minutes ago. Check the cluster's update history to eliminate a cluster upgrade as the reason for the change.

$ omc get clusterversion -o yaml | yq -yr '.items[].status.history'
- completionTime: '2025-06-18T18:29:47Z'
  image: quay.io/openshift-release-dev/ocp-release@sha256:7708f832ae02919f2cdb2798fdbc64e17ce7a576d1e3baabdd78a000d2d62f40
  startedTime: '2025-06-18T13:04:54Z'
  state: Completed
  verified: true
  version: 4.16.28

We now know that this cluster was previously upgraded 152 days before this must-gather was collected, so we can determine that either a configuration change or instability is the reason for the SINCE column returning 44m.

Checking Cluster Operators

Next, check Cluster Operators. Look for all operators to be on the same version, none Progressing or Degraded, and Since values measured in days or weeks. Operators flapping in a troubled cluster show Since timestamps of minutes or hours, not days.

$ omc get co
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.16.28   True        False         False      32m
baremetal                                  4.16.28   True        False         False      1y
cloud-controller-manager                   4.16.28   True        False         False      1y
cloud-credential                           4.16.28   True        False         False      1y
cluster-autoscaler                         4.16.28   True        False         False      1y
config-operator                            4.16.28   True        False         False      1y
console                                    4.16.28   True        False         False      74d
control-plane-machine-set                  4.16.28   True        False         False      1y
csi-snapshot-controller                    4.16.28   True        False         False      74d
dns                                        4.16.28   True        False         False      32d
etcd                                       4.16.28   True        False         False      13d
image-registry                             4.16.28   True        False         False      1y
ingress                                    4.16.28   True        False         False      74d
insights                                   4.16.28   True        False         False      48d
kube-apiserver                             4.16.28   True        False         False      1y
kube-controller-manager                    4.16.28   True        False         False      1y
kube-scheduler                             4.16.28   True        False         False      1y
kube-storage-version-migrator              4.16.28   True        False         False      13d
machine-api                                4.16.28   True        False         False      1y
machine-approver                           4.16.28   True        False         False      1y
machine-config                             4.16.28   True        False         False      166d
marketplace                                4.16.28   True        False         False      1y
monitoring                                 4.16.28   True        False         False      1h
network                                    4.16.28   True        False         False      1y
node-tuning                                4.16.28   True        False         False      152d
openshift-apiserver                        4.16.28   True        False         False      46m
openshift-controller-manager               4.16.28   True        False         False      1y
openshift-samples                          4.16.28   True        False         False      152d
operator-lifecycle-manager                 4.16.28   True        False         False      1y
operator-lifecycle-manager-catalog         4.16.28   True        False         False      1y
operator-lifecycle-manager-packageserver   4.16.28   True        False         False      49m
service-ca                                 4.16.28   True        False         False      1y
storage                                    4.16.28   True        False         False      1y

We know that no configuration changes were made to the cluster, so let's look at the operators with the newest status change: authentication, monitoring, openshift-apiserver, and operator-lifecycle-manager-packageserver. We'll focus on the last time each was in the Degraded status or became Available.

$ omc get co authentication -o yaml | yq -yr '.status.conditions[] | select(.type == "Degraded")'
lastTransitionTime: '2025-11-18T00:01:18Z'
message: All is well
reason: AsExpected
status: 'False'
type: Degraded

$ omc get co monitoring -o yaml | yq -yr '.status.conditions[] | select(.type == "Degraded")'
lastTransitionTime: '2025-11-17T23:34:02Z'
status: 'False'
type: Degraded

$ omc get co openshift-apiserver -o yaml | yq -yr '.status.conditions[] | select(.type == "Degraded")'
lastTransitionTime: '2025-11-18T00:04:55Z'
message: All is well
reason: AsExpected
status: 'False'
type: Degraded

$ omc get co operator-lifecycle-manager-packageserver -o yaml | yq -yr '.status.conditions[] | select(.type == "Available")'
lastTransitionTime: '2025-11-18T00:01:44Z'
message: ClusterServiceVersion openshift-operator-lifecycle-manager/packageserver
  observed in phase Succeeded
reason: ClusterServiceVersionSucceeded
status: 'True'
type: Available

While the cluster currently looks healthy, we can see that all 4 of these operators went into a degraded state within a 30-minute window, which confirms that something significant happened on this cluster. To verify these findings, we need to look for pods that started less than an hour ago, which would explain the Degraded state, and look for pod restarts across the control plane to determine how unstable the control plane truly is.

Checking Pod Age and Restarts

Because those operators reported their last status change within the last hour, any pods rescheduled during that incident should fall within that window. The first command below correlates pod age back to that 1-hour window by returning running pods whose AGE column reads in minutes, confirming which workloads were rescheduled when the operators flapped. The second command filters for etcd, API server, scheduler, and controller-manager pods and returns only those with at least one restart:

$ omc get pods -A | grep Running | awk '$6 ~ /m$/'
openshift-authentication                           oauth-openshift-85bc66dc4d-7n6kq                                  1/1  Running     0        45m
openshift-authentication                           oauth-openshift-85bc66dc4d-8sqf6                                  1/1  Running     0        49m
openshift-authentication                           oauth-openshift-85bc66dc4d-hjrn4                                  1/1  Running     0        45m
openshift-marketplace                              certified-operators-gsczr                                         1/1  Running     0        33m
openshift-marketplace                              community-operators-twqbv                                         1/1  Running     0        44m
openshift-marketplace                              redhat-marketplace-6jlkc                                          0/1  Running     0        9m

$ omc get pods -A --no-headers | grep -E 'etcd|apiserver|controller-manager|scheduler' | awk '$5 != 0'
openshift-apiserver                                apiserver-59f65c798-d6fx6                                         2/2   Running     9       13d
openshift-apiserver                                apiserver-59f65c798-rhqmm                                         2/2   Running     19      13d
openshift-apiserver                                apiserver-59f65c798-x9m7f                                         2/2   Running     356     152d
openshift-apiserver-operator                       openshift-apiserver-operator-858db8f58d-bvtrd                     1/1   Running     34      13d
openshift-cloud-controller-manager-operator        cluster-cloud-controller-manager-operator-58c96c984-pl9bv         3/3   Running     345     152d
openshift-controller-manager                       controller-manager-58b5c6d67d-f8mnv                               1/1   Running     5       13d
openshift-controller-manager                       controller-manager-58b5c6d67d-jndg2                               1/1   Running     25      13d
openshift-controller-manager                       controller-manager-58b5c6d67d-pl82v                               1/1   Running     3       13d
openshift-controller-manager-operator              openshift-controller-manager-operator-bc46ffc87-5rm9s             1/1   Running     230     152d
openshift-etcd                                     etcd-master0-openshift                                            4/4   Running     13      152d
openshift-etcd                                     etcd-master1-openshift                                            4/4   Running     13      152d
openshift-etcd                                     etcd-master2-openshift                                            4/4   Running     5       152d
openshift-etcd-operator                            etcd-operator-5f9ccb4ffb-v7lkj                                    1/1   Running     266     152d
openshift-kube-apiserver                           kube-apiserver-master0-openshift                                  5/5   Running     6       3d
openshift-kube-apiserver                           kube-apiserver-master1-openshift                                  5/5   Running     5       3d
openshift-kube-apiserver                           kube-apiserver-master2-openshift                                  5/5   Running     10      3d
openshift-kube-apiserver-operator                  kube-apiserver-operator-5c496bb69b-8gclx                          1/1   Running     243     152d
openshift-kube-controller-manager                  kube-controller-manager-master0-openshift                         4/4   Running     5669    152d
openshift-kube-controller-manager                  kube-controller-manager-master1-openshift                         4/4   Running     5467    152d
openshift-kube-controller-manager                  kube-controller-manager-master2-openshift                         4/4   Running     5849    152d
openshift-kube-controller-manager-operator         kube-controller-manager-operator-7dc4fc458b-jw4qq                 1/1   Running     236     152d
openshift-kube-scheduler                           openshift-kube-scheduler-master0-openshift                        3/3   Running     120     152d
openshift-kube-scheduler                           openshift-kube-scheduler-master1-openshift                        3/3   Running     130     152d
openshift-kube-scheduler                           openshift-kube-scheduler-master2-openshift                        3/3   Running     98      152d
openshift-kube-scheduler-operator                  openshift-kube-scheduler-operator-58bb8b85c4-fs7dd                1/1   Running     230     152d
openshift-oauth-apiserver                          apiserver-cf6b5f478-gxzl9                                         1/1   Running     317     152d
openshift-oauth-apiserver                          apiserver-cf6b5f478-jz5xh                                         1/1   Running     1       13d
openshift-oauth-apiserver                          apiserver-cf6b5f478-l6954                                         1/1   Running     8       13d
openshift-route-controller-manager                 route-controller-manager-7c66879896-2sjkt                         1/1   Running     7       13d
openshift-route-controller-manager                 route-controller-manager-7c66879896-lfzfw                         1/1   Running     15      13d
openshift-route-controller-manager                 route-controller-manager-7c66879896-t8n2v                         1/1   Running     29      13d

Reviewing the output of these two commands, we can confirm that the authentication, monitoring, openshift-apiserver, and operator-lifecycle-manager-packageserver operators all had new pods scheduled. Additionally, the control plane pods have experienced between 2 and 5849 pod restarts across 14 different key OpenShift namespaces, indicating major instability in this cluster.

The etcd pods themselves don't look great with 5 to 13 restarts across 152 days. If you stopped there, you might conclude that etcd is having issues but isn't the core problem. The kube-controller-manager tells a different story. All three instances are sitting at 5,400 to 5,800 restarts over the same 152 days, roughly 38 restarts per day or one crash every 38 minutes. The kube-scheduler is also elevated at 98 to 130 restarts, and one openshift-apiserver pod has 356.

This pattern is a classic fingerprint of etcd performance pressure. The etcd pods aren't crashing outright, but slow write latency or leader elections are causing watch streams and heartbeat timeouts to fail. The kube-controller-manager and scheduler lose their connections to the API server, hit their own deadlines, and restart. The etcd operator at 266 restarts reflects the same instability: it's repeatedly trying to reconcile a cluster that keeps going sideways.

The operational impact during those restart windows is significant. The kube-controller-manager manages pod lifecycle, endpoint sync, garbage collection, and node lease renewals. When it's down, even briefly, pod scheduling stalls, endpoints go stale, and terminated pods pile up without being cleaned. Users don't see a hard outage. They see slow deployments, delayed pod scheduling, and API calls that take seconds instead of milliseconds.


Reviewing etcd Logs for Performance Issues

Manually grepping etcd logs for performance issues works, but it can quickly get tedious when you have three nodes' worth of logs covering weeks. etcd-ocp-diag automates that work. It's a Python script that scans a must-gather for etcd performance-related errors, groups them by pod and timestamp, shows statistics like max latency and first/last occurrence, and lets you compare multiple pods to determine whether the same error is hitting multiple members simultaneously.

Install it by cloning the repo, making it executable, and then copying it to your $PATH. Python 3.8 or later is the only requirement:

$ git clone https://github.com/cptmorgan-rh/etcd-ocp-diag-script.git && \
cd etcd-ocp-diag-script && chmod +x ./etcd-ocp-diag.py && mv ./etcd-ocp-diag.py /usr/local/bin/etcd-ocp-diag.py

Start with the --errors flag, which gives you a per-pod count of every performance-related error the script knows about:

$ etcd-ocp-diag.py --path must-gather.local.<timestamp>/ --errors
POD                         ERROR                                                   COUNT
etcd-master0-openshift      waiting for ReadIndex response took too long, retrying    505
etcd-master0-openshift      slow fdatasync                                             25
etcd-master0-openshift      apply request took too long                             17339
etcd-master0-openshift      leader is overloaded likely from slow disk                504
etcd-master0-openshift      lost leader                                                 3
etcd-master0-openshift      request stats                                           16229
etcd-master1-openshift      waiting for ReadIndex response took too long, retrying    273
etcd-master1-openshift      slow fdatasync                                             17
etcd-master1-openshift      apply request took too long                              7087
etcd-master1-openshift      leader is overloaded likely from slow disk                214
etcd-master1-openshift      lost leader                                                 1
etcd-master1-openshift      request stats                                            6103
etcd-master2-openshift      waiting for ReadIndex response took too long, retrying    354
etcd-master2-openshift      etcdserver: request timed out                               3
etcd-master2-openshift      slow fdatasync                                              7
etcd-master2-openshift      apply request took too long                             12577
etcd-master2-openshift      leader is overloaded likely from slow disk                448
etcd-master2-openshift      lost leader                                                 2
etcd-master2-openshift      request stats                                           11688

The first thing that stands out is the significant number of request stats warnings and apply request took too long errors.

Before digging into those, here is what each error the script detects actually means and how seriously to treat it:

Error What It Means Severity
waiting for ReadIndex response took too long, retrying A follower sent a ReadIndex request to the leader to confirm it's still leader before serving a linearizable read. The leader didn't respond within the timeout, so the follower retried. Warning
slow fdatasync etcd called fdatasync() to flush the WAL to disk and it took longer than the acceptable threshold. Warning
apply request took too long Applying a committed Raft log entry to the state machine took longer than the acceptable threshold. Critical
leader is overloaded likely from slow disk The leader's disk is too slow to keep up with the write rate. Critical
lost leader This member was leader and lost that status, typically because followers stopped receiving its heartbeats. Informational
wal: sync duration The time to sync the Write-Ahead Log to disk exceeded the acceptable threshold. Warning
local node might have slow network etcd detected that this member is taking too long to send or receive messages from peers. Warning
server is likely overloaded etcd's internal request queue is filling faster than it can drain. Caused by a combination of high write rate and slow disk or CPU; left unchecked, requests begin timing out. Critical
sending buffer is full The outbound message buffer to a peer is saturated, meaning etcd cannot send messages fast enough. Warning
request stats A warning log line etcd emits summarizing recent request latency that took longer than the acceptable threshold. Warning
etcdserver: request timed out A client request to etcd exceeded the server-side timeout. The operation failed from the client's perspective. Critical

To view pods, dates, and error counts for request stats, use the --request_stats flag:

$ etcd-ocp-diag.py --path must-gather.local.<timestamp>/ --request_stats
POD                     DATE            COUNT
etcd-master0-openshift  2025-11-17       9725
etcd-master0-openshift  2025-11-18       6504
etcd-master1-openshift  2025-11-18       6103
etcd-master2-openshift  2025-11-17        685
etcd-master2-openshift  2025-11-18      11003

One of the script's main features is that using the --date flag returns the number of errors that occurred during each hour and minute along with the highest time spent or took value, making it easier to determine whether the cluster is experiencing chronic or periodic issues. In the output below, we can see that request stats warnings spanned a 55-minute window from 23:36 on 2025-11-17 to 00:31 on 2025-11-18, and in most cases the time spent was far above the 300ms threshold that triggers the log entry.

$ etcd-ocp-diag.py --path must-gather.local.<timestamp>/ --request_stats --date 2025-11-17
POD                     DATE    COUNT   MAX_TIME
etcd-master0-openshift  23:36    834    3525.2161ms
etcd-master0-openshift  23:37    331    595.3195ms
etcd-master0-openshift  23:38    691    4450.4680ms
etcd-master0-openshift  23:39     81    1347.1995ms
etcd-master0-openshift  23:40    127    1152.4057ms
etcd-master0-openshift  23:41     52    1125.4104ms
etcd-master0-openshift  23:42     51    1049.1937ms
etcd-master0-openshift  23:43     32    1038.7765ms
etcd-master0-openshift  23:44     58    1096.3438ms
etcd-master0-openshift  23:45    481    1723.8945ms
etcd-master0-openshift  23:46    751    3182.1362ms
etcd-master0-openshift  23:47    460    3368.0477ms
etcd-master0-openshift  23:48    450    37923.3061ms
etcd-master0-openshift  23:49    781    60005.1878ms
etcd-master0-openshift  23:50    536    82543.9393ms
etcd-master0-openshift  23:51    168    1423.6394ms
etcd-master0-openshift  23:52     99    990.9552ms
etcd-master0-openshift  23:53    197    3080.9392ms
etcd-master0-openshift  23:54     55    1138.4741ms
etcd-master0-openshift  23:55    109    1128.5804ms
etcd-master0-openshift  23:56    717    2565.0926ms
etcd-master0-openshift  23:57   1352    4684.8516ms
etcd-master0-openshift  23:58    591    40000.2205ms
etcd-master0-openshift  23:59    721    63401.0197ms
etcd-master2-openshift  23:58    552    13400.5967ms
etcd-master2-openshift  23:59    133    1378.8829ms

$ etcd-ocp-diag.py --path must-gather.local.<timestamp>/ --request_stats --date 2025-11-18
POD                     DATE    COUNT   MAX_TIME
etcd-master0-openshift  00:00    181    3465.6826ms
etcd-master0-openshift  00:01    210    1404.0401ms
etcd-master0-openshift  00:02    602    1453.8899ms
etcd-master0-openshift  00:03    196    2469.9012ms
etcd-master0-openshift  00:04     21    520.4142ms
etcd-master0-openshift  00:05    907    1981.8301ms
...
etcd-master2-openshift  00:29    310    2826.0084ms
etcd-master2-openshift  00:30    195    26549.1939ms
etcd-master2-openshift  00:31    389    2219.8426ms

Apply Latency and Disk Sync Times

For a cluster with slow disk symptoms, start by using --fdatasync, which reports warnings when the storage backend takes longer than 1 second to flush the Write-Ahead Log (WAL) data to disk. Follow it up with the --ttl flag, which shows the same time-based breakdown but for the underlying apply latency rather than the client-visible request stats.

$ etcd-ocp-diag.py --path must-gather.local.<timestamp>/  --fdatasync
POD                     DATE            COUNT
etcd-master0-openshift  2025-11-17       6
etcd-master0-openshift  2025-11-18      19
etcd-master1-openshift  2025-11-18      17
etcd-master2-openshift  2025-11-18       7

$ etcd-ocp-diag.py --path must-gather.local.<timestamp>/  --fdatasync --date 2025-11-17
POD                     DATE    COUNT   MAX_TIME
etcd-master0-openshift  23:47   2       1590.3280ms
etcd-master0-openshift  23:51   1       1024.1193ms
etcd-master0-openshift  23:58   3       1204.2944ms

$ etcd-ocp-diag.py --path must-gather.local.<timestamp>/  --fdatasync --date 2025-11-18
POD                     DATE    COUNT   MAX_TIME
etcd-master0-openshift  00:07   7       3341.3137ms
etcd-master0-openshift  00:08   4       2998.5233ms
etcd-master0-openshift  00:10   1       1214.7937ms
etcd-master0-openshift  00:11   3       1233.9415ms
etcd-master0-openshift  00:18   1       1098.6943ms
etcd-master0-openshift  00:21   3       1377.3595ms
etcd-master1-openshift  00:07   4       1969.3752ms
etcd-master1-openshift  00:08   2       1700.1053ms
etcd-master1-openshift  00:10   1       1268.7579ms
etcd-master1-openshift  00:11   1       2465.6723ms
etcd-master1-openshift  00:17   4       5332.0430ms
etcd-master1-openshift  00:18   1       1253.9097ms
etcd-master1-openshift  00:27   3       1763.0241ms
etcd-master1-openshift  00:30   1       1147.0087ms
etcd-master2-openshift  00:00   1       1004.7974ms
etcd-master2-openshift  00:03   1       1829.8329ms
etcd-master2-openshift  00:05   1       1466.1899ms
etcd-master2-openshift  00:07   1       1772.8069ms
etcd-master2-openshift  00:10   1       1280.2763ms
etcd-master2-openshift  00:11   1       1049.6459ms
etcd-master2-openshift  00:30   1       1190.8360ms

$ etcd-ocp-diag.py --path must-gather.local.<timestamp>/ --ttl
POD                     DATE            COUNT
etcd-master0-openshift  2025-11-17      10079
etcd-master0-openshift  2025-11-18       7260
etcd-master1-openshift  2025-11-18       7087
etcd-master2-openshift  2025-11-17        646
etcd-master2-openshift  2025-11-18      11931

$ etcd-ocp-diag.py --path must-gather.local.<timestamp>/ --ttl --date 2025-11-17
POD                     DATE    COUNT   MAX_TIME
etcd-master0-openshift  23:36    847    3513.0858ms
etcd-master0-openshift  23:37    799    595.2415ms
etcd-master0-openshift  23:38   1141    4450.3839ms
etcd-master0-openshift  23:39    124    1347.0568ms
etcd-master0-openshift  23:40    193    1152.2839ms
etcd-master0-openshift  23:41     87    1082.6561ms
etcd-master0-openshift  23:42     68    1049.0904ms
etcd-master0-openshift  23:43     66    1038.6286ms
etcd-master0-openshift  23:44     71    1096.2263ms
etcd-master0-openshift  23:45    462    1723.7928ms
etcd-master0-openshift  23:46    713    3181.9526ms
etcd-master0-openshift  23:47    467    3367.8464ms
etcd-master0-openshift  23:48    267    37923.1994ms
etcd-master0-openshift  23:49    555    63233.9340ms
etcd-master0-openshift  23:50    531    82615.2260ms
etcd-master0-openshift  23:51    170    1422.9851ms
etcd-master0-openshift  23:52    151    990.8517ms
etcd-master0-openshift  23:53    202    3080.6236ms
etcd-master0-openshift  23:54     89    1138.3644ms
etcd-master0-openshift  23:55    160    1128.4636ms
etcd-master0-openshift  23:56    789    2564.9544ms
etcd-master0-openshift  23:57   1168    4684.6261ms
etcd-master0-openshift  23:58    408    39999.9523ms
etcd-master0-openshift  23:59    551    64436.3973ms
etcd-master2-openshift  23:58    504    13399.6897ms
etcd-master2-openshift  23:59    142    1378.7462ms

$ etcd-ocp-diag.py --path must-gather.local.<timestamp>/ --ttl --date 2025-11-18
POD                     DATE    COUNT   MAX_TIME
etcd-master0-openshift  00:00    158    3465.4064ms
etcd-master0-openshift  00:01    308    1403.4716ms
etcd-master0-openshift  00:02    577    1453.5319ms
etcd-master0-openshift  00:03    204    2280.3111ms
etcd-master0-openshift  00:04     73    520.2907ms
etcd-master0-openshift  00:05   1256    1981.6568ms
...
etcd-master2-openshift  00:29    492    2825.7174ms
etcd-master2-openshift  00:30    224    26548.9988ms
etcd-master2-openshift  00:31    396    2219.7259ms

Note: Even the healthiest of clusters will have some apply request took too long messages in the etcd pod logs. Double-digit to low-triple-digit occurrences per day are normal. The key is to compare the took value against expected-duration: occasional entries just over the threshold are no cause for concern, but entries where the took value is significantly above expected-duration indicate a problem that requires deeper analysis.

Correlating etcd errors

At this point, it is safe to say that the etcd pods were experiencing significant performance issues between 23:30 and 00:30 from the 17th to the 18th. Reviewing the Max Time, we are seeing apply request took too long times exceeding a minute in some cases. A delay of 400-500ms is significant. A 1+ minute delay will result in API server pods not responding and timing out.

To confirm this, let's look at the events in the openshift-kube-apiserver namespace and filter for warnings during the incident window. The commands below filter for Warning events with a lastTimestamp starting with 2025-11-17T23:5 and 2025-11-18T00:0.

$ omc get events -o yaml | yq -yr '.items[] | select(.type == "Warning" and (.lastTimestamp | startswith("2025-11-17T23:5"))) | {lastTimestamp, reason, message}'
lastTimestamp: '2025-11-17T23:58:25Z'
reason: ProbeError
message: "Readiness probe error: Get \"https://10.1.1.56:6443/readyz\": context deadline\
  \ exceeded (Client.Timeout exceeded while awaiting headers)\nbody: \n"
---
lastTimestamp: '2025-11-17T23:58:36Z'
reason: ProbeError
message: 'Readiness probe error: HTTP probe failed with statuscode: 500
  body: [+]ping ok
  [+]log ok
  [+]etcd ok
  [-]etcd-readiness failed: reason withheld
  [+]api-openshift-apiserver-available ok
  [+]api-openshift-oauth-apiserver-available ok
  [+]informer-sync ok
  [+]poststarthook/openshift.io-oauth-apiserver-reachable ok
  [+]poststarthook/start-kube-apiserver-admission-initializer ok
  [+]poststarthook/quota.openshift.io-clusterquotamapping ok
  [+]poststarthook/openshift.io-api-request-count-filter ok
  [+]poststarthook/openshift.io-startkubeinformers ok
  [+]poststarthook/openshift.io-openshift-apiserver-reachable ok
  [+]poststarthook/generic-apiserver-start-informers ok
  [+]poststarthook/priority-and-fairness-config-consumer ok
  [+]poststarthook/priority-and-fairness-filter ok
  [+]poststarthook/storage-object-count-tracker-hook ok
  [+]poststarthook/start-apiextensions-informers ok
  [+]poststarthook/start-apiextensions-controllers ok
  [+]poststarthook/crd-informer-synced ok
  [+]poststarthook/start-service-ip-repair-controllers ok
  [+]poststarthook/rbac/bootstrap-roles ok
  [+]poststarthook/scheduling/bootstrap-system-priority-classes ok
  [+]poststarthook/priority-and-fairness-config-producer ok
  [+]poststarthook/start-system-namespaces-controller ok
  [+]poststarthook/bootstrap-controller ok
  [+]poststarthook/start-cluster-authentication-info-controller ok
  [+]poststarthook/start-kube-apiserver-identity-lease-controller ok
  [+]poststarthook/start-kube-apiserver-identity-lease-garbage-collector ok
  [+]poststarthook/start-legacy-token-tracking-controller ok
  [+]poststarthook/aggregator-reload-proxy-client-cert ok
  [+]poststarthook/start-kube-aggregator-informers ok
  [+]poststarthook/apiservice-registration-controller ok
  [+]poststarthook/apiservice-status-available-controller ok
  [+]poststarthook/apiservice-wait-for-first-sync ok
  [+]poststarthook/kube-apiserver-autoregistration ok
  [+]autoregister-completion ok
  [+]poststarthook/apiservice-openapi-controller ok
  [+]poststarthook/apiservice-openapiv3-controller ok
  [+]poststarthook/apiservice-discovery-controller ok
  [+]shutdown ok
  readyz check failed
  '

$ omc get events -o yaml | yq -yr '.items[] | select(.type == "Warning" and (.lastTimestamp | startswith("2025-11-18T00:0"))) | {lastTimestamp, reason, message}'
lastTimestamp: '2025-11-18T00:01:29Z'
reason: KubeAPIReadyz
message: readyz=true
---
lastTimestamp: '2025-11-18T00:04:17Z'
reason: ProbeError
message: "(combined from similar events): Readiness probe error: Get \"https://10.1.1.56:6443/readyz\"\
  : net/http: request canceled (Client.Timeout exceeded while awaiting headers)\n\
  body: \n"
---

The events tie directly to the etcd latency timestamps. At 23:58:25, the kube-apiserver's readiness probe fails with context deadline exceeded. Eleven seconds later, a second failure returns a full /readyz body with only one failing check: [-]etcd-readiness failed. Everything else reports ok. This lands squarely in the peak latency window identified by running etcd-ocp-diag with --ttl, which showed apply request took too long entries exceeding 40 and 64 seconds on etcd-master0. The kube-apiserver wasn't the problem; etcd was, and the readiness body shows this. By 00:01:29, readyz=true signals that the API has recovered, and the final ProbeError at 00:04:17 shows the cluster wasn't fully stable yet, consistent with --fdatasync showing continued slow syncs across all three members through 00:30.

Digging deeper into the namespace events, we found container restarts in the openshift-kube-scheduler, openshift-kube-controller-manager, and openshift-apiserver-operator namespaces, along with ProbeError, liveness, and readiness errors in the openshift-etcd-operator and openshift-route-controller-manager namespaces.


Conclusion

Everything in this post traces back to one root cause: a disk that could not keep up with etcd's write rate.

The disk pressure showed up first as slow fdatasync, which fed directly into apply request took too long with peak latency hitting 82,615ms at 23:50, over 400 times the 200ms threshold. At those latencies, etcd could not send heartbeats on time, triggering 3 leader elections and 504 leader is overloaded likely from slow disk warnings.

The kube-apiserver felt it first. At 23:58:25, its readiness probe failed to reach /readyz. Eleven seconds later, a second failure returned a full readiness body with one failing check: [-]etcd-readiness failed. Everything else was healthy. The kube-controller-manager and kube-scheduler, which depend on a stable API server connection, began losing that connection and restarting. Four Cluster Operators went degraded within a 30-minute window, and readiness probes across the control plane started failing and triggering further restarts.

One important note about what this must-gather can and cannot show: the etcd pod logs only contain several hours up to the event. The namespace events, however, have a longer retention window and tell a broader story. They show a prior episode that resolved approximately 8 hours before this must-gather was collected, placing it well outside the log retention window and entirely invisible to etcd-ocp-diag. The kube-controller-manager restart counts, 5,400 to 5,800 over 152 days, confirm what that pattern implies: this was not a single bad night. The disk had been failing to keep up for months, and periodic latency episodes had already become a recurring pattern before anyone opened a support case.

The fix is storage. Moving the etcd databases to faster or dedicated hardware and validating performance thresholds will resolve the issue. Fix the storage issues and the cascade stops.

If you want to run etcd-ocp-diag against your own must-gathers, the script is on GitHub at cptmorgan-rh/etcd-ocp-diag-script.


Error Log Reference

Raw log examples for each error etcd-ocp-diag detects. etcd writes structured JSON logs, so each entry below is a single log line from a pod's current.log file.

request stats

{
  "level": "warn",
  "ts": "2025-11-17T23:50:23.943884Z",
  "caller": "v3rpc/interceptor.go:197",
  "msg": "request stats",
  "start time": "2025-11-17T23:49:01.399932Z",
  "time spent": "1m22.543939321s",
  "remote": "10.1.1.58:43044",
  "response type": "/etcdserverpb.KV/Range",
  "request count": 0,
  "request size": 98,
  "response count": 0,
  "response size": 32,
  "request content": "key:\"/kubernetes.io/loki.grafana.com/alertingrules/\" range_end:\"/kubernetes.io/loki.grafana.com/alertingrules0\" count_only:true "
}

slow fdatasync

{
  "level": "warn",
  "ts": "2025-11-17T23:47:06.246348Z",
  "caller": "wal/wal.go:805",
  "msg": "slow fdatasync",
  "took": "1.059814472s",
  "expected-duration": "1s"
}

apply request took too long

{
  "level": "warn",
  "ts": "2025-11-17T23:36:03.279035Z",
  "caller": "etcdserver/util.go:170",
  "msg": "apply request took too long",
  "took": "1.049703217s",
  "expected-duration": "200ms",
  "prefix": "",
  "request": "header:<ID:3424031285281708008 username:\"etcd-client\" auth_revision:1 > txn:<compare:<target:MOD key:\"/kubernetes.io/apiserver.openshift.io/apirequestcounts/cronjobs.v1.batch\" mod_revision:1154801942 > success:<request_put:<key:\"/kubernetes.io/apiserver.openshift.io/apirequestcounts/cronjobs.v1.batch\" value_size:71506 >> failure:<request_range:<key:\"/kubernetes.io/apiserver.openshift.io/apirequestcounts/cronjobs.v1.batch\" > >>",
  "response": "size:22"
}

leader is overloaded likely from slow disk

{
  "level": "warn",
  "ts": "2025-11-18T00:11:36.864222Z",
  "caller": "etcdserver/raft.go:416",
  "msg": "leader failed to send out heartbeat on time; took too long, leader is overloaded likely from slow disk",
  "to": "61ce582558a9b290",
  "heartbeat-interval": "100ms",
  "expected-duration": "200ms",
  "exceeded-duration": "27.432518ms"
}

lost leader

{
  "level": "info",
  "ts": "2025-11-18T00:03:32.178223Z",
  "logger": "raft",
  "caller": "etcdserver/zap_raft.go:77",
  "msg": "raft.node: 88b5f2b6a5a62f84 lost leader 21f9907189308d61 at term 5501"
}

etcdserver: request timed out

{
  "level": "warn",
  "ts": "2025-11-18T00:03:33.35972Z",
  "caller": "etcdserver/server.go:1201",
  "msg": "failed to revoke lease",
  "lease-id": "32909a5088cbfd4b",
  "error": "etcdserver: request timed out"
}

waiting for ReadIndex response took too long, retrying

{
  "level": "warn",
  "ts": "2025-11-17T23:36:04.078837Z",
  "caller": "etcdserver/v3_server.go:920",
  "msg": "waiting for ReadIndex response took too long, retrying",
  "sent-request-id": 3424031285281708052,
  "retry-timeout": "500ms"
}