Hello, we have recently updated our production kafka cluster from 2.6.1 to 2.7.1 and started receiving 2 types of errors:
1. When a broker is restared, upon starting up it produces a lot of warnings with information about old partition leader epoch and:
...
[2021-08-23 15:25:55,629] INFO [ReplicaFetcher replicaId=10, leaderId=11, fetcherId=2] Partition redacted-topic1-name-19 has an older epoch (44) than the current leader. Will await the new LeaderAndIsr state before resuming fetching. (kafka.server.ReplicaFetcherThread)
[2021-08-23 15:25:55,630] WARN [ReplicaFetcher replicaId=10, leaderId=11, fetcherId=2] Partition redacted-topic1-name-19 marked as failed (kafka.server.ReplicaFetcherThread)
...
At the end of broker startup I get:
[2021-08-23 15:25:55,645] INFO [ReplicaFetcherManager on broker 10] Removed fetcher for partitions Set(...[a lot of partitions], redacted-topic1-name-19, [a lot of partitions]...) (kafka.server.ReplicaFetcherManager)
2. While running, the broker seems to be spamming STDOUT with a ~30 sec period (we have set leader.imbalance.check.interval.seconds=30) with messages like this, that end up in /var/log/messages. They have no stack shown and don't appear in the standard server.log, just the /var/log/messages (so they look like STDOUT captured by journald)
...
Aug 23 16:50:34 prod-kafka10 java: org.apache.kafka.common.errors.NotLeaderOrFollowerException: Error while fetching partition state for redacted-topic1-name-0
Aug 23 16:50:34 prod-kafka10 java: org.apache.kafka.common.errors.NotLeaderOrFollowerException: Error while fetching partition state for redacted-topic1-name-1
Aug 23 16:50:34 prod-kafka10 java: org.apache.kafka.common.errors.NotLeaderOrFollowerException: Error while fetching partition state for redacted-topic1-name-2
Aug 23 16:50:34 prod-kafka10 java: org.apache.kafka.common.errors.NotLeaderOrFollowerException: Error while fetching partition state for redacted-topic1-name-3
Aug 23 16:50:34 prod-kafka10 java: org.apache.kafka.common.errors.NotLeaderOrFollowerException: Error while fetching partition state for redacted-topic2-name-0
Aug 23 16:50:34 prod-kafka10 java: org.apache.kafka.common.errors.NotLeaderOrFollowerException: Error while fetching partition state for redacted-topic2-name-1
...
These messages are even for topics/partitions that are not hosted by the broker. For the topics/partitions that are replicated by the broker we get different exceptions:
...
Aug 23 16:50:34 prod-kafka10 java: org.apache.kafka.common.errors.NotLeaderOrFollowerException: Failed to find leader log for partition redacted-topic3-name-1 with leader epoch Optional.empty. The current leader is Some(11) and the current epoch 188
Aug 23 16:50:34 prod-kafka10 java: org.apache.kafka.common.errors.NotLeaderOrFollowerException: Failed to find leader log for partition redacted-topic3-name-2 with leader epoch Optional.empty. The current leader is Some(9) and the current epoch 164
...
There are no exceptions for the partitions, for which the broker is the leader.
Does anyone know what is wrong with the cluster and how to fix it? So far the cluster appears to be running, producers are successfully writing messages to it and consumers are reading them and there appears to be no message loss. Also, ISR is full on all partitions, no partitions are under-replicated or offline. We have update a number of different clusters in our company prior to the production cluster and no other cluster shows these errors.
1. When a broker is restared, upon starting up it produces a lot of warnings with information about old partition leader epoch and:
...
[2021-08-23 15:25:55,629] INFO [ReplicaFetcher replicaId=10, leaderId=11, fetcherId=2] Partition redacted-topic1-name-19 has an older epoch (44) than the current leader. Will await the new LeaderAndIsr state before resuming fetching. (kafka.server.ReplicaFetcherThread)
[2021-08-23 15:25:55,630] WARN [ReplicaFetcher replicaId=10, leaderId=11, fetcherId=2] Partition redacted-topic1-name-19 marked as failed (kafka.server.ReplicaFetcherThread)
...
At the end of broker startup I get:
[2021-08-23 15:25:55,645] INFO [ReplicaFetcherManager on broker 10] Removed fetcher for partitions Set(...[a lot of partitions], redacted-topic1-name-19, [a lot of partitions]...) (kafka.server.ReplicaFetcherManager)
2. While running, the broker seems to be spamming STDOUT with a ~30 sec period (we have set leader.imbalance.check.interval.seconds=30) with messages like this, that end up in /var/log/messages. They have no stack shown and don't appear in the standard server.log, just the /var/log/messages (so they look like STDOUT captured by journald)
...
Aug 23 16:50:34 prod-kafka10 java: org.apache.kafka.common.errors.NotLeaderOrFollowerException: Error while fetching partition state for redacted-topic1-name-0
Aug 23 16:50:34 prod-kafka10 java: org.apache.kafka.common.errors.NotLeaderOrFollowerException: Error while fetching partition state for redacted-topic1-name-1
Aug 23 16:50:34 prod-kafka10 java: org.apache.kafka.common.errors.NotLeaderOrFollowerException: Error while fetching partition state for redacted-topic1-name-2
Aug 23 16:50:34 prod-kafka10 java: org.apache.kafka.common.errors.NotLeaderOrFollowerException: Error while fetching partition state for redacted-topic1-name-3
Aug 23 16:50:34 prod-kafka10 java: org.apache.kafka.common.errors.NotLeaderOrFollowerException: Error while fetching partition state for redacted-topic2-name-0
Aug 23 16:50:34 prod-kafka10 java: org.apache.kafka.common.errors.NotLeaderOrFollowerException: Error while fetching partition state for redacted-topic2-name-1
...
These messages are even for topics/partitions that are not hosted by the broker. For the topics/partitions that are replicated by the broker we get different exceptions:
...
Aug 23 16:50:34 prod-kafka10 java: org.apache.kafka.common.errors.NotLeaderOrFollowerException: Failed to find leader log for partition redacted-topic3-name-1 with leader epoch Optional.empty. The current leader is Some(11) and the current epoch 188
Aug 23 16:50:34 prod-kafka10 java: org.apache.kafka.common.errors.NotLeaderOrFollowerException: Failed to find leader log for partition redacted-topic3-name-2 with leader epoch Optional.empty. The current leader is Some(9) and the current epoch 164
...
There are no exceptions for the partitions, for which the broker is the leader.
Does anyone know what is wrong with the cluster and how to fix it? So far the cluster appears to be running, producers are successfully writing messages to it and consumers are reading them and there appears to be no message loss. Also, ISR is full on all partitions, no partitions are under-replicated or offline. We have update a number of different clusters in our company prior to the production cluster and no other cluster shows these errors.
Comments
Post a Comment