Skip to main content

Slow cluster recover after a restart

Hello,

we are updating one of our clusters from version 2.0 to 2.2. The cluster
has 4 brokers. After stopping the first broker the cluster was still
operating normally as expected, receiving data from producers and sending
data to consumers.

When starting the first broker again with the new version 2.2, the brokers
start showing lots of the following messages:

INFO [Transaction Marker Request Completion Handler 4]: Sending client-x's
transaction marker for partition topic-name-1 has failed with error
org.apache.kafka.common.errors.NotLeaderForPartitionException, retrying
with current coordinator epoch 0
(kafka.coordinator.transaction.TransactionMarkerRequestCompletionHandler)

...and

INFO [ReplicaFetcher replicaId=1, leaderId=4, fetcherId=8] Retrying
leaderEpoch request for partition another-topic-1 as the leader reported an
error: NOT_LEADER_FOR_PARTITION (kafka.server.ReplicaFetcherThread)

In the end, all Brokers start showing logs for the transaction
initialization:

INFO [TransactionCoordinator id=1] Initialized transactionalId client-x
with producerId 5388 and producer epoch 6 on partition
__transaction_state-36
(kafka.coordinator.transaction.TransactionCoordinator)

Meanwhile that happens (all the logs showing up) the cluster stop receiving
data, it took around 2 hours for the cluster to be up and running normally.
Then all our producers and consumers were able to work correctly.

Once all Brokers stop showing constantly the logs about the Transaction
initialization process, is when actually all producers start sending data
gradually to the Brokers.

Is that behavior expected? I mean, the fact that the cluster stop working
as expected (receiving data from producers and sending data to consumers
normally)

Broker update from 2.0 to 2.2

Replication factor 4
Min in sync replicas 3

Producers version: 0.11.0.3
~1100 Producers
Exactly once semantic and transaction enabled in Producers.

Consumer version: 2.2.0
~64 Consumers

I hope someone can give me a clue what could be happening, maybe some
configuration to review.


Thanks a lot in advance.
--
Santilli Jonathan

Comments