Re: Slow cluster recover after a restart

Hello again community, any thoughts about this? I will really appreciate any clue here.

Now we are facing another problem (after the previous one), even more serious since it does not allow the Broker to start:

ERROR There was an error in one of the threads during logs loading: org.apache.kafka.common.errors.TransactionCoordinatorFencedException: Invalid coordinator epoch: 7 (zombie), 8 (current) (kafka.log.LogManager)

...

ERROR [KafkaServer id=3] Fatal error during KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer)

...

ERROR Exiting Kafka. (kafka.server.KafkaServerStartable)

This happens after the second restart according to the "Upgrading From Previous Versions" guide (https://kafka.apache.org/documentation/#upgrade_2_2_0)

Step four says:

4.- Restart the brokers one by one for the new protocol version to take effect. Once the brokers begin using the latest protocol version, it will no longer be possible to downgrade the cluster to an older version.

When restarting the Broker with the new protocol version, from 2.0 to 2.2.

Is this because too many producers using transactions and EOS?

Could be the producer's API version? (0.11.0.3)

Any help/guide will be highly appreciated.

Cheers!

Jonathan

On Thu, Apr 25, 2019 at 3:57 PM Jonathan Santilli <jonathansantilli@gmail.com> wrote:

Hello,

we are updating one of our clusters from version 2.0 to 2.2. The cluster has 4 brokers. After stopping the first broker the cluster was still operating normally as expected, receiving data from producers and sending data to consumers.

When starting the first broker again with the new version 2.2, the brokers start showing lots of the following messages:

INFO [Transaction Marker Request Completion Handler 4]: Sending client-x's transaction marker for partition topic-name-1 has failed with error org.apache.kafka.common.errors.NotLeaderForPartitionException, retrying with current coordinator epoch 0 (kafka.coordinator.transaction.TransactionMarkerRequestCompletionHandler)

...and

INFO [ReplicaFetcher replicaId=1, leaderId=4, fetcherId=8] Retrying leaderEpoch request for partition another-topic-1 as the leader reported an error: NOT_LEADER_FOR_PARTITION (kafka.server.ReplicaFetcherThread)

In the end, all Brokers start showing logs for the transaction initialization:

INFO [TransactionCoordinator id=1] Initialized transactionalId client-x with producerId 5388 and producer epoch 6 on partition __transaction_state-36 (kafka.coordinator.transaction.TransactionCoordinator)

Meanwhile that happens (all the logs showing up) the cluster stop receiving data, it took around 2 hours for the cluster to be up and running normally. Then all our producers and consumers were able to work correctly.

Once all Brokers stop showing constantly the logs about the Transaction initialization process, is when actually all producers start sending data gradually to the Brokers.

Is that behavior expected? I mean, the fact that the cluster stop working as expected (receiving data from producers and sending data to consumers normally)

Broker update from 2.0 to 2.2

Replication factor 4
Min in sync replicas 3

Producers version: 0.11.0.3
~1100 Producers
Exactly once semantic and transaction enabled in Producers.

Consumer version: 2.2.0
~64 Consumers

I hope someone can give me a clue what could be happening, maybe some configuration to review.

Thanks a lot in advance.
--
Santilli Jonathan

Santilli Jonathan

Kafka

Search This Blog

Re: Slow cluster recover after a restart

Comments

Post a Comment