Hi All,
We had a topic partition(with 5 replication) going offline when leader of the partition was down. Below is some analysis
Kafka server - 1.1 , relevant config (replica.fetch.wait.max.ms = 500, replica.fetch.min.bytes = 50000, replica.lag.time.max.ms=10000)
Topic partition (Test.Request-3) - replication 5 , Replica List [17, 425222741, 425222681, 423809494,425222740] , unclean leader election = false
Sequence of events.
1. Leader(425222740) of the partition is down.
1. Controller detects the offline broker.
[2019-08-26 13:00:22,037] INFO [Controller id=423809469] Newly added brokers: , deleted brokers: 425222740, all live brokers: 15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,423809443,423809444,423809450,423809458,423809463,423809464,423809469,423809474,423809494,425218574,425222675,425222681,425222741,425222745 (kafka.controller.KafkaController)
1. Controller sends update metadata request , but observes only leader in the isr. Please note that none of the isr metrics "underminisr", "isrshirnk" are captured before or after the partition going offline.
[2019-08-26 13:00:05,333] TRACE [Controller id=423809469 epoch=206] Sending UpdateMetadata request PartitionState(controllerEpoch=204, leader=425222740, leaderEpoch=435, isr=[425222740], zkVersion=804, replicas=[425222740, 423809494, 17, 425222741, 425222681], offlineReplicas=[]) to brokers Set(425222740, 423809458, 425222741, 24, 425218574, 25, 423809474, 26, 27, 19, 20, 21, 22, 425222681, 23, 423809450, 15, 16, 423809443, 423809494, 425222675, 17, 423809444, 18, 423809469, 425222745, 423809463, 28, 423809464, 29) for Test.Request-3 (state.change.logger)
1. New leader election fails under strategy OfflinePartitionLeaderElectionStrategy since no replicas are in ISR list.
1. All replicas see replica fetch request errors as they cannot connect to leader.
Any pointers on why the ISR list was shrunk just before the leader went down forcing the partition to go offline.
Thanks,
Koushik
We had a topic partition(with 5 replication) going offline when leader of the partition was down. Below is some analysis
Kafka server - 1.1 , relevant config (replica.fetch.wait.max.ms = 500, replica.fetch.min.bytes = 50000, replica.lag.time.max.ms=10000)
Topic partition (Test.Request-3) - replication 5 , Replica List [17, 425222741, 425222681, 423809494,425222740] , unclean leader election = false
Sequence of events.
1. Leader(425222740) of the partition is down.
1. Controller detects the offline broker.
[2019-08-26 13:00:22,037] INFO [Controller id=423809469] Newly added brokers: , deleted brokers: 425222740, all live brokers: 15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,423809443,423809444,423809450,423809458,423809463,423809464,423809469,423809474,423809494,425218574,425222675,425222681,425222741,425222745 (kafka.controller.KafkaController)
1. Controller sends update metadata request , but observes only leader in the isr. Please note that none of the isr metrics "underminisr", "isrshirnk" are captured before or after the partition going offline.
[2019-08-26 13:00:05,333] TRACE [Controller id=423809469 epoch=206] Sending UpdateMetadata request PartitionState(controllerEpoch=204, leader=425222740, leaderEpoch=435, isr=[425222740], zkVersion=804, replicas=[425222740, 423809494, 17, 425222741, 425222681], offlineReplicas=[]) to brokers Set(425222740, 423809458, 425222741, 24, 425218574, 25, 423809474, 26, 27, 19, 20, 21, 22, 425222681, 23, 423809450, 15, 16, 423809443, 423809494, 425222675, 17, 423809444, 18, 423809469, 425222745, 423809463, 28, 423809464, 29) for Test.Request-3 (state.change.logger)
1. New leader election fails under strategy OfflinePartitionLeaderElectionStrategy since no replicas are in ISR list.
1. All replicas see replica fetch request errors as they cannot connect to leader.
Any pointers on why the ISR list was shrunk just before the leader went down forcing the partition to go offline.
Thanks,
Koushik
Comments
Post a Comment