Hi,
The follower is not able to sync-up with the leader due to epochs diverged
between leader and follower.
To confirm this, you can enable request logger and check the
diverging-epoch field in the fetch-response:
https://sourcegraph.com/github.com/apache/kafka@a640a81040f6ef6f85819b60194f0394f5f2194e/-/blob/clients/src/main/resources/common/message/FetchResponse.json?L76
This issue can happen when the leader-epoch-checkpoint file is corrupted in
the leader node. To mitigate the issue, you have to:
1. Stop the leader broker
2. Remove the `leader-epoch-checkpoint` file for that affected partition
3. Recover the partition by deleting the partition entry from the
checkpoint files: `log-start-offset-checkpoint`,
`replication-offset-checkpoint`, `recovery-point-offset-checkpoint`, and
`cleaner-offset-checkpoint`. Note that when removing the entry, you also
have to update the number of entries in those files in Line 2.
4. Remove the `.kafka_cleanshutdown` marker file....