Kafka outage due to partial node failure on Kubernetes

Hi all,

I have an issue that spans Kafka and K8s... Do you think a Kafka bug is appropriate? Is there an alternative configuration to prevent this from happening again? Would it be any different with KRaft?

Here's what happened:
* A big disruption occurs on a node running the kafka-2 broker. Lots of I/O, OCI, Docker errors in /var/log/messages.
* The Controller sees kafka-2 disappear and it moves leadership to other brokers which have their replicas. Everything is good
* The node and kafka-2 aren't actually dead. The Controller sees kafka-2 return and marks it as part of the cluster. I guess it briefly lost its ZooKeeper registration and then reregistered itself.
* However, Kubelet is not responsive and the rest of the K8s cluster has marked the node as "Unavailable", "Kubelet stopped posting node status."
* Because of this, K8s has removed the kafka-2 pod from the headless-service, so its DNS name cannot be resolved anymore
* A preferred replica leader election happens and the Controller assigns partitions back to kafka-2, but nobody can resolve it, producers are now stuck.

Finally, we rebooted the node. This caused the Controller to see kafka-2 go away again, at which point it reassigned leadership back to the available brokers. But before that (about an hour), all our producers were stuck because the leader for all those partitions was unavailable.

The Controller's logs are full of hundreds of UnknownHostExceptions, so it should be aware the broker has problems. Yet, it leaves it as the leader in the metadata.

Kafka: version 3.4.0, 9 brokers, 2x replication
Deployed by: Bitnami Kafka chart 21.2.0
Stuck producers: Java Client (standard, no streams, etc.)
Connections: Plaintext, acks=0
Metadata: ZooKeeper

Thank you!
Meg
Privacy and Confidentiality Notice: This email and any attachments are intended solely for the use of the individual or entity to which they are addressed. The information contained herein may be confidential, privileged, or subject to legal restrictions (such as the GDPR). If you are not the intended recipient, please be advised that any unauthorized disclosure, copying, distribution, or any action taken or omitted in reliance on the contents of this email is strictly prohibited. If you have received this email in error, please notify the sender immediately and delete all copies from your system. Our external privacy policy is available here: https://www.infovista.com/infovista-privacy-and-personal-data-protection-policy

Kafka

Search This Blog

Kafka outage due to partial node failure on Kubernetes

Comments

Post a Comment