Hi everyone,
I recently observed some behavior in a KRaft-based Kafka cluster (v3.7.2)
that I was hoping to get some technical clarification on.
My Setup:
-
*Architecture:* 3 Controllers / 3 Brokers (KRaft mode).
-
*Version:* 3.7.2.
The Scenario:
1.
I brought down *2 out of 3 controllers*, effectively breaking the KRaft
quorum. As expected, describe topics and metadata quorum commands failed.
2.
While the quorum was down, I brought down *1 Broker*.
3.
*Producer Behavior:* Remained fully functional. It automatically
redirected messages to the remaining nodes for the affected partition
leaders without issue.
4.
*Consumer Behavior:* The consumer initially stopped when the broker went
down, but upon restarting the consumer, it resumed processing perfectly
fine. I also waited for 5 mins for any caches to clear, it was able to
consume after restart. Though this wasn't the case across all tests.
My Questions:
I am curious how the data plane remains so resilient when the control plane
is offline:
-
*Partition Management:* How is the partition leadership/assignment
maintained and communicated to clients when the Controller Quorum is
unavailable to handle new metadata updates?
-
*Consumer Rebalancing:* In the event of a broker failure while the
quorum is down, how does the Group Coordinator handle partition
assignments? Is it simply relying on the last "frozen" metadata state
cached on the remaining brokers?
-
*Internal Offsets:* Since the __consumer_offsets topic is just another
topic, am I correct in assuming that as long as those partitions have
surviving replicas, the consumer group logic functions independently of the
KRaft Quorum status and there would be no impact on data integrity?
I'd appreciate any insights into the internal mechanics of how the brokers
utilize the cached metadata log during a total controller outage.
Best regards,
Vruttant Mankad
I recently observed some behavior in a KRaft-based Kafka cluster (v3.7.2)
that I was hoping to get some technical clarification on.
My Setup:
-
*Architecture:* 3 Controllers / 3 Brokers (KRaft mode).
-
*Version:* 3.7.2.
The Scenario:
1.
I brought down *2 out of 3 controllers*, effectively breaking the KRaft
quorum. As expected, describe topics and metadata quorum commands failed.
2.
While the quorum was down, I brought down *1 Broker*.
3.
*Producer Behavior:* Remained fully functional. It automatically
redirected messages to the remaining nodes for the affected partition
leaders without issue.
4.
*Consumer Behavior:* The consumer initially stopped when the broker went
down, but upon restarting the consumer, it resumed processing perfectly
fine. I also waited for 5 mins for any caches to clear, it was able to
consume after restart. Though this wasn't the case across all tests.
My Questions:
I am curious how the data plane remains so resilient when the control plane
is offline:
-
*Partition Management:* How is the partition leadership/assignment
maintained and communicated to clients when the Controller Quorum is
unavailable to handle new metadata updates?
-
*Consumer Rebalancing:* In the event of a broker failure while the
quorum is down, how does the Group Coordinator handle partition
assignments? Is it simply relying on the last "frozen" metadata state
cached on the remaining brokers?
-
*Internal Offsets:* Since the __consumer_offsets topic is just another
topic, am I correct in assuming that as long as those partitions have
surviving replicas, the consumer group logic functions independently of the
KRaft Quorum status and there would be no impact on data integrity?
I'd appreciate any insights into the internal mechanics of how the brokers
utilize the cached metadata log during a total controller outage.
Best regards,
Vruttant Mankad
Comments
Post a Comment