Hello Ankit,
Kafka Streams's rebalance protocol is trying to balance workloads based on
the num.partitions (more specifically, the num.tasks which is derived from
the input partitions) but not on the num.messages or num.bytes, so they
would not be able to handle data-skewness across partitions unfortunately.
In practice, if a KS app is reading multiple topics, the data skewness
could be remedied since an instance could get the heavy partitions of a
topic, while getting light partitions of another topic. But if your app is
only reading a single topic that has data skewness, it's hard to balance
the throughput.
Guozhang
On Thu, Jul 7, 2022 at 7:29 AM ankit Soni <ankit.soni.geode@gmail.com>
wrote:
> Hello kafka-users,
>
> I have 50 topics, each with 32 partitions where data is being ingested
> continuously.
>
> Data is being published in these 50 partitions externally (no control)
> which causes data skew amount the partitions of each topic.
>
> For example: For topic-1, partition-1 contains 100 events, while
> partition-2 can have 10K events and so on for all 50 topics.
>
> *Consuming data from all 50 topics using kafka-stream mechanism,*
>
> - Running 4 consumer instances, all within the same consumer-group.
> - Num of threads per consumer process: 8
>
>
> As data among partitions are not evenly distributed (Data-skewed partitions
> across topics), I see 1 or 2 consumer instances (JVM) are
> processing/consuming very less records compared to other 2 instances, My
> guess is these instances process partitions with less data.
>
> *Can someone help, how can I balance the consumers here (distribute
> consumer workload evenly across 4 consumer instances)? Expectation here is
> that all 4 consumer instances should process approx. same amount of
> events. *
>
> Looking forward to hearing your inputs.
>
> Thanks in advance.
>
> *Ankit.*
>
--
-- Guozhang
Kafka Streams's rebalance protocol is trying to balance workloads based on
the num.partitions (more specifically, the num.tasks which is derived from
the input partitions) but not on the num.messages or num.bytes, so they
would not be able to handle data-skewness across partitions unfortunately.
In practice, if a KS app is reading multiple topics, the data skewness
could be remedied since an instance could get the heavy partitions of a
topic, while getting light partitions of another topic. But if your app is
only reading a single topic that has data skewness, it's hard to balance
the throughput.
Guozhang
On Thu, Jul 7, 2022 at 7:29 AM ankit Soni <ankit.soni.geode@gmail.com>
wrote:
> Hello kafka-users,
>
> I have 50 topics, each with 32 partitions where data is being ingested
> continuously.
>
> Data is being published in these 50 partitions externally (no control)
> which causes data skew amount the partitions of each topic.
>
> For example: For topic-1, partition-1 contains 100 events, while
> partition-2 can have 10K events and so on for all 50 topics.
>
> *Consuming data from all 50 topics using kafka-stream mechanism,*
>
> - Running 4 consumer instances, all within the same consumer-group.
> - Num of threads per consumer process: 8
>
>
> As data among partitions are not evenly distributed (Data-skewed partitions
> across topics), I see 1 or 2 consumer instances (JVM) are
> processing/consuming very less records compared to other 2 instances, My
> guess is these instances process partitions with less data.
>
> *Can someone help, how can I balance the consumers here (distribute
> consumer workload evenly across 4 consumer instances)? Expectation here is
> that all 4 consumer instances should process approx. same amount of
> events. *
>
> Looking forward to hearing your inputs.
>
> Thanks in advance.
>
> *Ankit.*
>
--
-- Guozhang
Comments
Post a Comment