Skip to main content

Re: Kafka Scaling Ideas

Hm, it's an optimization for "first layer", so if the bottleneck is in
"second layer" (i.e. DB write) as you mentioned, it shouldn't make much
difference I think.

2020年12月22日(火) 16:02 Yana K <yanak1019@gmail.com>:

> I thought about it but then we don't have much time - will it optimize
> performance?
>
> On Mon, Dec 21, 2020 at 4:16 PM Haruki Okada <ocadaruma@gmail.com> wrote:
>
> > About "first layer" right?
> > Then it's better to make sure that not get() the result of
> Producer#send()
> > for each message, because in that way, it spoils the ability of
> > producer-batching.
> > Kafka producer batches messages by default and it's very efficient, so if
> > you produce in async way, it rarely becomes a bottleneck in general.
> > > Also are there any producer optimizations
> >
> > By the way, if "first layer" just filters then produces messages without
> > interacting with any other external DB, using KafkaStreams should be much
> > easier.
> >
> > 2020年12月22日(火) 3:27 Yana K <yanak1019@gmail.com>:
> >
> > > Thanks!
> > >
> > > Also are there any producer optimizations anyone can think of in this
> > > scenario?
> > >
> > >
> > >
> > > On Mon, Dec 21, 2020 at 8:58 AM Joris Peeters <
> > joris.mg.peeters@gmail.com>
> > > wrote:
> > >
> > > > I'd probably just do it by experiment for your concrete data.
> > > >
> > > > Maybe generate a few million synthetic data rows, and for-each-batch
> > > insert
> > > > them into a dev DB, with an outer grid search over various candidate
> > > batch
> > > > sizes. You're looking to optimise for flat-out rows/s, so whichever
> > batch
> > > > size wins (given a fixed number of total rows) is near-optimal.
> > > > You can repeat the exercise with N simultaneous threads to inspect
> how
> > > > batch sizes and multiple partitions P would interact (which might
> well
> > be
> > > > sublinear in P in case of e.g. transactions etc).
> > > >
> > > > On Mon, Dec 21, 2020 at 4:48 PM Yana K <yanak1019@gmail.com> wrote:
> > > >
> > > > > Thanks Haruki and Joris.
> > > > >
> > > > > Haruki:
> > > > > Thanks for the detailed calculations. Really appreciate it. What
> > > tool/lib
> > > > > is used to load test kafka?
> > > > > So we've one consumer group and running 7 instances of the
> > application
> > > -
> > > > > that should be good enough - correct?
> > > > >
> > > > > Joris:
> > > > > Great point.
> > > > > DB insert is a bottleneck (and hence moved it to its own layer) -
> and
> > > we
> > > > > are batching but wondering what is the best way to calculate the
> > batch
> > > > > size.
> > > > >
> > > > > Thanks,
> > > > > Yana
> > > > >
> > > > > On Mon, Dec 21, 2020 at 1:39 AM Joris Peeters <
> > > > joris.mg.peeters@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Do you know why your consumers are so slow? 12E6msg/hour is
> > > 3333msg/s,
> > > > > > which is not very high from a Kafka point-of-view. As you're
> doing
> > > > > database
> > > > > > inserts, I suspect that is where the bottleneck lies.
> > > > > >
> > > > > > If, for example, you're doing a single-row insert in a SQL DB for
> > > every
> > > > > > message then this would incur a lot of overhead. Yes, you can
> > > somewhat
> > > > > > alleviate that by parallellising - i.e. increasing the partition
> > > count
> > > > -
> > > > > > but it is also worth looking at batch inserts, if you aren't yet.
> > > Say,
> > > > > each
> > > > > > consumer waits for 1000 messages or 5 seconds to have passed
> > > (whichever
> > > > > > comes first) and then does a single bulk insert of the msgs it
> has
> > > > > > received, followed by a manual commit.
> > > > > >
> > > > > > [A] you might already be doing this and [B] your DB of choice
> might
> > > not
> > > > > > support bulk inserts (although most do), but otherwise I'd expect
> > > this
> > > > to
> > > > > > work a lot better than increasing the partition count.
> > > > > >
> > > > > > On Mon, Dec 21, 2020 at 8:10 AM Haruki Okada <
> ocadaruma@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > > About load test:
> > > > > > > I think it'd be better to monitor per-message process latency
> and
> > > > > > estimate
> > > > > > > required partition count based on it because it determines the
> > max
> > > > > > > throughput per single partition.
> > > > > > > - Say you have to process 12 million messages/hour = 3333
> > > > messages/sec
> > > > > .
> > > > > > > - If you have 7 partitions (thus 7 parallel consumers at
> > maximum),
> > > > > single
> > > > > > > consumer should process 3333 / 7 = 476 messages/sec
> > > > > > > - It means, process latency per single message should be lower
> > than
> > > > 2.1
> > > > > > > milliseconds (1000 / 476)
> > > > > > > => If you have 14 partitions, it becomes 4.2 milliseconds
> > > > > > >
> > > > > > > So required partition count can be calculated by per-message
> > > process
> > > > > > > latency. (I think Spring-Kafka can be easily integrated with
> > > > prometheus
> > > > > > so
> > > > > > > you can use it to measure that)
> > > > > > >
> > > > > > > About increasing instance count:
> > > > > > > - It depends on current system resource usage.
> > > > > > > * If the system resource is not so busy (likely because the
> > > > consumer
> > > > > > just
> > > > > > > almost waits DB-write to return), you don't need to increase
> > > consumer
> > > > > > > instances
> > > > > > > * But I think you should make sure that single consumer
> > instance
> > > > > isn't
> > > > > > > assigned multiple partitions to fully parallelize consumption
> > > across
> > > > > > > partitions. (If I remember correctly,
> > > > > ConcurrentMessageListenerContainer
> > > > > > > has a property to configure the concurrency)
> > > > > > >
> > > > > > > 2020年12月21日(月) 15:51 Yana K <yanak1019@gmail.com>:
> > > > > > >
> > > > > > > > So as the next step I see to increase the partition of the
> 2nd
> > > > topic
> > > > > -
> > > > > > > do I
> > > > > > > > increase the instances of the consumer from that or keep it
> at
> > 7?
> > > > > > > > Anything else (besides researching those libs)?
> > > > > > > >
> > > > > > > > Are there any good tools for load testing kafka?
> > > > > > > >
> > > > > > > > On Sun, Dec 20, 2020 at 7:23 PM Haruki Okada <
> > > ocadaruma@gmail.com>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > It depends on how you manually commit offsets.
> > > > > > > > > Auto-commit does commits offsets in async manner basically,
> > so
> > > as
> > > > > > long
> > > > > > > as
> > > > > > > > > you do manual-commit in the same way, there should be no
> > much
> > > > > > > > difference.
> > > > > > > > >
> > > > > > > > > And, generally offset-commit mode doesn't make much
> > difference
> > > in
> > > > > > > > > performance regardless manual/auto or async/sync unless
> > > > > offset-commit
> > > > > > > > > latency takes significant amount in processing time (e.g.
> you
> > > > > commit
> > > > > > > > > offsets synchronously in every poll() loop).
> > > > > > > > >
> > > > > > > > > 2020年12月21日(月) 11:08 Yana K <yanak1019@gmail.com>:
> > > > > > > > >
> > > > > > > > > > Thank you so much Marina and Haruka.
> > > > > > > > > >
> > > > > > > > > > Marina's response:
> > > > > > > > > > - When you say " if you are sure there is no room for
> perf
> > > > > > > optimization
> > > > > > > > > of
> > > > > > > > > > the processing itself :" - do you mean code level
> > > > optimizations?
> > > > > > Can
> > > > > > > > you
> > > > > > > > > > please explain?
> > > > > > > > > > - On the second topic you say " I'd say at least 40" - is
> > > this
> > > > > > based
> > > > > > > on
> > > > > > > > > 12
> > > > > > > > > > million records / hour?
> > > > > > > > > > - "if you can change the incoming topic" - I don't think
> > it
> > > is
> > > > > > > > possible
> > > > > > > > > :(
> > > > > > > > > > - "you could artificially achieve the same by adding one
> > > more
> > > > > step
> > > > > > > > > > (service) in your pipeline" - this is the next thing -
> but
> > I
> > > > want
> > > > > > to
> > > > > > > be
> > > > > > > > > > sure this will help, given we've to maintain one more
> layer
> > > > > > > > > >
> > > > > > > > > > Haruka's response:
> > > > > > > > > > - "One possible solution is creating an intermediate
> topic"
> > > - I
> > > > > > > already
> > > > > > > > > did
> > > > > > > > > > it
> > > > > > > > > > - I'll look at Decaton - thx
> > > > > > > > > >
> > > > > > > > > > Is there any thoughts on the auto commit vs manual
> commit -
> > > if
> > > > it
> > > > > > can
> > > > > > > > > > better the performance while consuming?
> > > > > > > > > >
> > > > > > > > > > Yana
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Sat, Dec 19, 2020 at 7:01 PM Haruki Okada <
> > > > > ocadaruma@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi.
> > > > > > > > > > >
> > > > > > > > > > > Yeah, Spring-Kafka does processing messages
> sequentially,
> > > so
> > > > > the
> > > > > > > > > consumer
> > > > > > > > > > > throughput would be capped by database latency per
> single
> > > > > > process.
> > > > > > > > > > > One possible solution is creating an intermediate topic
> > (or
> > > > > > > altering
> > > > > > > > > > source
> > > > > > > > > > > topic) with much more partitions as Marina suggested.
> > > > > > > > > > >
> > > > > > > > > > > I'd like to suggest another solution, that is
> > > multi-threaded
> > > > > > > > processing
> > > > > > > > > > per
> > > > > > > > > > > single partition.
> > > > > > > > > > > Decaton (https://github.com/line/decaton) is a library
> > to
> > > > > > achieve
> > > > > > > > it.
> > > > > > > > > > >
> > > > > > > > > > > Also confluent has published a blog post about
> > > > > parallel-consumer
> > > > > > (
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://www.confluent.io/blog/introducing-confluent-parallel-message-processing-client/
> > > > > > > > > > > )
> > > > > > > > > > > for that purpose, but it seems it's still in the BETA
> > > stage.
> > > > > > > > > > >
> > > > > > > > > > > 2020年12月20日(日) 11:41 Marina Popova <
> > > ppine7sub@protonmail.com
> > > > > > > > .invalid>:
> > > > > > > > > > >
> > > > > > > > > > > > The way I see it - you can only do a few things - if
> > you
> > > > are
> > > > > > sure
> > > > > > > > > there
> > > > > > > > > > > is
> > > > > > > > > > > > no room for perf optimization of the processing
> itself
> > :
> > > > > > > > > > > > 1. speed up your processing per consumer thread:
> which
> > > you
> > > > > > > already
> > > > > > > > > > tried
> > > > > > > > > > > > by splitting your logic into a 2-step pipeline
> instead
> > of
> > > > > > 1-step,
> > > > > > > > and
> > > > > > > > > > > > delegating the work of writing to a DB to the second
> > > step (
> > > > > > make
> > > > > > > > sure
> > > > > > > > > > > your
> > > > > > > > > > > > second intermediate Kafka topic is created with much
> > more
> > > > > > > > partitions
> > > > > > > > > to
> > > > > > > > > > > be
> > > > > > > > > > > > able to parallelize your work much higher - I'd say
> at
> > > > least
> > > > > > 40)
> > > > > > > > > > > > 2. if you can change the incoming topic - I would
> > create
> > > it
> > > > > > with
> > > > > > > > many
> > > > > > > > > > > more
> > > > > > > > > > > > partitions as well - say at least 40 or so - to
> > > parallelize
> > > > > > your
> > > > > > > > > first
> > > > > > > > > > > step
> > > > > > > > > > > > service processing more
> > > > > > > > > > > > 3. and if you can't increase partitions for the
> > original
> > > > > topic
> > > > > > )
> > > > > > > -
> > > > > > > > > you
> > > > > > > > > > > > could artificially achieve the same by adding one
> more
> > > step
> > > > > > > > (service)
> > > > > > > > > > in
> > > > > > > > > > > > your pipeline that would just read data from the
> > original
> > > > > > > > 7-partition
> > > > > > > > > > > > topic1 and just push it unchanged into a new topic2
> > with
> > > ,
> > > > > say
> > > > > > 40
> > > > > > > > > > > > partitions - and then have your other services pick
> up
> > > from
> > > > > > this
> > > > > > > > > topic2
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > good luck,
> > > > > > > > > > > > Marina
> > > > > > > > > > > >
> > > > > > > > > > > > Sent with ProtonMail Secure Email.
> > > > > > > > > > > >
> > > > > > > > > > > > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> > > > > > > > > > > > On Saturday, December 19, 2020 6:46 PM, Yana K <
> > > > > > > > yanak1019@gmail.com>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hi
> > > > > > > > > > > > >
> > > > > > > > > > > > > I am new to the Kafka world and running into this
> > scale
> > > > > > > problem.
> > > > > > > > I
> > > > > > > > > > > > thought
> > > > > > > > > > > > > of reaching out to the community if someone can
> help.
> > > > > > > > > > > > > So the problem is I am trying to consume from a
> Kafka
> > > > topic
> > > > > > > that
> > > > > > > > > can
> > > > > > > > > > > > have a
> > > > > > > > > > > > > peak of 12 million messages/hour. That topic is not
> > > under
> > > > > my
> > > > > > > > > control
> > > > > > > > > > -
> > > > > > > > > > > it
> > > > > > > > > > > > > has 7 partitions and sending json payload.
> > > > > > > > > > > > > I have written a consumer (I've used Java and
> > > > Spring-Kafka
> > > > > > lib)
> > > > > > > > > that
> > > > > > > > > > > will
> > > > > > > > > > > > > read that data, filter it and then load it into a
> > > > > database. I
> > > > > > > ran
> > > > > > > > > > into
> > > > > > > > > > > a
> > > > > > > > > > > > > huge consumer lag that would take 10-12hours to
> catch
> > > > up. I
> > > > > > > have
> > > > > > > > 7
> > > > > > > > > > > > > instances of my application running to match the 7
> > > > > partitions
> > > > > > > > and I
> > > > > > > > > > am
> > > > > > > > > > > > > using auto commit. Then I thought of splitting the
> > > write
> > > > > > logic
> > > > > > > > to a
> > > > > > > > > > > > > separate layer. So now my architecture has a
> > component
> > > > that
> > > > > > > reads
> > > > > > > > > and
> > > > > > > > > > > > > filters and produces the data to an internal topic
> > > (I've
> > > > > > done 7
> > > > > > > > > > > > partitions
> > > > > > > > > > > > > but as you see it's under my control). Then a
> > consumer
> > > > > picks
> > > > > > up
> > > > > > > > > data
> > > > > > > > > > > from
> > > > > > > > > > > > > that topic and writes it to the database. It's
> better
> > > but
> > > > > > still
> > > > > > > > it
> > > > > > > > > > > takes
> > > > > > > > > > > > > 3-5hours for the consumer lag to catch up.
> > > > > > > > > > > > > Am I missing something fundamentally? Are there any
> > > other
> > > > > > ideas
> > > > > > > > for
> > > > > > > > > > > > > optimization that can help overcome this scale
> > > challenge.
> > > > > Any
> > > > > > > > > pointer
> > > > > > > > > > > and
> > > > > > > > > > > > > article will help too.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Appreciate your help with this.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks
> > > > > > > > > > > > > Yana
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > --
> > > > > > > > > > > ========================
> > > > > > > > > > > Okada Haruki
> > > > > > > > > > > ocadaruma@gmail.com
> > > > > > > > > > > ========================
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > ========================
> > > > > > > > > Okada Haruki
> > > > > > > > > ocadaruma@gmail.com
> > > > > > > > > ========================
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > ========================
> > > > > > > Okada Haruki
> > > > > > > ocadaruma@gmail.com
> > > > > > > ========================
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> >
> > --
> > ========================
> > Okada Haruki
> > ocadaruma@gmail.com
> > ========================
> >
>


--
========================
Okada Haruki
ocadaruma@gmail.com
========================

Comments