KafkaJS 2.2.4 – Request Produce(version: 7) timed out, TCP Window Full, and prolonged instability after rolling restart
Hello,
We are experiencing stability issues in our Kafka architecture during
chunked file transfers, especially during load spikes or after broker
restarts. Here are the details:
------------------------------
📦 *Architecture*:
-
KafkaJS v2.2.4 used as both *producer and consumer*.
-
Kafka cluster with *3 brokers*, connected to a *shared NAS*.
-
Files are *split into 500 KB chunks* and sent to a Kafka topic.
-
A dedicated consumer reassembles files on the NAS.
-
Each file is assigned to the *topic/broker of the first chunk* to
preserve order.
------------------------------
❌ *Observed issues*:
1.
*Request Produce(version: 7) timed out error during load spikes*:
-
For about *15 minutes*, KafkaJS producers fail with the error:
*Request Produce(key: 0, version: 7) timed out*
-
This generally occurs *during traffic spikes*, typically when *multiple
files are being sent simultaneously*.
2.
*Network behavior – TCP Window Full*:
-
During this period, *TCP Window Full messages* appear on the network,
-
No *CPU or RAM spikes are observed during the blockage*,
-
However, a significant *resource usage increase (CPU/RAM)* happens *when
the system recovers*, suggesting a *sudden backlog clearance*.
3.
*Recovery after blockage*:
-
Connections reset,
-
A rollback seems to occur, then messages are processed quickly
without errors,
-
The issue may reoccur at the next volume spike.
4.
*Amplified behavior and prolonged instability after rolling restart*:
-
This problem is more frequent after a *rolling restart of brokers*
(spaced 20 to 30 minutes apart),
-
Instability can persist for *several days or even weeks* before
decreasing,
-
This suggests a *desynchronization or prolonged delay in partition
reassignment, metadata updates, or coordination*.
------------------------------
⚙️ *KafkaJS configuration*:
*{*
* requestTimeout: 30000, // 30s
retry: {
initialRetryTime: 1000, // 1s
retries: 1500 // max value, but in practice we rarely exceed 30
retries (~15 minutes)
}
}*
------------------------------
❓ *Questions*:
1.
Is the Request Produce(version: 7) timed out error generally
related to *broker
congestion*, *network issues*, or *partition imbalance*?
2.
Do the TCP Window Full messages indicate *network or broker buffer
saturation*? Are there Kafka logs or metrics recommended to monitor?
3.
Could assigning each file strictly to a single broker/topic lead to *local
saturation*?
4.
Could the KafkaJS retry configuration (1500 max retries, but rarely
exceeding 30 retries) *exacerbate congestion*? Would a *progressive
backoff strategy* be preferable?
5.
What are the *best KafkaJS practices* for chunked file flows:
compression, batching, flush intervals, etc.?
6.
Can a *rolling restart* of brokers cause *temporary metadata
desynchronization* or *excessive client wait times*? And why might this
instability last so long?
Thank you in advance for your help. Any diagnostic insights or optimization
recommendations would be greatly appreciated.
Best regards,
Mathias VANHEMS
We are experiencing stability issues in our Kafka architecture during
chunked file transfers, especially during load spikes or after broker
restarts. Here are the details:
------------------------------
📦 *Architecture*:
-
KafkaJS v2.2.4 used as both *producer and consumer*.
-
Kafka cluster with *3 brokers*, connected to a *shared NAS*.
-
Files are *split into 500 KB chunks* and sent to a Kafka topic.
-
A dedicated consumer reassembles files on the NAS.
-
Each file is assigned to the *topic/broker of the first chunk* to
preserve order.
------------------------------
❌ *Observed issues*:
1.
*Request Produce(version: 7) timed out error during load spikes*:
-
For about *15 minutes*, KafkaJS producers fail with the error:
*Request Produce(key: 0, version: 7) timed out*
-
This generally occurs *during traffic spikes*, typically when *multiple
files are being sent simultaneously*.
2.
*Network behavior – TCP Window Full*:
-
During this period, *TCP Window Full messages* appear on the network,
-
No *CPU or RAM spikes are observed during the blockage*,
-
However, a significant *resource usage increase (CPU/RAM)* happens *when
the system recovers*, suggesting a *sudden backlog clearance*.
3.
*Recovery after blockage*:
-
Connections reset,
-
A rollback seems to occur, then messages are processed quickly
without errors,
-
The issue may reoccur at the next volume spike.
4.
*Amplified behavior and prolonged instability after rolling restart*:
-
This problem is more frequent after a *rolling restart of brokers*
(spaced 20 to 30 minutes apart),
-
Instability can persist for *several days or even weeks* before
decreasing,
-
This suggests a *desynchronization or prolonged delay in partition
reassignment, metadata updates, or coordination*.
------------------------------
⚙️ *KafkaJS configuration*:
*{*
* requestTimeout: 30000, // 30s
retry: {
initialRetryTime: 1000, // 1s
retries: 1500 // max value, but in practice we rarely exceed 30
retries (~15 minutes)
}
}*
------------------------------
❓ *Questions*:
1.
Is the Request Produce(version: 7) timed out error generally
related to *broker
congestion*, *network issues*, or *partition imbalance*?
2.
Do the TCP Window Full messages indicate *network or broker buffer
saturation*? Are there Kafka logs or metrics recommended to monitor?
3.
Could assigning each file strictly to a single broker/topic lead to *local
saturation*?
4.
Could the KafkaJS retry configuration (1500 max retries, but rarely
exceeding 30 retries) *exacerbate congestion*? Would a *progressive
backoff strategy* be preferable?
5.
What are the *best KafkaJS practices* for chunked file flows:
compression, batching, flush intervals, etc.?
6.
Can a *rolling restart* of brokers cause *temporary metadata
desynchronization* or *excessive client wait times*? And why might this
instability last so long?
Thank you in advance for your help. Any diagnostic insights or optimization
recommendations would be greatly appreciated.
Best regards,
Mathias VANHEMS
Comments
Post a Comment