Skip to main content

Entire Kafka Connect cluster stuck because of a stuck sink connector

We have stumbled upon an issue on a running cluster with multiple
source/sink connectors:

1. One of our connectors was a JDBC sink connector connected to an SQL
Server database (using the oracle JDBC driver).
2. It turns out that the DB instance had a problem causing all queries
to be stuck forever, which in turn made the start method of the connector
hang forever.
3. After some time, the entire Kafka Connect cluster was unavailable and
the REST API was not responding giving {"error_code":500,"message":"Request
timed out"} for most requests.
4. Pausing (just before the deletion of the consumer group) or deleting
the problematic connector allowed the cluster to run normally again.

We could reproduce the same issue by adding Thread.sleep(300000) in the
start method or in the put method of the ConnectorTask.

Wanted to know if there's any wiki/documentation provided that mentions how
to handle this issue. My approach would be to throw a timeout after waiting
for a particular time period and make the connector fail fast.

--
Thanks & Regards,
Hemanth

Comments