Go Kafka `ProduceChannel()` Fills Up and Hangs
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}
I have a server-side app written in Go producing Kafka events. It runs perfectly for days, producing ~1.6k msg/sec, and then hits a sporadic problem, where all Kafka message sending stops, and the server app needs to be manually restarted for Kafka messages to resume sending.
I've included a screenshot of the metric graphs when the incident started. To annotate what I see happening:
For seven days, the app runs perfectly. For every message queued, there is a delivery event notification sent to
kafkaProducer.Events()
. You can see that num queued = num delivered.10:39: The issue starts. The delivery notification count quickly drops to zero. Kafka messages keep getting queued, but the callbacks stop.
10:52:
kafkaProducer.ProduceChannel()
is filled up, and attempts to queue new messsages into the go channel block the goroutine. At this point the app will never send another Kafka message again until it is manually restarted.17:55: I manually restarted the application. kafka message queue/delivery resumes. kafka_produce_attempts drops back to zero.
The one and only place my Go code sends Kafka messages is here:
recordChannelGauge.Inc()
kafkaProducer.ProduceChannel() <- &msg
recordChannelGauge.Dec()
In the metric screenshot, note that recordChannelGauge
normally stays at zero because sending the message to the Kafka ProduceChannel()
doesn't block and each Inc()
is immediately followed by a matching Dec()
However, when the ProduceChannel()
is filled up, the goroutine blocks and recordChannelGauge
stays at 1 and will never unblock until the app is manually restarted.
FYI, my environment details:
- Go server binary built with golang 1.10.x
- Latest version of
github.com/confluentinc/confluent-kafka-go/kafka
. This library doesn't use versions, it's using the latest git commit, which as of this writing is 2 months old, so I'm sure I'm using that latest version. - Server OS Ubuntu 16.04.5
- librdkafka1 version librdka0.11.6~1confluent5.0.1-
I suspect this is due to some internal problem in the confluentinc go client, where it doesn't handle some error scenario appropriately.
Also, I see no relevant log output around the time of the problem. I do see sporadic Kafka broker disconnected and time out errors in the logs before the problem happened that don't seem to be serious. These log messages happened every few hours or so for days without serious consequence.
Nov 26 06:52:04 01 appserver.linux[6550]: %4|1543215124.447|REQTMOUT|rdkafka#producer-1| [thrd:kafka-broker-3:9092/bootstrap]: kafka-broker-3:9092/bootstrap: Timed out 0 in-flight, 1 retry-queued, 0 out-queue, 0 partially-sent requests
Nov 26 06:52:10 01 appserver.linux[6550]: %4|1543215130.448|REQTMOUT|rdkafka#producer-1| [thrd:kafka-broker-3:9092/bootstrap]: kafka-broker-3:9092/bootstrap: Timed out 0 in-flight, 1 retry-queued, 0 out-queue, 0 partially-sent requests
Nov 26 08:46:57 01 appserver.linux[6550]: 2018/11/26 08:46:57 Ignored event: kafka-broker-2:9092/bootstrap: Disconnected (after 600000ms in state UP)
Nov 26 08:47:02 01 appserver.linux[6550]: %4|1543222022.803|REQTMOUT|rdkafka#producer-1| [thrd:kafka-broker-2:9092/bootstrap]: kafka-broker-2:9092/bootstrap: Timed out 0 in-flight, 1 retry-queued, 0 out-queue, 0 partially-sent requests
Nov 26 08:47:09 01 appserver.linux[6550]: %4|1543222029.807|REQTMOUT|rdkafka#producer-1| [thrd:kafka-broker-2:9092/bootstrap]: kafka-broker-2:9092/bootstrap: Timed out 0 in-flight, 1 retry-queued, 0 out-queue, 0 partially-sent requests
Zoomed in to problem occurrence
Zoomed out to show before and after
go apache-kafka kafka-producer-api confluent-kafka
add a comment |
I have a server-side app written in Go producing Kafka events. It runs perfectly for days, producing ~1.6k msg/sec, and then hits a sporadic problem, where all Kafka message sending stops, and the server app needs to be manually restarted for Kafka messages to resume sending.
I've included a screenshot of the metric graphs when the incident started. To annotate what I see happening:
For seven days, the app runs perfectly. For every message queued, there is a delivery event notification sent to
kafkaProducer.Events()
. You can see that num queued = num delivered.10:39: The issue starts. The delivery notification count quickly drops to zero. Kafka messages keep getting queued, but the callbacks stop.
10:52:
kafkaProducer.ProduceChannel()
is filled up, and attempts to queue new messsages into the go channel block the goroutine. At this point the app will never send another Kafka message again until it is manually restarted.17:55: I manually restarted the application. kafka message queue/delivery resumes. kafka_produce_attempts drops back to zero.
The one and only place my Go code sends Kafka messages is here:
recordChannelGauge.Inc()
kafkaProducer.ProduceChannel() <- &msg
recordChannelGauge.Dec()
In the metric screenshot, note that recordChannelGauge
normally stays at zero because sending the message to the Kafka ProduceChannel()
doesn't block and each Inc()
is immediately followed by a matching Dec()
However, when the ProduceChannel()
is filled up, the goroutine blocks and recordChannelGauge
stays at 1 and will never unblock until the app is manually restarted.
FYI, my environment details:
- Go server binary built with golang 1.10.x
- Latest version of
github.com/confluentinc/confluent-kafka-go/kafka
. This library doesn't use versions, it's using the latest git commit, which as of this writing is 2 months old, so I'm sure I'm using that latest version. - Server OS Ubuntu 16.04.5
- librdkafka1 version librdka0.11.6~1confluent5.0.1-
I suspect this is due to some internal problem in the confluentinc go client, where it doesn't handle some error scenario appropriately.
Also, I see no relevant log output around the time of the problem. I do see sporadic Kafka broker disconnected and time out errors in the logs before the problem happened that don't seem to be serious. These log messages happened every few hours or so for days without serious consequence.
Nov 26 06:52:04 01 appserver.linux[6550]: %4|1543215124.447|REQTMOUT|rdkafka#producer-1| [thrd:kafka-broker-3:9092/bootstrap]: kafka-broker-3:9092/bootstrap: Timed out 0 in-flight, 1 retry-queued, 0 out-queue, 0 partially-sent requests
Nov 26 06:52:10 01 appserver.linux[6550]: %4|1543215130.448|REQTMOUT|rdkafka#producer-1| [thrd:kafka-broker-3:9092/bootstrap]: kafka-broker-3:9092/bootstrap: Timed out 0 in-flight, 1 retry-queued, 0 out-queue, 0 partially-sent requests
Nov 26 08:46:57 01 appserver.linux[6550]: 2018/11/26 08:46:57 Ignored event: kafka-broker-2:9092/bootstrap: Disconnected (after 600000ms in state UP)
Nov 26 08:47:02 01 appserver.linux[6550]: %4|1543222022.803|REQTMOUT|rdkafka#producer-1| [thrd:kafka-broker-2:9092/bootstrap]: kafka-broker-2:9092/bootstrap: Timed out 0 in-flight, 1 retry-queued, 0 out-queue, 0 partially-sent requests
Nov 26 08:47:09 01 appserver.linux[6550]: %4|1543222029.807|REQTMOUT|rdkafka#producer-1| [thrd:kafka-broker-2:9092/bootstrap]: kafka-broker-2:9092/bootstrap: Timed out 0 in-flight, 1 retry-queued, 0 out-queue, 0 partially-sent requests
Zoomed in to problem occurrence
Zoomed out to show before and after
go apache-kafka kafka-producer-api confluent-kafka
add a comment |
I have a server-side app written in Go producing Kafka events. It runs perfectly for days, producing ~1.6k msg/sec, and then hits a sporadic problem, where all Kafka message sending stops, and the server app needs to be manually restarted for Kafka messages to resume sending.
I've included a screenshot of the metric graphs when the incident started. To annotate what I see happening:
For seven days, the app runs perfectly. For every message queued, there is a delivery event notification sent to
kafkaProducer.Events()
. You can see that num queued = num delivered.10:39: The issue starts. The delivery notification count quickly drops to zero. Kafka messages keep getting queued, but the callbacks stop.
10:52:
kafkaProducer.ProduceChannel()
is filled up, and attempts to queue new messsages into the go channel block the goroutine. At this point the app will never send another Kafka message again until it is manually restarted.17:55: I manually restarted the application. kafka message queue/delivery resumes. kafka_produce_attempts drops back to zero.
The one and only place my Go code sends Kafka messages is here:
recordChannelGauge.Inc()
kafkaProducer.ProduceChannel() <- &msg
recordChannelGauge.Dec()
In the metric screenshot, note that recordChannelGauge
normally stays at zero because sending the message to the Kafka ProduceChannel()
doesn't block and each Inc()
is immediately followed by a matching Dec()
However, when the ProduceChannel()
is filled up, the goroutine blocks and recordChannelGauge
stays at 1 and will never unblock until the app is manually restarted.
FYI, my environment details:
- Go server binary built with golang 1.10.x
- Latest version of
github.com/confluentinc/confluent-kafka-go/kafka
. This library doesn't use versions, it's using the latest git commit, which as of this writing is 2 months old, so I'm sure I'm using that latest version. - Server OS Ubuntu 16.04.5
- librdkafka1 version librdka0.11.6~1confluent5.0.1-
I suspect this is due to some internal problem in the confluentinc go client, where it doesn't handle some error scenario appropriately.
Also, I see no relevant log output around the time of the problem. I do see sporadic Kafka broker disconnected and time out errors in the logs before the problem happened that don't seem to be serious. These log messages happened every few hours or so for days without serious consequence.
Nov 26 06:52:04 01 appserver.linux[6550]: %4|1543215124.447|REQTMOUT|rdkafka#producer-1| [thrd:kafka-broker-3:9092/bootstrap]: kafka-broker-3:9092/bootstrap: Timed out 0 in-flight, 1 retry-queued, 0 out-queue, 0 partially-sent requests
Nov 26 06:52:10 01 appserver.linux[6550]: %4|1543215130.448|REQTMOUT|rdkafka#producer-1| [thrd:kafka-broker-3:9092/bootstrap]: kafka-broker-3:9092/bootstrap: Timed out 0 in-flight, 1 retry-queued, 0 out-queue, 0 partially-sent requests
Nov 26 08:46:57 01 appserver.linux[6550]: 2018/11/26 08:46:57 Ignored event: kafka-broker-2:9092/bootstrap: Disconnected (after 600000ms in state UP)
Nov 26 08:47:02 01 appserver.linux[6550]: %4|1543222022.803|REQTMOUT|rdkafka#producer-1| [thrd:kafka-broker-2:9092/bootstrap]: kafka-broker-2:9092/bootstrap: Timed out 0 in-flight, 1 retry-queued, 0 out-queue, 0 partially-sent requests
Nov 26 08:47:09 01 appserver.linux[6550]: %4|1543222029.807|REQTMOUT|rdkafka#producer-1| [thrd:kafka-broker-2:9092/bootstrap]: kafka-broker-2:9092/bootstrap: Timed out 0 in-flight, 1 retry-queued, 0 out-queue, 0 partially-sent requests
Zoomed in to problem occurrence
Zoomed out to show before and after
go apache-kafka kafka-producer-api confluent-kafka
I have a server-side app written in Go producing Kafka events. It runs perfectly for days, producing ~1.6k msg/sec, and then hits a sporadic problem, where all Kafka message sending stops, and the server app needs to be manually restarted for Kafka messages to resume sending.
I've included a screenshot of the metric graphs when the incident started. To annotate what I see happening:
For seven days, the app runs perfectly. For every message queued, there is a delivery event notification sent to
kafkaProducer.Events()
. You can see that num queued = num delivered.10:39: The issue starts. The delivery notification count quickly drops to zero. Kafka messages keep getting queued, but the callbacks stop.
10:52:
kafkaProducer.ProduceChannel()
is filled up, and attempts to queue new messsages into the go channel block the goroutine. At this point the app will never send another Kafka message again until it is manually restarted.17:55: I manually restarted the application. kafka message queue/delivery resumes. kafka_produce_attempts drops back to zero.
The one and only place my Go code sends Kafka messages is here:
recordChannelGauge.Inc()
kafkaProducer.ProduceChannel() <- &msg
recordChannelGauge.Dec()
In the metric screenshot, note that recordChannelGauge
normally stays at zero because sending the message to the Kafka ProduceChannel()
doesn't block and each Inc()
is immediately followed by a matching Dec()
However, when the ProduceChannel()
is filled up, the goroutine blocks and recordChannelGauge
stays at 1 and will never unblock until the app is manually restarted.
FYI, my environment details:
- Go server binary built with golang 1.10.x
- Latest version of
github.com/confluentinc/confluent-kafka-go/kafka
. This library doesn't use versions, it's using the latest git commit, which as of this writing is 2 months old, so I'm sure I'm using that latest version. - Server OS Ubuntu 16.04.5
- librdkafka1 version librdka0.11.6~1confluent5.0.1-
I suspect this is due to some internal problem in the confluentinc go client, where it doesn't handle some error scenario appropriately.
Also, I see no relevant log output around the time of the problem. I do see sporadic Kafka broker disconnected and time out errors in the logs before the problem happened that don't seem to be serious. These log messages happened every few hours or so for days without serious consequence.
Nov 26 06:52:04 01 appserver.linux[6550]: %4|1543215124.447|REQTMOUT|rdkafka#producer-1| [thrd:kafka-broker-3:9092/bootstrap]: kafka-broker-3:9092/bootstrap: Timed out 0 in-flight, 1 retry-queued, 0 out-queue, 0 partially-sent requests
Nov 26 06:52:10 01 appserver.linux[6550]: %4|1543215130.448|REQTMOUT|rdkafka#producer-1| [thrd:kafka-broker-3:9092/bootstrap]: kafka-broker-3:9092/bootstrap: Timed out 0 in-flight, 1 retry-queued, 0 out-queue, 0 partially-sent requests
Nov 26 08:46:57 01 appserver.linux[6550]: 2018/11/26 08:46:57 Ignored event: kafka-broker-2:9092/bootstrap: Disconnected (after 600000ms in state UP)
Nov 26 08:47:02 01 appserver.linux[6550]: %4|1543222022.803|REQTMOUT|rdkafka#producer-1| [thrd:kafka-broker-2:9092/bootstrap]: kafka-broker-2:9092/bootstrap: Timed out 0 in-flight, 1 retry-queued, 0 out-queue, 0 partially-sent requests
Nov 26 08:47:09 01 appserver.linux[6550]: %4|1543222029.807|REQTMOUT|rdkafka#producer-1| [thrd:kafka-broker-2:9092/bootstrap]: kafka-broker-2:9092/bootstrap: Timed out 0 in-flight, 1 retry-queued, 0 out-queue, 0 partially-sent requests
Zoomed in to problem occurrence
Zoomed out to show before and after
go apache-kafka kafka-producer-api confluent-kafka
go apache-kafka kafka-producer-api confluent-kafka
asked Nov 26 '18 at 20:33
clayclay
5,06354292
5,06354292
add a comment |
add a comment |
0
active
oldest
votes
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53488624%2fgo-kafka-producechannel-fills-up-and-hangs%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
0
active
oldest
votes
0
active
oldest
votes
active
oldest
votes
active
oldest
votes
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53488624%2fgo-kafka-producechannel-fills-up-and-hangs%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown