-
-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MapAsync reprocesses messages on processor restart #368
Comments
Attached are the shutdown related logs. |
Hey @sm003ash , Thanks for your issue. I will have a look and get back to you as soon as possible. Kr, |
Thanks. Note that we are not setting EnableIdempotence to true; and using the default at least once semantics. |
Hi @sm003ash , Indeed, hardly to reproduce, but I think the issue is due that the StreamThread and the ExternalStreamThread stop asynchronously and in some edge cases the StreamThread shutdown before the ExternalStreamThread. I will try to apply a fix. Would be able to test from a dedicated branch ? |
yes I can test it from a dedicated branch. Thanks! |
Hey @sm003ash, Can you run this branch |
Sure thing. I will test it for a few days and get back to you. |
Hi @LGouellec the fix seems to be working. I haven't been able to reproduce the issue. |
All good ! I plan to include this fix in the 1.7.0 coming soon. |
Description
Confluent.SchemaRegistry.Serdes.Avro nuget version: 2.2.0
Streamiz.Kafka.Net nuget version: 1.5.1
Streamiz.Kafka.Net.SchemaRegistry.SerDes.Avro nuget version: 1.5.1
MapAsyncBug.zip
I have attached a visual studio solution file. In this solution, a producer writes events to an input topic. A consumer uses Streamiz library to implement a simple topology: Input topic -> MapAsync() -> Output topic. This is a simplification of our topology that's in use in production. The MapAsync() produces two internal topics - one for requests and one for responses.
In production, we have seen this issue: when the consumer is restarted as part of upgrade, on startup, the response topic seems to resubmit the writes to the output topic. Here are the logs:
Sep 11, 2024 17:35:23.105
"external-task[1] Record persisted: (timestamp 1726076122716) topic=[SessionManager-ProcessIoTOperationalEvents-response] partition=[[5]] offset=[2393]"
Sep 11, 2024 17:35:23.211
"stream-task[2|5]|processor[KSTREAM-SINK-0000000007]- Process<String,Value> message with key ddf1a263-653a-4f4f-9bf8-c1128225fa08 and Kafka.DeviceSessionEvents.Value with record metadata [topic:SessionManager-ProcessIoTOperationalEvents-response|partition:5|offset:2393]"
"stream-task[2|5]|processor[KSTREAM-FILTER-0000000006]- Process<String,Value> message with key ddf1a263-653a-4f4f-9bf8-c1128225fa08 and Kafka.DeviceSessionEvents.Value with record metadata [topic:SessionManager-ProcessIoTOperationalEvents-response|partition:5|offset:2393]"
"stream-task[2|5] - recordQueue [record-queue-SessionManager-ProcessIoTOperationalEvents-response-2-5] Record polled. (Record info [Topic:SessionManager-ProcessIoTOperationalEvents-response|Partition:[5]|Offset:2393])"
"stream-task[2|5] Start processing one record [Topic:SessionManager-ProcessIoTOperationalEvents-response|Partition:5|Offset:2393|Timestamp:1726076122716]"
Sep 11, 2024 17:35:23.212
"stream-task[2|5] Completed processing one record [Topic:SessionManager-ProcessIoTOperationalEvents-response|Partition:5|Offset:2393|Timestamp:1726076122716]"
Sep 11, 2024 17:35:23.262
"stream-task[2|5] Record persisted: (timestamp 1726076122716) topic=[IoTDeviceSessionEvents] partition=[[5]] offset=[2393]"
POST RESTART - notice the same offset 2393
Sep 11, 2024 17:41:45.507
"stream-task[2|5] - recordQueue [record-queue-SessionManager-ProcessIoTOperationalEvents-response-2-5] Record polled. (Record info [Topic:SessionManager-ProcessIoTOperationalEvents-response|Partition:[5]|Offset:2393])"
Another set of logs for offset 2395:
Sep 11, 2024 17:35:23.225
"external-task[1] Record persisted: (timestamp 1726076122716) topic=[SessionManager-ProcessIoTOperationalEvents-response] partition=[[5]] offset=[2395]"
Sep 11, 2024 17:35:23.381
"stream-task[2|5] - recordQueue [record-queue-SessionManager-ProcessIoTOperationalEvents-response-2-5] Record polled. (Record info [Topic:SessionManager-ProcessIoTOperationalEvents-response|Partition:[5]|Offset:2395])"
"stream-task[2|5] Start processing one record [Topic:SessionManager-ProcessIoTOperationalEvents-response|Partition:5|Offset:2395|Timestamp:1726076122716]"
"stream-task[2|5]|processor[KSTREAM-FILTER-0000000006]- Process<String,Value> message with key ddf1a263-653a-4f4f-9bf8-c1128225fa08 and Kafka.DeviceSessionEvents.Value with record metadata [topic:SessionManager-ProcessIoTOperationalEvents-response|partition:5|offset:2395]"
"stream-task[2|5]|processor[KSTREAM-SINK-0000000007]- Process<String,Value> message with key ddf1a263-653a-4f4f-9bf8-c1128225fa08 and Kafka.DeviceSessionEvents.Value with record metadata [topic:SessionManager-ProcessIoTOperationalEvents-response|partition:5|offset:2395]"
"stream-task[2|5] Record persisted: (timestamp 1726076122716) topic=[IoTDeviceSessionEvents] partition=[[5]] offset=[2395]"
"stream-task[2|5] Completed processing one record [Topic:SessionManager-ProcessIoTOperationalEvents-response|Partition:5|Offset:2395|Timestamp:1726076122716]"
POST RESTART
Sep 11, 2024 17:41:45.512
"stream-task[2|5] Completed processing one record [Topic:SessionManager-ProcessIoTOperationalEvents-response|Partition:5|Offset:2395|Timestamp:1726076122716]"
For partition 5, you can see that 9 events were duplicated, and the end offsets differ for the MapAsync's response topic and the output topic:
SessionManager-ProcessIoTOperationalEvents-response
Partition Log-end-offset
0 1017
1 10356
2 1835
3 6833
4 5650
5 11089
IoTDeviceSessionEvents
0 1017
1 10356
2 1835
3 6833
4 5650
5 11098
We have a kafka sink connector that syncs data from the output topic to a postgres db table. The guids that are generated when creating output events are unique, and the corresponding db column has UNIQUE constraint. We noticed that this constraint was violated, and that's how we found this issue.
How to reproduce
This is hard to reproduce. The attached logs might be your best bet in analyzing the issue. But you can try running the docker compose cluster, and periodically restarting the consumer app.
Checklist
Please provide the following information:
The text was updated successfully, but these errors were encountered: