Tags: kaltura/media-framework
Tags
Executive Summary On September 23, 2025, the Kaltura Media Framework experienced a critical failure with "ngx_buf_queue_get: memory limit exceeded" at 21:51:45. Our investigation revealed a cascading failure triggered by a bug in the transcoder's time estimation algorithm during a 90kHz timestamp wraparound event. Timeline of Discovery Initial Investigation - Symptom: Memory buffer exhaustion causing stream processing failure - Channel: nvp1-1_yqgk4a0v - Error: "ngx_buf_queue_get: memory limit exceeded" Root Cause Chain Discovered 1. Timestamp wraparound occurs every ~26.5 hours (90kHz clock cycle) 2. Transcoder time estimator bug creates future timestamps 3. Sync system chaos due to timeline divergence 4. Segmentation failure prevents segment completion 5. Buffer accumulation without release 6. Memory exhaustion and system failure The Core Bug: clock_estimator_get_clock Location File: /Users/igorshevachmac/dev/media-framework/transcoder/utils/time_estimator.c Function: clock_estimator_get_clock() lines 32-58 The Algorithm Flaw for (int64_t runner=fifo->framesFifoHead; runner>=fifo->framesFifoTail; runner--) { clock_estimator_sample_t* sample=&(fifo->samples[runner % TIME_ESTIMATOR_FIFO_SIZE]); int64_t runnerdistance=llabs(dts - sample->dts); if (runnerdistance < distance) { clock = dts - sample->dts + sample->clock; // Line 48 distance = runnerdistance; if (distance == 0) { break; } } else { break; // ❌ EXITS EARLY - prevents finding true minimum } } The Problem: Early Exit Prevents Optimal Solution Normal operation: Algorithm works correctly with monotonic DTS progression During wraparound: Early exit prevents finding the best match The Failure Scenario FIFO State During Wraparound FIFO: [90000, 0, 8589467340, ...] (head to tail) Processing frame: dts = 8589467340 (26.5h frame delayed in pipeline) Algorithm Execution 1. Iteration 1: sample->dts=90000, distance=8589377340 2. Iteration 2: sample->dts=0, distance=8589467340 3. Distance comparison: 8589467340 > 8589377340 → algorithm exits 4. Never reaches: Perfect match at sample->dts=8589467340 The Calculation That Created Future Timestamps clock = dts - sample->dts + sample->clock clock = 8589467340 - 90000 + sample->clock clock = 8589377340 + sample->clock = FUTURE TIMESTAMP Proof with Observed Values Observed future timestamp: 158287382912982 Calculation verification: 158287382912982 = 8589467340 + 158278793445642 Where 158278793445642 was the clock value of the dts=90000 sample Why Only Some Tracks Failed Successful Tracks (Passthrough) - a32@b, v32@b: "passthrough": true - Low latency: FIFO stays synchronized with input - Perfect matches found: distance = 0 Failed Tracks (Transcoded) - a33@b, a42@b, v33@b, v34@b, v35@b, v42@b, v43@b: Active transcoding - High latency: Encoding pipeline introduces delays - FIFO gets "ahead": Contains wrapped samples when processing pre-wrap frames - Future timestamps generated: Time estimator bug triggers The Cascade Effect 1. Future Timestamps Generated (21:47:35) Transcoder sends frames with created timestamps in the future 2. Sync System Divergence (21:47:34-21:47:35) a32@b: 158270203044791 (correct, old timeline) a33@b: 158278792980979 (incorrect, new timeline) 3. Segmentation Chaos (21:47:35-21:51:45) - Syncer cannot reconcile 26.5-hour time differences - "large inter-frame pts jump" warnings - "track added/removed, forcing new period" messages 4. Buffer Accumulation (21:47:29-21:51:45) - Each incomplete segment locks buffers for all tracks - ~40 incomplete segments × 6-7 tracks = 240-280 buffer locks - No segment completion = no buffer releases 5. Memory Exhaustion (21:51:45) ngx_buf_queue_get: memory limit exceeded Evidence from Logs FFmpeg Warnings During Wraparound 2025-09-23T18:47:35.385 FFMPEG:AVCodecContext WARN Queue input is backward in time 2025-09-23T18:47:35.388 FFMPEG:AVCodecContext WARN Queue input is backward in time Transcoder Diagnostics 21:47:30.226: "drift": -96, "lastDts": 8589467340 (before wrap) 21:47:45.363: "drift": 95443698, "lastDts": 890130 (after wrap) Sync Timeline Divergence Sep 23 @ 21:47:34.031: a33@b using channel reference 158270203044791 ✅ Sep 23 @ 21:47:35.406: a33@b using channel reference 158278792980979 ❌ Comparison with Successful Wraparound (Sep 21) Sep 21, 2025 @ 21:59:41: Similar wraparound event succeeded - All tracks converged to single timeline reference - No future timestamps generated - System recovered normally Key difference: Transcoder pipeline timing and FIFO state during wraparound transition The Fix Immediate Solution: Remove Early Exit uint64_t clock_estimator_get_clock(clock_estimator_t *fifo, int64_t dts) { if (fifo->framesFifoTail == -1) { return 0; } int64_t min_distance = INT64_MAX; uint64_t best_clock = 0; // Always traverse entire FIFO to find true minimum for (int64_t runner = fifo->framesFifoHead; runner >= fifo->framesFifoTail; runner--) { clock_estimator_sample_t* sample = &(fifo->samples[runner % TIME_ESTIMATOR_FIFO_SIZE]); int64_t distance = llabs(dts - sample->dts); if (distance < min_distance) { min_distance = distance; best_clock = dts - sample->dts + sample->clock; if (distance == 0) { break; // Perfect match found } } // NO early exit on increasing distance } return best_clock; } Enhanced Solution: Wraparound Detection // Add wraparound detection threshold #define WRAPAROUND_THRESHOLD (8000000000LL) // ~24 hours in 90kHz if (min_distance > WRAPAROUND_THRESHOLD) { return getClock90Khz(); // Use current time instead of extrapolation } Technical Implications Performance Impact - Minimal: FIFO typically contains 100-1000 samples - O(n) search remains fast for small n - Benefit far outweighs cost of preventing system failures Reliability Improvement - Eliminates future timestamp generation during wraparound - Prevents cascade failures in downstream systems - Maintains system stability during 26.5-hour boundaries Conclusion This investigation revealed a subtle but critical bug in time estimation logic that only manifests during timestamp wraparound events when transcoding latency is sufficient to create mixed FIFO states. The bug's impact cascades through multiple system layers, ultimately causing memory exhaustion and system failure. The fix is algorithmic: ensure the time estimator finds the truly optimal solution rather than settling for a local optimum that can be catastrophically wrong during wraparound transitions. ---Investigation completed: October 1, 2025Systems affected: All transcoded tracks in nginx-live-moduleSeverity: Critical system failure every ~26.5 hoursPriority: Immediate fix required
- fix warnings - workaround downstream sending multiple media info after transcoder reconnection
attempt to force rtmp not to send old media info
PreviousNext