Skip to content

Monitoring & Observability

Metrics Overview

rtpbridge exposes Prometheus metrics at GET /metrics on the control plane port (default: 9100) in OpenMetrics text format.

All metrics use the rtpbridge_ prefix.

Counters

MetricDescription
rtpbridge_sessions_totalTotal sessions created since startup
rtpbridge_endpoints_totalTotal endpoints created since startup
rtpbridge_packets_routed_totalTotal RTP packets routed between endpoints
rtpbridge_packets_recorded_totalTotal packets written to PCAP recordings
rtpbridge_srtp_errors_totalSRTP authentication or replay check failures
rtpbridge_transcode_errors_totalCodec transcode failures (decode or encode)
rtpbridge_dtmf_events_totalDTMF digits detected
rtpbridge_events_dropped_totalWebSocket events dropped due to client backpressure

Gauges

MetricDescription
rtpbridge_sessions_activeCurrently active sessions
rtpbridge_endpoints_activeCurrently active endpoints
rtpbridge_recordings_activeCurrently active PCAP recordings

Interpreting Metrics

Throughput

rate(rtpbridge_packets_routed_total[5m]) gives you the packet routing rate. For reference:

  • A single two-party call at 20ms ptime generates ~100 packets/sec (50 in each direction)
  • At 10ms ptime, this doubles to ~200 packets/sec

Scale linearly with active sessions to estimate expected throughput.

Error Rates

SRTP errorsrate(rtpbridge_srtp_errors_total[5m]) should be near zero in normal operation. A sustained rate above 1/sec usually indicates a key mismatch or network-level packet corruption. Investigate the specific session causing errors.

Transcode errorsrate(rtpbridge_transcode_errors_total[5m]) should be zero. Non-zero values indicate codec failures (e.g., corrupt Opus frames). Occasional errors on noisy links are expected; sustained errors suggest a codec negotiation problem.

Events droppedrate(rtpbridge_events_dropped_total[5m]) > 0 means clients aren't reading WebSocket events fast enough. This doesn't affect media flow but means your application is missing events (DTMF, VAD, stats).

Capacity

rtpbridge_sessions_active relative to max_sessions tells you how much headroom you have. If you're consistently above 80% capacity, scale out.

rtpbridge_endpoints_active / rtpbridge_sessions_active gives you the average endpoints per session. A sudden increase might indicate a misconfigured client creating too many endpoints.

Alerting

yaml
groups:
  - name: rtpbridge
    rules:
      # Instance down
      - alert: RtpbridgeDown
        expr: up{job="rtpbridge"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "rtpbridge instance {{ $labels.instance }} is down"

      # SRTP errors sustained
      - alert: RtpbridgeSrtpErrors
        expr: rate(rtpbridge_srtp_errors_total[5m]) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Sustained SRTP errors on {{ $labels.instance }}"

      # Transcode errors
      - alert: RtpbridgeTranscodeErrors
        expr: rate(rtpbridge_transcode_errors_total[5m]) > 0
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Transcode errors on {{ $labels.instance }}"

      # Client backpressure
      - alert: RtpbridgeEventsDropped
        expr: rate(rtpbridge_events_dropped_total[5m]) > 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Events being dropped on {{ $labels.instance }} — client too slow"

      # Approaching session capacity
      # Adjust the threshold to 80% of your configured max_sessions value.
      - alert: RtpbridgeHighSessionLoad
        expr: rtpbridge_sessions_active > 4000
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "{{ $labels.instance }} is above 80% session capacity"

      # No packets routed (but sessions active)
      - alert: RtpbridgeNoTraffic
        expr: >
          rtpbridge_sessions_active > 0
          and rate(rtpbridge_packets_routed_total[5m]) == 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Active sessions but no packets routed on {{ $labels.instance }}"

Threshold Guidelines

MetricNormalInvestigateAlert
rate(srtp_errors_total[5m])0> 0.1/s> 1/s sustained 5m
rate(transcode_errors_total[5m])0> 0> 0 sustained 10m
rate(events_dropped_total[5m])0> 0> 0 sustained 5m
sessions_active vs configured max_sessions< 50%> 70%> 80%
rate(packets_routed_total[5m]) with active sessions> 0== 0 for 2m== 0 for 5m

Dashboards

Grafana Dashboard Panels

Recommended panels for a Grafana dashboard:

Row 1 — Overview

  • Sessions active (gauge) — rtpbridge_sessions_active
  • Endpoints active (gauge) — rtpbridge_endpoints_active
  • Packet rate (graph) — rate(rtpbridge_packets_routed_total[1m])
  • Recordings active (gauge) — rtpbridge_recordings_active

Row 2 — Errors

  • SRTP error rate (graph) — rate(rtpbridge_srtp_errors_total[1m])
  • Transcode error rate (graph) — rate(rtpbridge_transcode_errors_total[1m])
  • Events dropped rate (graph) — rate(rtpbridge_events_dropped_total[1m])

Row 3 — Cumulative

  • Total sessions (counter) — rtpbridge_sessions_total
  • Total DTMF events (counter) — rtpbridge_dtmf_events_total
  • Packets recorded (counter) — rtpbridge_packets_recorded_total

Per-Session Observability

Beyond Prometheus metrics, rtpbridge provides per-session observability through the WebSocket control protocol:

Statistics Subscription

json
{"id":"1","method":"stats.subscribe","params":{"interval_ms":5000}}

Returns per-endpoint stats including packet counts, loss, jitter, RTT, codec, and state. See Statistics for full details.

Event Stream

All session events (DTMF, state changes, VAD, recording stops, media timeouts) are delivered as JSON over the same WebSocket connection. See Events for the full list.

HTTP Endpoints

EndpointDescription
GET /healthHealth check — returns {"status":"ok"}
GET /metricsPrometheus metrics
GET /sessionsList active sessions with endpoint counts
GET /recordingsList recording files with pagination