Understanding Video Streaming Application Architecture

This document outlines the fundamental concepts and architectural components of modern video streaming applications. It covers key terminology, definitions, and the typical five-step pipeline involved in delivering video content.

Key Terminology and Definitions

To understand the architecture of modern streaming, it is essential to define the core technologies, grouped by their functional domains.

1. Media Compression & Codecs

  • Codec: A device or program (COmpressor-DECompressor) that shrinks large video files for transmission and expands them for viewing.

  • H.265 (HEVC): High Efficiency Video Coding. The successor to H.264, offering 25% to 50% better data compression at the same quality level.

  • H.264 (AVC): Advanced Video Coding. The industry standard for over a decade, known for universal compatibility but lower efficiency compared to HEVC.
  • CTU (Coding Tree Unit): The basic processing unit of HEVC, which can be as large as 64×64 pixels, allowing for more efficient processing of high-resolution video than H.264’s 16×16 macroblocks.
  • AAC AudioSpecificConfig: A global header for MPEG-4 Audio that contains essential information for an AAC decoder, such as the Audio Object Type (AOT), sampling rate, and channel configuration. It is typically generated by the encoder and used to initialize the decoder.

Ingest

  • SRT (Secure Reliable Transport): A UDP-based protocol designed for reliable video ingest over unstable networks, utilizing error correction (ARQ/FEC).
  • RTP (Real-time Transport Protocol): A protocol for delivering audio and video over IP networks. It carries the actual media data and includes sequence numbers and timestamps to help the receiver reassemble the stream correctly.
  • RTSP (Real Time Streaming Protocol): A network control protocol used to manage media sessions. It allows clients to control the media server with commands like play, pause, and stop, acting as a “remote control” for the stream.
  • RTCP (Real-time Transport Control Protocol): A companion protocol to RTP that provides feedback on the quality of data delivery (QoS) and helps synchronize different media streams (e.g., audio and video).
  • Relationship between RTP/RTSP/RTCP:
    Standard: RTSP (TCP) + RTP (UDP) + RTCP (UDP) = 3 connections.
    Interleaved: Everything is squeezed into a single RTSP (TCP) connection.
    Trade-off: This makes the video much more likely to stutter because TCP will pause the video to “fix” tiny errors that UDP would have just ignored.
  • RTMP (Real-Time Messaging Protocol): Originally developed by Macromedia (now Adobe), this is a TCP-based protocol used for streaming audio, video, and data. It remains the industry standard for “first-mile” ingest, where an encoder sends a live stream to a media server or CDN (e.g., YouTube Live or Twitch).

2. Delivery Protocols

  • HLS (HTTP Live Streaming): An adaptive bitrate streaming protocol developed by Apple that serves video via standard HTTP infrastructure.
  • LL-HLS (Low-Latency HLS): An extension of HLS that reduces delay from 30 seconds to the 2–6 second range using partial segments and preload hints.
  • DASH (Dynamic Adaptive Streaming over HTTP): An international standard for adaptive bitrate streaming that allows high-quality streaming of media content over the internet. It works by breaking content into a sequence of small HTTP-based file segments, each containing a small chunk of playback time.
  • LL-DASH (Low-Latency DASH): An extension of the DASH standard designed to reduce end-to-end latency in live streaming. It achieves lower latency by using smaller segment durations, chunked transfer encoding, and HTTP/2 Push to deliver media fragments as soon as they are available.

  • WebRTC (Web Real-Time Communication): An open-source project and IETF/W3C standard that enables ultra-low latency (sub-500ms) real-time communication directly in web browsers without plugins. Unlike HTTP-based streaming (HLS/DASH), WebRTC is stateful and primarily uses UDP to prioritize speed. It relies on several sub-protocols:

    • Why NAT Traversal is Required: Most devices sit behind routers using Network Address Translation (NAT) and have private IP addresses (e.g., 192.168.x.x). Peers cannot communicate directly because they don’t know each other’s public identities, and firewalls block unsolicited incoming traffic. NAT traversal “punches holes” through these barriers.
    • ICE (Interactive Connectivity Establishment): A framework that coordinates STUN and TURN to find the best path between peers.
    • STUN (Session Traversal Utilities for NAT): Allows a device to discover its public IP address to bypass simple NATs.
    • TURN (Traversal Using Relays around NAT): A relay server used as a fallback when firewalls block direct peer-to-peer connections.
    • DTLS (Datagram Transport Layer Security): Secures the initial connection and handles the exchange of encryption keys over UDP.
    • SRTP (Secure Real-time Transport Protocol): Uses the keys from the DTLS handshake to encrypt the actual media payload (audio/video).
    • SDP (Session Description Protocol): A text-based format used to negotiate session parameters (codecs, resolution, encryption keys). It acts as the “contract” that both sides must agree upon before media can flow.
  • WHIP (WebRTC HTTP Ingestion Protocol): A standard (RFC 9428) for pushing media from an encoder to a server using HTTP POST. It solves the “signaling problem” by standardizing the SDP (Session Description Protocol) exchange, allowing hardware and software encoders (like OBS) to support WebRTC ingest without custom WebSocket implementations.

  • WHEP (WebRTC HTTP Egress Protocol): A standard for pulling media from a server to a player. Like WHIP, it uses HTTP to standardize signaling for playback, enabling universal WebRTC players that work across different media servers without vendor-specific integration code.

3. Packaging & Standards

  • CMAF (Common Media Application Format): A standard that allows a single fragmented MP4 file to be compatible with both HLS and DASH players.
  • Container Format: A file format that specifies how data (video, audio, subtitles, metadata) is stored together in a single file. It doesn’t compress the data itself but organizes it for playback. Examples include MP4, WebM, and MOV.
  • Manifest (M3U8): A text-based playlist used by HLS that indexes video segments and quality variants.
  • MPEG-TS (MPEG Transport Stream): A standard digital container format for transmission and storage of audio, video, and Program Specific Information (PSI) data. It is commonly used in broadcast systems like DVB and ATSC, and was historically the primary container format for HLS.
  • MP4 (MPEG-4 Part 14): A widely used container format for storing video, audio, and other data. It is based on Apple’s QuickTime File Format and is highly compatible across devices and platforms.
  • fMP4 (Fragmented MP4): An MP4 format that breaks a file into independent segments, making it suitable for live streaming and low-latency delivery.
  • WebM: An open, royalty-free media file format designed for the web. It typically uses VP8 or VP9 video codecs and Vorbis or Opus audio codecs.
  • MOV (QuickTime File Format): A proprietary container file format developed by Apple, primarily used for QuickTime multimedia framework. It can contain multiple tracks of video, audio, and text.

4. Web API

  • MSE (MediaSource Extensions): A Web API that allows JavaScript to construct media streams for <audio> and <video> elements, giving web applications fine-grained control over media data for adaptive streaming.

General Video Streaming Flow

The video streaming process typically follows a predictable five-step pipeline.

  • Capture & Encoding (The Origin)
    A camera captures raw frames. An encoder (e.g., FFmpeg or NVIDIA NVENC) compresses this raw data into a codec like H.265 to reduce its size.

  • Contribution/Ingest (The First Mile)
    The compressed video is sent from the edge device to a central server.

  • Technologies Used: SRT (for reliability over public internet) or RTP/UDP (for lowest overhead in managed networks).

  • Processing & Transcoding (The Core)
    The server receives the stream. It may change the resolution (transcoding) or simply re-package it.

  • Technologies Used: GStreamer or FFmpeg.

  • Packaging & Distribution (The Delivery)
    For HLS, the video is split into small fragments (fMP4) and indexed in a manifest (M3U8). An HTTP server (e.g., Ktor or a CDN) hosts these files.

  • Playback & Consumption (The Last Mile)
    A web browser fetches the playlist and segments. A library like hls.js feeds the data into the browser’s hardware decoder via MediaSource Extensions (MSE).

WebRTC Connection Establishment Flow

Establishing a WebRTC connection requires a “handshake” process known as Signaling. Historically, WebRTC did not specify a signaling transport, leading to proprietary implementations. Today, WHIP and WHEP provide the industry standard for HTTP-based signaling, enabling interoperability between different encoders, servers, and players.

  1. Signaling (Offer/Answer Exchange):
    • Offer: The initiator (e.g., a WHIP encoder) generates an SDP describing its media capabilities and sends it to the server via an HTTP POST.
    • Answer: The receiver (Answerer) processes the offer, selects compatible codecs, generates its own SDP, and sends it back.
  2. ICE Candidate Gathering: Both peers contact STUN/TURN servers to discover their public IP addresses and ports (known as ICE Candidates).
  3. ICE Candidate Exchange: Peers share these candidates through the signaling channel. This allows them to find the most efficient network path (Direct P2P vs. Relay).
  4. DTLS Handshake: Once a network path is established, the peers perform a secure handshake to verify identities and generate encryption keys.
  5. Media Flow: Encrypted audio and video data begin flowing using SRTP, utilizing the keys from the DTLS step.

The HLS Protocol and the Clarification on Container Formats

A common misconception in video engineering is that the difference between HLS and LL-HLS is the use of Fragmented MP4 (fMP4). This is technically incorrect.

fMP4 in Standard HLS
Standard HLS has supported fMP4 segments since version 7 (introduced in 2016). Before this, HLS exclusively used MPEG-2 Transport Streams (.ts). The primary advantage of fMP4 in standard HLS is its compatibility with the CMAF standard, allowing a single set of video files to serve both HLS and DASH clients. Most browser engines (via MSE) prefer fMP4 containers for HEVC over legacy TS because fMP4 offers better compatibility with modern streaming standards like CMAF and DASH, and is often a requirement for HEVC/H.265 playback on certain platforms like macOS and iOS.

Defining Differences: Standard HLS vs. LL-HLS

The real difference lies in the delivery mechanisms, not the container format. While standard HLS can use fMP4, it still delivers complete segments (e.g., 2–6 seconds long), whereas LL-HLS introduces several specific technical features to reduce latency:

  • Partial Segments (Parts): LL-HLS divides segments into tiny “parts” (e.g., 200ms). These are advertised in the playlist and can be downloaded as soon as they are ready, long before the full parent segment is complete.

  • Preload Hints: The server informs the player of the URL of the next expected partial segment in advance, allowing the player to issue a request immediately when data becomes available.

  • Blocking Playlist Reloads: Instead of constant polling, the server “holds” a playlist request until new data arrives, eliminating unnecessary network round trips.

  • Playlist Delta Updates: To reduce overhead, the server can send only the changed portions of a playlist rather than the entire file.

How to check what codec is supported by Chrome

chrome://gpu/

H.265 is not displayed although GPU supports it

It is highly likely that your Intel Arc 140V does have the hardware capability, but it is being “hidden” or “gated” by your laptop manufacturer (OEM) or a missing Windows component.

There is a known industry issue where manufacturers like Dell and HP have recently begun disabling H.265 hardware support in the BIOS/ACPI tables to avoid paying patent royalty fees on every laptop sold.

To resolve this issue, user needs to install “HEVC Video Extensions”

PTS vs. DTS

To understand PTS, you also need to know its partner, DTS (Decoding Time Stamp). They are often different because of how modern video compression (like H.264 or H.265) works:

DTS (Decoding Order): Tells the computer when to process the data.

PTS (Presentation Order): Tells the screen when to show the frame.

Because certain frames (B-frames) need information from “future” frames to be decoded, the computer might decode Frame 4 before it can show Frame 2.

DTS order: 1, 4, 2, 3

PTS order: 1, 2, 3, 4 (The smooth 1-2-3-4 sequence you actually see)

RTP packet jitter

If Packet A is sent at 1.0s and Packet B at 1.1s, but Packet B arrives at the receiver 0.01s before Packet A due to a network hiccup, do-timestamp=true would stamp Packet B with an earlier time than Packet A. This would cause the “back and forth” jitter that makes video stutter or crash muxers.

To utilize the correct order stored in the RTP header, we have to change how GStreamer handles the incoming data.

The Solution on GStreamer: rtpjitterbuffer
Instead of just grabbing the packet and stamping it with the “wall clock” immediately, we need a “waiting room” that looks at the RTP sequence numbers and timestamps.

  • RTP Sequence Number: Tells GStreamer the exact order (1, 2, 3…).
  • RTP Timestamp: Tells GStreamer the intended spacing between frames.

RTP packet drop

Even with a jitter buffer, sometimes a packet is lost forever (it’s UDP, after all). If Packet 5 never arrives, there is a “hole” in the timeline. mp4mux hates holes.

videorate sees the hole and says: “Packet 5 is missing, but I need to keep 30fps for this MP4 file. I will just duplicate Packet 4 and give it the timestamp Packet 5 should have had.”

GStreamer tips

  1. Added rtpjitterbuffer: This element was inserted after udpsrc to handle network jitter and packet reordering, which helps in
    constructing valid timestamps from RTP packets. I set latency=200 to provide a buffer against network fluctuations.

  2. Added videorate: This element was inserted before the encoder (x264enc) along with a caps filter video/x-raw,framerate=30/1.
    This forces a constant frame rate and regenerates timestamps for the raw video frames, ensuring that the encoder receives a
    stream with perfect, monotonic timestamps. This effectively sanitizes the stream and prevents the “Buffer has no PTS” error in
    the downstream mp4mux element.

Wall clock display

To accurately display the “wall-clock” time (NTP) for a specific frame, you have to bridge the gap between Media Time (RTP/Segments) and Real Time (NTP).

The method changes depending on whether you are using a packet-based protocol (WebRTC) or a file-based protocol (HLS).


1. WebRTC / RTP (Packet-Level Precision)

In WebRTC, the timestamp is calculated dynamically by the client. It is the most precise method but requires the most math.

  • The Source: The Edge device sends RTP Packets (video) and RTCP Sender Reports (the “Clock Map”).
  • The Map: The RTCP report explicitly says: “RTP timestamp 90,000 = Friday, 10:00:00 AM UTC.”
  • Client Calculation: The browser uses the getStats() API to find the estimatedPlayoutTimestamp. This value represents the exact NTP time the current frame was captured, adjusted for the network delay and jitter buffer.
  • Accuracy: Frame-accurate (millisecond precision).

2. HLS (Manifest-Level Precision)

HLS doesn’t have a continuous “clock map” like RTCP. Instead, it embeds time metadata into the playlist or the video stream itself.

  • The Source: The server converts the RTP stream into segments and writes the time into the .m3u8 manifest.
  • The Tag: #EXT-X-PROGRAM-DATE-TIME. This tag associates the first frame of a segment with an absolute UTC time.
  • Client Calculation: The player (like hls.js) reads this tag. To find the current time, it takes the programDateTime of the segment and adds the current playback offset (e.g., if you are 2 seconds into a 6-second segment).
  • Accuracy: High, but dependent on how frequently the server writes the tag.

3. Comparison Summary

Feature WebRTC HLS
Mechanism RTCP Sender Reports (SR) #EXT-X-PROGRAM-DATE-TIME
Data Type Binary packets (UDP) Text in Manifest (HTTP)
Browser Tool RTCPeerConnection.getStats() hls.currentProgramDateTime
Update Frequency Every 1–2 seconds Once per segment (e.g., 2–6s)
Main Challenge Syncing local clock vs. sender clock Clock “drift” if segments are long

Which should you use?

  • Use WebRTC if your frontend needs to trigger an action at the exact millisecond something happens (e.g., an AI detecting a face on a security camera).
  • Use HLS if you just need a “Clock” overlay on the video player that shows the approximate time for human reference.

Would you like a JavaScript code snippet for either the WebRTC getStats or the HLS.js program-date-time implementation?

what library is used for receiving RTP and converting it to WebRTC?

In your architecture (Edge device Web Server Frontend), the software sitting in the middle is typically called a Media Server or a WebRTC Gateway.

Because browsers cannot receive raw RTP directly (due to the lack of encryption and NAT traversal), this library acts as a “translator” that takes the raw incoming RTP and wraps it in the security layers (DTLS/SRTP) required for WebRTC.


1. Top Libraries/Servers for RTP-to-WebRTC

MediaMTX (Formerly rtsp-simple-server)

This is currently the most popular “out-of-the-box” solution for your exact use case. It is written in Go and is very lightweight.

  • How it works: It can ingest RTSP/RTP from your edge device and automatically provide a WebRTC mount point for your frontend.
  • Best for: Quick setups and edge computing where you don’t want to write much code.

Janus WebRTC Server

A highly modular, C-based server designed by Meetecho.

  • How it works: It has a specific Streaming Plugin. You send your RTP packets to a specific port on the Janus server, and it handles the WebRTC “handshake” (SDP/ICE) with the frontend.
  • Best for: High-performance, scalable systems where you need a rock-solid, battle-tested gateway.

GStreamer

If you are building a custom application in C, C++, or Python, GStreamer is the “Swiss Army Knife” of media.

  • How it works: You build a “pipeline.” One end is rtpbin (to receive RTP) and the other is webrtcbinsink (to send it to the browser).
  • Best for: Deep technical control, transcoding, or adding AI filters (like face detection) in the middle of the stream.

Mediasoup / Pion

These are more “developer-centric” libraries rather than finished servers.

  • Pion (Go): The leading library if you are writing your web server in Go. It gives you total control over the RTP packets.
  • Mediasoup (Node.js/C++): Extremely powerful for routing media. It is often used in professional conferencing tools.

2. How the “Conversion” Works Internally

The library doesn’t usually change the video data itself (unless you ask it to transcode). Instead, it performs a Header and Security transformation:

  1. RTP Ingest: The library listens on a UDP port for your Edge device’s RTP packets.
  2. Clock Sync: It listens for RTCP Sender Reports to maintain the NTP-to-RTP timestamp mapping.
  3. DTLS/SRTP Encryption: It takes the payload (the H.264/H.265 data) and encrypts it using the keys negotiated with the browser.
  4. Signaling: It generates the SDP (Session Description) that your frontend needs to connect.

Library Comparison

Library Language Complexity Best Use Case
MediaMTX Go Low Quick “RTSP to WebRTC” bridge; minimal coding required.
Janus C Medium General-purpose WebRTC gateway with a plugin architecture.
GStreamer C / Python High Complex pipelines, hardware acceleration, and edge-side processing.
Pion Go High Building custom, high-performance media servers in Go.
Mediasoup Node.js / C++ High Massive scale multi-party conferencing (SFU).

How to measure the End-to-End latency

1. RTCP based method

Calculates Round Trip Time (RTT) and estimates End-to-End (E2E) latency using protocol reports.

  • RTT Formula: $RTT = \text{Arrival Time} – \text{Last SR} – \text{Delay}$
  • E2E Calculation Logic:
    1. Sender: Encoder captures a frame at NTP Time ($T_{ntp_send}$).
    2. RTP Packet: The RTP Timestamp ($T_{rtp}$) is mapped to $T_{ntp_send}$ using the most recent RTCP Sender Report (SR).
    3. Receiver: Frame is rendered at local NTP Time ($T_{ntp_recv}$).
    4. Result: $Latency = T_{ntp_recv} – T_{ntp_send}$ (Requires synchronized clocks).

Why this is a “Rough Estimation”

Even with perfect clock sync, three factors can skew results by 10ms to 100ms:
1. Sampling Offset: RTP timestamps are often set at encoding, not capture. Sensor lag adds ~16ms (at 60fps).
2. Monitor Refresh/V-Sync: Frames sit in the GPU buffer waiting for the next refresh cycle (avg. 8.3ms on 60Hz).
3. Mapping Frequency: RTCP SRs are periodic. Clock drift between reports can cause millisecond errors.

Drawback

some webrtc gateway terminates the RTCP when relaying.
How to calculate End-to-End Latency via RTCP SR
To measure the true end-to-end latency using RTCP SR, Janus must perform NTP-to-NTP mapping
(Timestamp Translation).

  1. Source SR: The Edge sends an SR: RTP_A is NTP_Edge.
  2. Janus receives SR: Janus notes that RTP_A arrived at NTP_Janus_Arrival.
  3. Janus sends SR to Browser: Janus sends a new SR: RTP_A is NTP_Janus_Departure.

    Standard Janus behavior is to use its own clock for NTP_Janus_Departure. To get end-to-end latency,
    the NTP value in the SR sent to the browser would need to be the original NTP from the Edge,
    adjusted for any internal processing time in Janus.

    The Limitation
    Standard WebRTC implementations and Janus do not usually “pass through” the original NTP timestamp
    in the RTCP SR because NTP clocks between different servers are rarely synchronized well enough for
    this to work without a common reference (like PTP or a shared NTP server).

2. clockoverlay method

Also known as the “Glass-to-Glass” or physical method. It measures the absolute time from light hitting the sensor to light leaving the display.

  • The Setup: Place a high-speed millisecond timer in the camera’s view.
  • The Capture: Use a second camera to take a photo showing both the original timer and the receiving monitor in the same frame.
  • The Result: Subtract the time shown on the monitor from the time on the original timer.

3. abs-capture-time method

A high-precision software instrumentation method that tracks a packet’s “life cycle” using internal timestamps embedded directly in the stream.

  • Method: Use RTP Header Extensions (RFC 8285) to embed a “Capture Timestamp” (RFC 6051 or Absolute Capture Time) in every packet.
  • Requirement: Requires sub-microsecond clock synchronization (e.g., PTP).

Technical Implementation

The receiver identifies the extension data through a handshake between signaling and packet structure:

  1. SDP Negotiation (The Map):
    The sender defines the extension ID in the SDP:
    a=extmap:5 urn:ietf:params:rtp-hdrext:ntp-64
    Receiver learns: “ID 5 = 8-byte NTP-64 timestamp.”

  2. RTP Header (The Signal):
    The receiver checks the X-bit (10th bit) of the RTP header.

    • If X = 1, a Header Extension block follows the SSRC.

Extension Comparison

Extension Primary Goal Resolution Survive Transcoding?
RFC 6051 Rapid A/V Sync 64-bit NTP No (usually)
Abs-Capture-Time E2E Latency / Sync 64-bit NTP Yes
RFC 5450 (TOffset) Jitter Compensation 24-bit Offset No
AST (abs-send-time) Congestion Control 24-bit Time No

Is it “Correct”?

Yes. For 99% of applications (Video conferencing, standard streaming, etc.), the RTCP/Instrumentation method is considered the “System Latency” and is the industry standard for telemetry.

relay of plain RTP

  1. The Relay Evolution: nftables vs. rtpengine
    You are currently at a crossroads. While nftables is excellent for simple, low-overhead UDP forwarding, it is “blind” to the encryption keys required for SRTP.

Current State (Plain RTP): You can use nftables, but it lacks the session management and NAT traversal intelligence needed for scaling.

Future State (SRTP): You must use a media-aware proxy like rtpengine. It handles the “Security Translation” (e.g., converting an Edge’s SDES keys to Janus’s DTLS-SRTP keys) which nftables cannot do.

  1. Encryption Types & Key Management
    The “type” of SRTP you use depends entirely on your Edge hardware.

SDES-SRTP: Keys are sent as visible text in the signaling (SDP). This is common in legacy enterprise IP phones. You must use SIP-over-TLS to keep these keys from leaking.

DTLS-SRTP: Keys are negotiated via a handshake directly on the media path. This is the WebRTC standard and is mandatory for Janus. It is more secure because it provides “Forward Secrecy.”

Understanding the sequence for these protocols is key to grasping how modern VoIP and WebRTC handle security. While they all aim to encrypt media, they handle the “handshake” (key exchange) very differently.


1. SDES-SRTP (The “Old School” Way)

In SDES (Session Description Protocol Security Descriptions), the secret cryptographic keys are sent in plain text within the SDP message. This requires the signaling channel (SIP) to be encrypted (using TLS), otherwise, anyone sniffing the network can see the keys.

  Caller (Alice)                      Signaling Server                      Callee (Bob)
        |                                     |                                     |
        |--- INVITE (SDP: crypto:AES_128...) ->|                                     |
        |    [Key is in the text!]            |--- INVITE (SDP: crypto:AES_128...) ->|
        |                                     |                                     |
        |                                     |<-- 200 OK (SDP: crypto:AES_128...) -|
        |<-- 200 OK (SDP: crypto:AES_128...) -|                                     |
        |                                     |                                     |
        |================== SRTP Encrypted Media (using keys from SDP) =============|


2. DTLS-SRTP (The Modern Standard)

DTLS-SRTP is more secure because the keys are never sent via signaling. Instead, the SDP only contains a “fingerprint” of a certificate. The actual keys are generated via a direct DTLS handshake between the two endpoints once the media path is open.

  Endpoint A                                                              Endpoint B
      |                                                                        |
      |------- SDP Offer (a=fingerprint:SHA-256...) -------------------------->|
      |                                                                        |
      |<------ SDP Answer (a=fingerprint:SHA-256...) --------------------------|
      |                                                                        |
      |   [ ICE Connectivity Checks / STUN / TURN ]                            |
      |                                                                        |
      |----------------------- DTLS Client Hello ----------------------------->|
      |<---------------------- DTLS Server Hello + Cert -----------------------|
      |----------------------- DTLS Key Exchange ------------------------------>|
      |                                                                        |
      |   [ Key derivation happens locally on both sides ]                     |
      |                                                                        |
      |================== SRTP Encrypted Media (using DTLS keys) ==============|


3. WebRTC (The Complete Stack)

WebRTC is a collection of protocols. It strictly mandates DTLS-SRTP for security and ICE for NAT traversal. It is essentially the DTLS-SRTP flow above, but with a complex preamble to find a network path.

 Browser A             Signaling (Web Server)             STUN/TURN             Browser B
     |                         |                             |                      |
     |--- (1) Create Offer ----|                             |                      |
     |--- (SDP + Fingerprint) -|---------------------------->|                      |
     |                         |                             |--- (2) Forward Offer |
     |                         |                             |                      |
     |--- (3) ICE Gathering ---|                             |                      |
     |--- (Query Candidates) --|---------------------------->|                      |
     |                         |                             |<-- (4) ICE Candidates|
     |                         |                             |                      |
     |                         |<----------------------------|--- (5) Create Answer |
     |<-- (6) Forward Answer --|                             |                      |
     |                         |                             |                      |
     |<<<<<<<<<<<<<<<<<<<<<< (7) ICE Connectivity Checks >>>>>>>>>>>>>>>>>>>>>>>>>|
     |                                                                              |
     |<<<<<<<<<<<<<<<<<<<<<< (8) DTLS Handshake (Key Exchange) >>>>>>>>>>>>>>>>>>>>|
     |                                                                              |
     |======================= (9) Secure Media (SRTP) ==============================|


Key Differences Summary

Feature SDES-SRTP DTLS-SRTP WebRTC
Key Exchange Inside SDP (Signaling) Direct DTLS (Media Path) Direct DTLS (Media Path)
Security Lower (Relies on SIP TLS) High (Perfect Forward Secrecy) High (Mandatory)
NAT Traversal Minimal (Standard RTP) Standard Built-in (ICE/STUN/TURN)
Usage Legacy VoIP / SIP Modern VoIP / Hardphones Browsers / Mobile Apps

Would you like me to generate a Dockerfile for a WebRTC-capable rtpengine setup that supports these protocols?

#Janus

  1. The VideoRoom Plugin (WebRTC Ingestion)
    The VideoRoom plugin is designed for WebRTC-to-WebRTC communication. Since the WebRTC standard mandates DTLS-SRTP, Janus fully supports it here.

How it works: When a “publisher” (like a browser or a WebRTC-capable SDK) joins a room, Janus establishes a full PeerConnection. This includes the ICE handshake and a DTLS handshake to derive the SRTP keys.

Support: Native and Mandatory. You cannot turn off DTLS-SRTP for these types of connections in Janus.

  1. The Streaming Plugin (External Ingestion)
    This plugin is used when you want to “feed” media into Janus from an external tool like FFmpeg or GStreamer (often called “Mountpoints”).

DTLS-SRTP Support: No. The Streaming plugin generally expects Plain RTP or SDES-SRTP.

Why? Most external RTP tools are not designed to perform the complex DTLS handshake required for WebRTC. Instead, you typically send raw RTP to a local port on the Janus server.

Workaround: If you need to ingest via DTLS-SRTP from an external source, you would typically use a tool like GStreamer with the webrtcsink element or WHIP (WebRTC-HTTP Ingestion Protocol), which Janus supports via a dedicated wrapper.

RTP Packet Overview

RTP packet is descirbed here.

https://en.wikipedia.org/wiki/Real-time_Transport_Protocol

Sequence Number vs Timestamp

Why is both needed?

While they might seem redundant at first glance, they serve two distinct and critical purposes in real-time communication over unreliable networks like UDP:

1. Sequence Number (The “Order” Logic)

  • Definition: A 16-bit counter that increments by exactly 1 for every RTP packet sent.
  • Primary Goal: Packet Loss Detection and Reordering.
  • Why it’s needed: UDP does not guarantee delivery or order. If the receiver gets packets in the order 1, 2, 4, 3, the Sequence Number allows the rtpjitterbuffer to put 3 before 4 and realize that no packets are actually missing. If it sees 1, 2, 5, it knows immediately that 3 and 4 were lost in transit.

2. Timestamp (The “Timing” Logic)

  • Definition: A 32-bit value that increments based on the sampling clock (e.g., 90,000 units per second for a 90kHz video clock).
  • Primary Goal: Playback Synchronization and Jitter Compensation.
  • Why it’s needed:
    • Fragmentation: A single large video frame (like an I-frame) is often split into multiple RTP packets to fit within the MTU. All these packets will have the same timestamp (because they belong to the same instant in time) but different sequence numbers.
    • Variable Timing: Video frames aren’t always captured or sent at perfect intervals. The timestamp tells the player exactly how many milliseconds to wait between showing Frame A and Frame B, regardless of when the packets actually arrived over the wire.

Summary Comparison

Feature Sequence Number Timestamp
Increment Always +1 per packet Based on clock rate (e.g., +3000 for 30fps @ 90kHz)
Duplicate Values? Never (until 16-bit wrap-around) Yes (for fragmented packets of the same frame)
Main Job Find missing data / Fix network order Smooth out “stutter” / Sync Audio with Video