Understanding Video Streaming Application Architecture

This document outlines the fundamental concepts and architectural components of modern video streaming applications. It covers key terminology, definitions, and the typical five-step pipeline involved in delivering video content.

Key Terminology and Definitions

To understand the architecture of modern streaming, it is essential to define the core technologies, grouped by their functional domains.

1. Media Compression & Codecs

  • Codec: A device or program (COmpressor-DECompressor) that shrinks large video files for transmission and expands them for viewing.

  • H.265 (HEVC): High Efficiency Video Coding. The successor to H.264, offering 25% to 50% better data compression at the same quality level.

  • H.264 (AVC): Advanced Video Coding. The industry standard for over a decade, known for universal compatibility but lower efficiency compared to HEVC.
  • CTU (Coding Tree Unit): The basic processing unit of HEVC, which can be as large as 64×64 pixels, allowing for more efficient processing of high-resolution video than H.264’s 16×16 macroblocks.
  • AAC AudioSpecificConfig: A global header for MPEG-4 Audio that contains essential information for an AAC decoder, such as the Audio Object Type (AOT), sampling rate, and channel configuration. It is typically generated by the encoder and used to initialize the decoder.

Ingest

  • SRT (Secure Reliable Transport): A UDP-based protocol designed for reliable video ingest over unstable networks, utilizing error correction (ARQ/FEC).
  • RTP (Real-time Transport Protocol): A protocol for delivering audio and video over IP networks. It carries the actual media data and includes sequence numbers and timestamps to help the receiver reassemble the stream correctly.
  • RTSP (Real Time Streaming Protocol): A network control protocol used to manage media sessions. It allows clients to control the media server with commands like play, pause, and stop, acting as a “remote control” for the stream.
  • RTCP (Real-time Transport Control Protocol): A companion protocol to RTP that provides feedback on the quality of data delivery (QoS) and helps synchronize different media streams (e.g., audio and video).
  • Relationship between RTP/RTSP/RTCP:
    Standard: RTSP (TCP) + RTP (UDP) + RTCP (UDP) = 3 connections.
    Interleaved: Everything is squeezed into a single RTSP (TCP) connection.
    Trade-off: This makes the video much more likely to stutter because TCP will pause the video to “fix” tiny errors that UDP would have just ignored.
  • RTMP (Real-Time Messaging Protocol): Originally developed by Macromedia (now Adobe), this is a TCP-based protocol used for streaming audio, video, and data. It remains the industry standard for “first-mile” ingest, where an encoder sends a live stream to a media server or CDN (e.g., YouTube Live or Twitch).

2. Delivery Protocols

  • HLS (HTTP Live Streaming): An adaptive bitrate streaming protocol developed by Apple that serves video via standard HTTP infrastructure.
  • LL-HLS (Low-Latency HLS): An extension of HLS that reduces delay from 30 seconds to the 2–6 second range using partial segments and preload hints.
  • DASH (Dynamic Adaptive Streaming over HTTP): An international standard for adaptive bitrate streaming that allows high-quality streaming of media content over the internet. It works by breaking content into a sequence of small HTTP-based file segments, each containing a small chunk of playback time.
  • LL-DASH (Low-Latency DASH): An extension of the DASH standard designed to reduce end-to-end latency in live streaming. It achieves lower latency by using smaller segment durations, chunked transfer encoding, and HTTP/2 Push to deliver media fragments as soon as they are available.

  • WebRTC (Web Real-Time Communication): An open-source project and IETF/W3C standard that enables ultra-low latency (sub-500ms) real-time communication directly in web browsers without plugins. Unlike HTTP-based streaming (HLS/DASH), WebRTC is stateful and primarily uses UDP to prioritize speed. It relies on several sub-protocols:

    • Why NAT Traversal is Required: Most devices sit behind routers using Network Address Translation (NAT) and have private IP addresses (e.g., 192.168.x.x). Peers cannot communicate directly because they don’t know each other’s public identities, and firewalls block unsolicited incoming traffic. NAT traversal “punches holes” through these barriers.
    • ICE (Interactive Connectivity Establishment): A framework that coordinates STUN and TURN to find the best path between peers.
    • STUN (Session Traversal Utilities for NAT): Allows a device to discover its public IP address to bypass simple NATs.
    • TURN (Traversal Using Relays around NAT): A relay server used as a fallback when firewalls block direct peer-to-peer connections.
    • DTLS (Datagram Transport Layer Security): Secures the initial connection and handles the exchange of encryption keys over UDP.
    • SRTP (Secure Real-time Transport Protocol): Uses the keys from the DTLS handshake to encrypt the actual media payload (audio/video).
    • SDP (Session Description Protocol): A text-based format used to negotiate session parameters (codecs, resolution, encryption keys). It acts as the “contract” that both sides must agree upon before media can flow.
  • WHIP (WebRTC HTTP Ingestion Protocol): A standard (RFC 9428) for pushing media from an encoder to a server using HTTP POST. It solves the “signaling problem” by standardizing the SDP (Session Description Protocol) exchange, allowing hardware and software encoders (like OBS) to support WebRTC ingest without custom WebSocket implementations.

  • WHEP (WebRTC HTTP Egress Protocol): A standard for pulling media from a server to a player. Like WHIP, it uses HTTP to standardize signaling for playback, enabling universal WebRTC players that work across different media servers without vendor-specific integration code.

3. Packaging & Standards

  • CMAF (Common Media Application Format): A standard that allows a single fragmented MP4 file to be compatible with both HLS and DASH players.
  • Container Format: A file format that specifies how data (video, audio, subtitles, metadata) is stored together in a single file. It doesn’t compress the data itself but organizes it for playback. Examples include MP4, WebM, and MOV.
  • Manifest (M3U8): A text-based playlist used by HLS that indexes video segments and quality variants.
  • MPEG-TS (MPEG Transport Stream): A standard digital container format for transmission and storage of audio, video, and Program Specific Information (PSI) data. It is commonly used in broadcast systems like DVB and ATSC, and was historically the primary container format for HLS.
  • MP4 (MPEG-4 Part 14): A widely used container format for storing video, audio, and other data. It is based on Apple’s QuickTime File Format and is highly compatible across devices and platforms.
  • fMP4 (Fragmented MP4): An MP4 format that breaks a file into independent segments, making it suitable for live streaming and low-latency delivery.
  • WebM: An open, royalty-free media file format designed for the web. It typically uses VP8 or VP9 video codecs and Vorbis or Opus audio codecs.
  • MOV (QuickTime File Format): A proprietary container file format developed by Apple, primarily used for QuickTime multimedia framework. It can contain multiple tracks of video, audio, and text.

4. Web API

  • MSE (MediaSource Extensions): A Web API that allows JavaScript to construct media streams for <audio> and <video> elements, giving web applications fine-grained control over media data for adaptive streaming.

General Video Streaming Flow

The video streaming process typically follows a predictable five-step pipeline.

  • Capture & Encoding (The Origin)
    A camera captures raw frames. An encoder (e.g., FFmpeg or NVIDIA NVENC) compresses this raw data into a codec like H.265 to reduce its size.

  • Contribution/Ingest (The First Mile)
    The compressed video is sent from the edge device to a central server.

  • Technologies Used: SRT (for reliability over public internet) or RTP/UDP (for lowest overhead in managed networks).

  • Processing & Transcoding (The Core)
    The server receives the stream. It may change the resolution (transcoding) or simply re-package it.

  • Technologies Used: GStreamer or FFmpeg.

  • Packaging & Distribution (The Delivery)
    For HLS, the video is split into small fragments (fMP4) and indexed in a manifest (M3U8). An HTTP server (e.g., Ktor or a CDN) hosts these files.

  • Playback & Consumption (The Last Mile)
    A web browser fetches the playlist and segments. A library like hls.js feeds the data into the browser’s hardware decoder via MediaSource Extensions (MSE).

WebRTC Connection Establishment Flow

Establishing a WebRTC connection requires a “handshake” process known as Signaling. Historically, WebRTC did not specify a signaling transport, leading to proprietary implementations. Today, WHIP and WHEP provide the industry standard for HTTP-based signaling, enabling interoperability between different encoders, servers, and players.

  1. Signaling (Offer/Answer Exchange):
    • Offer: The initiator (e.g., a WHIP encoder) generates an SDP describing its media capabilities and sends it to the server via an HTTP POST.
    • Answer: The receiver (Answerer) processes the offer, selects compatible codecs, generates its own SDP, and sends it back.
  2. ICE Candidate Gathering: Both peers contact STUN/TURN servers to discover their public IP addresses and ports (known as ICE Candidates).
  3. ICE Candidate Exchange: Peers share these candidates through the signaling channel. This allows them to find the most efficient network path (Direct P2P vs. Relay).
  4. DTLS Handshake: Once a network path is established, the peers perform a secure handshake to verify identities and generate encryption keys.
  5. Media Flow: Encrypted audio and video data begin flowing using SRTP, utilizing the keys from the DTLS step.

The HLS Protocol and the Clarification on Container Formats

A common misconception in video engineering is that the difference between HLS and LL-HLS is the use of Fragmented MP4 (fMP4). This is technically incorrect.

fMP4 in Standard HLS
Standard HLS has supported fMP4 segments since version 7 (introduced in 2016). Before this, HLS exclusively used MPEG-2 Transport Streams (.ts). The primary advantage of fMP4 in standard HLS is its compatibility with the CMAF standard, allowing a single set of video files to serve both HLS and DASH clients. Most browser engines (via MSE) prefer fMP4 containers for HEVC over legacy TS because fMP4 offers better compatibility with modern streaming standards like CMAF and DASH, and is often a requirement for HEVC/H.265 playback on certain platforms like macOS and iOS.

Defining Differences: Standard HLS vs. LL-HLS

The real difference lies in the delivery mechanisms, not the container format. While standard HLS can use fMP4, it still delivers complete segments (e.g., 2–6 seconds long), whereas LL-HLS introduces several specific technical features to reduce latency:

  • Partial Segments (Parts): LL-HLS divides segments into tiny “parts” (e.g., 200ms). These are advertised in the playlist and can be downloaded as soon as they are ready, long before the full parent segment is complete.

  • Preload Hints: The server informs the player of the URL of the next expected partial segment in advance, allowing the player to issue a request immediately when data becomes available.

  • Blocking Playlist Reloads: Instead of constant polling, the server “holds” a playlist request until new data arrives, eliminating unnecessary network round trips.

  • Playlist Delta Updates: To reduce overhead, the server can send only the changed portions of a playlist rather than the entire file.

How to check what codec is supported by Chrome

chrome://gpu/

H.265 is not displayed although GPU supports it

It is highly likely that your Intel Arc 140V does have the hardware capability, but it is being “hidden” or “gated” by your laptop manufacturer (OEM) or a missing Windows component.

There is a known industry issue where manufacturers like Dell and HP have recently begun disabling H.265 hardware support in the BIOS/ACPI tables to avoid paying patent royalty fees on every laptop sold.

To resolve this issue, user needs to install “HEVC Video Extensions”

PTS vs. DTS

To understand PTS, you also need to know its partner, DTS (Decoding Time Stamp). They are often different because of how modern video compression (like H.264 or H.265) works:

DTS (Decoding Order): Tells the computer when to process the data.

PTS (Presentation Order): Tells the screen when to show the frame.

Because certain frames (B-frames) need information from “future” frames to be decoded, the computer might decode Frame 4 before it can show Frame 2.

DTS order: 1, 4, 2, 3

PTS order: 1, 2, 3, 4 (The smooth 1-2-3-4 sequence you actually see)

RTP packet jitter

If Packet A is sent at 1.0s and Packet B at 1.1s, but Packet B arrives at the receiver 0.01s before Packet A due to a network hiccup, do-timestamp=true would stamp Packet B with an earlier time than Packet A. This would cause the “back and forth” jitter that makes video stutter or crash muxers.

To utilize the correct order stored in the RTP header, we have to change how GStreamer handles the incoming data.

The Solution on GStreamer: rtpjitterbuffer
Instead of just grabbing the packet and stamping it with the “wall clock” immediately, we need a “waiting room” that looks at the RTP sequence numbers and timestamps.

  • RTP Sequence Number: Tells GStreamer the exact order (1, 2, 3…).
  • RTP Timestamp: Tells GStreamer the intended spacing between frames.

RTP packet drop

Even with a jitter buffer, sometimes a packet is lost forever (it’s UDP, after all). If Packet 5 never arrives, there is a “hole” in the timeline. mp4mux hates holes.

videorate sees the hole and says: “Packet 5 is missing, but I need to keep 30fps for this MP4 file. I will just duplicate Packet 4 and give it the timestamp Packet 5 should have had.”

GStreamer tips

  1. Added rtpjitterbuffer: This element was inserted after udpsrc to handle network jitter and packet reordering, which helps in
    constructing valid timestamps from RTP packets. I set latency=200 to provide a buffer against network fluctuations.

  2. Added videorate: This element was inserted before the encoder (x264enc) along with a caps filter video/x-raw,framerate=30/1.
    This forces a constant frame rate and regenerates timestamps for the raw video frames, ensuring that the encoder receives a
    stream with perfect, monotonic timestamps. This effectively sanitizes the stream and prevents the “Buffer has no PTS” error in
    the downstream mp4mux element.

Wall clock display

To accurately display the “wall-clock” time (NTP) for a specific frame, you have to bridge the gap between Media Time (RTP/Segments) and Real Time (NTP).

The method changes depending on whether you are using a packet-based protocol (WebRTC) or a file-based protocol (HLS).


1. WebRTC / RTP (Packet-Level Precision)

In WebRTC, the timestamp is calculated dynamically by the client. It is the most precise method but requires the most math.

  • The Source: The Edge device sends RTP Packets (video) and RTCP Sender Reports (the “Clock Map”).
  • The Map: The RTCP report explicitly says: “RTP timestamp 90,000 = Friday, 10:00:00 AM UTC.”
  • Client Calculation: The browser uses the getStats() API to find the estimatedPlayoutTimestamp. This value represents the exact NTP time the current frame was captured, adjusted for the network delay and jitter buffer.
  • Accuracy: Frame-accurate (millisecond precision).

2. HLS (Manifest-Level Precision)

HLS doesn’t have a continuous “clock map” like RTCP. Instead, it embeds time metadata into the playlist or the video stream itself.

  • The Source: The server converts the RTP stream into segments and writes the time into the .m3u8 manifest.
  • The Tag: #EXT-X-PROGRAM-DATE-TIME. This tag associates the first frame of a segment with an absolute UTC time.
  • Client Calculation: The player (like hls.js) reads this tag. To find the current time, it takes the programDateTime of the segment and adds the current playback offset (e.g., if you are 2 seconds into a 6-second segment).
  • Accuracy: High, but dependent on how frequently the server writes the tag.

3. Comparison Summary

Feature WebRTC HLS
Mechanism RTCP Sender Reports (SR) #EXT-X-PROGRAM-DATE-TIME
Data Type Binary packets (UDP) Text in Manifest (HTTP)
Browser Tool RTCPeerConnection.getStats() hls.currentProgramDateTime
Update Frequency Every 1–2 seconds Once per segment (e.g., 2–6s)
Main Challenge Syncing local clock vs. sender clock Clock “drift” if segments are long

Which should you use?

  • Use WebRTC if your frontend needs to trigger an action at the exact millisecond something happens (e.g., an AI detecting a face on a security camera).
  • Use HLS if you just need a “Clock” overlay on the video player that shows the approximate time for human reference.

Would you like a JavaScript code snippet for either the WebRTC getStats or the HLS.js program-date-time implementation?

what library is used for receiving RTP and converting it to WebRTC?

In your architecture (Edge device Web Server Frontend), the software sitting in the middle is typically called a Media Server or a WebRTC Gateway.

Because browsers cannot receive raw RTP directly (due to the lack of encryption and NAT traversal), this library acts as a “translator” that takes the raw incoming RTP and wraps it in the security layers (DTLS/SRTP) required for WebRTC.


1. Top Libraries/Servers for RTP-to-WebRTC

MediaMTX (Formerly rtsp-simple-server)

This is currently the most popular “out-of-the-box” solution for your exact use case. It is written in Go and is very lightweight.

  • How it works: It can ingest RTSP/RTP from your edge device and automatically provide a WebRTC mount point for your frontend.
  • Best for: Quick setups and edge computing where you don’t want to write much code.

Janus WebRTC Server

A highly modular, C-based server designed by Meetecho.

  • How it works: It has a specific Streaming Plugin. You send your RTP packets to a specific port on the Janus server, and it handles the WebRTC “handshake” (SDP/ICE) with the frontend.
  • Best for: High-performance, scalable systems where you need a rock-solid, battle-tested gateway.

GStreamer

If you are building a custom application in C, C++, or Python, GStreamer is the “Swiss Army Knife” of media.

  • How it works: You build a “pipeline.” One end is rtpbin (to receive RTP) and the other is webrtcbinsink (to send it to the browser).
  • Best for: Deep technical control, transcoding, or adding AI filters (like face detection) in the middle of the stream.

Mediasoup / Pion

These are more “developer-centric” libraries rather than finished servers.

  • Pion (Go): The leading library if you are writing your web server in Go. It gives you total control over the RTP packets.
  • Mediasoup (Node.js/C++): Extremely powerful for routing media. It is often used in professional conferencing tools.

2. How the “Conversion” Works Internally

The library doesn’t usually change the video data itself (unless you ask it to transcode). Instead, it performs a Header and Security transformation:

  1. RTP Ingest: The library listens on a UDP port for your Edge device’s RTP packets.
  2. Clock Sync: It listens for RTCP Sender Reports to maintain the NTP-to-RTP timestamp mapping.
  3. DTLS/SRTP Encryption: It takes the payload (the H.264/H.265 data) and encrypts it using the keys negotiated with the browser.
  4. Signaling: It generates the SDP (Session Description) that your frontend needs to connect.

Library Comparison

Library Language Complexity Best Use Case
MediaMTX Go Low Quick “RTSP to WebRTC” bridge; minimal coding required.
Janus C Medium General-purpose WebRTC gateway with a plugin architecture.
GStreamer C / Python High Complex pipelines, hardware acceleration, and edge-side processing.
Pion Go High Building custom, high-performance media servers in Go.
Mediasoup Node.js / C++ High Massive scale multi-party conferencing (SFU).