This document outlines the fundamental concepts and architectural components of modern video streaming applications. It covers key terminology, definitions, and the typical five-step pipeline involved in delivering video content.

Key Terminology and Definitions

To understand the architecture of modern streaming, it is essential to define the core technologies, grouped by their functional domains.

1. Media Compression & Codecs

Codec: A device or program (COmpressor-DECompressor) that shrinks large video files for transmission and expands them for viewing.
H.265 (HEVC): High Efficiency Video Coding. The successor to H.264, offering 25% to 50% better data compression at the same quality level.
H.264 (AVC): Advanced Video Coding. The industry standard for over a decade, known for universal compatibility but lower efficiency compared to HEVC.
CTU (Coding Tree Unit): The basic processing unit of HEVC, which can be as large as 64×64 pixels, allowing for more efficient processing of high-resolution video than H.264’s 16×16 macroblocks.
AAC AudioSpecificConfig: A global header for MPEG-4 Audio that contains essential information for an AAC decoder, such as the Audio Object Type (AOT), sampling rate, and channel configuration. It is typically generated by the encoder and used to initialize the decoder.

Ingest

SRT (Secure Reliable Transport): A UDP-based protocol designed for reliable video ingest over unstable networks, utilizing error correction (ARQ/FEC).
RTP (Real-time Transport Protocol): A protocol for delivering audio and video over IP networks. It carries the actual media data and includes sequence numbers and timestamps to help the receiver reassemble the stream correctly.
RTSP (Real Time Streaming Protocol): A network control protocol used to manage media sessions. It allows clients to control the media server with commands like play, pause, and stop, acting as a “remote control” for the stream.
RTCP (Real-time Transport Control Protocol): A companion protocol to RTP that provides feedback on the quality of data delivery (QoS) and helps synchronize different media streams (e.g., audio and video).
Relationship between RTP/RTSP/RTCP:
Standard: RTSP (TCP) + RTP (UDP) + RTCP (UDP) = 3 connections.
Interleaved: Everything is squeezed into a single RTSP (TCP) connection.
Trade-off: This makes the video much more likely to stutter because TCP will pause the video to “fix” tiny errors that UDP would have just ignored.
RTMP (Real-Time Messaging Protocol): Originally developed by Macromedia (now Adobe), this is a TCP-based protocol used for streaming audio, video, and data. It remains the industry standard for “first-mile” ingest, where an encoder sends a live stream to a media server or CDN (e.g., YouTube Live or Twitch).

2. Delivery Protocols

HLS (HTTP Live Streaming): An adaptive bitrate streaming protocol developed by Apple that serves video via standard HTTP infrastructure.
LL-HLS (Low-Latency HLS): An extension of HLS that reduces delay from 30 seconds to the 2–6 second range using partial segments and preload hints.
DASH (Dynamic Adaptive Streaming over HTTP): An international standard for adaptive bitrate streaming that allows high-quality streaming of media content over the internet. It works by breaking content into a sequence of small HTTP-based file segments, each containing a small chunk of playback time.
LL-DASH (Low-Latency DASH): An extension of the DASH standard designed to reduce end-to-end latency in live streaming. It achieves lower latency by using smaller segment durations, chunked transfer encoding, and HTTP/2 Push to deliver media fragments as soon as they are available.
WebRTC (Web Real-Time Communication): A standard for sub-500ms real-time communication, typically used for peer-to-peer video conferencing.
WHIP/WHEP: The WebRTC HTTP Ingestion and Egress Protocols, which standardize how media is pushed to and pulled from WebRTC servers via HTTP signaling.

3. Packaging & Standards

CMAF (Common Media Application Format): A standard that allows a single fragmented MP4 file to be compatible with both HLS and DASH players.
Container Format: A file format that specifies how data (video, audio, subtitles, metadata) is stored together in a single file. It doesn’t compress the data itself but organizes it for playback. Examples include MP4, WebM, and MOV.
Manifest (M3U8): A text-based playlist used by HLS that indexes video segments and quality variants.
MPEG-TS (MPEG Transport Stream): A standard digital container format for transmission and storage of audio, video, and Program Specific Information (PSI) data. It is commonly used in broadcast systems like DVB and ATSC, and was historically the primary container format for HLS.
MP4 (MPEG-4 Part 14): A widely used container format for storing video, audio, and other data. It is based on Apple’s QuickTime File Format and is highly compatible across devices and platforms.
fMP4 (Fragmented MP4): An MP4 format that breaks a file into independent segments, making it suitable for live streaming and low-latency delivery.
WebM: An open, royalty-free media file format designed for the web. It typically uses VP8 or VP9 video codecs and Vorbis or Opus audio codecs.
MOV (QuickTime File Format): A proprietary container file format developed by Apple, primarily used for QuickTime multimedia framework. It can contain multiple tracks of video, audio, and text.

4. Web API

MSE (MediaSource Extensions): A Web API that allows JavaScript to construct media streams for <audio> and <video> elements, giving web applications fine-grained control over media data for adaptive streaming.

General Video Streaming Flow

The video streaming process typically follows a predictable five-step pipeline.

Capture & Encoding (The Origin)
A camera captures raw frames. An encoder (e.g., FFmpeg or NVIDIA NVENC) compresses this raw data into a codec like H.265 to reduce its size.
Contribution/Ingest (The First Mile)
The compressed video is sent from the edge device to a central server.
Technologies Used: SRT (for reliability over public internet) or RTP/UDP (for lowest overhead in managed networks).
Processing & Transcoding (The Core)
The server receives the stream. It may change the resolution (transcoding) or simply re-package it.
Technologies Used: GStreamer or FFmpeg.
Packaging & Distribution (The Delivery)
For HLS, the video is split into small fragments (fMP4) and indexed in a manifest (M3U8). An HTTP server (e.g., Ktor or a CDN) hosts these files.
Playback & Consumption (The Last Mile)
A web browser fetches the playlist and segments. A library like hls.js feeds the data into the browser’s hardware decoder via MediaSource Extensions (MSE).

The HLS Protocol and the Clarification on Container Formats

A common misconception in video engineering is that the difference between HLS and LL-HLS is the use of Fragmented MP4 (fMP4). This is technically incorrect.

fMP4 in Standard HLS
Standard HLS has supported fMP4 segments since version 7 (introduced in 2016). Before this, HLS exclusively used MPEG-2 Transport Streams (.ts). The primary advantage of fMP4 in standard HLS is its compatibility with the CMAF standard, allowing a single set of video files to serve both HLS and DASH clients. Most browser engines (via MSE) prefer fMP4 containers for HEVC over legacy TS because fMP4 offers better compatibility with modern streaming standards like CMAF and DASH, and is often a requirement for HEVC/H.265 playback on certain platforms like macOS and iOS.

Defining Differences: Standard HLS vs. LL-HLS

The real difference lies in the delivery mechanisms, not the container format. While standard HLS can use fMP4, it still delivers complete segments (e.g., 2–6 seconds long), whereas LL-HLS introduces several specific technical features to reduce latency:

Partial Segments (Parts): LL-HLS divides segments into tiny “parts” (e.g., 200ms). These are advertised in the playlist and can be downloaded as soon as they are ready, long before the full parent segment is complete.
Preload Hints: The server informs the player of the URL of the next expected partial segment in advance, allowing the player to issue a request immediately when data becomes available.
Blocking Playlist Reloads: Instead of constant polling, the server “holds” a playlist request until new data arrives, eliminating unnecessary network round trips.
Playlist Delta Updates: To reduce overhead, the server can send only the changed portions of a playlist rather than the entire file.

How to check what codec is supported by Chrome

chrome://gpu/

H.265 is not displayed although GPU supports it

It is highly likely that your Intel Arc 140V does have the hardware capability, but it is being “hidden” or “gated” by your laptop manufacturer (OEM) or a missing Windows component.

There is a known industry issue where manufacturers like Dell and HP have recently begun disabling H.265 hardware support in the BIOS/ACPI tables to avoid paying patent royalty fees on every laptop sold.

To resolve this issue, user needs to install “HEVC Video Extensions”

PTS vs. DTS

To understand PTS, you also need to know its partner, DTS (Decoding Time Stamp). They are often different because of how modern video compression (like H.264 or H.265) works:

DTS (Decoding Order): Tells the computer when to process the data.

PTS (Presentation Order): Tells the screen when to show the frame.

Because certain frames (B-frames) need information from “future” frames to be decoded, the computer might decode Frame 4 before it can show Frame 2.

DTS order: 1, 4, 2, 3

PTS order: 1, 2, 3, 4 (The smooth 1-2-3-4 sequence you actually see)

RTP packet jitter

If Packet A is sent at 1.0s and Packet B at 1.1s, but Packet B arrives at the receiver 0.01s before Packet A due to a network hiccup, do-timestamp=true would stamp Packet B with an earlier time than Packet A. This would cause the “back and forth” jitter that makes video stutter or crash muxers.

To utilize the correct order stored in the RTP header, we have to change how GStreamer handles the incoming data.

The Solution on GStreamer: rtpjitterbuffer
Instead of just grabbing the packet and stamping it with the “wall clock” immediately, we need a “waiting room” that looks at the RTP sequence numbers and timestamps.

RTP Sequence Number: Tells GStreamer the exact order (1, 2, 3…).
RTP Timestamp: Tells GStreamer the intended spacing between frames.

RTP packet drop

Even with a jitter buffer, sometimes a packet is lost forever (it’s UDP, after all). If Packet 5 never arrives, there is a “hole” in the timeline. mp4mux hates holes.

videorate sees the hole and says: “Packet 5 is missing, but I need to keep 30fps for this MP4 file. I will just duplicate Packet 4 and give it the timestamp Packet 5 should have had.”

GStreamer tips

Added rtpjitterbuffer: This element was inserted after udpsrc to handle network jitter and packet reordering, which helps in
constructing valid timestamps from RTP packets. I set latency=200 to provide a buffer against network fluctuations.
Added videorate: This element was inserted before the encoder (x264enc) along with a caps filter video/x-raw,framerate=30/1.
This forces a constant frame rate and regenerates timestamps for the raw video frames, ensuring that the encoder receives a
stream with perfect, monotonic timestamps. This effectively sanitizes the stream and prevents the “Buffer has no PTS” error in
the downstream mp4mux element.

Wall clock display

To accurately display the “wall-clock” time (NTP) for a specific frame, you have to bridge the gap between Media Time (RTP/Segments) and Real Time (NTP).

The method changes depending on whether you are using a packet-based protocol (WebRTC) or a file-based protocol (HLS).

1. WebRTC / RTP (Packet-Level Precision)

In WebRTC, the timestamp is calculated dynamically by the client. It is the most precise method but requires the most math.

The Source: The Edge device sends RTP Packets (video) and RTCP Sender Reports (the “Clock Map”).
The Map: The RTCP report explicitly says: “RTP timestamp 90,000 = Friday, 10:00:00 AM UTC.”
Client Calculation: The browser uses the getStats() API to find the estimatedPlayoutTimestamp. This value represents the exact NTP time the current frame was captured, adjusted for the network delay and jitter buffer.
Accuracy: Frame-accurate (millisecond precision).

2. HLS (Manifest-Level Precision)

HLS doesn’t have a continuous “clock map” like RTCP. Instead, it embeds time metadata into the playlist or the video stream itself.

The Source: The server converts the RTP stream into segments and writes the time into the .m3u8 manifest.
The Tag: #EXT-X-PROGRAM-DATE-TIME. This tag associates the first frame of a segment with an absolute UTC time.
Client Calculation: The player (like hls.js) reads this tag. To find the current time, it takes the programDateTime of the segment and adds the current playback offset (e.g., if you are 2 seconds into a 6-second segment).
Accuracy: High, but dependent on how frequently the server writes the tag.

3. Comparison Summary

Feature	WebRTC	HLS
Mechanism	RTCP Sender Reports (SR)	`#EXT-X-PROGRAM-DATE-TIME`
Data Type	Binary packets (UDP)	Text in Manifest (HTTP)
Browser Tool	`RTCPeerConnection.getStats()`	`hls.currentProgramDateTime`
Update Frequency	Every 1–2 seconds	Once per segment (e.g., 2–6s)
Main Challenge	Syncing local clock vs. sender clock	Clock “drift” if segments are long

Which should you use?

Use WebRTC if your frontend needs to trigger an action at the exact millisecond something happens (e.g., an AI detecting a face on a security camera).
Use HLS if you just need a “Clock” overlay on the video player that shows the approximate time for human reference.

Would you like a JavaScript code snippet for either the WebRTC getStats or the HLS.js program-date-time implementation?

what library is used for receiving RTP and converting it to WebRTC?

In your architecture (Edge device Web Server Frontend), the software sitting in the middle is typically called a Media Server or a WebRTC Gateway.

Because browsers cannot receive raw RTP directly (due to the lack of encryption and NAT traversal), this library acts as a “translator” that takes the raw incoming RTP and wraps it in the security layers (DTLS/SRTP) required for WebRTC.

1. Top Libraries/Servers for RTP-to-WebRTC

MediaMTX (Formerly rtsp-simple-server)

This is currently the most popular “out-of-the-box” solution for your exact use case. It is written in Go and is very lightweight.

How it works: It can ingest RTSP/RTP from your edge device and automatically provide a WebRTC mount point for your frontend.
Best for: Quick setups and edge computing where you don’t want to write much code.

Janus WebRTC Server

A highly modular, C-based server designed by Meetecho.

How it works: It has a specific Streaming Plugin. You send your RTP packets to a specific port on the Janus server, and it handles the WebRTC “handshake” (SDP/ICE) with the frontend.
Best for: High-performance, scalable systems where you need a rock-solid, battle-tested gateway.

GStreamer

If you are building a custom application in C, C++, or Python, GStreamer is the “Swiss Army Knife” of media.

How it works: You build a “pipeline.” One end is rtpbin (to receive RTP) and the other is webrtcbinsink (to send it to the browser).
Best for: Deep technical control, transcoding, or adding AI filters (like face detection) in the middle of the stream.

Mediasoup / Pion

These are more “developer-centric” libraries rather than finished servers.

Pion (Go): The leading library if you are writing your web server in Go. It gives you total control over the RTP packets.
Mediasoup (Node.js/C++): Extremely powerful for routing media. It is often used in professional conferencing tools.

2. How the “Conversion” Works Internally

The library doesn’t usually change the video data itself (unless you ask it to transcode). Instead, it performs a Header and Security transformation:

RTP Ingest: The library listens on a UDP port for your Edge device’s RTP packets.
Clock Sync: It listens for RTCP Sender Reports to maintain the NTP-to-RTP timestamp mapping.
DTLS/SRTP Encryption: It takes the payload (the H.264/H.265 data) and encrypts it using the keys negotiated with the browser.
Signaling: It generates the SDP (Session Description) that your frontend needs to connect.

Library Comparison

Library	Language	Complexity	Best Use Case
MediaMTX	Go	Low	Quick “RTSP to WebRTC” bridge; minimal coding required.
Janus	C	Medium	General-purpose WebRTC gateway with a plugin architecture.
GStreamer	C / Python	High	Complex pipelines, hardware acceleration, and edge-side processing.
Pion	Go	High	Building custom, high-performance media servers in Go.
Mediasoup	Node.js / C++	High	Massive scale multi-party conferencing (SFU).