This document outlines the fundamental concepts and architectural components of modern video streaming applications. It covers key terminology, definitions, and the typical five-step pipeline involved in delivering video content.
Key Terminology and Definitions
To understand the architecture of modern streaming, it is essential to define the core technologies, grouped by their functional domains.
1. Media Compression & Codecs
- Codec: A device or program (COmpressor-DECompressor) that shrinks large video files for transmission and expands them for viewing.
-
H.265 (HEVC): High Efficiency Video Coding. The successor to H.264, offering 25% to 50% better data compression at the same quality level.
- H.264 (AVC): Advanced Video Coding. The industry standard for over a decade, known for universal compatibility but lower efficiency compared to HEVC.
- CTU (Coding Tree Unit): The basic processing unit of HEVC, which can be as large as 64×64 pixels, allowing for more efficient processing of high-resolution video than H.264’s 16×16 macroblocks.
- AAC AudioSpecificConfig: A global header for MPEG-4 Audio that contains essential information for an AAC decoder, such as the Audio Object Type (AOT), sampling rate, and channel configuration. It is typically generated by the encoder and used to initialize the decoder.
Ingest
- SRT (Secure Reliable Transport): A UDP-based protocol designed for reliable video ingest over unstable networks, utilizing error correction (ARQ/FEC).
- RTP (Real-time Transport Protocol): A protocol for delivering audio and video over IP networks. It carries the actual media data and includes sequence numbers and timestamps to help the receiver reassemble the stream correctly.
- RTSP (Real Time Streaming Protocol): A network control protocol used to manage media sessions. It allows clients to control the media server with commands like play, pause, and stop, acting as a “remote control” for the stream.
- RTCP (Real-time Transport Control Protocol): A companion protocol to RTP that provides feedback on the quality of data delivery (QoS) and helps synchronize different media streams (e.g., audio and video).
- Relationship between RTP/RTSP/RTCP:
Standard: RTSP (TCP) + RTP (UDP) + RTCP (UDP) = 3 connections.
Interleaved: Everything is squeezed into a single RTSP (TCP) connection.
Trade-off: This makes the video much more likely to stutter because TCP will pause the video to “fix” tiny errors that UDP would have just ignored. - RTMP (Real-Time Messaging Protocol): Originally developed by Macromedia (now Adobe), this is a TCP-based protocol used for streaming audio, video, and data. It remains the industry standard for “first-mile” ingest, where an encoder sends a live stream to a media server or CDN (e.g., YouTube Live or Twitch).
2. Delivery Protocols
- HLS (HTTP Live Streaming): An adaptive bitrate streaming protocol developed by Apple that serves video via standard HTTP infrastructure.
- LL-HLS (Low-Latency HLS): An extension of HLS that reduces delay from 30 seconds to the 2–6 second range using partial segments and preload hints.
- DASH (Dynamic Adaptive Streaming over HTTP): An international standard for adaptive bitrate streaming that allows high-quality streaming of media content over the internet. It works by breaking content into a sequence of small HTTP-based file segments, each containing a small chunk of playback time.
-
LL-DASH (Low-Latency DASH): An extension of the DASH standard designed to reduce end-to-end latency in live streaming. It achieves lower latency by using smaller segment durations, chunked transfer encoding, and HTTP/2 Push to deliver media fragments as soon as they are available.
-
WebRTC (Web Real-Time Communication): A standard for sub-500ms real-time communication, typically used for peer-to-peer video conferencing.
-
WHIP/WHEP: The WebRTC HTTP Ingestion and Egress Protocols, which standardize how media is pushed to and pulled from WebRTC servers via HTTP signaling.
3. Packaging & Standards
- CMAF (Common Media Application Format): A standard that allows a single fragmented MP4 file to be compatible with both HLS and DASH players.
- Container Format: A file format that specifies how data (video, audio, subtitles, metadata) is stored together in a single file. It doesn’t compress the data itself but organizes it for playback. Examples include MP4, WebM, and MOV.
- Manifest (M3U8): A text-based playlist used by HLS that indexes video segments and quality variants.
- MPEG-TS (MPEG Transport Stream): A standard digital container format for transmission and storage of audio, video, and Program Specific Information (PSI) data. It is commonly used in broadcast systems like DVB and ATSC, and was historically the primary container format for HLS.
- MP4 (MPEG-4 Part 14): A widely used container format for storing video, audio, and other data. It is based on Apple’s QuickTime File Format and is highly compatible across devices and platforms.
- fMP4 (Fragmented MP4): An MP4 format that breaks a file into independent segments, making it suitable for live streaming and low-latency delivery.
- WebM: An open, royalty-free media file format designed for the web. It typically uses VP8 or VP9 video codecs and Vorbis or Opus audio codecs.
- MOV (QuickTime File Format): A proprietary container file format developed by Apple, primarily used for QuickTime multimedia framework. It can contain multiple tracks of video, audio, and text.
4. Web API
- MSE (MediaSource Extensions): A Web API that allows JavaScript to construct media streams for
<audio>and<video>elements, giving web applications fine-grained control over media data for adaptive streaming.
General Video Streaming Flow
The video streaming process typically follows a predictable five-step pipeline.
- Capture & Encoding (The Origin)
A camera captures raw frames. An encoder (e.g., FFmpeg or NVIDIA NVENC) compresses this raw data into a codec like H.265 to reduce its size. -
Contribution/Ingest (The First Mile)
The compressed video is sent from the edge device to a central server. -
Technologies Used: SRT (for reliability over public internet) or RTP/UDP (for lowest overhead in managed networks).
-
Processing & Transcoding (The Core)
The server receives the stream. It may change the resolution (transcoding) or simply re-package it. -
Technologies Used: GStreamer or FFmpeg.
-
Packaging & Distribution (The Delivery)
For HLS, the video is split into small fragments (fMP4) and indexed in a manifest (M3U8). An HTTP server (e.g., Ktor or a CDN) hosts these files. -
Playback & Consumption (The Last Mile)
A web browser fetches the playlist and segments. A library like hls.js feeds the data into the browser’s hardware decoder via MediaSource Extensions (MSE).
The HLS Protocol and the Clarification on Container Formats
A common misconception in video engineering is that the difference between HLS and LL-HLS is the use of Fragmented MP4 (fMP4). This is technically incorrect.
fMP4 in Standard HLS
Standard HLS has supported fMP4 segments since version 7 (introduced in 2016). Before this, HLS exclusively used MPEG-2 Transport Streams (.ts). The primary advantage of fMP4 in standard HLS is its compatibility with the CMAF standard, allowing a single set of video files to serve both HLS and DASH clients. Most browser engines (via MSE) prefer fMP4 containers for HEVC over legacy TS because fMP4 offers better compatibility with modern streaming standards like CMAF and DASH, and is often a requirement for HEVC/H.265 playback on certain platforms like macOS and iOS.
Defining Differences: Standard HLS vs. LL-HLS
The real difference lies in the delivery mechanisms, not the container format. While standard HLS can use fMP4, it still delivers complete segments (e.g., 2–6 seconds long), whereas LL-HLS introduces several specific technical features to reduce latency:
- Partial Segments (Parts): LL-HLS divides segments into tiny “parts” (e.g., 200ms). These are advertised in the playlist and can be downloaded as soon as they are ready, long before the full parent segment is complete.
-
Preload Hints: The server informs the player of the URL of the next expected partial segment in advance, allowing the player to issue a request immediately when data becomes available.
-
Blocking Playlist Reloads: Instead of constant polling, the server “holds” a playlist request until new data arrives, eliminating unnecessary network round trips.
-
Playlist Delta Updates: To reduce overhead, the server can send only the changed portions of a playlist rather than the entire file.
How to check what codec is supported by Chrome
chrome://gpu/

H.265 is not displayed although GPU supports it
It is highly likely that your Intel Arc 140V does have the hardware capability, but it is being “hidden” or “gated” by your laptop manufacturer (OEM) or a missing Windows component.
There is a known industry issue where manufacturers like Dell and HP have recently begun disabling H.265 hardware support in the BIOS/ACPI tables to avoid paying patent royalty fees on every laptop sold.
To resolve this issue, user needs to install “HEVC Video Extensions”
PTS vs. DTS
To understand PTS, you also need to know its partner, DTS (Decoding Time Stamp). They are often different because of how modern video compression (like H.264 or H.265) works:
DTS (Decoding Order): Tells the computer when to process the data.
PTS (Presentation Order): Tells the screen when to show the frame.
Because certain frames (B-frames) need information from “future” frames to be decoded, the computer might decode Frame 4 before it can show Frame 2.
DTS order: 1, 4, 2, 3
PTS order: 1, 2, 3, 4 (The smooth 1-2-3-4 sequence you actually see)
RTP packet jitter
If Packet A is sent at 1.0s and Packet B at 1.1s, but Packet B arrives at the receiver 0.01s before Packet A due to a network hiccup, do-timestamp=true would stamp Packet B with an earlier time than Packet A. This would cause the “back and forth” jitter that makes video stutter or crash muxers.
To utilize the correct order stored in the RTP header, we have to change how GStreamer handles the incoming data.
The Solution on GStreamer: rtpjitterbuffer
Instead of just grabbing the packet and stamping it with the “wall clock” immediately, we need a “waiting room” that looks at the RTP sequence numbers and timestamps.
- RTP Sequence Number: Tells GStreamer the exact order (1, 2, 3…).
- RTP Timestamp: Tells GStreamer the intended spacing between frames.
RTP packet drop
Even with a jitter buffer, sometimes a packet is lost forever (it’s UDP, after all). If Packet 5 never arrives, there is a “hole” in the timeline. mp4mux hates holes.
videorate sees the hole and says: “Packet 5 is missing, but I need to keep 30fps for this MP4 file. I will just duplicate Packet 4 and give it the timestamp Packet 5 should have had.”
GStreamer tips
- Added
rtpjitterbuffer: This element was inserted after udpsrc to handle network jitter and packet reordering, which helps in
constructing valid timestamps from RTP packets. I set latency=200 to provide a buffer against network fluctuations. -
Added
videorate: This element was inserted before the encoder (x264enc) along with a caps filter video/x-raw,framerate=30/1.
This forces a constant frame rate and regenerates timestamps for the raw video frames, ensuring that the encoder receives a
stream with perfect, monotonic timestamps. This effectively sanitizes the stream and prevents the “Buffer has no PTS” error in
the downstream mp4mux element.
Wall clock display
To accurately display the “wall-clock” time (NTP) for a specific frame, you have to bridge the gap between Media Time (RTP/Segments) and Real Time (NTP).
The method changes depending on whether you are using a packet-based protocol (WebRTC) or a file-based protocol (HLS).
1. WebRTC / RTP (Packet-Level Precision)
In WebRTC, the timestamp is calculated dynamically by the client. It is the most precise method but requires the most math.
- The Source: The Edge device sends RTP Packets (video) and RTCP Sender Reports (the “Clock Map”).
- The Map: The RTCP report explicitly says: “RTP timestamp 90,000 = Friday, 10:00:00 AM UTC.”
- Client Calculation: The browser uses the
getStats()API to find theestimatedPlayoutTimestamp. This value represents the exact NTP time the current frame was captured, adjusted for the network delay and jitter buffer. - Accuracy: Frame-accurate (millisecond precision).
2. HLS (Manifest-Level Precision)
HLS doesn’t have a continuous “clock map” like RTCP. Instead, it embeds time metadata into the playlist or the video stream itself.
- The Source: The server converts the RTP stream into segments and writes the time into the
.m3u8manifest. - The Tag:
#EXT-X-PROGRAM-DATE-TIME. This tag associates the first frame of a segment with an absolute UTC time. - Client Calculation: The player (like
hls.js) reads this tag. To find the current time, it takes theprogramDateTimeof the segment and adds the current playback offset (e.g., if you are 2 seconds into a 6-second segment). - Accuracy: High, but dependent on how frequently the server writes the tag.
3. Comparison Summary
| Feature | WebRTC | HLS |
|---|---|---|
| Mechanism | RTCP Sender Reports (SR) | #EXT-X-PROGRAM-DATE-TIME |
| Data Type | Binary packets (UDP) | Text in Manifest (HTTP) |
| Browser Tool | RTCPeerConnection.getStats() |
hls.currentProgramDateTime |
| Update Frequency | Every 1–2 seconds | Once per segment (e.g., 2–6s) |
| Main Challenge | Syncing local clock vs. sender clock | Clock “drift” if segments are long |
Which should you use?
- Use WebRTC if your frontend needs to trigger an action at the exact millisecond something happens (e.g., an AI detecting a face on a security camera).
- Use HLS if you just need a “Clock” overlay on the video player that shows the approximate time for human reference.
Would you like a JavaScript code snippet for either the WebRTC getStats or the HLS.js program-date-time implementation?
what library is used for receiving RTP and converting it to WebRTC?
In your architecture (Edge device Web Server Frontend), the software sitting in the middle is typically called a Media Server or a WebRTC Gateway.
Because browsers cannot receive raw RTP directly (due to the lack of encryption and NAT traversal), this library acts as a “translator” that takes the raw incoming RTP and wraps it in the security layers (DTLS/SRTP) required for WebRTC.
1. Top Libraries/Servers for RTP-to-WebRTC
MediaMTX (Formerly rtsp-simple-server)
This is currently the most popular “out-of-the-box” solution for your exact use case. It is written in Go and is very lightweight.
- How it works: It can ingest RTSP/RTP from your edge device and automatically provide a WebRTC mount point for your frontend.
- Best for: Quick setups and edge computing where you don’t want to write much code.
Janus WebRTC Server
A highly modular, C-based server designed by Meetecho.
- How it works: It has a specific Streaming Plugin. You send your RTP packets to a specific port on the Janus server, and it handles the WebRTC “handshake” (SDP/ICE) with the frontend.
- Best for: High-performance, scalable systems where you need a rock-solid, battle-tested gateway.
GStreamer
If you are building a custom application in C, C++, or Python, GStreamer is the “Swiss Army Knife” of media.
- How it works: You build a “pipeline.” One end is
rtpbin(to receive RTP) and the other iswebrtcbinsink(to send it to the browser). - Best for: Deep technical control, transcoding, or adding AI filters (like face detection) in the middle of the stream.
Mediasoup / Pion
These are more “developer-centric” libraries rather than finished servers.
- Pion (Go): The leading library if you are writing your web server in Go. It gives you total control over the RTP packets.
- Mediasoup (Node.js/C++): Extremely powerful for routing media. It is often used in professional conferencing tools.
2. How the “Conversion” Works Internally
The library doesn’t usually change the video data itself (unless you ask it to transcode). Instead, it performs a Header and Security transformation:
- RTP Ingest: The library listens on a UDP port for your Edge device’s RTP packets.
- Clock Sync: It listens for RTCP Sender Reports to maintain the NTP-to-RTP timestamp mapping.
- DTLS/SRTP Encryption: It takes the payload (the H.264/H.265 data) and encrypts it using the keys negotiated with the browser.
- Signaling: It generates the SDP (Session Description) that your frontend needs to connect.
Library Comparison
| Library | Language | Complexity | Best Use Case |
|---|---|---|---|
| MediaMTX | Go | Low | Quick “RTSP to WebRTC” bridge; minimal coding required. |
| Janus | C | Medium | General-purpose WebRTC gateway with a plugin architecture. |
| GStreamer | C / Python | High | Complex pipelines, hardware acceleration, and edge-side processing. |
| Pion | Go | High | Building custom, high-performance media servers in Go. |
| Mediasoup | Node.js / C++ | High | Massive scale multi-party conferencing (SFU). |