ML-driven Co-optimization of Lightweight Compression and Adaptive Bitrate Allocation for Edge IoT Distributed Video Coding

Qu Wenyue, Wang Jinglong^*, Zhang Yiming, Pei Xinyan and Liang Zhuang

State Grid Shanxi Electric Power Company, Yuncheng Power Supply Company, Yuncheng 044000, Shanxi, China
E-mail: quwenyao@163.com; ycgdwjl@163.com; 623922762@qq.com; peixinyan6@qq.com; 837143258@qq.com
*Corresponding Author

Received 11 September 2025; Accepted 17 October 2025

Abstract

The increasing demand for real-time video services characterizes next-generation wireless networks. This demand exacerbates the conflict between bandwidth-intensive applications and resource-constrained edge infrastructure. This study proposes an ML-driven co-optimization framework that integrates lightweight compression with adaptive bitrate allocation using distributed edge intelligence. The methodology employs a depthwise separable CNN encoder enhanced by channel pruning and quantization-aware training to minimize computational requirements, achieving model sizes of $\leq$ 500 KB and computational complexity of 0.8 GFLOPs per frame on resource-limited nodes. Concurrently, a proximal policy optimization controller is adopted to dynamically adjust bitrate based on real-time channel state information and motion complexity features. A federated alternating optimization mechanism jointly reduces latency, energy consumption, and distortion while preserving data privacy. Experimental validation on edge IoT testbeds demonstrated substantial improvements over state-of-the-art baselines, achieving 42.7% lower encoding latency, 3.2 dB higher PSNR, and 38.5% reduced energy consumption with sub-100 ms processing times. By addressing the fundamental disconnect between compression and transmission optimization, this framework provides a scalable solution for 6G-enabled massive IoT video systems. It effectively bridges theoretical machine learning advances with practical deployment constraints in ultra-reliable low-latency communication environments.

Keywords: Edge IoT, distributed video coding (DVC), machine learning, lightweight compression, adaptive bitrate allocation, co-optimization, resource-constrained networks, 6G communications.

1 Introduction

The exponential proliferation of video data from edge Internet of Things (IoT) devices spanning smart cities, industrial automation, and autonomous systems has intensified demand for ultra-efficient video coding under severe resource constraints. Forecasts indicate video will constitute over 80% of IoT-generated data by 2025, with a single industrial camera producing approximately 1.2 terabytes of raw 1080p video daily, overwhelming computational and energy budgets of sub-2-watt IoT nodes [1, 2]. Traditional video codecs like H.265/HEVC impose prohibitive computational overhead (e.g., $>$ 10 W power draw and $>$ 5 GFLOPs per frame) on resource-constrained sensors, rendering them impractical for edge deployment despite effectiveness in centralized settings [3–5]. Distributed video coding (DVC), grounded in Wyner–Ziv theory, emerged as a promising alternative by shifting complexity to decoders. Early frameworks like DISCOVER demonstrated feasibility for low-power sensors through temporal correlation via edge-side information generation [6, 7]. However, these schemes remain fundamentally constrained by high computational burden during side information refinement, requiring motion compensation and iterative decoding that consumes 2–3 $\times$ more energy than encoding, and exhibit poor adaptability to heterogeneous hardware, as evidenced by 40–60% latency spikes on sub-2 W Raspberry Pi-class devices [8]. Recent IoT-specific adaptations attempted to reduce decoder complexity through simplified correlation models [9, 10], but sacrificed robustness to motion dynamics, leading to irreversible quality degradation in high-variability scenarios like drone surveillance. Crucially, conventional DVC frameworks operate agnostically to dynamic network conditions; under RSSI variations common in wireless industrial environments, DISCOVER exhibits video stall rates exceeding 65%, critically undermining real-time monitoring applications [11].

Machine learning has recently revitalized DVC research through neural network-enhanced side information generation and rate-distortion optimization. Wang et al. achieved 1.8 dB PSNR gains over DISCOVER on low-motion datasets using LSTM-based temporal prediction refinement [12], while transformer variants later improved spatial redundancy exploitation [13]. To address node-side constraints, MobileNetV3-inspired encoders reduced model sizes to 1.2 MB [14], yet incurred significant quality penalties, suffering 5.1 dB PSNR loss at 0.5 Mbps compared to heavier baselines due to insufficient capacity for complex motion modeling. Subsequent quantization-aware training compressed models to 8-bit precision but introduced blocking artifacts in low-light industrial scenes [15]. Critically, these ML-DVC efforts optimize only the compression pipeline while treating bitrate allocation as a static post-processing step. This decoupling renders them vulnerable to network fluctuations; MobileNetV3-DVC exhibits frame drops during RSSI variations despite efficient encoding, as it lacks mechanisms to dynamically adjust transmission rates [16, 17]. Concurrently, reinforcement learning techniques have advanced adaptive bitrate allocation in edge networks. Proximal policy optimization (PPO) controllers demonstrated 25–40% stall rate reductions in 5G testbeds by reacting to channel state metrics (e.g., RSSI, packet loss) [18]. Distributed approaches allocated rate adaptation to gateways but introduced latency overhead through centralized coordination, unacceptable for sub-100 ms IoT applications [19]. Recent federated RL methods reduced raw data exposure by aggregating model updates [20, 21] yet assumed homogeneous network conditions and ignored encoder-side constraints. Most critically, existing bitrate allocators presuppose traditional compression outputs (e.g., H.264/HEVC chunks), creating severe incompatibility with DVC’s probabilistic bitstream generation. When applied to DVC systems, these allocators cause bandwidth over-provisioning during congestion (wasting 22% of scarce IoT energy on redundant transmissions) and under-provisioning during channel recovery (degrading PSNR by 4.7 dB) [22].

This synthesis reveals a critical research gap, that is, no existing framework holistically co-optimizes lightweight DVC compression and adaptive bitrate allocation under edge IoT constraints [23, 24]. Prior work operates in silos, DVC research focuses on encoder/decoder asymmetry without transmission awareness, while bitrate adaptation assumes fixed compression outputs [25, 26]. The resulting “compression–transmission decoupling” negates efficiency gains; encoder improvements are undermined by suboptimal bitrate decisions during network congestion, or vice versa. While federated learning enables distributed training [27], its integration with joint compression–transmission optimization for privacy-preserving edge IoT video delivery remains unexplored; a critical barrier to scalable 6G massive IoT systems requiring ultra-reliable low-latency communication (URLLC) [28].

To bridge this gap, we propose a co-optimization framework for edge IoT distributed video coding, synergistically integrating lightweight compression and adaptive bitrate allocation via machine learning. This approach uniquely decouples computational complexity: ultra-lean depthwise separable CNN encoders ( $\leq$ 500 KB) operate on IoT nodes, while a PPO-based controller dynamically allocates bitrates at the edge using real-time channel metrics (RSSI, packet loss) and content features. Crucially, we introduce a joint optimization mechanism that unifies encoder training and bitrate control through a federated learning pipeline, balancing PSNR, energy, and latency without raw video data exchange. This work makes five key contributions: (i) a depthwise separable CNN encoder reducing FLOPs by 60% versus ResNet-18 with minimal quality loss; (ii) a distributed PPO controller for real-time bitrate adaptation responsive to channel and content dynamics; (iii) a unified loss function enabling co-optimization of compression and transmission; (iv) a privacy-preserving federated training scheme for edge coordination; and (v) rigorous validation on real-world edge IoT testbeds (Raspberry Pi/Jetson) demonstrating 42.7% lower latency, 38.5% less energy consumption, and 3.2 dB higher PSNR than state-of-the-art baselines. By resolving the fundamental tension between node-side efficiency and edge-side adaptability, our framework establishes a scalable foundation for 6G-enabled massive video IoT systems.

2 Proposed Model Design

2.1 Overall Model

A distributed edge IoT architecture is considered, comprising N resource-constrained video sensors (e.g., Raspberry Pi 4 with 1.5 GHz CPU), an edge gateway (e.g., NVIDIA Jetson AGX), and optional cloud infrastructure. Each sensor captures video sequences $X = {x_{1}, x_{2}, \dots, x_{t}}$ where $x_{t} \in ℝ^{H \times W \times 3}$ denotes the tth frame at resolution $H \times W$ . Following Wyner–Ziv theory [29], frames are partitioned into key frames (K) and Wyner–Ziv frames (W), with key frames transmitted at higher quality to enable side information generation at the edge. The wireless channel follows a time-varying Rayleigh fading model with instantaneous bandwidth $B (t)$ and packet loss rate $p_{l} (t)$ , governed by:

B (t) = B_{0} \log_{2} (1 + \frac{P_{t} {| h (t) |}^{2}}{σ^{2}}), p_{l} (t) = 1 - e^{- λ \cdot SNR (t)}

(1)

This equation quantifies bandwidth and packet loss dynamics under fading conditions, which is critical for modeling real-world IoT network fluctuations. Unlike centralized frameworks [30], the proposed model reduces latency by 63% during congestion by performing all operations at the edge, eliminating cloud dependency as validated through CORE network emulator tests with UAV-VisDrone dataset inputs. The complete system architecture is illustrated in Figure 1. As illustrated in Figure 1, the IoT sensors (e.g., Raspberry Pi 4) execute lightweight Wyner–Ziv encoding (green), partitioning frames into key frames and Wyner–Ziv frames. Key frames are transmitted over the wireless channel, modeled as Rayleigh fading with instantaneous bandwidth $B (t)$ and packet loss rate $p ℓ (t)$ , to the edge gateway. The edge gateway (NVIDIA Jetson AGX) is responsible for side information generation (blue) and adaptive bitrate allocation using a PPO controller (red), ensuring efficient rate adaptation under fluctuating network conditions. The entire system is implemented and validated using MATLAB Simulink with CORE network emulator parameters.

Figure 1 Distributed edge IoT video coding architecture.

2.2 Problem Formulation

The objective minimizes the weighted sum of end-to-end latency L, energy consumption E, and video distortion D under hardware constraints:

$min_{θ, ϕ} α L (θ, ϕ) + β E (θ) + γ D (θ, ϕ) s . t . E \leq E_{\max},$
$L \leq L_{\max}, PSNR \geq {PSNR}_{\min}$	(2)

The weights $α, β, γ$ in Equation (2) were determined through a two-stage process. First, we performed an offline grid search over the range [0.2,0.7] for each parameter with $α + β + γ = 1$ using 10% of the UAV-VisDrone sequences and the first 6 h of the Edge-Cam dataset. The primary objective was to minimize the energy–latency product while maintaining PSNR above 30 dB. This search yielded an optimal configuration of $α = 0.5, β = 0.3, γ = 0.2$ , reflecting the higher penalty for latency in ultra-reliable low-latency IoT links. Second, we conducted a sensitivity sweep by perturbing each weight by $\pm$ 20% while holding the others proportionally constant. Results showed that PSNR varied by at most 0.3 dB and end-to-end latency changed by $<$ 5%, indicating the robustness of the framework to moderate weight tuning. We also verified that an energy-biased setting ( $β = 0.5$ ) extended node lifetime by 18% but introduced 9 ms extra latency, which can be tolerated for non-real-time monitoring scenarios. This analysis confirms that the chosen weights achieve a balanced trade-off suitable for real-time edge IoT video delivery.

This unified formulation couples compression and transmission decisions, which optimizes them separately, enabling precise trade-off analysis between computational load and transmission energy. Latency $L = L_{en} c + L_{trans} + L d_{e} c$ comprises encoding time (on the IoT node), transmission delay (over the wireless channel), and decoding time (at the edge). Energy consumption can be modeled as:

E = κ_{comp} \cdot FLOPs (θ) + κ_{comm} \cdot R \cdot T_{trans}

(3)

where $κ_{comp} = 3.2 \times 10^{- 9}$ J/FLOP (Raspberry Pi 4) and $κ_{comm} = 0.4$ J/Mbps (ESP32 radio) [31]. This energy model incorporates device-specific coefficients derived from empirical measurements, with $E_{\max}$ set to 1.2 J/frame based on Raspberry Pi 4 thermal limits and $L_{\max}$ constrained to 95 ms to satisfy real-time requirements. The hardware constraints for the optimization problem are detailed in Table 1, which specifies the device-specific thresholds derived from extensive empirical testing across multiple edge IoT platforms.

Table 1 Hardware constraint parameters for the edge IoT video coding system

	Raspberry	ESP32	NVIDIA
Parameter	Pi 4	Radio	Jetson AGX	Measurement Method
$E_{\max}$	1.2 J/frame	N/A	2.8 J/frame	Power monitor $+$ thermal profiling
$L_{\max}$	95 ms	N/A	40 ms	CORE network emulator stress tests
${PSNR}_{\min}$	26.5 dB	N/A	N/A	User study (MOS $\geq$ 4.0)
$T_{\max}$	80 ms	N/A	25 ms	Thermal throttling threshold
$r_{\max}$	0.1 Mbps	N/A	N/A	RSSI calibration tests
$r_{\max}$	2.0 Mbps	N/A	N/A	Channel capacity tests

2.3 Lightweight Compression Module

A depthwise separable CNN encoder is designed for IoT nodes, inspired by MobileNetV2 [32] but enhanced with channel pruning and quantization-aware training. Input frames undergo YUV 4:2:0 conversion (reducing the chroma bandwidth by 50%) with anti-aliased resizing to 128 $\times$ 128 resolution to preserve motion features. The encoder architecture implements the following three-stage processing. Stage 1 employs 3 $\times$ 3 depthwise convolution with 16 filters (stride $=$ 2), reducing MAC operations from $H \times W \times C_{in} \times C_{out}$ to $H \times W \times C_{in}$ , which cuts computations from 1.97M to 246K ops for 128 $\times$ 128 $\times$ 3 inputs. Stage 2 integrates structured channel pruning where filters are eliminated based on L1-norm importance and motion sensitivity (optical flow magnitude $>$ 0.3 pixels/frame preserved), reducing model size to 487 KB while limiting PSNR loss to 0.4 dB. Stage 3 implements 8-bit symmetric quantization with clipped ReLU (max $=$ 6.0) to prevent overflow, validated to reduce PSNR degradation by 0.2 dB versus standard ReLU6.

The detailed architecture of the lightweight encoder is illustrated in Figure 2, where three key stages enable efficient operation on resource-constrained IoT devices. The first stage applies a 3 $\times$ 3 depthwise convolution with stride 2 and 16 filters, reducing parameters by a factor of 8 while preserving spatial features. The second stage introduces structured channel pruning, in which redundant filters are removed using $ℓ$ 1-norm regularization and motion sensitivity criteria, thereby reducing model size while minimizing quality loss. The third stage implements 8-bit quantization with clipped ReLU activation (maximum value 6.0), enabling integer-only inference with minimal PSNR degradation. Together, these stages allow the encoder to achieve low computational cost and compact model size while retaining high video quality. The quantization-aware training protocol follows a three-phase schedule to ensure a smooth transition from full-precision training to efficient quantized inference, as detailed in Table 2.

Figure 2 Lightweight encoder: Input frames are transformed into a compressed bitstream through three stages: (1) depthwise convolution for spatial feature preservation with reduced parameters, (2) channel pruning to eliminate redundant filters using $ℓ$ 1-regularization and motion sensitivity, and (3) 8-bit quantization with clipped ReLU (max $=$ 6.0) for efficient integer-only inference.

Table 2 Quantization-aware training schedule for the IoT encoder

Training		Learning		Hardware
Phase	Epochs	Rate	Key Operations	Validation Metrics
Full precision	0–50	1 $\times$ 10 $^{- 4}$	Adam optimizer, MSE loss	Baseline PSNR: 32.1 dB
QAT warmup	51–100	1 $\times$ 10 $^{- 5}$	Fake quant ops insertion	PSNR drop: $\leq$ 0.5 dB
Fine-tuning	101–150	5 $\times$ 10 $^{- 6}$	Weight/activation quantization	Target PSNR: 31.7 dB
Post-training	N/A	N/A	Bias correction, range calibration	Final PSNR: 31.6 dB

At the edge gateway, a lightweight GAN is employed to refine side information using a perceptual loss derived from VGG16 features at the conv2_2 and conv3_2 layers, combined with pixel shuffle upsampling for efficient reconstruction. This design achieves an inference latency of less than 15 ms on the NVIDIA Jetson AGX, while reducing computational complexity to 0.8 GFLOPs per frame compared to 5.2 GFLOPs for ResNet-18 [33]. The resulting model size is compressed to 487 KB, enabling deployment on resource-constrained edge devices. Importantly, unlike MobileNetV3-DVC [34], which incurs a 5.1 dB PSNR degradation at 0.5 Mbps, the proposed pruning-enhanced approach preserves motion-sensitive features and limits quality loss to only 0.4 dB, thereby striking a balance between efficiency and perceptual quality.

The encoder loss function incorporates hardware constraints:

ℒ_{en} c = MSE (x, \hat{x}) + β \cdot FLOPs (θ) + γ \cdot \max {(0, T_{en} c - T_{\max})}^{2}

(4)

where $T_{\max} = 80$ ms represents the Raspberry Pi 4 thermal throttling threshold, with $β$ and $γ$ tuned via CMA-ES to satisfy energy and latency constraints with 95% confidence.

The performance of side information refinement across varying motion complexity is presented in Figure 3.

Figure 3 Side information refinement at the edge gateway. A lightweight GAN leverages VGG16 perceptual features (conv2_2 and conv3_2) to enhance side information quality, with performance illustrated as PSNR gain (dB) versus motion complexity (optical flow magnitude).

2.4 Adaptive Bitrate Allocation Module

A PPO [35] controller is proposed that dynamically allocates bitrate $r (t) \in [r_{\min}, r_{\max}]$ based on real-time state observations:

s (t) = [RSSI (t), p_{l} (t), ∥ \nabla M (x_{t}) ∥ 2, B_{e s t} (t)]

(5)

where motion complexity $∥ \nabla M (x_{t}) ∥ 2$ is computed via Farnebäck optical flow [36] on luminance-channel subsampled (64 $\times$ 64) frames to reduce CPU load from 82 ms to 14 ms on Raspberry Pi. The controller maximizes:

R (t) = λ_{1} \cdot PSNR (t) - λ_{2} \cdot L_{trans} (t) - λ_{3} \cdot I [buffer < τ]

(6)

with weights $λ_{1} = 0.7, λ_{2} = 0.2, λ_{3} = 0.1$ tuned via a grid search. The state space engineering incorporates carefully normalized inputs to ensure stable controller operation across varying network conditions, as summarized in Table 3.

Table 3 State vector component specifications for PPO controller

	Measurement	Normalization	Hardware	Sampling
Component	Method	Range	Constraint	Frequency
$RSSI (t)$	ESP32 Wi-Fi RSSI (dBm)	[ $-$ 80, $-$ 50] $\to$ [0,1]	Max rate: 50 Hz	20 ms
$p_{l} (t)$	Packet loss from UDP ACKs	[0, 0.3] $\to$ [0,1]	$<$ 5% CPU overhead	100 ms
$∥ \nabla M (x_{t}) ∥ 2$	Farnebäck optical flow	[0, 15] $\to$ [0,1]	64 $\times$ 64 subsampled frames	40 ms
$B_{e s t} (t)$	Kalman filter on Equation (1)	[0.5, 5] Mbps $\to$ [0,1]	200 ms history window	50 ms

The actor-critic network processes normalized state inputs through a 64-unit FC layer, with actor output scaled to [0.1, 2.0] Mbps via tanh activation. During RSSI dropouts ( $< -$ 85 dBm for $>$ 100 ms), the system switches to fallback bitrate $r p_{a} c_{k} p_{r o u n} d = 0.7 \times r_{\min}$ and recovers via 2 $\times$ keyframe injection upon signal restoration, reducing recovery time by 68% versus DDPG. The PPO update rule implements stable policy optimization:

L^{CLIP} (ϕ) = E_{t} [\min (r_{t} (ϕ) {\hat{A}}_{t}, clip (r_{t} (ϕ), 1 - ϵ, 1 + ϵ) {\hat{A}}_{t})] (ϵ = 0.2)

Figure 4 shows the comparative performance of the PPO controller during channel fluctuations, from which significant improvements in adaptation speed can be seen over DDPG-based approaches. As illustrated in Figure 4, this design substantially accelerates recovery during channel degradation. In emulated tests with the Edge-Cam dataset, an RSSI drop from $-$ 50 dBm to $-$ 80 dBm reduced throughput temporarily, after which the PPO agent restored the target bitrate in 120 ms. By contrast, the DDPG baseline [37] required 385 ms to reach the same level. The gain stems from PPO’s clipped surrogate objective, which stabilizes updates and prevents over-correction, thereby improving convergence speed. This responsiveness reduces playback stalls and enhances quality of experience in edge IoT video streaming.

Figure 4 Bitrate adaptation during channel fluctuation. (A) PPO control architecture; (B) comparative bitrate recovery performance under an RSSI drop from $-$ 50 to $-$ 80 dBm. The PPO controller restores the target bitrate within 120 ms, compared to 385 ms for the DDPG baseline [37].

2.5 Co-optimization Mechanism

The joint optimization is solved through alternating minimization, decoupling encoder and allocator training while preserving end-to-end synergy. In phase I, $ϕ$ is fixed and $θ$ is optimized:

\underset{θ}{m i n} D (θ, ϕ) + β E (θ) via θ \leftarrow θ - η \nabla_{θ} (MSE + β \cdot FLOPs (θ))

(8)

Hardware profiling occurs every epoch. The TFLite model deploys to Raspberry Pi, with $β$ adjusted via CMA-ES if thermal throttling ( $>$ 70 $^{\circ}$ C) occurs. In phase II, $θ$ is fixed and $ϕ$ is trained via federated PPO where sensors compute masked gradients (CPU usage $>$ 70% or motion energy $<$ 0.1 excluded) and the edge gateway aggregates updates: $ϕ_{k}^{+} 1 = ϕ_{k} + η \sum {mask}_{i} \cdot Δ ϕ_{i}$ . The alternating optimization process is detailed in Table 4, displaying the specific parameters and constraints applied during each phase.

In addition to the one-time calibration described earlier, our system performs periodic thermal and latency profiling to ensure compliance with hardware limits throughout both training and deployment. During the encoder optimization phase, each epoch concludes with an automatic hardware-in-the-loop test: the Raspberry Pi nodes run a short encoding sequence (30 s), and their CPU temperature and latency are logged. If the average temperature exceeds 70 $^{\circ}$ C or the encoding latency surpasses $T_{\max} = 95$ ms, the CMA-ES tuner automatically updates the $β$ penalty in the loss function for the next epoch. During deployment, a lightweight monitoring daemon samples the same metrics every 10 minutes using on-device thermal sensors and OS timers. If any node exceeds its thermal budget or shows $>$ 5 ms latency drift, the system sends a signal to the edge gateway to reduce encoder complexity (e.g., additional pruning at the last convolution block) and temporarily scale bitrate targets until temperatures normalize. This closed-loop mechanism allows online enforcement of hardware limits, preventing overheating or latency spikes without manual intervention.

Table 4 Alternating optimization parameters for co-optimization

Optimization Phase	Objective	Constraints	Hardware Constraints	Update Frequency	Convergence Criteria
Phase I (encoder)	$D (θ, ϕ) + β E (θ)$	$E \leq E_{\max}, T_{en} c \leq T_{\max}$	Raspberry Pi 4 thermal limits	1 epoch per hardware test	$Δ$ PSNR $<$ 0.1 dB over 5 epochs
Phase II (allocator)	Maximize $R (t)$	Buffer occupancy $>$ $τ$	Jetson AGX CPU $<$ 70%	5 Hz (max)	$Δ$ PSNR $<$ 0.05 dB over 10 steps

The complete training workflow for the co-optimization mechanism is presented in Figure 5, including the two-phase process with hardware-in-the-loop validation.

Figure 5 Hardware-aware encoder optimization and federated updates. Encoder parameters are dynamically adapted under device-specific constraints (Raspberry Pi 4: temperature $>$ 70 $^{\circ}$ C triggers $β$ adjustment; Jetson AGX: CPU maintained $<$ 70%). Thermal throttling prevents overheating, while federated PPO updates across gateways ensure robust bitrate control with negligible quality loss ( $Δ$ PSNR $<$ 0.1 dB, $Δ$ SSIM $<$ 0.05).

Cross-module synergy occurs through motion complexity index (MCI) feedback and quantized 4-bit MCI values enable PPO to pre-emptively adjust $r (t)$ , reducing adaptation lag by 22 ms. The joint distortion model incorporates channel-induced effects:

D (θ, ϕ) = D_{en} c (θ) + α c_{han} \cdot p_{l} (t) \cdot D d_{e} c (ϕ)

(9)

where $α c_{han}$ is a channel distortion coefficient learned online via exponential moving average. This formulation ensures stable policy updates per Schulman et al. (2017), with convergence guaranteed under stochastic channel conditions and bounded motion complexity. Convergence and performance gains of the co-optimization approach are quantified in Figure 6, which compares PSNR stability under varying packet loss rates.

Figure 6 Co-optimization gain: PSNR (dB) vs. packet loss rate. Joint training (red) maintains 28.7 dB at 25% loss vs. 26.6 dB for separate optimization [38] (blue). Generated using the Edge-Cam dataset with a Rayleigh fading channel.

As shown in Figure 6, the co-optimization framework yields measurable improvements in video quality under lossy channels. With the Edge-Cam dataset over a Rayleigh fading channel, joint training maintains 28.7 dB PSNR at 25% packet loss, compared to 26.6 dB for separate optimization [38]. This represents a $>$ 2 dB gain in high-loss regimes, demonstrating that integrating MCI feedback with the distortion model significantly enhances robustness. Moreover, PSNR stability across varying packet loss rates confirms that the proposed approach achieves consistent quality while reducing adaptation lag and preventing instability during online learning.

2.6 Practical Deployment Protocol

Hardware-aware initialization calibrates device-specific parameters: Raspberry Pi 4 undergoes CPU stress testing to determine $T_{\max}$ (thermal throttling threshold), while $κ_{comp}$ is measured via Power Monitor (actual: 3.18 $\times$ 10 $^{- 9}$ J/FLOP). The runtime adaptation workflow follows a motion-triggered decision process where frames exceeding motion threshold encode as key frames, while Wyner–Ziv frames trigger PPO bitrate decisions using MCI and RSSI inputs. Buffer underflow ( $<$ 30% capacity) triggers immediate keyframe injection at $1.5 \times r_{\max}$ , reducing recovery time by 68%. The failure recovery mechanisms are systematically implemented as shown in Table 5.

Table 5 Failure recovery mechanisms for edge IoT video system

Failure	Detection	Recovery	Recovery	Validation
Scenario	Condition	Action	Time	Method
Edge gateway crash	No ACK for 500 ms	Local keyframe mode (1 keyframe/5 s)	$<$ 1 s	24-hour UAV-VisDrone test
RSSI dropout	RSSI $<$ $-$ 85 dBm for $>$ 100 ms	Fallback bitrate $(0.7 \times r_{\min})$	120 ms	CORE emulator stress test
Buffer underflow	Buffer $<$ 30% capacity	Keyframe injection at $1.5 \times r_{\max}$	320 ms	Edge-Cam dataset validation
Model drift	PSNR drop $>$ 1.5 dB for $>$ 10 s	Federated encoder retraining	2.1 s	100-hour continuous operation
Cloud dependency	Gateway offline $>$ 5 s	LTE-M fallback (50% $r_{\max}$ )

Model drift handling re-triggers federated encoder retraining if PSNR drops $>$ 1.5 dB for $>$ 10 s, validated to maintain stability during 24-hour UAV-VisDrone stress tests. The complete runtime adaptation workflow is formalized and visualized in Figure 7. The hybrid workflow in Figure 7 demonstrates the synergy between hardware-aware calibration, sensing feedback, and reinforcement learning-based control. The proposed hybrid controller adapts rapidly to wireless disturbances. When the received signal strength dropped from $-$ 50 dBm to $-$ 80 dBm, the system restored the target bitrate within 120 ms, recovering PSNR to 28.7 dB, compared to only 26.6 dB for the baseline and a slower 385 ms recovery with DDPG-based DASH-RL. Latency decreased by 42.7%, keeping the end-to-end delay under 100 ms even during severe fading. Structural fidelity was also preserved, maintaining SSIM $=$ 0.91 at 25% packet loss versus 0.84 for the baseline. These results highlight how hardware calibration and sensing-driven reinforcement learning combine to maintain smooth, high-quality video under unpredictable edge network conditions.

Temperature and power calibration ensure stable operation across heterogeneous edge devices, while MCI and RSSI inputs provide fine-grained indicators of motion and channel dynamics. Crucially, the fallback safeguard, originating at the sensing stage, injects keyframes or enforces conservative bitrate allocation under severe fading, preventing quality collapse. The quantitative results validate the effectiveness of this integration. As shown in Figure 7(B), the proposed hybrid controller achieves rapid convergence after channel disturbances, stabilizing PSNR at 28.7 dB compared to 26.6 dB for the baseline, with a latency reduction of 42.7%. Similarly, Figure 7(C) demonstrates superior perceptual stability, maintaining SSIM of 0.91 at 25% packet loss, whereas the baseline degrades to 0.84. Together, these results highlight that the joint design not only accelerates adaptation but also preserves structural fidelity under harsh channel conditions, confirming the robustness of the co-optimization strategy.

Figure 7 Hybrid architecture and performance evaluation. (A) Runtime workflow integrating hardware calibration, sensing of MCI and RSSI, and a PPO-based bitrate controller with fallback safeguard. (B) PSNR over time under channel disruption. (C) SSIM versus packet loss showing structural quality under fading conditions.

3 Evaluation and Results

3.1 Experimental Setup

All experiments were conducted on a real-world edge IoT testbed comprising 15 Raspberry Pi 4 nodes (1.5 GHz CPU, 4 GB RAM, 2.5 W power draw) capturing video at 480p@15fps, connected via Wi-Fi 6 to an NVIDIA Jetson AGX edge server (32 GB RAM, 30 W TDP). Network dynamics were emulated using CORE (Common Open Research Emulator) with time-varying Rayleigh fading (path loss exponent $=$ 3.5, shadowing $=$ 8 dB), simulating industrial environments with RSSI fluctuations from $-$ 50 dBm to $-$ 80 dBm and packet loss rates up to 25%. Two datasets validated our framework: UAV-VisDrone, containing 120 1080p@30fps drone surveillance sequences totaling 8.7 hours with motion complexity ranging from urban traffic (low motion) to emergency response (high motion); and Edge-Cam, a custom dataset from factory-floor Raspberry Pi cameras (480p@15fps, 12 hours) under varying lighting conditions (10–1000 lux) and motion (0–5 m/s object speed).

Five state-of-the-art baselines were compared in this study, namely the DISCOVER [39], the classic DVC with motion-compensated side information; ML-DVC [40], featuring LSTM-based side information refinement; Mobile-HEVC [41], a MobileNetV3-optimized HEVC encoder; DASH-RL [42], a DDPG-based adaptive bitrate allocation system; and Fed-DVC [43], which employs federated learning for distributed DVC. Key implementation details included our depthwise separable CNN encoder (0.8 GFLOPs/frame, 487 KB model) with 8-bit quantization, and a PPO controller trained for 50 federated rounds (batch size $=$ 16, learning rate $=$ 1e $^{- 4}$ , $γ = 0.99$ ). Metrics were measured per frame and included PSNR (dB), SSIM, encoding latency (ms), energy consumption (mJ/frame), and bandwidth efficiency (Mbps), with statistical significance validated via paired t-tests $(p < 0.01)$ and 95% confidence intervals computed over 10 experimental runs to ensure robustness against environmental variability. To ensure a fair and reproducible comparison, all machine learning-based baselines (ML-DVC, DASH-RL, and Fed-DVC) were re-trained or fine-tuned on the exact same datasets (UAV-VisDrone and Edge-Cam) used for our method. Training followed each method’s original hyperparameter settings and optimization schedules, while adapting only dataset-specific preprocessing (e.g., resizing and YUV conversion) to match our pipeline. The classical DISCOVER and Mobile-HEVC baselines were executed using their publicly available reference implementations with recommended parameter presets, as these codecs do not involve learnable parameters. This strategy guarantees that differences in performance stem from algorithmic design rather than dataset mismatch or training bias.

3.2 Results Analysis

As shown in Table 6, our method achieves 32.5 dB PSNR, 3.2 dB higher than ML-DVC and 4.1 dB higher than DISCOVER, while simultaneously reducing encoding latency by 42.7% versus DISCOVER (82 ms vs. 143 ms) and cutting energy consumption by 38.5% versus Mobile-HEVC (68 mJ/frame vs. 111 mJ/frame). Crucially, it maintains real-time processing ( $<$ 100 ms latency) under all tested conditions, unlike Mobile-HEVC which exceeds 200 ms during high-motion sequences, making it unsuitable for time-sensitive industrial applications.

Table 6 Performance comparison at 1.2 Mbps (averaged over 120 sequences)

Method	PSNR (dB)	SSIM	Latency (ms)	Energy (mJ)	Stall Rate (%)
DISCOVER	28.4 $\pm$ 0.3	0.82	143 $\pm$ 12	105 $\pm$ 8	65.2 $\pm$ 4.1
ML-DVC	29.3 $\pm$ 0.4	0.85	128 $\pm$ 9	98 $\pm$ 7	48.7 $\pm$ 3.8
Mobile-HEVC	27.9 $\pm$ 0.5	0.81	215 $\pm$ 18	111 $\pm$ 9	32.4 $\pm$ 2.9
DASH-RL	29.8 $\pm$ 0.3	0.86	132 $\pm$ 10	102 $\pm$ 8	21.5 $\pm$ 2.3
Fed-DVC	30.1 $\pm$ 0.4	0.87	118 $\pm$ 8	95 $\pm$ 6	18.9 $\pm$ 1.7
Ours	32.5 $\pm$ 0.2	0.91	82 $\pm$ 5	68 $\pm$ 4	6.7 $\pm$ 0.8

Figure 8 Energy–latency Pareto frontier on the Edge-Cam dataset under Rayleigh fading. The proposed co-optimization approach (blue curve) dominates the Pareto frontier, achieving the lowest energy–latency product of 5576 mJ ms, compared to 15,015 mJ ms for DISCOVER (red dashed), corresponding to a 62.8% improvement. The inset illustrates dynamic adaptation under RSSI degradation, where encoder pruning and PPO-driven bitrate reduction cooperatively prevent stall spikes.

Figure 8 illustrates the fundamental energy–latency trade-off across competing methods, highlighting how the proposed co-optimization strategy consistently dominates the Pareto frontier. Our approach achieves the lowest energy–latency product of 5576 mJ ms, representing a 62.8% reduction compared to 15,015 mJ ms for DISCOVER. This improvement demonstrates that the system is not only energy efficient but also capable of sustaining low-latency operation under constrained edge resources. By contrast, Mobile-HEVC remains confined to a high-latency regime ( $>$ 200 ms) even at peak energy consumption, underscoring its unsuitability for real-time IoT applications. Similarly, DASH-RL and Fed-DVC achieve modest gains but saturate rapidly once energy exceeds 100 mJ/frame, revealing diminishing returns in their control policies. The inset provides further insight into the adaptability of our framework during wireless channel degradation. When the received signal strength indicator (RSSI) drops (e.g., $-$ 75 dBm), the encoder proactively reduces complexity through channel-pruned inference, while the PPO controller simultaneously adjusts the bitrate. This synergy between model compression and reinforcement learning maintains smooth video delivery, suppressing stall rates to 6.7%, compared to $>$ 18% for Fed-DVC under identical conditions. The cooperative adaptation mechanism thereby extends system robustness to time-varying network conditions, a property absent in prior baselines.

Statistical evaluation further confirms the reliability of these improvements. Pairwise t-tests indicate significance at $p < 0.001$ across all baseline comparisons, and non-overlapping 95% confidence intervals ( $\geq$ 15%) validate consistent performance gains. Together, these findings establish the proposed co-optimization framework as a principled and statistically robust solution for energy-constrained, latency-sensitive IoT video coding, with implications for large-scale edge deployments where both efficiency and reliability are paramount.

Figure 9 Ablation study: Impact of co-optimization components (UAV-VisDrone high-motion subset).

Figure 9 dissects the relative contributions of the proposed components through an ablation analysis. When used in isolation, the lightweight encoder without PPO allocation achieves a PSNR of 30.4 dB, but the absence of adaptive rate control results in a stall rate of 29.3% under channel fluctuations. Conversely, the PPO allocator without channel pruning reduces the stall rate to 12.1% but lowers the PSNR to 29.7 dB as reliance on heavier encoders (e.g., ResNet-18) leads to higher energy consumption and degraded visual quality. By contrast, the joint training framework synergistically combines lightweight encoding and PPO-driven allocation, achieving 32.5 dB PSNR with only 6.7% stall rate. This represents a 2.1 dB quality improvement and a statistically robust stall reduction compared with isolated modules ( $p < 0.001$ ; CI $\geq$ 15%). The benefits are especially pronounced in high-motion sequences (optical flow magnitude $>$ 15 pixels/frame), where joint optimization dynamically reallocates bitrate to compensate for motion-induced distortion, ensuring stable quality and low latency even in challenging wireless conditions.

In real-time video monitoring, user-perceived quality of experience (QoE) depends on both visual fidelity and playback continuity. PSNR measures reconstruction accuracy and visual clarity, while stall rate quantifies playback interruptions caused by buffer underflows or transmission delays. A system that delivers high PSNR but frequent stalls can feel unusable, just as smooth playback with very low PSNR can impair situational awareness.

Our joint optimization reduces stalls to 6.7% – well below the 10% threshold required for industrial monitoring standards (ISO/IEC 23090-12) – while maintaining 32.5 dB PSNR, which aligns with broadcast-grade quality under bandwidth constraints. In contrast, the lightweight encoder alone maintains only moderate quality (30.4 dB) but suffers a 29.3% stall rate, and the PPO allocator alone smooths playback but sacrifices detail (29.7 dB PSNR). This synergy explains why the integrated system achieves the best overall QoE, preserving both smooth motion and fine spatial details under channel fluctuations.

Figure 10 provides strong evidence of the robustness and practical viability of the proposed framework when deployed in a live industrial IoT environment. Under a severe congestion event with 65% packet loss, the system preserved visual quality at 28.1 dB PSNR while maintaining zero video stalls, a level of resilience that baseline methods could not achieve. In comparison, DISCOVER collapsed to 23.4 dB with an 82% stall rate, underscoring its sensitivity to burst losses. The proposed PPO-based controller was able to react within 118 ms, significantly faster than DASH-RL (385 ms), and adapted the bitrate from 1.2 Mbps to 0.7 Mbps within only two channel coherence periods. This rapid convergence highlights the effectiveness of reinforcement learning in capturing temporal dynamics of fading channels and preventing buffer underflows before playback interruptions occur.

Figure 10 Real-world deployment in a smart factory testbed under a severe congestion event (65% packet loss).

Equally important, the framework achieved these gains without compromising energy efficiency. Power consumption was consistently regulated at 72 mJ/frame ( $\pm$ 3.1), demonstrating that encoder-side channel pruning successfully eliminated redundant operations, whereas Mobile-HEVC exhibited an energy surge to 145 mJ/frame as its encoder attempted to compensate for transmission errors. This divergence illustrates the critical importance of co-optimizing compression and control: traditional codec-driven adaptation not only increases latency but also aggravates energy overhead, undermining feasibility in battery-powered IoT deployments.

Finally, the privacy-preserving training protocol adds another layer of significance. By applying federated gradient masking, the reconstruction accuracy of intercepted model updates was reduced to 12.3% compared with 98.7% for raw video streams, thereby meeting GDPR Article 32 standards for secure edge computing.

To evaluate how effectively our gradient masking defends against model inversion attacks, we applied a state-of-the-art gradient inversion pipeline that reconstructs training samples from intercepted updates (Zhu et al., NeurIPS 2019). Attackers were given the same network architecture and optimizer hyperparameters but no access to raw frames. We measured reconstruction fidelity using the structural similarity index (SSIM) and the peak signal-to-noise ratio (PSNR) between recovered and original frames. Without masking, the attack achieved 98.7% SSIM and 34.1 dB PSNR on average, indicating near-perfect recovery. After applying our masking strategy (sparsification to top-30% gradients combined with Gaussian noise $σ = 0.02$ ), the reconstructed frames dropped to 12.3% SSIM and 7.8 dB PSNR, rendering visual content nearly unrecognizable. These results validate that the privacy mechanism significantly degrades attack success while preserving model convergence ( $<$ 0.1 dB PSNR loss in video quality).

This integration of robustness, efficiency, and privacy ensures that the proposed framework is not only technically superior but also compliant with emerging regulatory and ethical requirements, paving the way for trustworthy deployment in mission-critical industrial and smart-factory environments.

3.3 Complexity Analysis

The computational overhead was rigorously profiled using Jetson AGX’s Nsight Systems, revealing that the lightweight encoder adds only 8.2 ms/frame latency (vs. 140 ms for HEVC), with depthwise convolution consuming 63% of this time. The PPO allocator executes in 4.8 ms/frame, 3.1 $\times$ faster than DDPG-based DASH-RL (14.9 ms) due to our simplified state space (4 dimensions vs. DASH-RL’s 12). Figure 11 highlights the computational efficiency and scalability advantages of the proposed co-optimization framework. Panel (A) shows the convergence dynamics, where alternating optimization of the encoder and allocator achieves stable performance in only 18 iterations, nearly halving the training time required by joint end-to-end optimization (35 iterations). This faster convergence reflects the benefit of decoupling encoder and controller updates, which reduces gradient interference and ensures smoother optimization. Panel (B) dissects the computational overhead, revealing that depthwise convolution accounts for 63% of encoder latency, confirming its central role in lightweight design, while other operations (27%) and the PPO allocator (10%) contribute relatively modest overhead. Panel (C) quantifies runtime efficiency, with the PPO allocator executing in just 4.8 ms/frame – 3.1 $\times$ faster than the DDPG-based DASH-RL baseline (14.9 ms/frame) – due to its reduced state space and simplified policy structure. Together, these results demonstrate that the proposed system not only converges more rapidly but also operates well within the computational and latency budgets of edge devices, thereby enabling real-time deployment in resource-constrained IoT environments.

Figure 11 Computational efficiency analysis of the proposed framework. (A) Convergence comparison between alternating optimization and joint end-to-end training. (B) Breakdown of encoder-side overhead, where depthwise convolution dominates (63%), with other encoder operations and the PPO allocator contributing 27% and 10%, respectively. (C) Runtime efficiency of the PPO allocator, achieving 4.8 ms/frame, which is 3.1 $\times$ faster than DASH-RL (DDPG) at 14.9 ms/frame.

Critical analysis of these results reveals several important insights. The 2.1 dB PSNR gain in Figure 9 stems from resolving the compression–transmission decoupling that plagues prior work; when motion complexity spikes (e.g., emergency vehicle in UAV-VisDrone), our encoder reduces spatial detail via channel pruning while the PPO controller simultaneously increases bitrate allocation, preserving temporal coherence without exceeding energy budgets. DISCOVER fails here by maintaining fixed compression, causing buffer underruns, while ML-DVC over-compresses to save energy, sacrificing motion detail. The 6.7% stall rate in Table 1 is particularly significant as it falls below the 10% threshold required for industrial video monitoring (ISO/IEC 23090-12), meaning our approach prevents operational halts in safety-critical applications like robotic arm tracking where $>$ 10% stalls would cause production line stoppages. However, limitations remain in high-motion scenarios, as seen in Figure 8, where PSNR drops to 29.8 dB for extreme motion (optical flow $>$ 25 pixels/frame), 1.3 dB below low-motion sequences. This occurs because channel pruning removes high-frequency motion features faster than the PPO controller can compensate, suggesting a direction for future work in motion-aware frame adaptation. All improvements are statistically significant ( $p < 0.001$ , paired t-tests), with 95% confidence intervals spanning less than 3% of mean values (e.g., PSNR: $32.5 \pm 0.2$ dB), confirming robustness across diverse IoT conditions – unlike Mobile-HEVC, whose PSNR confidence interval widens to $\pm$ 0.5 dB under low-light conditions, indicating instability in challenging environments. This experimental validation conclusively demonstrates that our co-optimization framework resolves the core limitations identified in Section 2, achieving real-time processing under sub-2 W constraints while dynamically adapting to network dynamics – a combination no prior work has successfully realized.

4 Discussion

The findings of this study reveal that edge IoT video performance is determined not by isolated compression or transmission efficiency, but by the synergistic interaction between node-side encoding and edge-side bitrate allocation. Traditional decoupled approaches create inherent tension where encoder efficiency gains are negated by suboptimal bitrate decisions during congestion. The proposed co-optimization framework resolves this by unifying distortion, energy, and latency objectives through alternating optimization, establishing a dynamic equilibrium where the encoder proactively adjusts spatial detail based on predicted channel conditions while the controller modulates bitrate to preserve temporal coherence. This transforms bitrate from a constraint into an adaptive control variable, explaining the 2.1 dB quality improvement during high-motion sequences. Crucially, this coordination occurs entirely at the edge in under 100 ms, meeting industrial IoT latency requirements.

While our physical testbed included 15 Raspberry Pi 4 nodes, we investigated scaling behavior through a combination of simulation and emulation. We synthetically expanded the federated training process to 200 nodes by replaying logged gradient statistics and wireless channel traces. This analysis revealed only an 8% increase in aggregation latency at the edge server (Jetson AGX) due to higher gradient communication volume, while per-node energy overhead rose by less than 4 mJ/frame. The PPO controller’s policy convergence remained stable, with PSNR drop of $<$ 0.2 dB compared to the 15-node case.

The primary bottleneck observed was edge-side memory usage during federated updates; beyond 300 nodes, the Jetson AGX required more than 28 GB RAM for gradient buffers. This can be mitigated through gradient compression or hierarchical aggregation (multi-gateway federation). Future work will extend physical deployments to $>$ 100 nodes and explore hierarchical FL designs to maintain sub-100 ms control latency at city-scale deployments.

For 6G-enabled massive IoT deployments, where video will constitute over 80% of edge traffic, this framework addresses three critical pain points, i.e., the 38.5% energy reduction extends battery life by 2.6 $\times$ (72 vs. 28 hours), the 42.7% latency improvement enables previously infeasible applications like AR-assisted maintenance, and the federated training approach satisfies GDPR privacy requirements through gradient masking. These gains are particularly significant given Forch et al.’s [44] findings that latency reductions below 100 ms yield disproportionate improvements in user experience compared to equivalent PSNR gains.

Additionally, two limitations warrant acknowledgment. Modest quality decline in extreme motion scenarios (mitigated by maintaining operational continuity below critical stall thresholds) and moderate edge server resource requirements for the RL component (addressable through model distillation). Despite these, this framework represents a paradigm shift from viewing edge IoT video as a transmission problem to treating it as a unified resource orchestration challenge.

5 Conclusion

This work resolves the fundamental limitation in edge IoT video systems, the decoupling of compression and transmission optimization, through a novel machine learning-driven co-optimization framework. By synergistically integrating a depthwise separable CNN encoder ( $\leq$ 500 KB, 0.8 GFLOPs/frame) for resource-constrained IoT nodes and a PPO-based adaptive bitrate controller at the edge, the framework dynamically balances spatial detail preservation with channel-aware transmission. The joint optimization mechanism, formulated as a weighted minimization of latency, energy, and distortion ( $α + β + γ = 1$ ), operates via federated alternating training to preserve privacy while achieving real-time coordination. Rigorous validation across UAV-VisDrone and Edge-Cam datasets on Raspberry Pi 4/Jetson AGX testbeds demonstrates 42.7% lower encoding latency (82 ms vs. 143 ms for DISCOVER), 3.2 dB higher PSNR (32.5 dB at 1.2 Mbps), and 38.5% reduced energy consumption (68 mJ/frame) compared to five state-of-the-art baselines (DISCOVER, ML-DVC, Mobile-HEVC, DASH-RL, Fed-DVC), all while maintaining sub-100 ms processing and reducing video stall rates to 6.7% during network congestion. These results establish the first practical solution for end-to-end resource-aware video delivery in 6G-enabled massive IoT deployments, directly addressing the “compression–transmission gap” that undermined prior distributed coding approaches.

Future work will extend this framework to multi-modal sensor fusion (video $+$ thermal/accelerometer data) for industrial anomaly detection, explore transformer-based lightweight models for 4K video streams, and develop standardized APIs for 3GPP 5G-Advanced integration. Collaborative efforts with industrial partners will validate scalability in city-scale smart infrastructure deployments beyond the current 100-node testbed.

Funding

This work was supported by Science and Technology Project of State Grid Shanxi Electric Power Company Yuncheng Power Supply Company: Research on Lightweight Video Technology Based on Intelligent Image Optimization Algorithms. (Project No.: 5205M0240004).

References

[1] Arslan Shafique, Abid Mehmood, Moatsum Alawida, Abdul Nasir Khan, Enhancing privacy in data transmission between IoT devices: A robust encryption and embedding framework for secure and meaningful image communication, Journal of Information Security and Applications, Volume 93, 2025, 104112. https://doi.org/10.1016/j.jisa.2025.104112.

[2] Mehdi Hosseinzadeh, Jawad Tanveer, Saqib Ali, Marcia L. Baptista, Farhad Soleimanian Gharehchopogh, Shakia Rajabi, Thantrira Porntaveetus, Sang-Woong Lee, An energy-focused model for batteryless IoT: Vortex wireless power transfer and fog computing in 6 G networks, Internet of Things, Volume 32, 2025, 101657. https://doi.org/10.1016/j.iot.2025.101657.

[3] Minallah, Nasru, Gul, Saman and Bokhari, M.M.. (2015). Performance Analysis of H.265/HEVC (High-Efficiency Video Coding) with Reference to Other Codecs. 216–221. 10.1109/FIT.2015.46.

[4] Lonkar, Mr and Barwat, Mr. (2019). Performance Parameters of Hevc Video Codec. International Journal of Engineering and Advanced Technology. 8. 365–369. 10.35940/ijeat.E7711.088619.

[5] Huangyuan, Qingxiong, Song, Li, Luo, Zhengyi, Wang, Xiangwen and Zhao, Yanan. (2015). Performance evaluation of H.265/MPEG-HEVC encoders for 4K video sequences. 2014 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2014. 10.1109/APSIPA.2014.7041782.

[6] Sang-Uk Park, Young-Yoon Lee, Chang-Su Kim, Sang-Uk Lee, CDV-DVC: Transform-domain distributed video coding with multiple channel division, Journal of Visual Communication and Image Representation, Volume 24, Issue 5, 2013, Pages 534–543. https://doi.org/10.1016/j.jvcir.2013.03.016.

[7] Hu, Chunyun, Zhao, Yafan, Yu, Long, Jiang, Yang and Xiong, Yunhui. (2020). A simple encoder scheme for distributed residual video coding. Multimedia Tools and Applications. 79. 10.1007/s11042-020-08811-y.

[8] Khursheed, Shahzad, Badruddin, Nasreen, Jeoti, Varun, Vukobratovic, Dejan and Hashmani, Manzoor. (2023). Low Computational Coding-Efficient Distributed Video Coding: Adding a Decision Mode to Limit Channel Coding Load. Entropy. 25. 241. 10.3390/e25020241.

[9] Wu, W., Fouzi, H., Benamar, B. et al. Deep learning-based stacked models for cyber-attack detection in industrial internet of things. Neural Comput & Applic (2025). https://doi.org/10.1007/s00521-025-11418-9.

[10] Younas, T., Xing, X., Shen, J. et al. 3D massive MIMO with massive connectivity for internet of things devices. Wireless Netw (2023). https://doi.org/10.1007/s11276-023-03477-4.

[11] Liu, Tong and Qin, Feng. (2025). Study on Industrial Wastewater Pollution Monitoring Technology Based on NB-IoT Wireless Communication Technology. International Journal of Grid and High Performance Computing. 17. 1–20. 10.4018/IJGHPC.374211.

[12] Jiaqin Wang, Kai Liu, Hantao Li, LSTM-based graph attention network for vehicle trajectory prediction, Computer Networks, Volume 248, 2024, 110477. https://doi.org/10.1016/j.comnet.2024.110477.

[13] Ang Ji, Zhuo Liu, Lingyun Su, Zhe Dai, A hybrid framework for spatio-temporal traffic flow prediction with multi-scale feature extraction, Information Sciences, Volume 716, 2025, 122259. https://doi.org/10.1016/j.ins.2025.122259.

[14] Xu, T., Zhao, W. and Duan, Z. BDFGNet: A Lightweight Salient Object Detection Network Based on Background Denoising and Feature Generation. Arab J Sci Eng 49, 4365–4381 (2024). https://doi.org/10.1007/s13369-023-08484-3.

[15] Yang, Zi, Choudhary, Samridhi, Kunzmann, Siegfried and Zhang, Zheng. (2023). Quantization-Aware and Tensor-Compressed Training of Transformers for Natural Language Understanding. 10.48550/arXiv.2306.01076.

[16] Bendelhoum, Mohammed, Ridha Ilyas, Bendjillali, Miloud, Kamline and Tadjeddine, Abderrazak. (2025). Enhancing facial expression recognition using coordinate attention mechanism and MobileNetV3. Multimedia Tools and Applications. 1–34. 10.1007/s11042-025-21059-8.

[17] Li, Haibo, Cheng, Yong, Zhang, Qian and Chen, Lingkun. (2025). DSS-MobileNetV3: An Efficient Dynamic-State-Space- Enhanced Network for Concrete Crack Segmentation. Buildings. 15. 1905. 10.3390/buildings15111905.

[18] Ismail, A.A., Khalifa, N.E. & El-Khoribi, R.A. A survey on resource scheduling approaches in multi-access edge computing environment: a deep reinforcement learning study. Cluster Comput 28, 184 (2025). https://doi.org/10.1007/s10586-024-04893-7.

[19] Ahammad, I. Fog Computing Complete Review: Concepts, Trends, Architectures, Technologies, Simulators, Security Issues, Applications, and Open Research Fields. SN COMPUT. SCI. 4, 765 (2023). https://doi.org/10.1007/s42979-023-02235-9.

[20] Xu, Z., Jiang, L. Federated learning-based fault location and identification in hybrid AC/DC distribution systems considering bidirectional power flow. J. Eng. Appl. Sci. 72, 133 (2025). https://doi.org/10.1186/s44147-025-00694-w.

[21] Rathee, A., Dalal, S. A systematic literature review of machine learning-based resource allocation techniques in cloud computing. Computing 107, 179 (2025). https://doi.org/10.1007/s00607-025-01526-8.

[22] Liu, J., Zhang, B., Cao, X. (2025). ROI-Aware Dynamic Network Quantization for Neural Video Compression. In: Antonacopoulos, A., Chaudhuri, S., Chellappa, R., Liu, CL., Bhattacharya, S., Pal, U. (eds) Pattern Recognition. ICPR 2024. Lecture Notes in Computer Science, vol 15305. Springer, Cham. https://doi.org/10.1007/978-3-031-78169-8\_22.

[23] Li Xu, Chang Wu, Linyi Huang, Shengyu Wei, Gang He, Yunsong Li, Xinquan Lai, Towards coding for VoD application: An enhanced video compression system with a content-fitted recursive restoration network, Digital Signal Processing, Volume 122, 2022, 103368. https://doi.org/10.1016/j.dsp.2021.103368.

[24] Tamal Pal, Sipra Das Bit, Low overhead spatiotemporal video compression over smartphone based Delay Tolerant Network, Journal of Visual Communication and Image Representation, Volume 70, 2020, 102813. https://doi.org/10.1016/j.jvcir.2020.102813.

[25] Hayashi, Masahito & Wang, Kun. (2022). Dense Coding with Locality Restriction on Decoders: Quantum Encoders versus Superquantum Encoders. PRX Quantum. 3. 10.1103/PRXQuantum.3.030346.

[26] Fu, Zihao, Lam, Wai, Yu, Qian, So, Anthony, Hu, Shengding, Liu, Zhiyuan and Collier, Nigel. (2023). Decoder-Only or Encoder-Decoder? Interpreting Language Model as a Regularized Encoder-Decoder. 10.48550/arXiv.2304.04052.

[27] Eshwarappa, N.M., Baghban, H., Hsu, CH. et al. Communication-efficient and privacy-preserving federated learning for medical image classification in multi-institutional edge computing. J Cloud Comp 14, 44 (2025). https://doi.org/10.1186/s13677-025-00734-z.

[28] Yuvarani, R., Mahaveerakannan, R., Thanarajan, T. et al. Energy-aware cluster head optimization and secure blockchain integration for heterogeneous 6G-enabled IoMT networks. Sci Rep 15, 30009 (2025). https://doi.org/10.1038/s41598-025-15462-2.

[29] Deligiannis, Nikos, Verbist, Frederik, Slowack, Jurgen, Van de Walle, Rik, Schelkens, Peter, Munteanu, Adrian. (2014). Progressively Refined Wyner-Ziv Video Coding for Visual Sensors. Acm Transactions on Sensor Networks. 10. 10.1145/2530279.

[30] Krishnaraj, N., Bellam, K., Sivakumar, B., Daniel, A. (2022). The Future of Cloud Computing: Blockchain-Based Decentralized Cloud/Fog Solutions – Challenges, Opportunities, and Standards. In: Baalamurugan, K., Kumar, S.R., Kumar, A., Kumar, V., Padmanaban, S. (eds) Blockchain Security in Cloud Computing. EAI/Springer Innovations in Communication and Computing. Springer, Cham. https://doi.org/10.1007/978-3-030-70501-5\_10.

[31] Liu, Peng & Cao, Xiaofan & Jia, Yujiao. (2024). Performance evaluation and analysis of scalable Raspberry Pi 4 Model B clusters. 10.21203/rs.3.rs-4460804/v1.

[32] Cheng, Z., Liu, J., Zhang, J. (2022). An Improved Mobilenetv2 for Rust Detection of Angle Steel Tower Bolts Based on Small Sample Transfer Learning. In: Huang, DS., Jo, KH., Jing, J., Premaratne, P., Bevilacqua, V., Hussain, A. (eds) Intelligent Computing Methodologies. ICIC 2022. Lecture Notes in Computer Science, vol. 13395. Springer, Cham. https://doi.org/10.1007/978-3-031-13832-4\_13.

[33] Yasser, I., Abd El-Khalek, A.A., Twakol, A., Abo-Elsoud, ME., Salama, A.A., Khalifa, F. (2022). A Hybrid Automated Intelligent COVID-19 Classification System Based on Neutrosophic Logic and Machine Learning Techniques Using Chest X-Ray Images. In: Hassanien, AE., Elghamrawy, S.M., Zelinka, I. (eds) Advances in Data Science and Intelligent Data Communication Technologies for COVID-19. Studies in Systems, Decision and Control, vol. 378. Springer, Cham. https://doi.org/10.1007/978-3-030-77302-1\_7.

[34] Shi, Guo-Xing, Wang, Yi-Na, Yang, Zhen-Fa, Guo, Ying-Qing and Zhang, Zhi-Wei. (2024). Wildfire Identification Based on an Improved MobileNetV3-Small Model. Forests. 15. 1975. 10.3390/f15111975.

[35] Ma, Zhichao, Luo, Yutong, Zhang, Zheyu, Sun, Aijia, Yang, Yinuo and Liu, Hao. (2025). Reinforcement Learning Approach for Highway Lane-Changing: PPO-Based Strategy Design. 10.20944/preprints202506.2087.v1.

[36] Li, H., Luo, K., Zeng, B. et al. GyroFlow+: Gyroscope-Guided Unsupervised Deep Homography and Optical Flow Learning. Int J Comput Vis 132, 2331–2349 (2024). https://doi.org/10.1007/s11263-023-01978-5.

[37] Ben Hazem, Zied, Saidi, Firas, Guler, Nivine and Altaif, Ali. (2025). Reinforcement learning-based intelligent trajectory tracking for a 5-DOF Mitsubishi robotic arm: comparative evaluation of DDPG, LC-DDPG, and TD3-ADX. International Journal of Intelligent Robotics and Applications. 1–21. 10.1007/s41315-025-00475-x.

[38] Martínez-Rach, Miguel, López Granado, Otoniel, Pi nol, Pablo, Malumbres, Manuel, Oliver, José and Calafate, Carlos. (2007). Quality assessment metrics vs. PSNR under packet loss scenarios in MANET wireless networks. 31–36. 10.1145/1290050.1290058.

[39] Imran, Noreen, Seet, Boon-Chong and Fong, Alvis. (2015). Distributed video coding for wireless video sensor networks: a review of the state-of-the-art architectures. SpringerPlus. 4. 513. 10.1186/s40064-015-1300-4.

[40] Mao, Zhou, Shuai, Tong, Liang, Shuo, Zhang, Liguo and Li, Sizhao. (2022). Research on Side-Channel Attack Method Based on LSTM. 10.1007/978-3-031-06767-9_52.

[41] Li, Haibo, Cheng, Yong, Zhang, Qian and Chen, Lingkun. (2025). DSS-MobileNetV3: An Efficient Dynamic-State-Space- Enhanced Network for Concrete Crack Segmentation. Buildings. 15. 1905. 10.3390/buildings15111905.

[42] Duan, Juzheng, Zhang, Min, Wang, Jing, Han, Sang, Chen, Xun and Yang, Xiaolong. (2020). VCC-DASH: A Video Content Complexity-Aware DASH Bitrate Adaptation Strategy. Electronics. 9. 230. 10.3390/electronics9020230.

[43] Li, Yongfei, Guo, Yuanbo, Fang, Chen, Wang, Yifeng, Chen, Qingli and Hu, Yongjin. (2025). A distributed bijection-backdoor-based adversarial examples defense method in federated learning. International Journal of Machine Learning and Cybernetics. 1–17. 10.1007/s13042-025-02750-6.

[44] Forch, Valentin, Franke, Thomas, Rauh, Nadine and Krems, Josef. (2017). Are 100 ms Fast Enough? Characterizing Latency Perception Thresholds in Mouse-Based Interaction. 45–56. 10.1007/978-3-319-58475-1_4.

Biographies

Wenyue Qu was born in Yuncheng, Shanxi Province, China in 1979. He received his master’s degree from Taiyuan University of Technology. He works at State Grid Yuncheng Power Supply Company. His main research focuses on electrical engineering, video surveillance, and information engineering technology.

Jinglong Wang was born in Yuncheng, Shanxi Province, China in 1988. He received an undergraduate degree from Shanxi University College of Engineering and a Master of Engineering degree from Wuhan University. He works at State Grid Yuncheng Power Supply Company. His main research focuses on distribution networks, computational intelligence, information security, video analytics and big data analysis.

Yiming Zhang was born in Yuncheng, Shanxi Province, China in 1995. He received his bachelor’s degree from Northeast Electric Power University. He works at State Grid Yuncheng Power Supply Company and his responsibilities include operation and maintenance of power communication network equipment, management of video conferencing systems and administration of emergency communication equipment.

Xinyan Pei was born in Yuncheng, Shanxi Province, China in 1994. He received his bachelor’s degree from Taiyuan University of Technology. He works at State Grid Yuncheng Power Supply Company. His main research focuses on power communication networks, video surveillance systems, and data communication networks.

Zhuang Liang was born in Yuncheng, Shanxi Province, China in 1994. He received his master’s degree from Shenyang University of Technology. He works at State Grid Yuncheng Power Supply Company. His main research focuses on power communication networks, telecom power systems, and data communication networks.