Recently, the research work titled “OVLR: Efficient, Scalable, and Robust Training via Output-Level Variance-Reduced Likelihood Ratio”, completed by Nanjing University in collaboration with the Peking University Institute of Advanced Information Technology (PKU-IIAI), Hangzhou IndEngine Intelligence, Xiangjiang Laboratory, and other institutions, has been officially accepted to ICML 2026, a top-tier international conference on machine learning.

ICML is one of the premier international academic conferences in the fields of machine learning and artificial intelligence and is recognized as a Category A international conference by the China Computer Federation (CCF). ICML 2026 will be held from July 6 to 11, 2026, at the COEX Convention & Exhibition Center in Seoul, South Korea.
This work proposes a novel gradient estimation framework named OVLR (Output-level Likelihood Ratio). By adding random perturbations to the model's low-dimensional output space and leveraging a single deterministic forward pass along with one vector-Jacobian product (VJP), OVLR obtains unbiased gradient estimates. This method not only trains standard differentiable tasks with efficiency comparable to backpropagation (BP) but also achieves, for the first time, direct and scalable optimization of objectives such as 0-1 loss and truncated loss.

Core Technology: Output-Layer Perturbation + Variance Reduction
Traditional likelihood ratio (LR) or evolution strategy (ES) methods add noise in the high-dimensional parameter space or hidden layers, causing computational and memory overhead that scales linearly with model size, making them difficult to apply to modern deep networks. OVLR shifts perturbations from the parameter space to the output layer. The model requires only one forward computation to reuse output features. Combined with output-level resampling and antithetic sampling, the gradient variance is almost decoupled from the noise scale, achieving near-constant stability. This feature makes OVLR highly robust to hyperparameters (noise scale σ, number of repetitions n). Experiments show stable performance for σ ∈ [0.1, 5.0] and n ≥ 200.
Theoretical Guarantees and Open-Source Release
The OVLR estimator is unbiased with bounded variance, achieving a convergence rate of O(1/K) under standard smoothness assumptions. More importantly, OVLR requires no modification to the model architecture and can be directly applied to any architecture, including ResNet, ViT, and Mamba. It is open-sourced based on PyTorch, facilitating easy integration by the community.
Comprehensive Experiments with Significant Advantages
Robust Classification: On CIFAR-10, directly optimizing the 0-1 loss achieves 73.0% accuracy under 60% label noise, significantly outperforming the cross-entropy baseline of 66.3%.
Robust Regression: In sine fitting with 20% outliers, OVLR recovers the true signal (MSE = 0.00032), while BP fails completely due to flat regions of the truncated loss (MSE = 4.09).
Efficiency Comparison: Compared to traditional parameter-space LR methods, OVLR achieves a 14× training speedup and 73× memory saving on ResNet-18, with efficiency nearly identical to BP.
Black-Box Optimization: On the IOH benchmark, OVLR achieves a success rate of 86.7% (on a discrete bin-packing problem), far exceeding CEM (33.3%).
Generative Models: On GANs trained on MNIST, OVLR achieves an FID of 40.27, outperforming BP (53.33); on VAEs, OVLR performs on par with BP.
Robotic Manipulation: On the Aloha dual-arm manipulation task, policies trained with OVLR achieve success rates comparable to the BP baseline, laying a foundation for future policy learning in non-differentiable physical simulators.
Cross-Institution Collaborative Innovation
This work is the result of close collaboration among Nanjing University, the Peking University Institute of Advanced Information Technology (PKU-IIAI), Hangzhou IndEngine Intelligence, Xiangjiang Laboratory, and other institutions. It continues the systematic exploration of LR gradient estimation by Professor Yijie Peng's team since 2022 (IJOC 2022, ICLR 2024/2025/2026), marking the transition of output-level likelihood ratio methods from theory to practice. It provides a unified and scalable new paradigm for training gradient-unaware objectives, robust learning, and black-box optimization.

