Note
You are reading the documentation for MMEditing 0.x, which will soon be deprecated by the end of 2022. We recommend you upgrade to MMEditing 1.0 to enjoy fruitful new features and better performance brought by OpenMMLab 2.0. Check out the changelog, code and documentation of MMEditing 1.0 for more details.
Super-Resolution Models¶
BasicVSR (CVPR’2021)¶
Abstract¶
Video super-resolution (VSR) approaches tend to have more components than the image counterparts as they need to exploit the additional temporal dimension. Complex designs are not uncommon. In this study, we wish to untangle the knots and reconsider some most essential components for VSR guided by four basic functionalities, i.e., Propagation, Alignment, Aggregation, and Upsampling. By reusing some existing components added with minimal redesigns, we show a succinct pipeline, BasicVSR, that achieves appealing improvements in terms of speed and restoration quality in comparison to many state-of-the-art algorithms. We conduct systematic analysis to explain how such gain can be obtained and discuss the pitfalls. We further show the extensibility of BasicVSR by presenting an information-refill mechanism and a coupled propagation scheme to facilitate information aggregation. The BasicVSR and its extension, IconVSR, can serve as strong baselines for future VSR approaches.

Results and models¶
Evaluated on RGB channels for REDS4 and Y channel for others. The metrics are PSNR
/ SSIM
.
The pretrained weights of SPyNet can be found here.
Method | REDS4 (BIx4) PSNR/SSIM (RGB) |
Vimeo-90K-T (BIx4) PSNR/SSIM (Y) |
Vid4 (BIx4) PSNR/SSIM (Y) |
UDM10 (BDx4) PSNR/SSIM (Y) |
Vimeo-90K-T (BDx4) PSNR/SSIM (Y) |
Vid4 (BDx4) PSNR/SSIM (Y) |
Download |
---|---|---|---|---|---|---|---|
basicvsr_reds4 | 31.4170/0.8909 | 36.2848/0.9395 | 27.2694/0.8318 | 33.4478/0.9306 | 34.4700/0.9286 | 24.4541/0.7455 | model | log |
basicvsr_vimeo90k_bi | 30.3128/0.8660 | 37.2026/0.9451 | 27.2755/0.8248 | 34.5554/0.9434 | 34.8097/0.9316 | 25.0517/0.7636 | model | log |
basicvsr_vimeo90k_bd | 29.0376/0.8481 | 34.6427/0.9335 | 26.2708/0.8022 | 39.9953/0.9695 | 37.5501/0.9499 | 27.9791/0.8556 | model | log |
Citation¶
@InProceedings{chan2021basicvsr,
author = {Chan, Kelvin CK and Wang, Xintao and Yu, Ke and Dong, Chao and Loy, Chen Change},
title = {BasicVSR: The Search for Essential Components in Video Super-Resolution and Beyond},
booktitle = {Proceedings of the IEEE conference on computer vision and pattern recognition},
year = {2021}
}
BasicVSR++ (CVPR’2022)¶
Abstract¶
A recurrent structure is a popular framework choice for the task of video super-resolution. The state-of-the-art method BasicVSR adopts bidirectional propagation with feature alignment to effectively exploit information from the entire input video. In this study, we redesign BasicVSR by proposing second-order grid propagation and flow-guided deformable alignment. We show that by empowering the recurrent framework with the enhanced propagation and alignment, one can exploit spatiotemporal information across misaligned video frames more effectively. The new components lead to an improved performance under a similar computational constraint. In particular, our model BasicVSR++ surpasses BasicVSR by 0.82 dB in PSNR with similar number of parameters. In addition to video super-resolution, BasicVSR++ generalizes well to other video restoration tasks such as compressed video enhancement. In NTIRE 2021, BasicVSR++ obtains three champions and one runner-up in the Video Super-Resolution and Compressed Video Enhancement Challenges. Codes and models will be released to MMEditing.

Results and models¶
The pretrained weights of SPyNet can be found here.
Method | REDS4 (BIx4) PSNR/SSIM (RGB) | Vimeo-90K-T (BIx4) PSNR/SSIM (Y) | Vid4 (BIx4) PSNR/SSIM (Y) | UDM10 (BDx4) PSNR/SSIM (Y) | Vimeo-90K-T (BDx4) PSNR/SSIM (Y) | Vid4 (BDx4) PSNR/SSIM (Y) | Download |
---|---|---|---|---|---|---|---|
basicvsr_plusplus_c64n7_8x1_600k_reds4 | 32.3855/0.9069 | 36.4445/0.9411 | 27.7674/0.8444 | 34.6868/0.9417 | 34.0372/0.9244 | 24.6209/0.7540 | model | log |
basicvsr_plusplus_c64n7_4x2_300k_vimeo90k_bi | 31.0126/0.8804 | 37.7864/0.9500 | 27.7882/0.8401 | 33.1211/0.9270 | 33.8972/0.9195 | 23.6086/0.7033 | model | log |
basicvsr_plusplus_c64n7_4x2_300k_vimeo90k_bd | 29.2041/0.8528 | 34.7248/0.9351 | 26.4377/0.8074 | 40.7216/0.9722 | 38.2054/0.9550 | 29.0400/0.8753 | model | log |
NTIRE 2021 checkpoints
Note that the following models are finetuned from smaller models. The training schemes of these models will be released when MMEditing reaches 5k stars. We provide the pre-trained models here.
NTIRE 2021 Video Super-Resolution
NTIRE 2021 Quality Enhancement of Compressed Video - Track 1
NTIRE 2021 Quality Enhancement of Compressed Video - Track 2
NTIRE 2021 Quality Enhancement of Compressed Video - Track 3
Citation¶
@InProceedings{chan2022basicvsrplusplus,
author = {Chan, Kelvin C.K. and Zhou, Shangchen and Xu, Xiangyu and Loy, Chen Change},
title = {BasicVSR++: Improving Video Super-Resolution with Enhanced Propagation and Alignment},
booktitle = {Proceedings of the IEEE conference on computer vision and pattern recognition},
year = {2022}
}
DIC (CVPR’2020)¶
Abstract¶
Recent works based on deep learning and facial priors have succeeded in super-resolving severely degraded facial images. However, the prior knowledge is not fully exploited in existing methods, since facial priors such as landmark and component maps are always estimated by low-resolution or coarsely super-resolved images, which may be inaccurate and thus affect the recovery performance. In this paper, we propose a deep face super-resolution (FSR) method with iterative collaboration between two recurrent networks which focus on facial image recovery and landmark estimation respectively. In each recurrent step, the recovery branch utilizes the prior knowledge of landmarks to yield higher-quality images which facilitate more accurate landmark estimation in turn. Therefore, the iterative information interaction between two processes boosts the performance of each other progressively. Moreover, a new attentive fusion module is designed to strengthen the guidance of landmark maps, where facial components are generated individually and aggregated attentively for better restoration. Quantitative and qualitative experimental results show the proposed method significantly outperforms state-of-the-art FSR methods in recovering high-quality face images.

Results and models¶
Evaluated on RGB channels, scale
pixels in each border are cropped before evaluation.
The metrics are PSNR / SSIM
.
In the log data of dic_gan_x8c48b6_g4_150k_CelebAHQ
, DICGAN is verified on the first 9 pictures of the test set of CelebA-HQ, so PSNR/SSIM
shown in the follow table is different from the log data.
Method | scale | CelebA-HQ | Download |
---|---|---|---|
dic_x8c48b6_g4_150k_CelebAHQ | x8 | 25.2319 / 0.7422 | model | log |
dic_gan_x8c48b6_g4_150k_CelebAHQ | x8 | 23.6241 / 0.6721 | model | log |
Citation¶
@inproceedings{ma2020deep,
title={Deep face super-resolution with iterative collaboration between attentive recovery and landmark estimation},
author={Ma, Cheng and Jiang, Zhenyu and Rao, Yongming and Lu, Jiwen and Zhou, Jie},
booktitle={Proceedings of the IEEE/CVF conference on computer vision and pattern recognition},
pages={5569--5578},
year={2020}
}
EDSR (CVPR’2017)¶
Abstract¶
Recent research on super-resolution has progressed with the development of deep convolutional neural networks (DCNN). In particular, residual learning techniques exhibit improved performance. In this paper, we develop an enhanced deep super-resolution network (EDSR) with performance exceeding those of current state-of-the-art SR methods. The significant performance improvement of our model is due to optimization by removing unnecessary modules in conventional residual networks. The performance is further improved by expanding the model size while we stabilize the training procedure. We also propose a new multi-scale deep super-resolution system (MDSR) and training method, which can reconstruct high-resolution images of different upscaling factors in a single model. The proposed methods show superior performance over the state-of-the-art methods on benchmark datasets and prove its excellence by winning the NTIRE2017 Super-Resolution Challenge.

Results and models¶
Evaluated on RGB channels, scale
pixels in each border are cropped before evaluation.
The metrics are PSNR / SSIM
.
Method | Set5 | Set14 | DIV2K | Download |
---|---|---|---|---|
edsr_x2c64b16_1x16_300k_div2k | 35.7592 / 0.9372 | 31.4290 / 0.8874 | 34.5896 / 0.9352 | model | log |
edsr_x3c64b16_1x16_300k_div2k | 32.3301 / 0.8912 | 28.4125 / 0.8022 | 30.9154 / 0.8711 | model | log |
edsr_x4c64b16_1x16_300k_div2k | 30.2223 / 0.8500 | 26.7870 / 0.7366 | 28.9675 / 0.8172 | model | log |
Citation¶
@inproceedings{lim2017enhanced,
title={Enhanced deep residual networks for single image super-resolution},
author={Lim, Bee and Son, Sanghyun and Kim, Heewon and Nah, Seungjun and Mu Lee, Kyoung},
booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition workshops},
pages={136--144},
year={2017}
}
EDVR (CVPRW’2019)¶
Abstract¶
Video restoration tasks, including super-resolution, deblurring, etc, are drawing increasing attention in the computer vision community. A challenging benchmark named REDS is released in the NTIRE19 Challenge. This new benchmark challenges existing methods from two aspects: (1) how to align multiple frames given large motions, and (2) how to effectively fuse different frames with diverse motion and blur. In this work, we propose a novel Video Restoration framework with Enhanced Deformable networks, termed EDVR, to address these challenges. First, to handle large motions, we devise a Pyramid, Cascading and Deformable (PCD) alignment module, in which frame alignment is done at the feature level using deformable convolutions in a coarse-to-fine manner. Second, we propose a Temporal and Spatial Attention (TSA) fusion module, in which attention is applied both temporally and spatially, so as to emphasize important features for subsequent restoration. Thanks to these modules, our EDVR wins the champions and outperforms the second place by a large margin in all four tracks in the NTIRE19 video restoration and enhancement challenges. EDVR also demonstrates superior performance to state-of-the-art published methods on video super-resolution and deblurring.

Results and models¶
Evaluated on RGB channels.
The metrics are PSNR / SSIM
.
Method | REDS4 | Download |
---|---|---|
edvrm_wotsa_x4_8x4_600k_reds | 30.3430 / 0.8664 | model | log |
edvrm_x4_8x4_600k_reds | 30.4194 / 0.8684 | model | log |
edvrl_wotsa_c128b40_8x8_lr2e-4_600k_reds4 | 31.0010 / 0.8784 | model | log |
edvrl_c128b40_8x8_lr2e-4_600k_reds4 | 31.0467 / 0.8793 | model | log |
Citation¶
@InProceedings{wang2019edvr,
author = {Wang, Xintao and Chan, Kelvin C.K. and Yu, Ke and Dong, Chao and Loy, Chen Change},
title = {EDVR: Video restoration with enhanced deformable convolutional networks},
booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)},
month = {June},
year = {2019},
}
ESRGAN (ECCVW’2018)¶
Abstract¶
The Super-Resolution Generative Adversarial Network (SRGAN) is a seminal work that is capable of generating realistic textures during single image super-resolution. However, the hallucinated details are often accompanied with unpleasant artifacts. To further enhance the visual quality, we thoroughly study three key components of SRGAN - network architecture, adversarial loss and perceptual loss, and improve each of them to derive an Enhanced SRGAN (ESRGAN). In particular, we introduce the Residual-in-Residual Dense Block (RRDB) without batch normalization as the basic network building unit. Moreover, we borrow the idea from relativistic GAN to let the discriminator predict relative realness instead of the absolute value. Finally, we improve the perceptual loss by using the features before activation, which could provide stronger supervision for brightness consistency and texture recovery. Benefiting from these improvements, the proposed ESRGAN achieves consistently better visual quality with more realistic and natural textures than SRGAN and won the first place in the PIRM2018-SR Challenge.

Results and models¶
Evaluated on RGB channels, scale
pixels in each border are cropped before evaluation.
The metrics are PSNR / SSIM
.
Method | Set5 | Set14 | DIV2K | Download |
---|---|---|---|---|
esrgan_psnr_x4c64b23g32_1x16_1000k_div2k | 30.6428 / 0.8559 | 27.0543 / 0.7447 | 29.3354 / 0.8263 | model | log |
esrgan_x4c64b23g32_1x16_400k_div2k | 28.2700 / 0.7778 | 24.6328 / 0.6491 | 26.6531 / 0.7340 | model | log |
Citation¶
@inproceedings{wang2018esrgan,
title={Esrgan: Enhanced super-resolution generative adversarial networks},
author={Wang, Xintao and Yu, Ke and Wu, Shixiang and Gu, Jinjin and Liu, Yihao and Dong, Chao and Qiao, Yu and Change Loy, Chen},
booktitle={Proceedings of the European Conference on Computer Vision Workshops(ECCVW)},
pages={0--0},
year={2018}
}
GLEAN (CVPR’2021)¶
Abstract¶
We show that pre-trained Generative Adversarial Networks (GANs), e.g., StyleGAN, can be used as a latent bank to improve the restoration quality of large-factor image super-resolution (SR). While most existing SR approaches attempt to generate realistic textures through learning with adversarial loss, our method, Generative LatEnt bANk (GLEAN), goes beyond existing practices by directly leveraging rich and diverse priors encapsulated in a pre-trained GAN. But unlike prevalent GAN inversion methods that require expensive image-specific optimization at runtime, our approach only needs a single forward pass to generate the upscaled image. GLEAN can be easily incorporated in a simple encoder-bank-decoder architecture with multi-resolution skip connections. Switching the bank allows the method to deal with images from diverse categories, e.g., cat, building, human face, and car. Images upscaled by GLEAN show clear improvements in terms of fidelity and texture faithfulness in comparison to existing methods.

Results and models¶
For the meta info used in training and test, please refer to here. The results are evaluated on RGB channels.
Method | PSNR | Download |
---|---|---|
glean_cat_8x | 23.98 | model | log |
glean_ffhq_16x | 26.91 | model | log |
glean_cat_16x | 20.88 | model | log |
glean_in128out1024_4x2_300k_ffhq_celebahq | 27.94 | model | log |
Citation¶
@InProceedings{chan2021glean,
author = {Chan, Kelvin CK and Wang, Xintao and Xu, Xiangyu and Gu, Jinwei and Loy, Chen Change},
title = {GLEAN: Generative Latent Bank for Large-Factor Image Super-Resolution},
booktitle = {Proceedings of the IEEE conference on computer vision and pattern recognition},
year = {2021}
}
IconVSR (CVPR’2021)¶
Abstract¶
Video super-resolution (VSR) approaches tend to have more components than the image counterparts as they need to exploit the additional temporal dimension. Complex designs are not uncommon. In this study, we wish to untangle the knots and reconsider some most essential components for VSR guided by four basic functionalities, i.e., Propagation, Alignment, Aggregation, and Upsampling. By reusing some existing components added with minimal redesigns, we show a succinct pipeline, BasicVSR, that achieves appealing improvements in terms of speed and restoration quality in comparison to many state-of-the-art algorithms. We conduct systematic analysis to explain how such gain can be obtained and discuss the pitfalls. We further show the extensibility of BasicVSR by presenting an information-refill mechanism and a coupled propagation scheme to facilitate information aggregation. The BasicVSR and its extension, IconVSR, can serve as strong baselines for future VSR approaches.

Results and models¶
Evaluated on RGB channels for REDS4 and Y channel for others. The metrics are PSNR
/ SSIM
.
The pretrained weights of the IconVSR components can be found here: SPyNet, EDVR-M for REDS, and EDVR-M for Vimeo-90K.
Method | REDS4 (BIx4) PSNR/SSIM (RGB) |
Vimeo-90K-T (BIx4) PSNR/SSIM (Y) |
Vid4 (BIx4) PSNR/SSIM (Y) |
UDM10 (BDx4) PSNR/SSIM (Y) |
Vimeo-90K-T (BDx4) PSNR/SSIM (Y) |
Vid4 (BDx4) PSNR/SSIM (Y) |
Download |
---|---|---|---|---|---|---|---|
iconvsr_reds4 | 31.6926/0.8951 | 36.4983/0.9416 | 27.4809/0.8354 | 35.3377/0.9471 | 34.4299/0.9287 | 25.2110/0.7732 | model | log |
iconvsr_vimeo90k_bi | 30.3452/0.8659 | 37.3729/0.9467 | 27.4238/0.8297 | 34.2595/0.9398 | 34.5548/0.9295 | 24.6666/0.7491 | model | log |
iconvsr_vimeo90k_bd | 29.0150/0.8465 | 34.6780/0.9339 | 26.3109/0.8028 | 40.0640/0.9697 | 37.7573/0.9517 | 28.2464/0.8612 | model | log |
Citation¶
@InProceedings{chan2021basicvsr,
author = {Chan, Kelvin CK and Wang, Xintao and Yu, Ke and Dong, Chao and Loy, Chen Change},
title = {BasicVSR: The Search for Essential Components in Video Super-Resolution and Beyond},
booktitle = {Proceedings of the IEEE conference on computer vision and pattern recognition},
year = {2021}
}
LIIF (CVPR’2021)¶
Abstract¶
How to represent an image? While the visual world is presented in a continuous manner, machines store and see the images in a discrete way with 2D arrays of pixels. In this paper, we seek to learn a continuous representation for images. Inspired by the recent progress in 3D reconstruction with implicit neural representation, we propose Local Implicit Image Function (LIIF), which takes an image coordinate and the 2D deep features around the coordinate as inputs, predicts the RGB value at a given coordinate as an output. Since the coordinates are continuous, LIIF can be presented in arbitrary resolution. To generate the continuous representation for images, we train an encoder with LIIF representation via a self-supervised task with super-resolution. The learned continuous representation can be presented in arbitrary resolution even extrapolate to x30 higher resolution, where the training tasks are not provided. We further show that LIIF representation builds a bridge between discrete and continuous representation in 2D, it naturally supports the learning tasks with size-varied image ground-truths and significantly outperforms the method with resizing the ground-truths.

Results and models¶
Method | scale | Set5 PSNR / SSIM |
Set14 PSNR / SSIM |
DIV2K PSNR / SSIM |
Download |
---|---|---|---|---|---|
liif_edsr_norm_c64b16_g1_1000k_div2k | x2 | 35.7131 / 0.9366 | 31.5579 / 0.8889 | 34.6647 / 0.9355 | model | log |
△ | x3 | 32.3805 / 0.8915 | 28.4605 / 0.8039 | 30.9808 / 0.8724 | △ |
△ | x4 | 30.2748 / 0.8509 | 26.8415 / 0.7381 | 29.0245 / 0.8187 | △ |
△ | x6 | 27.1187 / 0.7774 | 24.7461 / 0.6444 | 26.7770 / 0.7425 | △ |
△ | x18 | 20.8516 / 0.5406 | 20.0096 / 0.4525 | 22.1987 / 0.5955 | △ |
△ | x30 | 18.8467 / 0.5010 | 18.1321 / 0.3963 | 20.5050 / 0.5577 | △ |
liif_rdn_norm_c64b16_g1_1000k_div2k | x2 | 35.7874 / 0.9366 | 31.6866 / 0.8896 | 34.7548 / 0.9356 | model | log |
△ | x3 | 32.4992 / 0.8923 | 28.4905 / 0.8037 | 31.0744 / 0.8731 | △ |
△ | x4 | 30.3835 / 0.8513 | 26.8734 / 0.7373 | 29.1101 / 0.8197 | △ |
△ | x6 | 27.1914 / 0.7751 | 24.7824 / 0.6434 | 26.8693 / 0.7437 | △ |
△ | x18 | 20.8913 / 0.5329 | 20.1077 / 0.4537 | 22.2972 / 0.5950 | △ |
△ | x30 | 18.9354 / 0.4864 | 18.1448 / 0.3942 | 20.5663 / 0.5560 | △ |
Note:
△ refers to ditto.
Evaluated on RGB channels,
scale
pixels in each border are cropped before evaluation.
Citation¶
@inproceedings{chen2021learning,
title={Learning continuous image representation with local implicit image function},
author={Chen, Yinbo and Liu, Sifei and Wang, Xiaolong},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={8628--8638},
year={2021}
}
RDN (CVPR’2018)¶
Abstract¶
A very deep convolutional neural network (CNN) has recently achieved great success for image super-resolution (SR) and offered hierarchical features as well. However, most deep CNN based SR models do not make full use of the hierarchical features from the original low-resolution (LR) images, thereby achieving relatively-low performance. In this paper, we propose a novel residual dense network (RDN) to address this problem in image SR. We fully exploit the hierarchical features from all the convolutional layers. Specifically, we propose residual dense block (RDB) to extract abundant local features via dense connected convolutional layers. RDB further allows direct connections from the state of preceding RDB to all the layers of current RDB, leading to a contiguous memory (CM) mechanism. Local feature fusion in RDB is then used to adaptively learn more effective features from preceding and current local features and stabilizes the training of wider network. After fully obtaining dense local features, we use global feature fusion to jointly and adaptively learn global hierarchical features in a holistic way. Extensive experiments on benchmark datasets with different degradation models show that our RDN achieves favorable performance against state-of-the-art methods.

Results and models¶
Evaluated on RGB channels, scale
pixels in each border are cropped before evaluation.
The metrics are PSNR / SSIM
.
Method | Set5 | Set14 | DIV2K | Download |
---|---|---|---|---|
rdn_x2c64b16_g1_1000k_div2k | 35.9883 / 0.9385 | 31.8366 / 0.8920 | 34.9392 / 0.9380 | model | log |
rdn_x3c64b16_g1_1000k_div2k | 32.6051 / 0.8943 | 28.6338 / 0.8077 | 31.2153 / 0.8763 | model | log |
rdn_x4c64b16_g1_1000k_div2k | 30.4922 / 0.8548 | 26.9570 / 0.7423 | 29.1925 / 0.8233 | model | log |
Citation¶
@inproceedings{zhang2018residual,
title={Residual dense network for image super-resolution},
author={Zhang, Yulun and Tian, Yapeng and Kong, Yu and Zhong, Bineng and Fu, Yun},
booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition},
pages={2472--2481},
year={2018}
}
RealBasicVSR (CVPR’2022)¶
Abstract¶
The diversity and complexity of degradations in real-world video super-resolution (VSR) pose non-trivial challenges in inference and training. First, while long-term propagation leads to improved performance in cases of mild degradations, severe in-the-wild degradations could be exaggerated through propagation, impairing output quality. To balance the tradeoff between detail synthesis and artifact suppression, we found an image pre-cleaning stage indispensable to reduce noises and artifacts prior to propagation. Equipped with a carefully designed cleaning module, our RealBasicVSR outperforms existing methods in both quality and efficiency. Second, real-world VSR models are often trained with diverse degradations to improve generalizability, requiring increased batch size to produce a stable gradient. Inevitably, the increased computational burden results in various problems, including 1) speed-performance tradeoff and 2) batch-length tradeoff. To alleviate the first tradeoff, we propose a stochastic degradation scheme that reduces up to 40% of training time without sacrificing performance. We then analyze different training settings and suggest that employing longer sequences rather than larger batches during training allows more effective uses of temporal information, leading to more stable performance during inference. To facilitate fair comparisons, we propose the new VideoLQ dataset, which contains a large variety of real-world low-quality video sequences containing rich textures and patterns. Our dataset can serve as a common ground for benchmarking. Code, models, and the dataset will be made publicly available.

Results and models¶
Evaluated on Y channel. The code for computing NRQM, NIQE, and PI can be found here. MATLAB official code is used to compute BRISQUE.
Method | NRQM (Y) | NIQE (Y) | PI (Y) | BRISQUE (Y) | Download |
---|---|---|---|---|---|
realbasicvsr_c64b20_1x30x8_lr5e-5_150k_reds | 6.0477 | 3.7662 | 3.8593 | 29.030 | model/log |
Citation¶
@InProceedings{chan2022investigating,
author = {Chan, Kelvin C.K. and Zhou, Shangchen and Xu, Xiangyu and Loy, Chen Change},
title = {RealBasicVSR: Investigating Tradeoffs in Real-World Video Super-Resolution},
booktitle = {Proceedings of the IEEE conference on computer vision and pattern recognition},
year = {2022}
}
Real-ESRGAN (ICCVW’2021)¶
Abstract¶
Though many attempts have been made in blind super-resolution to restore low-resolution images with unknown and complex degradations, they are still far from addressing general real-world degraded images. In this work, we extend the powerful ESRGAN to a practical restoration application (namely, Real-ESRGAN), which is trained with pure synthetic data. Specifically, a high-order degradation modeling process is introduced to better simulate complex real-world degradations. We also consider the common ringing and overshoot artifacts in the synthesis process. In addition, we employ a U-Net discriminator with spectral normalization to increase discriminator capability and stabilize the training dynamics. Extensive comparisons have shown its superior visual performance than prior works on various real datasets. We also provide efficient implementations to synthesize training pairs on the fly.

Results and models¶
Evaluated on RGB channels. The metrics are PSNR/SSIM
.
Method | Set5 | Download |
---|---|---|
realesrnet_c64b23g32_12x4_lr2e-4_1000k_df2k_ost | 28.0297/0.8236 | model/log |
realesrgan_c64b23g32_12x4_lr1e-4_400k_df2k_ost | 26.2204/0.7655 | model /log |
Citation¶
@inproceedings{wang2021real,
title={Real-ESRGAN: Training Real-World Blind Super-Resolution with Pure Synthetic data},
author={Wang, Xintao and Xie, Liangbin and Dong, Chao and Shan, Ying},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision Workshop (ICCVW)},
pages={1905--1914},
year={2021}
}
SRCNN (TPAMI’2015)¶
Abstract¶
We propose a deep learning method for single image super-resolution (SR). Our method directly learns an end-to-end mapping between the low/high-resolution images. The mapping is represented as a deep convolutional neural network (CNN) that takes the low-resolution image as the input and outputs the high-resolution one. We further show that traditional sparse-coding-based SR methods can also be viewed as a deep convolutional network. But unlike traditional methods that handle each component separately, our method jointly optimizes all layers. Our deep CNN has a lightweight structure, yet demonstrates state-of-the-art restoration quality, and achieves fast speed for practical on-line usage. We explore different network structures and parameter settings to achieve trade-offs between performance and speed. Moreover, we extend our network to cope with three color channels simultaneously, and show better overall reconstruction quality.

Results and models¶
Evaluated on RGB channels, scale
pixels in each border are cropped before evaluation.
The metrics are PSNR / SSIM
.
Method | Set5 | Set14 | DIV2K | Download |
---|---|---|---|---|
srcnn_x4k915_1x16_1000k_div2k | 28.4316 / 0.8099 | 25.6486 / 0.7014 | 27.7460 / 0.7854 | model | log |
Citation¶
@article{dong2015image,
title={Image super-resolution using deep convolutional networks},
author={Dong, Chao and Loy, Chen Change and He, Kaiming and Tang, Xiaoou},
journal={IEEE transactions on pattern analysis and machine intelligence},
volume={38},
number={2},
pages={295--307},
year={2015},
publisher={IEEE}
}
SRGAN (CVPR’2016)¶
Abstract¶
Despite the breakthroughs in accuracy and speed of single image super-resolution using faster and deeper convolutional neural networks, one central problem remains largely unsolved: how do we recover the finer texture details when we super-resolve at large upscaling factors? The behavior of optimization-based super-resolution methods is principally driven by the choice of the objective function. Recent work has largely focused on minimizing the mean squared reconstruction error. The resulting estimates have high peak signal-to-noise ratios, but they are often lacking high-frequency details and are perceptually unsatisfying in the sense that they fail to match the fidelity expected at the higher resolution. In this paper, we present SRGAN, a generative adversarial network (GAN) for image super-resolution (SR). To our knowledge, it is the first framework capable of inferring photo-realistic natural images for 4x upscaling factors. To achieve this, we propose a perceptual loss function which consists of an adversarial loss and a content loss. The adversarial loss pushes our solution to the natural image manifold using a discriminator network that is trained to differentiate between the super-resolved images and original photo-realistic images. In addition, we use a content loss motivated by perceptual similarity instead of similarity in pixel space. Our deep residual network is able to recover photo-realistic textures from heavily downsampled images on public benchmarks. An extensive mean-opinion-score (MOS) test shows hugely significant gains in perceptual quality using SRGAN. The MOS scores obtained with SRGAN are closer to those of the original high-resolution images than to those obtained with any state-of-the-art method.

Results and models¶
Evaluated on RGB channels, scale
pixels in each border are cropped before evaluation.
The metrics are PSNR / SSIM
.
Method | Set5 | Set14 | DIV2K | Download |
---|---|---|---|---|
msrresnet_x4c64b16_1x16_300k_div2k | 30.2252 / 0.8491 | 26.7762 / 0.7369 | 28.9748 / 0.8178 | model | log |
srgan_x4c64b16_1x16_1000k_div2k | 27.9499 / 0.7846 | 24.7383 / 0.6491 | 26.5697 / 0.7365 | model | log |
Citation¶
@inproceedings{ledig2016photo,
title={Photo-realistic single image super-resolution using a generative adversarial network},
author={Ledig, Christian and Theis, Lucas and Husz{\'a}r, Ferenc and Caballero, Jose and Cunningham, Andrew and Acosta, Alejandro and Aitken, Andrew and Tejani, Alykhan and Totz, Johannes and Wang, Zehan},
booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition workshops},
year={2016}
}
TDAN (CVPR’2020)¶
Abstract¶
Video super-resolution (VSR) aims to restore a photo-realistic high-resolution (HR) video frame from both its corresponding low-resolution (LR) frame (reference frame) and multiple neighboring frames (supporting frames). Due to varying motion of cameras or objects, the reference frame and each support frame are not aligned. Therefore, temporal alignment is a challenging yet important problem for VSR. Previous VSR methods usually utilize optical flow between the reference frame and each supporting frame to wrap the supporting frame for temporal alignment. Therefore, the performance of these image-level wrapping-based models will highly depend on the prediction accuracy of optical flow, and inaccurate optical flow will lead to artifacts in the wrapped supporting frames, which also will be propagated into the reconstructed HR video frame. To overcome the limitation, in this paper, we propose a temporal deformable alignment network (TDAN) to adaptively align the reference frame and each supporting frame at the feature level without computing optical flow. The TDAN uses features from both the reference frame and each supporting frame to dynamically predict offsets of sampling convolution kernels. By using the corresponding kernels, TDAN transforms supporting frames to align with the reference frame. To predict the HR video frame, a reconstruction network taking aligned frames and the reference frame is utilized. Experimental results demonstrate the effectiveness of the proposed TDAN-based VSR model.

Results and models¶
Evaluated on Y-channel. 8 pixels in each border are cropped before evaluation.
The metrics are PSNR / SSIM
.
Method | Vid4 (BIx4) | SPMCS-30 (BIx4) | Vid4 (BDx4) | SPMCS-30 (BDx4) | Download |
---|---|---|---|---|---|
tdan_vimeo90k_bix4_ft_lr5e-5_400k | 26.49/0.792 | 30.42/0.856 | 25.93/0.772 | 29.69/0.842 | model | log |
tdan_vimeo90k_bdx4_ft_lr5e-5_800k | 25.80/0.784 | 29.56/0.851 | 26.87/0.815 | 30.77/0.868 | model | log |
Train
Train Instructions
You can use the following command to train a model.
./tools/dist_train.sh ${CONFIG_FILE} ${GPU_NUM} [optional arguments]
TDAN is trained with two stages.
Stage 1: Train with a larger learning rate (1e-4)
./tools/dist_train.sh configs/restorers/tdan/tdan_vimeo90k_bix4_lr1e-4_400k.py 8
Stage 2: Fine-tune with a smaller learning rate (5e-5)
./tools/dist_train.sh configs/restorers/tdan/tdan_vimeo90k_bix4_ft_lr5e-5_400k.py 8
For more details, you can refer to Train a model part in getting_started.
Test
Test Instructions
You can use the following command to test a model.
python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [--out ${RESULT_FILE}] [--save-path ${IMAGE_SAVE_PATH}]
Example: Test TDAN on SPMCS-30 using Bicubic downsampling.
python tools/test.py configs/restorers/tdan/tdan_vimeo90k_bix4_ft_lr5e-5_400k.py checkpoints/SOME_CHECKPOINT.pth --save_path outputs/
For more details, you can refer to Inference with pretrained models part in getting_started.
Citation¶
@InProceedings{tian2020tdan,
title={TDAN: Temporally-Deformable Alignment Network for Video Super-Resolution},
author={Tian, Yapeng and Zhang, Yulun and Fu, Yun and Xu, Chenliang},
booktitle = {Proceedings of the IEEE conference on Computer Vision and Pattern Recognition},
year = {2020}
}
TOFlow (IJCV’2019)¶
Abstract¶
Many video enhancement algorithms rely on optical flow to register frames in a video sequence. Precise flow estimation is however intractable; and optical flow itself is often a sub-optimal representation for particular video processing tasks. In this paper, we propose task-oriented flow (TOFlow), a motion representation learned in a self-supervised, task-specific manner. We design a neural network with a trainable motion estimation component and a video processing component, and train them jointly to learn the task-oriented flow. For evaluation, we build Vimeo-90K, a large-scale, high-quality video dataset for low-level video processing. TOFlow outperforms traditional optical flow on standard benchmarks as well as our Vimeo-90K dataset in three video processing tasks: frame interpolation, video denoising/deblocking, and video super-resolution.

Results and models¶
Evaluated on RGB channels.
The metrics are PSNR / SSIM
.
Method | Vid4 | Download |
---|---|---|
tof_x4_vimeo90k_official | 24.4377 / 0.7433 | model |
Citation¶
@article{xue2019video,
title={Video enhancement with task-oriented flow},
author={Xue, Tianfan and Chen, Baian and Wu, Jiajun and Wei, Donglai and Freeman, William T},
journal={International Journal of Computer Vision},
volume={127},
number={8},
pages={1106--1125},
year={2019},
publisher={Springer}
}
TTSR (CVPR’2020)¶
Abstract¶
We study on image super-resolution (SR), which aims to recover realistic textures from a low-resolution (LR) image. Recent progress has been made by taking high-resolution images as references (Ref), so that relevant textures can be transferred to LR images. However, existing SR approaches neglect to use attention mechanisms to transfer high-resolution (HR) textures from Ref images, which limits these approaches in challenging cases. In this paper, we propose a novel Texture Transformer Network for Image Super-Resolution (TTSR), in which the LR and Ref images are formulated as queries and keys in a transformer, respectively. TTSR consists of four closely-related modules optimized for image generation tasks, including a learnable texture extractor by DNN, a relevance embedding module, a hard-attention module for texture transfer, and a soft-attention module for texture synthesis. Such a design encourages joint feature learning across LR and Ref images, in which deep feature correspondences can be discovered by attention, and thus accurate texture features can be transferred. The proposed texture transformer can be further stacked in a cross-scale way, which enables texture recovery from different levels (e.g., from 1x to 4x magnification). Extensive experiments show that TTSR achieves significant improvements over state-of-the-art approaches on both quantitative and qualitative evaluations.

Results and models¶
Evaluated on RGB channels, scale
pixels in each border are cropped before evaluation.
The metrics are PSNR / SSIM
.
Method | scale | CUFED | Download |
---|---|---|---|
ttsr-rec_x4_c64b16_g1_200k_CUFED | x4 | 25.2433 / 0.7491 | model | log |
ttsr-gan_x4_c64b16_g1_500k_CUFED | x4 | 24.6075 / 0.7234 | model | log |
Citation¶
@inproceedings{yang2020learning,
title={Learning texture transformer network for image super-resolution},
author={Yang, Fuzhi and Yang, Huan and Fu, Jianlong and Lu, Hongtao and Guo, Baining},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={5791--5800},
year={2020}
}