Shortcuts

Note

You are reading the documentation for MMEditing 0.x, which will soon be deprecated by the end of 2022. We recommend you upgrade to MMEditing 1.0 to enjoy fruitful new features and better performance brought by OpenMMLab 2.0. Check out the changelog, code and documentation of MMEditing 1.0 for more details.

Super-Resolution Models

BasicVSR (CVPR’2021)

Abstract

Video super-resolution (VSR) approaches tend to have more components than the image counterparts as they need to exploit the additional temporal dimension. Complex designs are not uncommon. In this study, we wish to untangle the knots and reconsider some most essential components for VSR guided by four basic functionalities, i.e., Propagation, Alignment, Aggregation, and Upsampling. By reusing some existing components added with minimal redesigns, we show a succinct pipeline, BasicVSR, that achieves appealing improvements in terms of speed and restoration quality in comparison to many state-of-the-art algorithms. We conduct systematic analysis to explain how such gain can be obtained and discuss the pitfalls. We further show the extensibility of BasicVSR by presenting an information-refill mechanism and a coupled propagation scheme to facilitate information aggregation. The BasicVSR and its extension, IconVSR, can serve as strong baselines for future VSR approaches.

Results and models

Evaluated on RGB channels for REDS4 and Y channel for others. The metrics are PSNR / SSIM . The pretrained weights of SPyNet can be found here.

Method REDS4 (BIx4)
PSNR/SSIM (RGB)
Vimeo-90K-T (BIx4)
PSNR/SSIM (Y)
Vid4 (BIx4)
PSNR/SSIM (Y)
UDM10 (BDx4)
PSNR/SSIM (Y)
Vimeo-90K-T (BDx4)
PSNR/SSIM (Y)
Vid4 (BDx4)
PSNR/SSIM (Y)
Download
basicvsr_reds4 31.4170/0.8909 36.2848/0.9395 27.2694/0.8318 33.4478/0.9306 34.4700/0.9286 24.4541/0.7455 model | log
basicvsr_vimeo90k_bi 30.3128/0.8660 37.2026/0.9451 27.2755/0.8248 34.5554/0.9434 34.8097/0.9316 25.0517/0.7636 model | log
basicvsr_vimeo90k_bd 29.0376/0.8481 34.6427/0.9335 26.2708/0.8022 39.9953/0.9695 37.5501/0.9499 27.9791/0.8556 model | log

Citation

@InProceedings{chan2021basicvsr,
  author = {Chan, Kelvin CK and Wang, Xintao and Yu, Ke and Dong, Chao and Loy, Chen Change},
  title = {BasicVSR: The Search for Essential Components in Video Super-Resolution and Beyond},
  booktitle = {Proceedings of the IEEE conference on computer vision and pattern recognition},
  year = {2021}
}

BasicVSR++ (CVPR’2022)

Abstract

A recurrent structure is a popular framework choice for the task of video super-resolution. The state-of-the-art method BasicVSR adopts bidirectional propagation with feature alignment to effectively exploit information from the entire input video. In this study, we redesign BasicVSR by proposing second-order grid propagation and flow-guided deformable alignment. We show that by empowering the recurrent framework with the enhanced propagation and alignment, one can exploit spatiotemporal information across misaligned video frames more effectively. The new components lead to an improved performance under a similar computational constraint. In particular, our model BasicVSR++ surpasses BasicVSR by 0.82 dB in PSNR with similar number of parameters. In addition to video super-resolution, BasicVSR++ generalizes well to other video restoration tasks such as compressed video enhancement. In NTIRE 2021, BasicVSR++ obtains three champions and one runner-up in the Video Super-Resolution and Compressed Video Enhancement Challenges. Codes and models will be released to MMEditing.

Results and models

The pretrained weights of SPyNet can be found here.

Method REDS4 (BIx4) PSNR/SSIM (RGB) Vimeo-90K-T (BIx4) PSNR/SSIM (Y) Vid4 (BIx4) PSNR/SSIM (Y) UDM10 (BDx4) PSNR/SSIM (Y) Vimeo-90K-T (BDx4) PSNR/SSIM (Y) Vid4 (BDx4) PSNR/SSIM (Y) Download
basicvsr_plusplus_c64n7_8x1_600k_reds4 32.3855/0.9069 36.4445/0.9411 27.7674/0.8444 34.6868/0.9417 34.0372/0.9244 24.6209/0.7540 model | log
basicvsr_plusplus_c64n7_4x2_300k_vimeo90k_bi 31.0126/0.8804 37.7864/0.9500 27.7882/0.8401 33.1211/0.9270 33.8972/0.9195 23.6086/0.7033 model | log
basicvsr_plusplus_c64n7_4x2_300k_vimeo90k_bd 29.2041/0.8528 34.7248/0.9351 26.4377/0.8074 40.7216/0.9722 38.2054/0.9550 29.0400/0.8753 model | log
NTIRE 2021 checkpoints

Note that the following models are finetuned from smaller models. The training schemes of these models will be released when MMEditing reaches 5k stars. We provide the pre-trained models here.

NTIRE 2021 Video Super-Resolution

NTIRE 2021 Quality Enhancement of Compressed Video - Track 1

NTIRE 2021 Quality Enhancement of Compressed Video - Track 2

NTIRE 2021 Quality Enhancement of Compressed Video - Track 3

Citation

@InProceedings{chan2022basicvsrplusplus,
  author = {Chan, Kelvin C.K. and Zhou, Shangchen and Xu, Xiangyu and Loy, Chen Change},
  title = {BasicVSR++: Improving Video Super-Resolution with Enhanced Propagation and Alignment},
  booktitle = {Proceedings of the IEEE conference on computer vision and pattern recognition},
  year = {2022}
}

DIC (CVPR’2020)

Abstract

Recent works based on deep learning and facial priors have succeeded in super-resolving severely degraded facial images. However, the prior knowledge is not fully exploited in existing methods, since facial priors such as landmark and component maps are always estimated by low-resolution or coarsely super-resolved images, which may be inaccurate and thus affect the recovery performance. In this paper, we propose a deep face super-resolution (FSR) method with iterative collaboration between two recurrent networks which focus on facial image recovery and landmark estimation respectively. In each recurrent step, the recovery branch utilizes the prior knowledge of landmarks to yield higher-quality images which facilitate more accurate landmark estimation in turn. Therefore, the iterative information interaction between two processes boosts the performance of each other progressively. Moreover, a new attentive fusion module is designed to strengthen the guidance of landmark maps, where facial components are generated individually and aggregated attentively for better restoration. Quantitative and qualitative experimental results show the proposed method significantly outperforms state-of-the-art FSR methods in recovering high-quality face images.

Results and models

Evaluated on RGB channels, scale pixels in each border are cropped before evaluation. The metrics are PSNR / SSIM .

In the log data of dic_gan_x8c48b6_g4_150k_CelebAHQ, DICGAN is verified on the first 9 pictures of the test set of CelebA-HQ, so PSNR/SSIM shown in the follow table is different from the log data.

Method scale CelebA-HQ Download
dic_x8c48b6_g4_150k_CelebAHQ x8 25.2319 / 0.7422 model | log
dic_gan_x8c48b6_g4_150k_CelebAHQ x8 23.6241 / 0.6721 model | log

Citation

@inproceedings{ma2020deep,
  title={Deep face super-resolution with iterative collaboration between attentive recovery and landmark estimation},
  author={Ma, Cheng and Jiang, Zhenyu and Rao, Yongming and Lu, Jiwen and Zhou, Jie},
  booktitle={Proceedings of the IEEE/CVF conference on computer vision and pattern recognition},
  pages={5569--5578},
  year={2020}
}

EDSR (CVPR’2017)

Abstract

Recent research on super-resolution has progressed with the development of deep convolutional neural networks (DCNN). In particular, residual learning techniques exhibit improved performance. In this paper, we develop an enhanced deep super-resolution network (EDSR) with performance exceeding those of current state-of-the-art SR methods. The significant performance improvement of our model is due to optimization by removing unnecessary modules in conventional residual networks. The performance is further improved by expanding the model size while we stabilize the training procedure. We also propose a new multi-scale deep super-resolution system (MDSR) and training method, which can reconstruct high-resolution images of different upscaling factors in a single model. The proposed methods show superior performance over the state-of-the-art methods on benchmark datasets and prove its excellence by winning the NTIRE2017 Super-Resolution Challenge.

Results and models

Evaluated on RGB channels, scale pixels in each border are cropped before evaluation. The metrics are PSNR / SSIM .

Method Set5 Set14 DIV2K Download
edsr_x2c64b16_1x16_300k_div2k 35.7592 / 0.9372 31.4290 / 0.8874 34.5896 / 0.9352 model | log
edsr_x3c64b16_1x16_300k_div2k 32.3301 / 0.8912 28.4125 / 0.8022 30.9154 / 0.8711 model | log
edsr_x4c64b16_1x16_300k_div2k 30.2223 / 0.8500 26.7870 / 0.7366 28.9675 / 0.8172 model | log

Citation

@inproceedings{lim2017enhanced,
  title={Enhanced deep residual networks for single image super-resolution},
  author={Lim, Bee and Son, Sanghyun and Kim, Heewon and Nah, Seungjun and Mu Lee, Kyoung},
  booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition workshops},
  pages={136--144},
  year={2017}
}

EDVR (CVPRW’2019)

Abstract

Video restoration tasks, including super-resolution, deblurring, etc, are drawing increasing attention in the computer vision community. A challenging benchmark named REDS is released in the NTIRE19 Challenge. This new benchmark challenges existing methods from two aspects: (1) how to align multiple frames given large motions, and (2) how to effectively fuse different frames with diverse motion and blur. In this work, we propose a novel Video Restoration framework with Enhanced Deformable networks, termed EDVR, to address these challenges. First, to handle large motions, we devise a Pyramid, Cascading and Deformable (PCD) alignment module, in which frame alignment is done at the feature level using deformable convolutions in a coarse-to-fine manner. Second, we propose a Temporal and Spatial Attention (TSA) fusion module, in which attention is applied both temporally and spatially, so as to emphasize important features for subsequent restoration. Thanks to these modules, our EDVR wins the champions and outperforms the second place by a large margin in all four tracks in the NTIRE19 video restoration and enhancement challenges. EDVR also demonstrates superior performance to state-of-the-art published methods on video super-resolution and deblurring.

Results and models

Evaluated on RGB channels. The metrics are PSNR / SSIM .

Method REDS4 Download
edvrm_wotsa_x4_8x4_600k_reds 30.3430 / 0.8664 model | log
edvrm_x4_8x4_600k_reds 30.4194 / 0.8684 model | log
edvrl_wotsa_c128b40_8x8_lr2e-4_600k_reds4 31.0010 / 0.8784 model | log
edvrl_c128b40_8x8_lr2e-4_600k_reds4 31.0467 / 0.8793 model | log

Citation

@InProceedings{wang2019edvr,
  author    = {Wang, Xintao and Chan, Kelvin C.K. and Yu, Ke and Dong, Chao and Loy, Chen Change},
  title     = {EDVR: Video restoration with enhanced deformable convolutional networks},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)},
  month     = {June},
  year      = {2019},
}

ESRGAN (ECCVW’2018)

Abstract

The Super-Resolution Generative Adversarial Network (SRGAN) is a seminal work that is capable of generating realistic textures during single image super-resolution. However, the hallucinated details are often accompanied with unpleasant artifacts. To further enhance the visual quality, we thoroughly study three key components of SRGAN - network architecture, adversarial loss and perceptual loss, and improve each of them to derive an Enhanced SRGAN (ESRGAN). In particular, we introduce the Residual-in-Residual Dense Block (RRDB) without batch normalization as the basic network building unit. Moreover, we borrow the idea from relativistic GAN to let the discriminator predict relative realness instead of the absolute value. Finally, we improve the perceptual loss by using the features before activation, which could provide stronger supervision for brightness consistency and texture recovery. Benefiting from these improvements, the proposed ESRGAN achieves consistently better visual quality with more realistic and natural textures than SRGAN and won the first place in the PIRM2018-SR Challenge.

Results and models

Evaluated on RGB channels, scale pixels in each border are cropped before evaluation. The metrics are PSNR / SSIM .

Method Set5 Set14 DIV2K Download
esrgan_psnr_x4c64b23g32_1x16_1000k_div2k 30.6428 / 0.8559 27.0543 / 0.7447 29.3354 / 0.8263 model | log
esrgan_x4c64b23g32_1x16_400k_div2k 28.2700 / 0.7778 24.6328 / 0.6491 26.6531 / 0.7340 model | log

Citation

@inproceedings{wang2018esrgan,
  title={Esrgan: Enhanced super-resolution generative adversarial networks},
  author={Wang, Xintao and Yu, Ke and Wu, Shixiang and Gu, Jinjin and Liu, Yihao and Dong, Chao and Qiao, Yu and Change Loy, Chen},
  booktitle={Proceedings of the European Conference on Computer Vision Workshops(ECCVW)},
  pages={0--0},
  year={2018}
}

GLEAN (CVPR’2021)

Abstract

We show that pre-trained Generative Adversarial Networks (GANs), e.g., StyleGAN, can be used as a latent bank to improve the restoration quality of large-factor image super-resolution (SR). While most existing SR approaches attempt to generate realistic textures through learning with adversarial loss, our method, Generative LatEnt bANk (GLEAN), goes beyond existing practices by directly leveraging rich and diverse priors encapsulated in a pre-trained GAN. But unlike prevalent GAN inversion methods that require expensive image-specific optimization at runtime, our approach only needs a single forward pass to generate the upscaled image. GLEAN can be easily incorporated in a simple encoder-bank-decoder architecture with multi-resolution skip connections. Switching the bank allows the method to deal with images from diverse categories, e.g., cat, building, human face, and car. Images upscaled by GLEAN show clear improvements in terms of fidelity and texture faithfulness in comparison to existing methods.

Results and models

For the meta info used in training and test, please refer to here. The results are evaluated on RGB channels.

Method PSNR Download
glean_cat_8x 23.98 model | log
glean_ffhq_16x 26.91 model | log
glean_cat_16x 20.88 model | log
glean_in128out1024_4x2_300k_ffhq_celebahq 27.94 model | log

Citation

@InProceedings{chan2021glean,
  author = {Chan, Kelvin CK and Wang, Xintao and Xu, Xiangyu and Gu, Jinwei and Loy, Chen Change},
  title = {GLEAN: Generative Latent Bank for Large-Factor Image Super-Resolution},
  booktitle = {Proceedings of the IEEE conference on computer vision and pattern recognition},
  year = {2021}
}

IconVSR (CVPR’2021)

Abstract

Video super-resolution (VSR) approaches tend to have more components than the image counterparts as they need to exploit the additional temporal dimension. Complex designs are not uncommon. In this study, we wish to untangle the knots and reconsider some most essential components for VSR guided by four basic functionalities, i.e., Propagation, Alignment, Aggregation, and Upsampling. By reusing some existing components added with minimal redesigns, we show a succinct pipeline, BasicVSR, that achieves appealing improvements in terms of speed and restoration quality in comparison to many state-of-the-art algorithms. We conduct systematic analysis to explain how such gain can be obtained and discuss the pitfalls. We further show the extensibility of BasicVSR by presenting an information-refill mechanism and a coupled propagation scheme to facilitate information aggregation. The BasicVSR and its extension, IconVSR, can serve as strong baselines for future VSR approaches.

Results and models

Evaluated on RGB channels for REDS4 and Y channel for others. The metrics are PSNR / SSIM . The pretrained weights of the IconVSR components can be found here: SPyNet, EDVR-M for REDS, and EDVR-M for Vimeo-90K.

Method REDS4 (BIx4)
PSNR/SSIM (RGB)
Vimeo-90K-T (BIx4)
PSNR/SSIM (Y)
Vid4 (BIx4)
PSNR/SSIM (Y)
UDM10 (BDx4)
PSNR/SSIM (Y)
Vimeo-90K-T (BDx4)
PSNR/SSIM (Y)
Vid4 (BDx4)
PSNR/SSIM (Y)
Download
iconvsr_reds4 31.6926/0.8951 36.4983/0.9416 27.4809/0.8354 35.3377/0.9471 34.4299/0.9287 25.2110/0.7732 model | log
iconvsr_vimeo90k_bi 30.3452/0.8659 37.3729/0.9467 27.4238/0.8297 34.2595/0.9398 34.5548/0.9295 24.6666/0.7491 model | log
iconvsr_vimeo90k_bd 29.0150/0.8465 34.6780/0.9339 26.3109/0.8028 40.0640/0.9697 37.7573/0.9517 28.2464/0.8612 model | log

Citation

@InProceedings{chan2021basicvsr,
  author = {Chan, Kelvin CK and Wang, Xintao and Yu, Ke and Dong, Chao and Loy, Chen Change},
  title = {BasicVSR: The Search for Essential Components in Video Super-Resolution and Beyond},
  booktitle = {Proceedings of the IEEE conference on computer vision and pattern recognition},
  year = {2021}
}

LIIF (CVPR’2021)

Abstract

How to represent an image? While the visual world is presented in a continuous manner, machines store and see the images in a discrete way with 2D arrays of pixels. In this paper, we seek to learn a continuous representation for images. Inspired by the recent progress in 3D reconstruction with implicit neural representation, we propose Local Implicit Image Function (LIIF), which takes an image coordinate and the 2D deep features around the coordinate as inputs, predicts the RGB value at a given coordinate as an output. Since the coordinates are continuous, LIIF can be presented in arbitrary resolution. To generate the continuous representation for images, we train an encoder with LIIF representation via a self-supervised task with super-resolution. The learned continuous representation can be presented in arbitrary resolution even extrapolate to x30 higher resolution, where the training tasks are not provided. We further show that LIIF representation builds a bridge between discrete and continuous representation in 2D, it naturally supports the learning tasks with size-varied image ground-truths and significantly outperforms the method with resizing the ground-truths.

Results and models

Method scale Set5
PSNR / SSIM
Set14
PSNR / SSIM
DIV2K
PSNR / SSIM
Download
liif_edsr_norm_c64b16_g1_1000k_div2k x2 35.7131 / 0.9366 31.5579 / 0.8889 34.6647 / 0.9355 model | log
x3 32.3805 / 0.8915 28.4605 / 0.8039 30.9808 / 0.8724
x4 30.2748 / 0.8509 26.8415 / 0.7381 29.0245 / 0.8187
x6 27.1187 / 0.7774 24.7461 / 0.6444 26.7770 / 0.7425
x18 20.8516 / 0.5406 20.0096 / 0.4525 22.1987 / 0.5955
x30 18.8467 / 0.5010 18.1321 / 0.3963 20.5050 / 0.5577
liif_rdn_norm_c64b16_g1_1000k_div2k x2 35.7874 / 0.9366 31.6866 / 0.8896 34.7548 / 0.9356 model | log
x3 32.4992 / 0.8923 28.4905 / 0.8037 31.0744 / 0.8731
x4 30.3835 / 0.8513 26.8734 / 0.7373 29.1101 / 0.8197
x6 27.1914 / 0.7751 24.7824 / 0.6434 26.8693 / 0.7437
x18 20.8913 / 0.5329 20.1077 / 0.4537 22.2972 / 0.5950
x30 18.9354 / 0.4864 18.1448 / 0.3942 20.5663 / 0.5560

Note:

  • △ refers to ditto.

  • Evaluated on RGB channels, scale pixels in each border are cropped before evaluation.

Citation

@inproceedings{chen2021learning,
  title={Learning continuous image representation with local implicit image function},
  author={Chen, Yinbo and Liu, Sifei and Wang, Xiaolong},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={8628--8638},
  year={2021}
}

RDN (CVPR’2018)

Abstract

A very deep convolutional neural network (CNN) has recently achieved great success for image super-resolution (SR) and offered hierarchical features as well. However, most deep CNN based SR models do not make full use of the hierarchical features from the original low-resolution (LR) images, thereby achieving relatively-low performance. In this paper, we propose a novel residual dense network (RDN) to address this problem in image SR. We fully exploit the hierarchical features from all the convolutional layers. Specifically, we propose residual dense block (RDB) to extract abundant local features via dense connected convolutional layers. RDB further allows direct connections from the state of preceding RDB to all the layers of current RDB, leading to a contiguous memory (CM) mechanism. Local feature fusion in RDB is then used to adaptively learn more effective features from preceding and current local features and stabilizes the training of wider network. After fully obtaining dense local features, we use global feature fusion to jointly and adaptively learn global hierarchical features in a holistic way. Extensive experiments on benchmark datasets with different degradation models show that our RDN achieves favorable performance against state-of-the-art methods.

Results and models

Evaluated on RGB channels, scale pixels in each border are cropped before evaluation. The metrics are PSNR / SSIM .

Method Set5 Set14 DIV2K Download
rdn_x2c64b16_g1_1000k_div2k 35.9883 / 0.9385 31.8366 / 0.8920 34.9392 / 0.9380 model | log
rdn_x3c64b16_g1_1000k_div2k 32.6051 / 0.8943 28.6338 / 0.8077 31.2153 / 0.8763 model | log
rdn_x4c64b16_g1_1000k_div2k 30.4922 / 0.8548 26.9570 / 0.7423 29.1925 / 0.8233 model | log

Citation

@inproceedings{zhang2018residual,
  title={Residual dense network for image super-resolution},
  author={Zhang, Yulun and Tian, Yapeng and Kong, Yu and Zhong, Bineng and Fu, Yun},
  booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition},
  pages={2472--2481},
  year={2018}
}

RealBasicVSR (CVPR’2022)

Abstract

The diversity and complexity of degradations in real-world video super-resolution (VSR) pose non-trivial challenges in inference and training. First, while long-term propagation leads to improved performance in cases of mild degradations, severe in-the-wild degradations could be exaggerated through propagation, impairing output quality. To balance the tradeoff between detail synthesis and artifact suppression, we found an image pre-cleaning stage indispensable to reduce noises and artifacts prior to propagation. Equipped with a carefully designed cleaning module, our RealBasicVSR outperforms existing methods in both quality and efficiency. Second, real-world VSR models are often trained with diverse degradations to improve generalizability, requiring increased batch size to produce a stable gradient. Inevitably, the increased computational burden results in various problems, including 1) speed-performance tradeoff and 2) batch-length tradeoff. To alleviate the first tradeoff, we propose a stochastic degradation scheme that reduces up to 40% of training time without sacrificing performance. We then analyze different training settings and suggest that employing longer sequences rather than larger batches during training allows more effective uses of temporal information, leading to more stable performance during inference. To facilitate fair comparisons, we propose the new VideoLQ dataset, which contains a large variety of real-world low-quality video sequences containing rich textures and patterns. Our dataset can serve as a common ground for benchmarking. Code, models, and the dataset will be made publicly available.

Results and models

Evaluated on Y channel. The code for computing NRQM, NIQE, and PI can be found here. MATLAB official code is used to compute BRISQUE.

Method NRQM (Y) NIQE (Y) PI (Y) BRISQUE (Y) Download
realbasicvsr_c64b20_1x30x8_lr5e-5_150k_reds 6.0477 3.7662 3.8593 29.030 model/log

Citation

@InProceedings{chan2022investigating,
  author = {Chan, Kelvin C.K. and Zhou, Shangchen and Xu, Xiangyu and Loy, Chen Change},
  title = {RealBasicVSR: Investigating Tradeoffs in Real-World Video Super-Resolution},
  booktitle = {Proceedings of the IEEE conference on computer vision and pattern recognition},
  year = {2022}
}

Real-ESRGAN (ICCVW’2021)

Abstract

Though many attempts have been made in blind super-resolution to restore low-resolution images with unknown and complex degradations, they are still far from addressing general real-world degraded images. In this work, we extend the powerful ESRGAN to a practical restoration application (namely, Real-ESRGAN), which is trained with pure synthetic data. Specifically, a high-order degradation modeling process is introduced to better simulate complex real-world degradations. We also consider the common ringing and overshoot artifacts in the synthesis process. In addition, we employ a U-Net discriminator with spectral normalization to increase discriminator capability and stabilize the training dynamics. Extensive comparisons have shown its superior visual performance than prior works on various real datasets. We also provide efficient implementations to synthesize training pairs on the fly.

Results and models

Evaluated on RGB channels. The metrics are PSNR/SSIM.

Method Set5 Download
realesrnet_c64b23g32_12x4_lr2e-4_1000k_df2k_ost 28.0297/0.8236 model/log
realesrgan_c64b23g32_12x4_lr1e-4_400k_df2k_ost 26.2204/0.7655 model /log

Citation

@inproceedings{wang2021real,
  title={Real-ESRGAN: Training Real-World Blind Super-Resolution with Pure Synthetic data},
  author={Wang, Xintao and Xie, Liangbin and Dong, Chao and Shan, Ying},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision Workshop (ICCVW)},
  pages={1905--1914},
  year={2021}
}

SRCNN (TPAMI’2015)

Abstract

We propose a deep learning method for single image super-resolution (SR). Our method directly learns an end-to-end mapping between the low/high-resolution images. The mapping is represented as a deep convolutional neural network (CNN) that takes the low-resolution image as the input and outputs the high-resolution one. We further show that traditional sparse-coding-based SR methods can also be viewed as a deep convolutional network. But unlike traditional methods that handle each component separately, our method jointly optimizes all layers. Our deep CNN has a lightweight structure, yet demonstrates state-of-the-art restoration quality, and achieves fast speed for practical on-line usage. We explore different network structures and parameter settings to achieve trade-offs between performance and speed. Moreover, we extend our network to cope with three color channels simultaneously, and show better overall reconstruction quality.

Results and models

Evaluated on RGB channels, scale pixels in each border are cropped before evaluation. The metrics are PSNR / SSIM .

Method Set5 Set14 DIV2K Download
srcnn_x4k915_1x16_1000k_div2k 28.4316 / 0.8099 25.6486 / 0.7014 27.7460 / 0.7854 model | log

Citation

@article{dong2015image,
  title={Image super-resolution using deep convolutional networks},
  author={Dong, Chao and Loy, Chen Change and He, Kaiming and Tang, Xiaoou},
  journal={IEEE transactions on pattern analysis and machine intelligence},
  volume={38},
  number={2},
  pages={295--307},
  year={2015},
  publisher={IEEE}
}

SRGAN (CVPR’2016)

Abstract

Despite the breakthroughs in accuracy and speed of single image super-resolution using faster and deeper convolutional neural networks, one central problem remains largely unsolved: how do we recover the finer texture details when we super-resolve at large upscaling factors? The behavior of optimization-based super-resolution methods is principally driven by the choice of the objective function. Recent work has largely focused on minimizing the mean squared reconstruction error. The resulting estimates have high peak signal-to-noise ratios, but they are often lacking high-frequency details and are perceptually unsatisfying in the sense that they fail to match the fidelity expected at the higher resolution. In this paper, we present SRGAN, a generative adversarial network (GAN) for image super-resolution (SR). To our knowledge, it is the first framework capable of inferring photo-realistic natural images for 4x upscaling factors. To achieve this, we propose a perceptual loss function which consists of an adversarial loss and a content loss. The adversarial loss pushes our solution to the natural image manifold using a discriminator network that is trained to differentiate between the super-resolved images and original photo-realistic images. In addition, we use a content loss motivated by perceptual similarity instead of similarity in pixel space. Our deep residual network is able to recover photo-realistic textures from heavily downsampled images on public benchmarks. An extensive mean-opinion-score (MOS) test shows hugely significant gains in perceptual quality using SRGAN. The MOS scores obtained with SRGAN are closer to those of the original high-resolution images than to those obtained with any state-of-the-art method.

Results and models

Evaluated on RGB channels, scale pixels in each border are cropped before evaluation.

The metrics are PSNR / SSIM .

Method Set5 Set14 DIV2K Download
msrresnet_x4c64b16_1x16_300k_div2k 30.2252 / 0.8491 26.7762 / 0.7369 28.9748 / 0.8178 model | log
srgan_x4c64b16_1x16_1000k_div2k 27.9499 / 0.7846 24.7383 / 0.6491 26.5697 / 0.7365 model | log

Citation

@inproceedings{ledig2016photo,
  title={Photo-realistic single image super-resolution using a generative adversarial network},
  author={Ledig, Christian and Theis, Lucas and Husz{\'a}r, Ferenc and Caballero, Jose and Cunningham, Andrew and Acosta, Alejandro and Aitken, Andrew and Tejani, Alykhan and Totz, Johannes and Wang, Zehan},
  booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition workshops},
  year={2016}
}

TDAN (CVPR’2020)

Abstract

Video super-resolution (VSR) aims to restore a photo-realistic high-resolution (HR) video frame from both its corresponding low-resolution (LR) frame (reference frame) and multiple neighboring frames (supporting frames). Due to varying motion of cameras or objects, the reference frame and each support frame are not aligned. Therefore, temporal alignment is a challenging yet important problem for VSR. Previous VSR methods usually utilize optical flow between the reference frame and each supporting frame to wrap the supporting frame for temporal alignment. Therefore, the performance of these image-level wrapping-based models will highly depend on the prediction accuracy of optical flow, and inaccurate optical flow will lead to artifacts in the wrapped supporting frames, which also will be propagated into the reconstructed HR video frame. To overcome the limitation, in this paper, we propose a temporal deformable alignment network (TDAN) to adaptively align the reference frame and each supporting frame at the feature level without computing optical flow. The TDAN uses features from both the reference frame and each supporting frame to dynamically predict offsets of sampling convolution kernels. By using the corresponding kernels, TDAN transforms supporting frames to align with the reference frame. To predict the HR video frame, a reconstruction network taking aligned frames and the reference frame is utilized. Experimental results demonstrate the effectiveness of the proposed TDAN-based VSR model.

Results and models

Evaluated on Y-channel. 8 pixels in each border are cropped before evaluation. The metrics are PSNR / SSIM .

Method Vid4 (BIx4) SPMCS-30 (BIx4) Vid4 (BDx4) SPMCS-30 (BDx4) Download
tdan_vimeo90k_bix4_ft_lr5e-5_400k 26.49/0.792 30.42/0.856 25.93/0.772 29.69/0.842 model | log
tdan_vimeo90k_bdx4_ft_lr5e-5_800k 25.80/0.784 29.56/0.851 26.87/0.815 30.77/0.868 model | log

Train

Train Instructions

You can use the following command to train a model.

./tools/dist_train.sh ${CONFIG_FILE} ${GPU_NUM} [optional arguments]

TDAN is trained with two stages.

Stage 1: Train with a larger learning rate (1e-4)

./tools/dist_train.sh configs/restorers/tdan/tdan_vimeo90k_bix4_lr1e-4_400k.py 8

Stage 2: Fine-tune with a smaller learning rate (5e-5)

./tools/dist_train.sh configs/restorers/tdan/tdan_vimeo90k_bix4_ft_lr5e-5_400k.py 8

For more details, you can refer to Train a model part in getting_started.

Test

Test Instructions

You can use the following command to test a model.

python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [--out ${RESULT_FILE}] [--save-path ${IMAGE_SAVE_PATH}]

Example: Test TDAN on SPMCS-30 using Bicubic downsampling.

python tools/test.py configs/restorers/tdan/tdan_vimeo90k_bix4_ft_lr5e-5_400k.py  checkpoints/SOME_CHECKPOINT.pth --save_path outputs/

For more details, you can refer to Inference with pretrained models part in getting_started.

Citation

@InProceedings{tian2020tdan,
  title={TDAN: Temporally-Deformable Alignment Network for Video Super-Resolution},
  author={Tian, Yapeng and Zhang, Yulun and Fu, Yun and Xu, Chenliang},
  booktitle = {Proceedings of the IEEE conference on Computer Vision and Pattern Recognition},
  year = {2020}
}

TOFlow (IJCV’2019)

Abstract

Many video enhancement algorithms rely on optical flow to register frames in a video sequence. Precise flow estimation is however intractable; and optical flow itself is often a sub-optimal representation for particular video processing tasks. In this paper, we propose task-oriented flow (TOFlow), a motion representation learned in a self-supervised, task-specific manner. We design a neural network with a trainable motion estimation component and a video processing component, and train them jointly to learn the task-oriented flow. For evaluation, we build Vimeo-90K, a large-scale, high-quality video dataset for low-level video processing. TOFlow outperforms traditional optical flow on standard benchmarks as well as our Vimeo-90K dataset in three video processing tasks: frame interpolation, video denoising/deblocking, and video super-resolution.

Results and models

Evaluated on RGB channels. The metrics are PSNR / SSIM .

Method Vid4 Download
tof_x4_vimeo90k_official 24.4377 / 0.7433 model

Citation

@article{xue2019video,
  title={Video enhancement with task-oriented flow},
  author={Xue, Tianfan and Chen, Baian and Wu, Jiajun and Wei, Donglai and Freeman, William T},
  journal={International Journal of Computer Vision},
  volume={127},
  number={8},
  pages={1106--1125},
  year={2019},
  publisher={Springer}
}

TTSR (CVPR’2020)

Abstract

We study on image super-resolution (SR), which aims to recover realistic textures from a low-resolution (LR) image. Recent progress has been made by taking high-resolution images as references (Ref), so that relevant textures can be transferred to LR images. However, existing SR approaches neglect to use attention mechanisms to transfer high-resolution (HR) textures from Ref images, which limits these approaches in challenging cases. In this paper, we propose a novel Texture Transformer Network for Image Super-Resolution (TTSR), in which the LR and Ref images are formulated as queries and keys in a transformer, respectively. TTSR consists of four closely-related modules optimized for image generation tasks, including a learnable texture extractor by DNN, a relevance embedding module, a hard-attention module for texture transfer, and a soft-attention module for texture synthesis. Such a design encourages joint feature learning across LR and Ref images, in which deep feature correspondences can be discovered by attention, and thus accurate texture features can be transferred. The proposed texture transformer can be further stacked in a cross-scale way, which enables texture recovery from different levels (e.g., from 1x to 4x magnification). Extensive experiments show that TTSR achieves significant improvements over state-of-the-art approaches on both quantitative and qualitative evaluations.

Results and models

Evaluated on RGB channels, scale pixels in each border are cropped before evaluation. The metrics are PSNR / SSIM .

Method scale CUFED Download
ttsr-rec_x4_c64b16_g1_200k_CUFED x4 25.2433 / 0.7491 model | log
ttsr-gan_x4_c64b16_g1_500k_CUFED x4 24.6075 / 0.7234 model | log

Citation

@inproceedings{yang2020learning,
  title={Learning texture transformer network for image super-resolution},
  author={Yang, Fuzhi and Yang, Huan and Fu, Jianlong and Lu, Hongtao and Guo, Baining},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={5791--5800},
  year={2020}
}
Read the Docs v: latest
Versions
latest
stable
1.x
0.16.1
v0.16.0
v0.15.2
v0.15.1
v0.15.0
v0.14.0
v0.13.0
v0.12.0
docs
dev-1.x
Downloads
pdf
html
epub
On Read the Docs
Project Home
Builds

Free document hosting provided by Read the Docs.