Assessing Picture Aesthetic and Technical High quality with Multi-scale Transformers – Google AI Weblog

Posted by Junjie Ke, Senior Software program Engineer, and Feng Yang, Senior Employees Software program Engineer, Google Analysis

Understanding the aesthetic and technical high quality of photos is vital for offering a greater person visible expertise. Picture high quality evaluation (IQA) makes use of fashions to construct a bridge between a picture and a person’s subjective notion of its high quality. Within the deep studying period, many IQA approaches, similar to NIMA, have achieved success by leveraging the ability of convolutional neural networks (CNNs). Nevertheless, CNN-based IQA fashions are sometimes constrained by the fixed-size enter requirement in batch coaching, i.e., the enter photos have to be resized or cropped to a set measurement form. This preprocessing is problematic for IQA as a result of photos can have very completely different facet ratios and resolutions. Resizing and cropping can influence picture composition or introduce distortions, thus altering the standard of the picture.

In CNN-based fashions, photos have to be resized or cropped to a set form for batch coaching. Nevertheless, such preprocessing can alter the picture facet ratio and composition, thus impacting picture high quality. Unique picture used underneath CC BY 2.0 license.

In “MUSIQ: Multi-scale Picture High quality Transformer”, printed at ICCV 2021, we suggest a patch-based multi-scale picture high quality transformer (MUSIQ) to bypass the CNN constraints on mounted enter measurement and predict the picture high quality successfully on native-resolution photos. The MUSIQ mannequin helps the processing of full-size picture inputs with various facet ratios and resolutions and permits multi-scale function extraction to seize picture high quality at completely different granularities. To assist positional encoding within the multi-scale illustration, we suggest a novel hash-based 2D spatial embedding mixed with an embedding that captures the picture scaling. We apply MUSIQ on 4 large-scale IQA datasets, demonstrating constant state-of-the-art outcomes throughout three technical high quality datasets (PaQ-2-PiQ, KonIQ-10k, and SPAQ) and comparable efficiency to that of state-of-the-art fashions on the aesthetic high quality dataset AVA.

The patch-based MUSIQ mannequin can course of the full-size picture and extract multi-scale options, which higher aligns with an individual’s typical visible response.

Within the following determine, we present a pattern of photos, their MUSIQ rating, and their imply opinion rating (MOS) from a number of human raters within the brackets. The vary of the rating is from 0 to 100, with 100 being the best perceived high quality. As we are able to see from the determine, MUSIQ predicts excessive scores for photos with excessive aesthetic high quality and excessive technical high quality, and it predicts low scores for photos that aren’t aesthetically pleasing (low aesthetic high quality) or that comprise seen distortions (low technical high quality).

Predicted MUSIQ rating (and floor reality) on photos from the KonIQ-10k dataset. High: MUSIQ predicts excessive scores for prime quality photos. Center: MUSIQ predicts low scores for photos with low aesthetic high quality, similar to photos with poor composition or lighting. Backside: MUSIQ predicts low scores for photos with low technical high quality, similar to photos with seen distortion artifacts (e.g., blurry, noisy).

The Multi-scale Picture High quality Transformer

MUSIQ tackles the problem of studying IQA on full-size photos. Not like CNN-models which might be typically constrained to mounted decision, MUSIQ can deal with inputs with arbitrary facet ratios and resolutions.

To perform this, we first make a multi-scale illustration of the enter picture, containing the native decision picture and its resized variants. To protect the picture composition, we keep its facet ratio throughout resizing. After acquiring the pyramid of photos, we then partition the pictures at completely different scales into fixed-size patches which might be fed into the mannequin.

Illustration of the multi-scale picture illustration in MUSIQ.

Since patches are from photos of various resolutions, we have to successfully encode the multi-aspect-ratio multi-scale enter right into a sequence of tokens, capturing each the pixel, spatial, and scale info. To realize this, we design three encoding parts in MUSIQ, together with: 1) a patch encoding module to encode patches extracted from the multi-scale illustration; 2) a novel hash-based spatial embedding module to encode the 2D spatial place for every patch; and three) a learnable scale embedding to encode completely different scales. On this approach, we are able to successfully encode the multi-scale enter as a sequence of tokens, serving because the enter to the Transformer encoder.

To foretell the ultimate picture high quality rating, we use the usual method of prepending a further learnable “classification token” (CLS). The CLS token state on the output of the Transformer encoder serves as the ultimate picture illustration. We then add a completely linked layer on high to foretell the IQS. The determine under gives an outline of the MUSIQ mannequin.

Overview of MUSIQ. The multi-scale multi-resolution enter might be encoded by three parts: the dimensions embedding (SCE), the hash-based 2D spatial embedding (HSE), and the multi-scale patch embedding (MPE).

Since MUSIQ solely modifications the enter encoding, it’s suitable with any Transformer variants. To show the effectiveness of the proposed technique, in our experiments we use the traditional Transformer with a comparatively light-weight setting in order that the mannequin measurement is akin to ResNet-50.

Benchmark and Analysis

To judge MUSIQ, we run experiments on a number of large-scale IQA datasets. On every dataset, we report the Spearman’s rank correlation coefficient (SRCC) and Pearson linear correlation coefficient (PLCC) between our mannequin prediction and the human evaluators’ imply opinion rating. SRCC and PLCC are correlation metrics starting from -1 to 1. Larger PLCC and SRCC means higher alignment between mannequin prediction and human analysis. The graph under reveals that MUSIQ outperforms different strategies on PaQ-2-PiQ, KonIQ-10k, and SPAQ.

Efficiency comparability of MUSIQ and former state-of-the-art (SOTA) strategies on 4 large-scale IQA datasets. On every dataset we examine the Spearman’s rank correlation coefficient (SRCC) and Pearson linear correlation coefficient (PLCC) of mannequin prediction and floor reality.

Notably, the PaQ-2-PiQ check set is totally composed of huge footage having no less than one dimension exceeding 640 pixels. That is very difficult for conventional deep studying approaches, which require resizing. MUSIQ can outperform earlier strategies by a big margin on the full-size check set, which verifies its robustness and effectiveness.

It’s also price mentioning that earlier CNN-based strategies typically required sampling as many as 20 crops for every picture throughout testing. This type of multi-crop ensemble is a strategy to mitigate the mounted form constraint within the CNN fashions. However since every crop is simply a sub-view of the entire picture, the ensemble remains to be an approximate method. Furthermore, CNN-based strategies each add further inference value for each crop and, as a result of they pattern completely different crops, they’ll introduce randomness within the consequence. In distinction, as a result of MUSIQ takes the full-size picture as enter, it might immediately study the very best aggregation of data throughout the total picture and it solely must run the inference as soon as.

To additional confirm that the MUSIQ mannequin captures completely different info at completely different scales, we visualize the eye weights on every picture at completely different scales.

Consideration visualization from the output tokens to the multi-scale illustration, together with the unique decision picture and two proportionally resized photos. Brighter areas point out increased consideration, which signifies that these areas are extra vital for the mannequin output. Photos for illustration are taken from the AVA dataset.

We observe that MUSIQ tends to give attention to extra detailed areas within the full, high-resolution photos and on extra international areas on the resized ones. For instance, for the flower picture above, the mannequin’s consideration on the unique picture is specializing in the pedal particulars, and the eye shifts to the buds at decrease resolutions. This reveals that the mannequin learns to seize picture high quality at completely different granularities.

Conclusion

We suggest a multi-scale picture high quality transformer (MUSIQ), which may deal with full-size picture enter with various resolutions and facet ratios. By remodeling the enter picture to a multi-scale illustration with each international and native views, the mannequin can seize the picture high quality at completely different granularities. Though MUSIQ is designed for IQA, it may be utilized to different situations the place process labels are delicate to picture decision and facet ratio. The MUSIQ mannequin and checkpoints can be found at our GitHub repository.

Acknowledgements

This work is made potential via a collaboration spanning a number of groups throughout Google. We’d wish to acknowledge contributions from Qifei Wang, Yilin Wang and Peyman Milanfar.