Invariance of deep image quality metrics to affine transformations (2024)

Nuria Alabau-Bosque^1,2, Paula Daudén-Oliver², Jorge Vila-Tomás², Valero Laparra², Jesús Malo²
¹ValgrAI: Valencian Grad. School Research Network of AI, València, 46022, Spain
²Image Processing Lab, Universitat de València, Paterna, 46980, Spain

Abstract

Deep architectures are the current state-of-the-art in predicting subjective image quality.Usually, these models are evaluated according to their ability to correlate with human opinion in databases with a range of distortions that may appear in digital media.However, these oversee affine transformations which may represent better the changes in the images actually happening in natural conditions. Humans can be particularly invariant to these natural transformations, as opposed to the digital ones.
In this work, we evaluate state-of-the-art deep image quality metrics by assessing their invariance to affine transformations, specifically: rotation, translation, scaling, and changes in spectral illumination.We propose a methodology to assign invisibility thresholds for any perceptual metric.This methodology involves transforming the distance measured by an arbitrary metric to a common distance representation based on available subjectively rated databases. We psychophysically measure an absolute detection threshold in that common representation and express it in the physical units of each affine transform for each metric. By doing so, we allow the analyzed metrics to be directly comparable with actual human thresholds.We find that none of the state-of-the-art metrics shows human-like results under this strong test based on invisibility thresholds.
This means that tuning the models exclusively to predict the visibility of generic distortions may disregard other properties of human vision as for instance invariances or invisibility thresholds.The code is publicly available to test other metrics ¹¹1https://github.com/Rietta5/InvarianceTestIQA.

Index Terms:

Perceptual Metrics, Perception Thresholds, Invariance to Affine Transformations

I Introduction

Deep architectures are the current state-of-the-art in predicting subjective image quality. These models, usually referred to as perceptual metrics, are often used as measures to optimize other models[1, 2]. This implies that assessing their performance can be critical in a wide range of applications. Usually, these models are evaluated (or even tuned) according to their ability to correlate with human opinion in databases including a wide range of generic distortions[3, 4]. In fundamental terms, this means trying to predict the visibility of generic distortions. However, focus on generic distortions may be a problem to take into account other relevant phenomena[5], and human vision may also be described in terms of perceptual constancies or invariances[6, 7, 8, 9, 10, 11].

In this regard, considering the structural analysis of distortions and high-level interpretation of human vision, people have suggested that humans are mainly invariant to transformations which do not change the structure of the scenes[12]. Affine transformations (rotations, translations, scalings, and changes in spectral illumination) are examples of distortions that do not change the structure of the scene. Therefore, humans should be relatively tolerant to them, and the corresponding models to assess image similarity should be invariant to these transformations too.

In fact, the spirit of the influential SSIM was focused on measuring changes of structure [13] to achieve invariances to irrelevant transformations[14]. Moreover, Wang and Simoncelli [12] decomposed generic distortions into structural and non-structural components so that the part not affecting the structure (e.g. affine transformations) could be processed, and weighted, differently.On the other hand, metrics with bio-inspired, explainable architecture[15, 16, 17, 18, 19, 20, 5, 4] work in multi-scale / multi-orientation domains where invariances can be introduced by means of appropriate poolings as in the scattering transformations [21]. This sort of poolings are thought to happen in the visual brain leading to invariances and texture metamers[22].

However, current state-of-the-art deep-architectures for image quality[3] do not address the invariance problem in any way, while examples that try to apply the SSIM concept in deep-nets[23] do not use invariances in simple or explicit ways.As a result, the analysis of invariance in deep image quality metrics remains an open question.

In this work, we compare the ability of metrics to be invariant to affine transformations in the same way as humans are.In particular, we propose to evaluate the metrics from the point of view of human detection thresholds: by assessing the invariance of the metrics to image transformations that are irrelevant (or invisible) to human observers.For example, the classical literature on visual thresholds determines the intensity of certain affine transformations which is invisible for humans [24, 25, 26, 27].The sizes of invisibility thresholds are related to the more general concept of invariance to transformations.By definition, transformations whose intensity is below the threshold are invisible to the observer. Then one can say that, in this region, the observer is invariant to the transform.Here we propose a methodology that allows us to assign specific invisibility thresholds per metric. This proposal uses transduction functions to a common internal representation based on a subjectively rated database and a psychophysically measured threshold in this representation.Then, we can (1)assess if the thresholds for the metrics are comparable to those found for human observers, and (2)assess if the sensitivities of a metric for the different distortions follow the same order as the human sensitivities (for instance humans are more sensible to rotation changes than to illumination changes).This proposed evaluation of invariance to distortions is a necessary complement to the conventional evaluation of the visibility of distortions because, as we will see, none of the studied metrics presents human-like thresholds nor sensitivities for all the transformations considered.

The structure of the paper is as follows: SectionII describes the proposed methodology to assess the human-like behavior of metric invariances.The proposed methodology consists of comparing the detection thresholds for humans and metrics.This proposaldepends on several concepts (transduction functions, psychophysical thresholds for humans, theoretical thresholds for metrics… ) that will be detailed in this section too.SectionIII describes the experimental setting, and SectionIV considers the results on the thresholds and the sensitivities. Finally, the discussion and conclusions are presented in SectionV and Section VI, respectively.

II Proposed Methodology: Thresholds and Sensitivities

Given an original image, $i$ , it can be distorted through certain transform whose intensity depends on a parameter $\theta$ , $i^{\prime}=T_{\theta}(i)$ .Image quality metrics are models that try to reproduce the subjective sensation of distance between the original image and the distorted image, $d(i,i^{\prime})$ .Human observers are unable to distinguish between $i$ and $i^{\prime}$ , i.e. they are invariant to transform $T_{\theta}$ , if its intensity is below a (human) threshold, $\theta_{\tau}^{\textrm{H}}$ .

While invariance thresholds in human observers, $\theta_{\tau}^{\textrm{H}}$ , are easy to understand and measure[24, 25, 26, 27], they are not obvious to define in artificial models of perceptual distance.The reason is simple: for the usual image quality models any non-zero image distortion leads to non-zero variations in the distance.In this situation where artificial distances are real-valued functionsone should define a value, the threshold distance, $\mathcal{D}_{\tau}$ , below which the difference between the images could be disregarded (or taken as zero).Once this threshold distance is available (eventually in a common scale for all metrics) one can translate this threshold to the axis that measures the distortion intensity, $\theta$ , and hence obtain the threshold of the metric, $\theta_{\tau}^{M}$ , in the same units that has been measured for humans, $\theta_{\tau}^{H}$ .See an illustration of this concept in Fig.1.

The diagram in Fig.1 displays the relation betweenthe physical description of the intensity of certain distortions in abscissas(the parameter $\theta$ , e.g. the angle in a rotation transform), and the common internal description of the perceived distance in ordinates (the distance $\mathcal{D}$ , e.g. the normalized Mean Opinion Score units explained below).This relation between the physical description of the intensity and common perceptual distance is what we call transduction function.The example displays three transduction functions, $\mathcal{D}_{M}=g_{M}(\theta)$ , with $M=1,2,3$ , in blue, red, and green, for the three corresponding metrics.Transduction functions are monotonic: the application of a progressively increasing distortion along the abscissas leads to monotonic increments in distances in ordinates.The threshold distance, $\mathcal{D}_{\tau}$ , in the internal representation is plotted in orange in Fig.1. This threshold may be uncertain (as represented by the central value, solid line, and the quartile limits, in dashed lines), but the empirical transduction functions can be used to put this internal threshold back in the axis that describes the distortion intensity: $\theta_{\tau}^{M}=g_{M}^{-1}(\mathcal{D}_{\tau})$ .In this way, one can check if the actual invisibility threshold measured for humans (black line) is consistent with the threshold interval deduced for each metric. In our illustration, the metric2 (red line) is the only one compatible with human behavior.

Invariance of deep image quality metrics to affine transformations (1)

Using the above, on the one hand, we propose to evaluate the alignment between the metric models and human observers by comparing $\theta_{\tau}^{M}$ versus $\theta_{\tau}^{H}$ .This is a strict comparison that only depends on theexperimental uncertainty of the thresholds.On the other hand, we can define an alternative (less strict) comparison by considering the order among the sensitivities for the different distortions in humans and models. In particular, one can define the sensitivity of a metric for a distortion as the variation of the transduction function in terms of the energy (or Mean Squared Error) of the distortion introduced by the transform. The human sensitivity in detection is classically defined to be proportional the inverse of the energy required to see the distortion[28].

The proposed comparisons of metric vs human thresholds and metric vs human sensitivities are general as long as one can address the following issues:

(a)
The transduction function (red, green, and blue curves in the illustration).For the $n$ -th metric, $d_{M}(i,i^{\prime})$ , the relation between the physical description of the image transform and a common metric-independent distance domain has two components:
1. (a.1)
  The (non-scaled) response function can be empirically computed by generating images distorted (transformed) with different intensities, $\theta$ , and using the metric expression to compute the corresponding distances from the original image. This leads to the distances $d_{M}(\theta)=d_{M}(i,i^{\prime})=d_{M}(i,T_{\theta}(i))$ .
2. (a.2)
  A metric equalization function transforming the previous (non-scaled) distance values into the common scale of the internal distance representation (what we called normalized DMOS units in the illustration).Here we propose to use auxiliary empirical data (e.g. certain subjectively rated databases)to scale the range of the different metrics: $\mathcal{D}_{M}=f_{M}(d_{M})$ . This makes different $\mathcal{D}_{M}$ comparable.
Then, the final (scaled) transduction is the composition of response and equalization: $\mathcal{D}(\theta)=g_{M}(\theta)=f_{M}(d_{M}(\theta))$ .
(b)
The human thresholds,that can be defined in different domains:
1. (b.1)
  The human threshold in the common internal representation, $\mathcal{D}_{\tau}$ , orange line in the illustration of Fig1.In principle, this value is unknown.Here we propose a standard measurement of this threshold through a psychometric function[29] using distorted images of the selected subjectively rated database.
2. (b.2)
  The human threshold in the input physical representation, $\theta_{\tau}^{H}$ , black line in the illustration of Fig1. In this work, we explore two options:(i)take the values from the classical literature, which in general uses substantially different stimuli (synthetic as opposed to natural), and(ii)re-determine the thresholds in humans by using comparable natural stimuli and a separate psychometric function for each distortion.
(c)
The metric threshold, which can be expressed in intuitive physical units, or in the own units of the metric:
1. (c.1)
  In physical units $\theta_{\tau}^{M}$ . Here, the blue, red, and green points in the x-axis of the illustration of Fig1, computed as $\theta_{\tau}^{M}=g_{M}^{-1}(\mathcal{D}_{\tau})$ .These units are useful to compare with the equivalent human values.
2. (c.2)
  In the units of the own metric, $d_{\tau}^{M}=f_{M}^{-1}(\mathcal{D}_{\tau})$ .This value is particularly interesting, as it indicates a variation in the distance of each metric for which humans see no difference between the original and the distorted image. Distortions leading to distances below this value are invisible to humans and hence should be neglected.
  See Also
  Succeeding at PVP – Steam Solo 2.0 Attributes Explained | Cyberpunk 2077｜Game8 State-dependent interactions in ultracold 174Yb probed by optical clock spectroscopy US Patent Application for HIGH-EFFICIENCY GOLD RECOVERY BY ADDITIVE-INDUCED SUPRAMOLECULAR POLYMERIZATION OF CYCLODEXTRIN Patent Application (Application #20240247336 issued July 25, 2024)
(d)
The sensitivities of humans and metrics for the different distortions. The sensitivity for a small distortion is usually defined as the inverse of the energy required to be above the invisibility threshold[28]. In the case of metrics, this general definition reduces to the derivative of the transduction function with regard to the energy of the distortion[30].

Below, we elaborate on each of these factors in turn.

II-A Transduction: response and equalization

The response function of a metric, $d_{M}$ , to a certain image transform, $T_{\theta}$ , is just the average over a set of images, $\{i^{s}\}_{s=1}^{S}$ , of the distances for the distorted images for different transform intensities:

d_{M}(\theta)=\frac{1}{S}\sum_{s=1}^{S}d_{M}(i^{s},T_{\theta}(i^{s}))

(1)

which can be empirically computed from controlled distortions of the images of the dataset.Of course, when considering $M$ different metrics, their response functions, $d_{M}$ , are not given in a common scale.

In this work, we propose to use auxiliary empirical data to determine a common distance scale, $\mathcal{D}$ , for any metric.In particular, subjectively rated image quality databases(e.g. TID[31]) consist of pairs of original and distorted images, $\{(i^{p},{i^{p}}^{\prime})\}_{p=1}^{P}$ ,with associated subjective scores, the so-called Mean Opinion Scores, $\{\textrm{DMOS}^{p}\}_{p=1}^{P}$ .The databases contain a wide set of generic distortions of different intensities thus ranging from invisible distortions to highly noticeable distortions. In this setting, DMOS values in the database can be normalized to be in the [0,1] range.In this way, the extreme values of the normalized DMOS represent an invisible distortion and the biggest subjective distortion in the database respectively.Therefore, if the range of distortions in the database is wide, the variations induced by $T_{\theta}$ will be within the limits of the normalized DMOS, and hence, it can be used to set the common distance scale $\mathcal{D}=\textrm{norm DMOS}\in[0,1]$ .

Using the normalized DMOS values and the image pairs of a large subjectively rated database one can fit an equalization function, $f_{M}$ , for each metric to transform the non-scaled response $D_{M}$ into the common scale $\mathcal{D}$ :

\mathcal{D}=f_{M}(d_{M}(i,i^{\prime}))=a_{M}\cdot d_{M}(i,i^{\prime})\,^{b_{M}}

(2)

where an exponential function with $a_{n}>0$ and $0<b_{n}\neq 1$ is chosen because of the nature of the DMOS, which changes rapidly for low distortion intensities, low values of $\theta$ , and saturates for bigger distortions, big values of $\theta$ . This is due to the maximum number of comparisons performed when building the database. An example of an equalization function is shown in Figure2 as an example.

Invariance of deep image quality metrics to affine transformations (2)

In summary, (a.1) an application of the distance metric to a controlled set of transformed images, and (a.2) a fit of an equalization function to transform the arbitrary scale of the distance to a common scale given by the set of distortions in a wide subjectively rated database, gives us the transduction function of the metric made of Eqs.1 and2:

\mathcal{D}=g_{n}(\theta)=f_{n}\circ d_{n}(\theta)

(3)

In our work, we compute these transduction functions for:

•
Six distance metrics (four state-of-the-art deep-learning metrics[3, 4, 32, 23] and two convenient references: the Euclidean metric, RMSE, and the classical SSIM[13]).
•
Four affine transformations: translation, rotation, scale, and change of spectral illumination.
•
Four datasets to transform the images and compute the response functions: MNIST[33], CIFAR10[34], ImageNet[35], and TID2013[31].
•
One subjectively rated dataset, TID2013[31], to define the equalization functions to the common internal distance representation.

II-B Human thresholds

In this section, we measure the human thresholds from two different points of view. First, we use distorted images from a subjectively rated database to measure from the internal common representation. Then, we measure it from the internal physical representation using natural stimuli.

B.1. Human thresholds in the common internal representation

To find out if the metrics behave similarly to humans we need to determine the human invisibility threshold, $\mathcal{D}_{\tau}$ . This is the distance in normalized DMOS units from which humans can’t tell the difference between $i$ and $i^{\prime}$ for low distortion intensities $\theta$ . For that, we need a database with ratings and opinions of observers, for which we can assign a threshold value of invisibility. In this case, we use the TID13 database.[31].

The value of $\mathcal{D}_{\tau}$ could be roughly estimated by visual inspection of the images presented in AppendixA: images with low values of normalized DMOS (below 0.3) cannot be discriminated from the original, while images with big normalized DMOS (above 0.6) are clearly distinct from the original.However, the more accurate estimation of such a threshold is obtained from the psychometric function in a constant stimulus experiment[29] applied to other fields within computer vision, such as[36, 37, 38, 39, 40]

In this kind of experiments, given a set of distorted images one computes the probability that an observer sees some distorted image as different from the original, i.e. $P(\mathcal{D}\geq\mathcal{D}_{\tau})$ . This is done by using the two-alternative forced choice paradigm (2AFC): by randomly presenting the observer each distorted image together with the original so that the observer is forced to choose the distorted one. This is repeated $R$ times, so the probability, $P(\mathcal{D}\geq\mathcal{D}_{\tau})$ , is given by the number of correct responses over $R$ . Note that if $\mathcal{D}\ll\mathcal{D}_{\tau}$ the observer will not see the difference and the probability of correct answer will be 0.5. On the other extreme, if $\mathcal{D}\gg\mathcal{D}_{\tau}$ the answer will be obvious for the observer, and the probability of a correct answer will be 1. As a result, the threshold can be defined as the point where $P(\mathcal{D}\geq\mathcal{D}_{\tau})=0.75$ .

We look for the optimal threshold, $\mathcal{D}_{\tau}$ , and slope, $k$ , that better fit the experimental data using the following sigmoid[29]:

p(x)=\frac{1}{2}+\frac{1}{2(1+e^{-k(\mathcal{D}-\mathcal{D}_{\tau})})}

(4)

Note that this expression enforces that for extremely distorted images $\lim\limits_{\mathcal{D}\to\infty}p(\mathcal{D}\geq\mathcal{D}_{\tau})=1$ , and that the probability at the threshold, $\mathcal{D}=\mathcal{D}_{\tau}$ is 0.75, as required.

In our experiment, the psychometric function has been evaluated in 20 different values of $\mathcal{D}$ , repeated 15 times for 5 observers, resulting in $20\times 15\times 5=1500$ forced choices in total. The fitted psychometric function is shown in Figure3 and shows that the value of the threshold for humans in the common internal distance representation is

\mathcal{D}_{\tau}=0.44\pm 0.05

where the corresponding quartiles give the uncertainty. This threshold corresponds to the orange lines in Figure1.

Invariance of deep image quality metrics to affine transformations (3)

B.2. Human threshold in physical units

Humans are not equally sensitive to all transformations or to different degrees of distortion, $\theta$ . In fact, there are certain thresholds at which these distortions are not perceptible, $\theta_{\tau}^{H}$ , known as the invisibility threshold. Initially, we were going to use the thresholds from the classical literature[24, 25, 26] but, considering that these thresholds are measured with synthetic stimuli and not with natural images, we decided to do the experiments ourselves. To derive the new thresholds with human data, we followed the psychometric function methodology, following the Equation 4. Contrary to the literature, these experiments have been done with natural images, to use the same type of images that the models use. In particular, images from ImageNet have been used to find the invisibility thresholds in a more realistic context. Regarding the geometric affine transformations (Section III-A), the procedure follows the classic psychometric function methodology (Section II-B). The results of the new measurements on humans with natural stimuli can be seen in Figure4.

Invariance of deep image quality metrics to affine transformations (4)

Because when changing the spectral illuminant we move in a 2D space, illuminant transformations are treated slightly differently. On the one hand, the MacAdam ellipses[27] don’t contain any ellipse centered in the true white, coordinates $(1/3,1/3)$ in the spectral locus, but it is the starting point of the employed illuminants. This can be solved by fitting a new ellipse at this point considering the closest existing ellipses. On the other hand, experimentally evaluating a new ellipse employing natural images would imply, in our case, fitting 20 psychometric functions. Instead, we only obtain the psychometric functions corresponding to the major and minor axes of the ellipse: Yellow-Blue (YB) and Red-Green (RG) chromatic directions respectively. For a more accurate fit, two intermediate hues have also been measured. The fitted psychometric functions for these 4 chromatic directions are shown in Figure5.

Invariance of deep image quality metrics to affine transformations (5)

Invariance of deep image quality metrics to affine transformations (6)

Once these points are measured, we can fit a new experimental ellipse that has been measured only with natural images. Both ellipses, the fitted and the experimental ones, have been obtained with the calculations from [41], and can be seen in Figure6.

Invariance of deep image quality metrics to affine transformations (7)

TableI summarizes the results with synthetic (obtained from classical literature) and real stimuli (experimentally measured in this work).

Affine Transformations	Synthetic stimuli	Natural stimuli
Translation	0.024 degrees [24]	0.23 degrees
Rotation	3 degrees [25]	3.6 degrees
Scale	1.03 scale factor [26]	1.026 scale factor
Color discriminant	MacAdam ellipse [27]	Experimental ellipse (Fig. 6)

From Table I we see that the rotation and scale thresholds are practically equivalent, but this is not the case for the translation threshold. This can be attributed to the fact that experiments with synthetic stimuli can be considered as being close to a hyperacuity setting. That is, they are based on an experiment in which images are shown in successive comparisons rather than in simultaneous comparisons, as in our case. Our experiments employ natural images and do not enforce the hyperacuity setting. This results in a considerably higher detection threshold for this transformation. Another more recent work we could consider in this area are the thresholds measured in 2016 with gabor filters [42]. In this case, the human rotation threshold was considered to be 2.7 degrees of rotation and the human translation threshold the equivalent of 0.12 degrees of translation. These measurements are quite consistent with ours. Taking into account that the methodology of obtaining the threshold is the same as in our case, the difference in the thresholds is given exclusively by the type of stimulus used.For this reason, the detection threshold for translation from other literature will not be considered in the results because of differences in the procedure for obtaining it and its interpretation in the case of classical literature; and for the type of the stimulus used in the recent literature.

II-C Metric thresholds in physical units

Following the methodology in Figure 1, we can obtain the metric thresholds in physical units, $\theta_{\tau}^{M}$ . For example, the rotation threshold is expressed in degrees. Through the function $g_{M}(\theta)$ , specific to each metric, a value in physical units is assigned to the threshold in the common internal representation, computed as $\theta_{\tau}^{M}=g_{M}^{-1}(\mathcal{D}_{\tau})$ . These values can be compared numerically with the thresholds obtained for humans, as they are expressed in the same units because they were obtained with the psychometric functions, Section 3. Therefore, a metric can be said to have human behavior if the invisibility threshold for a certain affine transformation coincides, $\theta_{\tau}^{H}==\theta_{\tau}^{M}$ . In particular, since $\mathcal{D}_{\tau}$ has an associated uncertainty, $\theta_{\tau}^{M}$ also has a confidence interval. In this case, we will check whether the human threshold falls within the confidence interval of each metric.

This test is particularly demanding because it is strictly quantitative, i.e. numerical comparison. It is to be expected, that the metrics will not be able to reproduce these thresholds. This is why an alternative, less demanding, test is proposed, reflecting the qualitative behavior of the metrics in the face of these distortions.

II-D Sensitivities of metrics and humans for the different distortions

Taking into account that the methodology proposed above is particularly demanding, we propose another test. Having calculated the thresholds for metrics and humans, we can define sensitivities for both and compare them. However, this comparison will be qualitative, not quantitative, so this test is less demanding for metrics.

The sensitivity for a small distortion is usually defined as the inverse of the energy required to be above the invisibility threshold (Eq. 5), i.e, expressing the distortion $\theta_{\tau}^{H}$ in RMSE units. In the case of metrics, this general definition reduces to the derivative of the transduction function concerning the energy of the distortion, i.e. its slope.

S=\frac{1}{|i-T_{\theta^{H,M}_{\tau}}(i)|_{2}}

(5)

In this way, we can obtain the sensitivity of a metric for each affine transform, order them and check if the order matches the human ones. A real example of this can be found in Figure 7 (the rest of the figures can be found in Appendix LABEL:app:apndD), where the vertical lines represent the human thresholds expressed in terms of energy and the different curves describe the behavior of that metric for the various transformations. Given two curves, the most sensitive will be the one with a higher slope.

In Figure 7, the human thresholds indicate that the human sensitivity order (indicated by the vertical lines) is: scale, translation, rotation, RG, and YB illuminants. However, the metric shown returns the following order (given by the slopes of the curves): scale, rotation, translation, YB, and RG illuminants. Even if the ordering doesn’t strictly match, it is to be noted that this metric maintains more sensitivity to geometric than chromatic distortions.

Invariance of deep image quality metrics to affine transformations (8)

III Experimental Settings

In this section, we review the selected affine transformations, models, and databases that we will use in the experiments. Code available to test other metrics publicly ²²2https://github.com/Rietta5/InvarianceTestIQA

III-A Affine Transformations

An affine transformation or affine application (also called affinity) between two affine spaces is a transformation that satisfies Equation 6.

F:v\to Mv+b

(6)

Where $v$ can be any vector and the affine transformation is represented by a matrix $\mathbf{M}$ and a vector $\mathbf{b}$ satisfying the following properties: first, it maintains the collinearity (and coplanarity) relations between points and, second, it maintains the ratios between distances along a line.

Here, we apply affine transformations in two cases: domain (Equation7) and image samples (Equation8), i.e. modifications within the image vector:

i^{\prime}(x)=i(Mx+b)

(7)

i^{\prime}(x)=Mi(x)+b

(8)

Some examples of affine transformations are geometric contraction, expansion, dilation, reflection, rotation, or shear. In this work, we are going to focus specifically on translation, rotation scale, and illuminant changes. For each tested affine transformation, the original images are modified in the following ways:

•
Translation: Displacements on the vertical and horizontal axis (and the combination between them) with an amplitude of 0.3 degrees of translation in each direction (left, right, up, and down). Given the symmetry in both displacement directions, we calculate the average and show only the displacements to the right in the graphs.
•
Rotation: Rotations from -10 to 10 degrees, in steps of 0.1 degree. Again, given the symmetry in both displacement directions, only positive rotations will be shown in the graphs.
•
Scale: Scale factors range from 0.1 to 2, but do not have a fixed step. Only scale factors that return images of even size are used. This ensures that only scales that do not force a translation are applied. Only scale factors bigger than 1 will be shown.
•
Illuminant changes: We have desaturated the original images and modified the illuminant for 20 hues in 8 saturations. Following the distribution at the locus as indicated in Figure 8.

As a summary, examples of all the affine transformations emulated in this work can be seen in Figure 9 and an example of all illuminant changes and different intensities in Figure 11.

Invariance of deep image quality metrics to affine transformations (10)

III-B Datasets

In terms of dataset selection, we chose four different datasets covering a wide range of features. On the one hand, we have MNIST [33] for the black and white images. A simple set both to understand and to modify. On the other hand, for color images, we have selected CIFAR-10[34] for color images with low resolution, and ImageNet[35] and TID2013[31] for color images with high resolution.

Specifically, from each dataset we selected 250 images to reduce the computational burden of applying all affine transformations and comparing all metrics; and they were modified so that, when applying the transformations, the resulting images would not present new artifacts or the central element would disappear. For the MNIST set images -originally with 28x28 images-, we simply enlarged the images (56x56) by adding black pixels around the original images to give us more room to move them. For color images, to avoid the appearance of black borders when applying some transformations, we decided to modify them and generate a mosaic. In addition to generating the mosaic, the images have a mirror effect to make the transitions between the different images smoother. Once the mosaic is created, the transformation is applied and a patch is taken from the original size of the database, thus preserving the size while including the modification. Figure 10 shows an example of the resulting modified images.

Invariance of deep image quality metrics to affine transformations (11)

Invariance of deep image quality metrics to affine transformations (12)

III-C Metrics

We choose different metrics because of their relevance in the context of perceptual metrics, either because of their age and widespread use, or because of the good results obtained in other works and applied to other areas of study. In this work, the following will be used:

•
Mean Squared Error (MSE): Measures the average squared difference between corresponding pixels in two images. It’s a basic metric for image quality assessment, often used in image denoising and restoration tasks. However, it doesn’t always align well with human perception.
•
Structural Similarity Index (SSIM) [13]: For two images - the original and the distorted image - it computes three different comparisons: Luminance, contrast and structure. SSIM provides a more perceptually meaningful measure than MSE.
•
Learned Perceptual Image Patch Similarity (LPIPS) [3]: It uses a VGG trained with Imagenet to pass the images and compute distances in different feature spaces. Then, it performs a weighted sum so that the correlation with a perceptual database is maximized.
•
Deep Image Structural Similarity (DISTS) [23]: As in LPIPS, it uses the VGG network, but in addition, it performs SSIM index at different intermediate layers. This new index combines sensitivity to structural distortions with tolerance to textures sampled elsewhere in the image.
•
PerceptNet [4]: It proposes an architecture to reflect the structure and stages of the early human visual system by considering a succession of canonical linear and non-linear operations. The network is trained to maximize correlation with a perceptual database.
•
Perceptual Information Metric (PIM) [32]: Unlike the other metrics, its training is based on two principles: efficient coding and slowness. On the one hand, it is compressive and, on the other, it captures persistent temporal information. This is achieved by training the network with images extracted from videos over which very short periods of time have passed, which should make it more robust to small variations in the object’s movements or subtle changes in lighting.

IV Results

Table II summarizes the numerical results of the experiments. For each geometric affine transformation, we present the discrimination thresholds per metric, $\theta_{\tau}^{M}$ . In the case of illuminant changes, we show the errors made with respect to the interpolated MacAdam and the experimentally fitted ellipses. As the ellipses are very similar (Figure6), the errors are very similar. In those where they did not coincide completely, the mean of the errors has been calculated and marked with an asterisk (*) in the Table. The Figures from which these numerical results are derived are given in Appendix B and Appendix C to avoid cluttering the main text.

As we are considering both the human thresholds extracted from the literature, $\theta_{\tau}^{H_{L}}$ , and the threshold obtained by ourselves with natural images, $\theta_{\tau}^{H_{N}}$ , a result is marked in bold if $\theta_{\tau}^{M}$ falls within $\theta_{\tau}^{H_{L}}$ and is underlined if it falls within $\theta_{\tau}^{H_{N}}$ . Note that a result can be marked both ways at the same time. A metric with many highlighted boxes will be a metric with a good performance from the point of view of human invariance.

		RMSE	SSIM	LPIPS	DISTS	PerceptNet	PIM	Human
translation	MNIST	0.025 $\pm$ 0.003	0.030 $\pm$ 0.004	0.05 $\pm$ 0.03	0.034 $\pm$ 0.008	0.031 $\pm$ 0.004	0.05 $\pm$ 0.03
	CIFAR-10	0.033 $\pm$ 0.004	0.032 $\pm$ 0.004	0.05 $\pm$ 0.03	0.036 $\pm$ 0.019	0.05 $\pm$ 0.02	0.13 $\pm$ 0.12
	TID13	0.027 $\pm$ 0.003	0.028 $\pm$ 0.003	0.034 $\pm$ 0.009	0.08 $\pm$ 0.10	0.0029 $\pm$ 0.003	0.06 $\pm$ 0.04	0.23
	ImageNet	0.036 $\pm$ 0.010	0.033 $\pm$ 0.004	0.05 $\pm$ 0.04	0.13 $\pm$ 0.15	0.036 $\pm$ 0.010	0.15 $\pm$ 0.13
Rotation	MNIST	2 $\pm$ 1	2.9 $\pm$ 1.5	3.7 $\pm$ 2	1.2 $\pm$ 0.8	4.0 $\pm$ 1.7	10.7 $\pm$ 8.1	3
	CIFAR-10	1.1 $\pm$ 0.6	1.5 $\pm$ 0.7	1.4 $\pm$ 1.2	0.7 $\pm$ 0.4	3.6 $\pm$ 1.8	1.7 $\pm$ 0.9
	TID13	0.10 $\pm$ 0.03	0.11 $\pm$ 0.04	0.10 $\pm$ 0.05	0.1 $\pm$ 0.3	0.11 $\pm$ 0.04	0.09 $\pm$ 0.08	3.6
	ImageNet	0.19 $\pm$ 0.09	0.15 $\pm$ 0.06	0.15 $\pm$ 0.11	0.11 $\pm$ 0.05	0.22 $\pm$ 0.09	0.2 $\pm$ 0.5
Scale	MNIST	1.10 $\pm$ 0.04	1.19 $\pm$ 0.12	1.2 $\pm$ 0.2	1.089 $\pm$ 0.010	1.16 $\pm$ 0.07	1.3 $\pm$ 0.2	1.03
	CIFAR-10	1.052 $\pm$ 0.006	1.055 $\pm$ 0.007	1.061 $\pm$ 0.007	1.054 $\pm$ 0.007	1.067 $\pm$ 0.018	1.06 $\pm$ 0.03
	TID13	1.047 $\pm$ 0.006	1.048 $\pm$ 0.006	1.057 $\pm$ 0.007	1.062 $\pm$ 0.008	1.040 $\pm$ 0.005	1.059 $\pm$ 0.007	1.026
	ImageNet	1.0080 $\pm$ 0.0010	1.0074 $\pm$ 0.0009	1.0086 $\pm$ 0.0011	1.00813 $\pm$ 0.00010	1.0082 $\pm$ 0.0010	1.009 $\pm$ 0.005
Illum	MNIST	0.03 $\pm$ 0.06*	-	-	0.016 $\pm$ 0.018*	-	0.05 $\pm$ 0.05*	MacAdam
	CIFAR-10	0.007* $\pm$ 0.009*	0.03 $\pm$ 0.05*	0.03 $\pm$ 0.03*	0.005 $\pm$ 0.013*	0.05 $\pm$ 0.13*	0.06 $\pm$ 0.06*
	TID13	0.010 $\pm$ 0.010*	0.02 $\pm$ 0.02*	0.02 $\pm$ 0.02	0.02 $\pm$ 0.02	0.05 $\pm$ 0.09*	0.005 $\pm$ 0.012*	Our ellipse
	ImageNet	0.008 $\pm$ 0.010	0.02 $\pm$ 0.02*	0.02 $\pm$ 0.02	0.02 $\pm$ 0.02	0.05 $\pm$ 0.09*	0.005 $\pm$ 0.012*

The results from Table II show that following our strong invariance criteria, there is no clear winner. No model shows the required robustness to affine transformations nor type of stimuli.

The comparison of the order between the human thresholds and the sensitivities of the metrics can be seen in Table III, where even if no metric shows complete human behavior, most of them maintain the chromatic ordering.

	MNIST	CIFAR-10	TID13	ImageNet
Human	S $>$ R $>$ T $>$ RG $>$ YB	S $>$ R $>$ T $>$ RG $>$ YB	S $>$ T $>$ R $>$ RG $>$ YB	S $>$ T $>$ R $>$ RG $>$ YB
SSIM	R $>$ S $>$ T $>$ RG $>$ YB	YB $>$ T $>$ S $>$ RG $>$ R	YB $>$ S $>$ R $>$ T $>$ RG	YB $>$ T $>$ S $>$ RG $>$ R
LPIPS	R $>$ RG $>$ YB $>$ S $>$ T	R $>$ YB $>$ RG $>$ S $>$ T	YB $>$ RG $>$ R $>$ S $>$ T	YB $>$ R $>$ RG $>$ T $>$ S
DISTS	RG $>$ S $>$ YB $>$ R $>$ T	R $>$ RG $>$ YB $>$ S $>$ T	R $>$ RG $>$ YB $>$ S $>$ T	R $>$ S $>$ RG $>$ YB $>$ T
PerceptNet	S $>$ T $>$ R $>$ RG $>$ YB	T $>$ S $>$ RG $>$ R $>$ YB	S $>$ R $>$ T $>$ YB $>$ RG	T $>$ S $>$ R $>$ YB $>$ RG
PIM	RG $>$ YB $>$ R $>$ S $>$ T	R $>$ RG $>$ YB $>$ S $>$ T	RG $>$ YB $>$ R $>$ S $>$ T	YB $>$ RG $>$ R $>$ S $>$ T

V Discussion

As we have seen in Table II, bearing in mind that no model is highlighted for all related transformations, the conclusion is that there is no metric that behaves, in terms of invariance, like the human being. We can dissect the behavior for the different affine transformation as follows:

•
Translation: There is no particular model that shows human-like translation invariance but DISTS and PIM are the best in this regard.
•
Rotation: Perceptnet shows the most human-like behavior but it’s very dependent on the database used.
•
Scale: There are isolated highlighted cells. In general, all the models seem less sensitive than humans except in the ImageNet database, where the threshold is an order of magnitude smaller than the human threshold.
•
Illuminants: Two models rise above the rest. On one side, DISTS has a better performance with small images, on the other side, PIM performs better with big images. It caught our attention that PerceptNet’s discrimination ellipse has a radically different orientation. We attribute this to the fact that the chromatic transformations it applies at its first layers, make all the images greener as shown in Figure 12.

Getting into Table III, it is not surprising to see that the ordering of the geometric transformations does not depend on the dataset chosen for the human observers. Regarding the metrics’ performance, PerceptNet is the only metric that almost maintains the ordering of being more sensitive to geometric over chromatic transformations. All the metrics are generally more sensitive to YB than RG, which matches human performance. All in all, no metric can be said to have human-like behavior even with the least strict test.

In addition, a collateral result that can be obtained from our work is specific invisibility thresholds for each metric. That is thresholds in units of each metric that indicate variations for which the distance is imperceptible to the human eye. These values are shown in Table IV.

	RMSE	SSIM	LPIPS	DISTS	PerceptNet	PIM
Invisibility Threshold	0.020	0.016	0.017	0.017	0.008	3.09

VI Conclussion

As a complement to the usual reproduction of subjective quality ratings, we argue that the invisibility thresholds of perceptual metrics should also correspond to the invisibility thresholds of humans. This direct comparison (particularly interesting for affine transformations) is a strong test for metrics because human thresholds can be accurately measured with classical psychophysics experiments. This comparison requires (a)data of human thresholds and (b)metric thresholds.

Regarding human thresholds, we used both classical and explicitly measured values of the thresholds obtained using the same kind of natural images employed in perceptual metrics. Whereas for metric thresholds we propose a methodology to assign invisibility regions for any particular metric, which can be expressed in units of the metric themselves or in physical units to facilitate direct comparison with human psychophysics results.

We also propose a less restrictive test: instead of reproducing the exact threshold values, we evaluate if the metric just matches the sensitivity ordering. That means, using sensitivity as the inverse of the threshold energy, one can check whether the metrics are more sensitive to one distortion or the other. This ordering can be then compared to the one in humans.

Making the demanding comparison between human and metric thresholds for a range of state-of-the-art deep image quality metrics shows that none of the studied metrics (both deep as well as RSMSE and SSIM) succeeds under these criteria: they do not reproduce all the human thresholds for the different affine transformations, and they do not reproduce the order of sensitivities in humans. No metric is capable of reproducing the thresholds of invisibility and human sensibilities.

This means that tuning the models exclusively to predict quality ratings may disregard other properties of human vision for instance invariances or invisibility thresholds.

Invariance of deep image quality metrics to affine transformations (13)

Acknowledgment

This work was supported in part by MICIIN/FEDER/UE under Grant PID2020-118071GB-I00 and PDC2021-121522-C21, in part by Spanish MIU under Grant FPU21/02256, in part by Generalitat Valenciana under Projects GV/2021/074, CIPROM/2021/056 and CIAPOT/2021/9 and in part by valgrAI - GVA. Some computer resources were provided by Artemisa, funded by the European Union ERDF and Comunitat Valenciana as well as the technical support provided by the Instituto de Física Corpuscular, IFIC (CSIC-UV).

References

[1]M.Eckert and A.Bradley, “Perceptual quality metrics applied to still image compression,” Signal Processing, vol.70, pp.177–200, 1998.
[2]K.Egiazarian, M.Ponomarenko, V.Lukin, and O.Ieremeiev, “Statistical evaluation of visual quality metrics for image denoising,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.6752–6756, 2018.
[3]R.Zhang, P.Isola, A.A. Efros, E.Shechtman, and O.Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.586–595, 2018.
[4]A.Hepburn, V.Laparra, J.Malo, R.McConville, and R.Santos-Rodriguez, “Perceptnet: A human visual system inspired neural network for estimating perceptual distance,” in 2020 IEEE International Conference on Image Processing (ICIP), IEEE, oct 2020.
[5]M.Martinez-Garcia, M.Bertalmío, and J.Malo, “In praise of artifice reloaded: Caution with subjective image quality databases,” Frontiers in Neuroscience, 2019.
[6]D.H. Kelly, “Motion and vision. ii. stabilized spatio-temporal threshold surface,” J. Opt. Soc. Am., vol.69, pp.1340–1349, Oct 1979.
[7]L.T. Maloney and B.A. Wandell, “Color constancy: a method for recovering surface spectral reflectance,” in Readings in Computer Vision (M.A. Fischler and O.Firschein, eds.), pp.293–297, San Francisco (CA): Morgan Kaufmann, 1987.
[8]M.D. Fairchild, Front Matter, pp.i–xxii.John Wiley and Sons, Ltd, 2013.
[9]D.J. Heeger, “Normalization of cell responses in cat striate cortex,” Visual Neuroscience, vol.9, no.2, p.181–197, 1992.
[10]G.Mather, F.Verstraten, and S.Anstis.MIT Press, 1998.
[11]M.Webster, “Adaptation and visual coding,” Journal of vision, vol.11, 05 2011.
[12]Z.Wang and E.Simoncelli, “An adaptive linear system framework for image distortion analysis,” in IEEE International Conference on Image Processing 2005, vol.3, pp.III–1160, 2005.
[13]Z.Wang, A.Bovik, H.Sheikh, and E.Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE Transactions on Image Processing, vol.13, no.4, pp.600–612, 2004.
[14]Z.Wang and A.C. Bovik, “Mean squared error: Love it or leave it? a new look at signal fidelity measures,” IEEE Signal Processing Magazine, vol.26, no.1, pp.98–117, 2009.
[15]J.Malo, A.Pons, and J.Artigas, “Subjective image fidelity metric based on bit allocation of the human visual system in the dct domain,” Image and Vision Computing, vol.15, pp.535–548, 1997.
[16]V.Laparra, J.Muñoz, and J.Malo, “Divisive normalization image quality metric revisited,” Journal of the Optical Society of America. A, Optics, image science, and vision, vol.27, pp.852–64, 04 2010.
[17]R.K. Mantiuk, K.J. Kim, A.G. Rempel, and W.Heidrich, “Hdr-vdp-2: a calibrated visual metric for visibility and quality predictions in all luminance conditions,” ACM SIGGRAPH 2011 papers, 2011.
[18]V.Laparra, J.Ballé, A.Berardino, and E.Simoncelli, “Perceptual image quality assessment using a normalized laplacian pyramid,” Electronic Imaging, vol.2016, pp.1–6, 02 2016.
[19]V.Laparra, A.Berardino, J.Ballé, and E.P. Simoncelli, “Perceptually optimized image rendering,” Journal of the Optical Society of America A, vol.34, p.1511, Aug. 2017.
[20]M.Martinez-Garcia, P.Cyriac, T.Batard, M.Bertalmío, and J.Malo, “Derivatives and inverse of cascaded linear+nonlinear neural models,” PLOS ONE, vol.13, p.e0201326, Oct. 2018.
[21]J.Bruna and S.Mallat, “Invariant scattering convolution networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.35, no.8, pp.1872–1886, 2013.
[22]J.Freeman and E.Simoncelli, “Metamers of the visual stream,” Nature neuroscience, vol.14, pp.1195–201, 08 2011.
[23]K.Ding, K.Ma, S.Wang, and E.P. Simoncelli, “Image quality assessment: Unifying structure and texture similarity,” IEEE Transactions on Pattern Analysis and Machine Intelligence, pp.1–1, 2020.
[24]G.E. Legge and F.Campbell, “Displacement detection in human vision,” Vision Research, vol.21, no.2, pp.205–213, 1981.
[25]D.Regan, R.Gray, and S.Hamstra, “Evidence for a neural mechanism that encodes angles,” Vision Research, vol.36, no.2, pp.323–IN3, 1996.
[26]R.Teghtsoonian, “On the exponents in stevens’ law and the constant in ekman’s law,” Psychological review, 1971.
[27]D.L. MacAdam, “Visual sensitivities to color differences in daylight $\ast$ ,” J. Opt. Soc. Am., vol.32, pp.247–274, May 1942.
[28]F.Campbell and J.Robson, “Application of fourier analysis to the visibility of gratings,” The Journal of Physiology, 08 1968.
[29]F.A. Kingdom and N.Prins, Psychophysics.Academic Press, 2009.
[30]A.Hepburn, V.Laparra, R.Santos-Rodriguez, J.Ballé, and J.Malo, “On the relation between statistical learning and perceptual distances,” in International Conference on Learning Representations, 2022.
[31]N.Ponomarenko, L.Jin, O.Ieremeiev, V.Lukin, K.Egiazarian, J.Astola, B.Vozel, K.Chehdi, M.Carli, F.Battisti, and C.-C. Jay Kuo, “Image database tid2013: Peculiarities, results and perspectives,” Signal Processing: Image Communication, vol.30, pp.57–77, 2015.
[32]S.Bhardwaj, I.Fischer, J.Ballé, and T.Chinen, “An unsupervised information-theoretic perceptual quality metric,” 2021.
[33]Y.Lecun, L.Bottou, Y.Bengio, and P.Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol.86, no.11, pp.2278–2324, 1998.
[34]A.Krizhevsky, “Learning multiple layers of features from tiny images,” 2009.
[35]J.Deng, W.Dong, R.Socher, L.-J. Li, K.Li, and L.Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp.248–255, 2009.
[36]R.Shi, K.Ngan, S.Li, R.Paramesran, and H.Li, “Visual quality evaluation of image object segmentation: Subjective assessment and objective metric,” IEEE Transactions on Image Processing, vol.24, pp.1–1, 08 2015.
[37]W.J. Scheirer, S.E. Anthony, K.Nakayama, and D.D. Cox, “Perceptual annotation: Measuring human vision to improve computer vision,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.36, no.8, pp.1679–1686, 2014.
[38]J.Johnson, E.Krupinski, M.Yan, H.Roehrig, A.Graham, and R.Weinstein, “Using a visual discrimination model for the detection of compression artifacts in virtual pathology images,” IEEE Trans. Med. Imaging, vol.30, pp.306–314, 01 2011.
[39]T.S.A. Wallis and P.J. Bex, “Image correlates of crowding in natural scenes,” Journal of vision, vol.12 7, 2011.
[40]A.Ninassi, F.Autrusseau, and P.LeCallet, “Pseudo no reference image quality metric using perceptual data hiding,” Human Vision and Electronic Imaging, 02 2006.
[41]R.H. oy and J.Flusser, “Numerically stable direct least squares fitting of ellipses,” 1998.
[42]A.Baldwin, M.Fu, R.Farivar, and R.Hess, “The equivalent internal orientation and position noise for contour integration,” Scientific Reports, vol.7, 10 2017.