Structural similarity index measure

The structural similarity index measure (SSIM) is a method for predicting the perceived quality of digital television and cinematic pictures, as well as other kinds of digital images and videos. It is also used for measuring the similarity between two images. The SSIM index is a full reference metric; in other words, the measurement or prediction of image quality is based on an initial uncompressed or distortion-free image as reference.

SSIM is a perception-based model that considers image degradation as perceived change in structural information, while also incorporating important perceptual phenomena, including both luminance masking and contrast masking terms. This distinguishes SSIM from other techniques such as mean squared error (MSE) or Peak signal-to-noise ratio (PSNR) that instead estimate absolute errors. Structural information is the idea that the pixels have strong inter-dependencies especially when they are spatially close. These dependencies carry important information about the structure of the objects in the visual scene. Luminance masking is a phenomenon whereby image distortions (in this context) tend to be less visible in bright regions, while contrast masking is a phenomenon whereby distortions become less visible where there is significant activity or "texture" in the image.

History

[edit]

The predecessor of SSIM was called Universal Quality Index (UQI), or Wang–Bovik Index, which was developed by Zhou Wang and Alan Bovik in 2001. This evolved, through their collaboration with Hamid Sheikh and Eero Simoncelli, into the current version of SSIM, which was published in April 2004 in the IEEE Transactions on Image Processing.[1] In addition to defining the SSIM quality index, the paper provides a general context for developing and evaluating perceptual quality measures, including connections to human visual neurobiology and perception, and direct validation of the index against human subject ratings.

The basic model was developed in the Laboratory for Image and Video Engineering (LIVE) at The University of Texas at Austin and further developed jointly with the Laboratory for Computational Vision (LCV) at New York University. Further variants of the model have been developed in the Image and Visual Computing Laboratory at University of Waterloo and have been commercially marketed.

SSIM subsequently found strong adoption in the image processing community and in the television and social media industries. The 2004 SSIM paper has been cited over 50,000 times according to Google Scholar,[2] making it one of the highest cited papers in the image processing and video engineering fields. It was recognized with the IEEE Signal Processing Society Best Paper Award for 2009.[3] It also received the IEEE Signal Processing Society Sustained Impact Award for 2016, indicative of a paper having an unusually high impact for at least 10 years following its publication. Because of its high adoption by the television industry, the authors of the original SSIM paper were each accorded a Primetime Engineering Emmy Award in 2015 by the Television Academy.

Algorithm

[edit]

The SSIM index is calculated on various windows of an image. The measure between two windows and of common size is:[4]

with:

  • the pixel sample mean of ;
  • the pixel sample mean of ;
  • the variance of ;
  • the variance of ;
  • the covariance of and ;
  • , two variables to stabilize the division with weak denominator;
  • the dynamic range of the pixel-values (typically this is );
  • and by default.

Formula components

[edit]

The SSIM formula is based on three comparison measurements between the samples of and : luminance (), contrast () and structure (). The individual comparison functions are:[4]

with, in addition to above definitions:

SSIM is then a weighted combination of those comparative measures:

Setting the weights to 1, the formula can be reduced to the form shown above.

Mathematical Properties

[edit]

SSIM satisfies the identity of indiscernibles, and symmetry properties, but not the triangle inequality or non-negativity, and thus is not a distance function. However, under certain conditions, SSIM may be converted to a normalized root MSE measure, which is a distance function.[5] The square of such a function is not convex, but is locally convex and quasiconvex,[5] making SSIM a feasible target for optimization.

Application of the formula

[edit]

In order to evaluate the image quality, this formula is usually applied only on luma, although it may also be applied on color (e.g., RGB) values or chromatic (e.g. YCbCr) values. The resultant SSIM index is a decimal value between -1 and 1, where 1 indicates perfect similarity, 0 indicates no similarity, and -1 indicates perfect anti-correlation. For an image, it is typically calculated using a sliding Gaussian window of size 11x11 or a block window of size 8×8. The window can be displaced pixel-by-pixel on the image to create an SSIM quality map of the image. In the case of video quality assessment,[6] the authors propose to use only a subgroup of the possible windows to reduce the complexity of the calculation.

Variants

[edit]

Multi-Scale SSIM

[edit]

A more advanced form of SSIM, called Multiscale SSIM (MS-SSIM)[4] is conducted over multiple scales through a process of multiple stages of sub-sampling, reminiscent of multiscale processing in the early vision system. It has been shown to perform equally well or better than SSIM on different subjective image and video databases.[4][7][8]

Multi-component SSIM

[edit]

Three-component SSIM (3-SSIM) is a form of SSIM that takes into account the fact that the human eye can see differences more precisely on textured or edge regions than on smooth regions.[9] The resulting metric is calculated as a weighted average of SSIM for three categories of regions: edges, textures, and smooth regions. The proposed weighting is 0.5 for edges, 0.25 for the textured and smooth regions. The authors mention that a 1/0/0 weighting (ignoring anything but edge distortions) leads to results that are closer to subjective ratings. This suggests that edge regions play a dominant role in image quality perception.

The authors of 3-SSIM have also extended the model into four-component SSIM (4-SSIM). The edge types are further subdivided into preserved and changed edges by their distortion status. The proposed weighting is 0.25 for all four components.[10]

Structural Dissimilarity

[edit]

Structural dissimilarity (DSSIM) may be derived from SSIM, though it does not constitute a distance function as the triangle inequality is not necessarily satisfied.

Video quality metrics and temporal variants

[edit]

It is worth noting that the original version SSIM was designed to measure the quality of still images. It does not contain any parameters directly related to temporal effects of human perception and human judgment.[7] A common practice is to calculate the average SSIM value over all frames in the video sequence. However, several temporal variants of SSIM have been developed.[11][6][12]

Complex Wavelet SSIM

[edit]

The complex wavelet transform variant of the SSIM (CW-SSIM) is designed to deal with issues of image scaling, translation and rotation. Instead of giving low scores to images with such conditions, the CW-SSIM takes advantage of the complex wavelet transform and therefore yields higher scores to said images. The CW-SSIM is defined as follows:

Where is the complex wavelet transform of the signal and is the complex wavelet transform for the signal . Additionally, is a small positive number used for the purposes of function stability. Ideally, it should be zero. Like the SSIM, the CW-SSIM has a maximum value of 1. The maximum value of 1 indicates that the two signals are perfectly structurally similar while a value of 0 indicates no structural similarity.[13]

SSIMPLUS

[edit]

The SSIMPLUS index is based on SSIM and is a commercially available tool.[14] It extends SSIM's capabilities, mainly to target video applications. It provides scores in the range of 0–100, linearly matched to human subjective ratings. It also allows adapting the scores to the intended viewing device, comparing video across different resolutions and contents.

According to its authors, SSIMPLUS achieves higher accuracy and higher speed than other image and video quality metrics. However, no independent evaluation of SSIMPLUS has been performed, as the algorithm itself is not publicly available.

cSSIM

[edit]

In order to further investigate the standard discrete SSIM from a theoretical perspective, the continuous SSIM (cSSIM)[15] has been introduced and studied in the context of Radial basis function interpolation.

SSIMULACRA

[edit]

SSIMULACRA and SSIMULACRA2 are variants of SSIM developed by Cloudinary with the goal of fitted to subjective opinion data. The variants operate in XYB color space and combine MS-SSIM with two types of asymmetric error maps for blockiness/ringing and smoothing/blur, common compression artifacts. SSIMULACRA2 is part of libjxl, the reference implementation of JPEG XL.[16][17]

Other simple modifications

[edit]

The r* cross-correlation metric is based on the variance metrics of SSIM. It's defined as r*(x, y) = σxy/σxσy when σxσy ≠ 0, 1 when both standard deviations are zero, and 0 when only one is zero. It has found use in analyzing human response to contrast-detail phantoms.[18]

SSIM has also been used on the gradient of images, making it "G-SSIM". G-SSIM is especially useful on blurred images.[19]

The modifications above can be combined. For example, 4-G-r* is a combination of 4-SSIM, G-SSIM, and r*. It is able to reflect radiologist preference for images much better than other SSIM variants tested.[20]

Application

[edit]

SSIM has applications in a variety of different problems. Some examples are:

  • Image Compression: In lossy image compression, information is deliberately discarded to decrease the storage space of images and video. The MSE is typically used in such compression schemes. According to its authors, using SSIM instead of MSE is suggested to produce better results for the decompressed images.[13]
  • Image Restoration: Image restoration focuses on solving the problem where is the blurry image that should be restored, is the blur kernel, is the additive noise and is the original image we wish to recover. The traditional filter which is used to solve this problem is the Wiener Filter. However, the Wiener filter design is based on the MSE. Using an SSIM variant, specifically Stat-SSIM, is claimed to produce better visual results, according to the algorithm's authors.[13]
  • Pattern Recognition: Since SSIM mimics aspects of human perception, it could be used for recognizing patterns. When faced with issues like image scaling, translation and rotation, the algorithm's authors claim that it is better to use CW-SSIM,[21] which is insensitive to these variations and may be directly applied by template matching without using any training sample. Since data-driven pattern recognition approaches may produce better performance when a large amount of data is available for training, the authors suggest using CW-SSIM in data-driven approaches.[21]

Performance comparison

[edit]

Due to its popularity, SSIM is often compared to other metrics, including more simple metrics such as MSE and PSNR, and other perceptual image and video quality metrics. SSIM has been repeatedly shown to significantly outperform MSE and its derivates in accuracy, including research by its own authors and others.[7][22][23][24][25][26]

A paper by Dosselmann and Yang claims that the performance of SSIM is "much closer to that of the MSE" than usually assumed. While they do not dispute the advantage of SSIM over MSE, they state an analytical and functional dependency between the two metrics.[8] According to their research, SSIM has been found to correlate as well as MSE-based methods on subjective databases other than the databases from SSIM's creators. As an example, they cite Reibman and Poole, who found that MSE outperformed SSIM on a database containing packet-loss–impaired video.[27] In another paper, an analytical link between PSNR and SSIM was identified.[28]

See also

[edit]

References

[edit]
  1. ^ Wang, Zhou; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. (2004-04-01). "Image quality assessment: from error visibility to structural similarity". IEEE Transactions on Image Processing. 13 (4): 600–612. Bibcode:2004ITIP...13..600W. CiteSeerX 10.1.1.2.5689. doi:10.1109/TIP.2003.819861. ISSN 1057-7149. PMID 15376593. S2CID 207761262.
  2. ^ "Google Scholar". scholar.google.com. Retrieved 2019-07-04.
  3. ^ "IEEE Signal Processing Society, Best Paper Award" (PDF).
  4. ^ a b c d Wang, Z.; Simoncelli, E.P.; Bovik, A.C. (2003-11-01). "Multiscale structural similarity for image quality assessment". The Thirty-Seventh Asilomar Conference on Signals, Systems & Computers, 2003. Vol. 2. pp. 1398–1402 Vol.2. CiteSeerX 10.1.1.58.1939. doi:10.1109/ACSSC.2003.1292216. ISBN 978-0-7803-8104-9. S2CID 60600316.
  5. ^ a b Brunet, D.; Vass, J.; Vrscay, E. R.; Wang, Z. (April 2012). "On the mathematical properties of the structural similarity index" (PDF). IEEE Transactions on Image Processing. 21 (4): 2324–2328. Bibcode:2012ITIP...21.1488B. doi:10.1109/TIP.2011.2173206. PMID 22042163. S2CID 13739220.
  6. ^ a b Wang, Z.; Lu, L.; Bovik, A. C. (February 2004). "Video quality assessment based on structural distortion measurement". Signal Processing: Image Communication. 19 (2): 121–132. CiteSeerX 10.1.1.2.6330. doi:10.1016/S0923-5965(03)00076-6.
  7. ^ a b c Søgaard, Jacob; Krasula, Lukáš; Shahid, Muhammad; Temel, Dogancan; Brunnström, Kjell; Razaak, Manzoor (2016-02-14). "Applicability of Existing Objective Metrics of Perceptual Quality for Adaptive Video Streaming" (PDF). Electronic Imaging. 2016 (13): 1–7. doi:10.2352/issn.2470-1173.2016.13.iqsp-206. S2CID 26253431.
  8. ^ a b Dosselmann, Richard; Yang, Xue Dong (2009-11-06). "A comprehensive assessment of the structural similarity index". Signal, Image and Video Processing. 5 (1): 81–91. doi:10.1007/s11760-009-0144-1. ISSN 1863-1703. S2CID 30046880.
  9. ^ Li, Chaofeng; Bovik, Alan Conrad (2010-01-01). "Content-weighted video quality assessment using a three-component image model". Journal of Electronic Imaging. 19 (1): 011003–011003–9. Bibcode:2010JEI....19a1003L. doi:10.1117/1.3267087. ISSN 1017-9909.
  10. ^ Li, Chaofeng; Bovik, Alan C. (August 2010). "Content-partitioned structural similarity index for image quality assessment". Signal Processing: Image Communication. 25 (7): 517–526. doi:10.1016/j.image.2010.03.004.
  11. ^ "Redirect page". www.compression.ru.
  12. ^ Wang, Z.; Li, Q. (December 2007). "Video quality assessment using a statistical model of human visual speed perception" (PDF). Journal of the Optical Society of America A. 24 (12): B61–B69. Bibcode:2007JOSAA..24...61W. CiteSeerX 10.1.1.113.4177. doi:10.1364/JOSAA.24.000B61. PMID 18059915.
  13. ^ a b c Zhou Wang; Bovik, A.C. (January 2009). "Mean squared error: Love it or leave it? A new look at Signal Fidelity Measures". IEEE Signal Processing Magazine. 26 (1): 98–117. Bibcode:2009ISPM...26...98W. doi:10.1109/msp.2008.930649. ISSN 1053-5888. S2CID 2492436.
  14. ^ Rehman, A.; Zeng, K.; Wang, Zhou (February 2015). Rogowitz, Bernice E; Pappas, Thrasyvoulos N; De Ridder, Huib (eds.). "Display device-adapted video quality-of-experience assessment" (PDF). IS&T-SPIE Electronic Imaging, Human Vision and Electronic Imaging XX. Human Vision and Electronic Imaging XX. 9394: 939406. Bibcode:2015SPIE.9394E..06R. doi:10.1117/12.2077917. S2CID 1466973.
  15. ^ Marchetti, F. (January 2021). "Convergence rate in terms of the continuous SSIM (cSSIM) index in RBF interpolation" (PDF). Dolom. Res. Notes Approx. 14: 27–32.
  16. ^ "SSIMULACRA 2 - Structural SIMilarity Unveiling Local And Compression Related Artifacts". Cloudinary. 12 July 2023.
  17. ^ "Detecting the psychovisual impact of compression related artifacts using SSIMULACRA". Cloudinary Blog. 14 June 2017.
  18. ^ Prieto, Gabriel; Guibelalde, Eduardo; Chevalier, Margarita; Turrero, Agustín (21 July 2011). "Use of the cross-correlation component of the multiscale structural similarity metric (R* metric) for the evaluation of medical images: R* metric for the evaluation of medical images". Medical Physics. 38 (8): 4512–4517. doi:10.1118/1.3605634. PMID 21928621.
  19. ^ Chen, Guan-hao; Yang, Chun-ling; Xie, Sheng-li (October 2006). "Gradient-Based Structural Similarity for Image Quality Assessment". 2006 International Conference on Image Processing. pp. 2929–2932. doi:10.1109/ICIP.2006.313132. ISBN 1-4244-0480-0. S2CID 15809337.
  20. ^ Renieblas, Gabriel Prieto; Nogués, Agustín Turrero; González, Alberto Muñoz; Gómez-Leon, Nieves; del Castillo, Eduardo Guibelalde (26 July 2017). "Structural similarity index family for image quality assessment in radiological images". Journal of Medical Imaging. 4 (3): 035501. doi:10.1117/1.JMI.4.3.035501. PMC 5527267. PMID 28924574.
  21. ^ a b Gao, Y.; Rehman, A.; Wang, Z. (September 2011). CW-SSIM based image classification (PDF). IEEE International Conference on Image Processing (ICIP11).
  22. ^ Zhang, Lin; Zhang, Lei; Mou, X.; Zhang, D. (September 2012). "A comprehensive evaluation of full reference image quality assessment algorithms". 2012 19th IEEE International Conference on Image Processing. pp. 1477–1480. CiteSeerX 10.1.1.476.2566. doi:10.1109/icip.2012.6467150. ISBN 978-1-4673-2533-2. S2CID 10716320.
  23. ^ Zhou Wang; Wang, Zhou; Li, Qiang (May 2011). "Information Content Weighting for Perceptual Image Quality Assessment". IEEE Transactions on Image Processing. 20 (5): 1185–1198. Bibcode:2011ITIP...20.1185W. doi:10.1109/tip.2010.2092435. PMID 21078577. S2CID 106021.
  24. ^ Channappayya, S. S.; Bovik, A. C.; Caramanis, C.; Heath, R. W. (March 2008). "SSIM-optimal linear image restoration". 2008 IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 765–768. CiteSeerX 10.1.1.152.7952. doi:10.1109/icassp.2008.4517722. ISBN 978-1-4244-1483-3. S2CID 14830268.
  25. ^ Gore, Akshay; Gupta, Savita (2015-02-01). "Full reference image quality metrics for JPEG compressed images". AEU — International Journal of Electronics and Communications. 69 (2): 604–608. doi:10.1016/j.aeue.2014.09.002.
  26. ^ Wang, Z.; Simoncelli, E. P. (September 2008). "Maximum differentiation (MAD) competition: a methodology for comparing computational models of perceptual quantities" (PDF). Journal of Vision. 8 (12): 8.1–13. doi:10.1167/8.12.8. PMC 4143340. PMID 18831621.
  27. ^ Reibman, A. R.; Poole, D. (September 2007). "Characterizing packet-loss impairments in compressed video". 2007 IEEE International Conference on Image Processing. Vol. 5. pp. V – 77–V – 80. CiteSeerX 10.1.1.159.5710. doi:10.1109/icip.2007.4379769. ISBN 978-1-4244-1436-9. S2CID 1685021.
  28. ^ Hore, A.; Ziou, D. (August 2010). "Image Quality Metrics: PSNR vs. SSIM". 2010 20th International Conference on Pattern Recognition. pp. 2366–2369. doi:10.1109/icpr.2010.579. ISBN 978-1-4244-7542-1. S2CID 9506273.
[edit]