Evaluation and Metrics


Evaluation

Participants receive a set of low-SNR (signal-to-noise ratio) images for training. During prediction, participants are expected to predict their denoised counterparts.

During the evaluation, the predictions will be compared to a hidden set of high-SNR images, using a set of metrics frequently used in image reconstruction: PSNR and SSIM, and their extensions: SI-PSNR and MS-SSIM.

You can find an example submission container here: https://github.com/ai4life-opencalls/AI4Life-MDC24-example-submission


Metrics

1. PSNR

Peak Signal-to-Noise Ratio (PSNR) is a widely used metric for assessing the quality of reconstructed images compared to the original ones. It is often used in image compression and restoration applications. When comparing two images, a higher PSNR value indicates that the reconstructed image is of higher quality and has a lower reconstruction error.

PSNR is defined as:

Where MAXI is the maximum possible pixel value of the image, for example, when the pixels are represented using 8 bits per sample, this is 255. In microscopy, we usually use the range of values in the ground-truth data. MSE is the Mean Squared Error between the original and reconstructed images.

Here, we use the PSNR implementation from the scikit-image package.

2. Scale-invariant PSNR

This metric is described in Luo, Yi, and Nima Mesgarani. "Tasnet: time-domain audio separation network for real-time, single-channel speech separation." 2018

SI-PSNR metric is invariant to the scale of the signals being compared, meaning that if one signal is a scaled version of another, the SI-SNR will not change, addressing a limitation of traditional SNR or PSNR metrics sensitive to signal amplitude changes.

SI-PSNR is defined as: 

Where:

Where s and s are the estimated and target clean sources, respectively, s and s are both normalized to have zero-mean to ensure scale-invariance.

Here, we use the scale-invariant implementation from the careamics package.

3. SSIM

The Structural Similarity Index (SSIM) stands out as a metric closely aligned with human vision. It compares two images based on luminance, contrast, and structure of image windows, and it ranges from -1 to 1, where 1 indicates perfect similarity. 

This metric was first described in Wang, Zhou & Bovik, Alan & Sheikh, Hamid & Simoncelli, Eero, “Image Quality Assessment: From Error Visibility to Structural Similarity," 2004

SSIM is defined as:

Where:

  • Luminance (l) is measured by averaging over all the pixel values.

  • Contrast (c) is measured by taking the pixel values' standard deviation (square root of variance). 

  • Structure (s) is measured by dividing the covariance between the signals and dividing it by the standard deviation.

And α > 0, β > 0, γ > 0 denote the relative importance of each metric.

Here, we use the SSIM implementation from the scikit-image package.

4. Multiscale SSIM

MS-SSIM was described in Z. Wang, E. P. Simoncelli, and A. C. Bovik, "Multiscale structural similarity for image quality assessment," 2003

The drawback of the SSIM is that it is a single-scale approach, and the correct scale depends on viewing conditions (e.g., display resolution and viewing distance). The authors propose a method to compute the structural similarity score using multiple image scales and calibrate the parameters that weigh the relative importance between scales.

MS-SSIM is defined as: 

Every j is an image obtained by iteratively applying a low-pass filter and downsampling the filtered image by a factor of 2. The luminance score is only computed at the highest scale, M, obtained after M-1 iterations. Parameters αj =βj =γj for all j’s, and the values are fixed for different scales.

This metirc allows to account for variations in image resolution and viewing conditions, providing a more comprehensive assessment of perceived image quality compared to the original SSIM.

Here, we use implementation from https://github.com/VainF/pytorch-msssim