Evaluation and Metrics
Evaluation
Participants receive a set of low-SNR (signal-to-noise ratio) images for training. During prediction, participants are expected to predict their denoised counterparts.
During the evaluation, the predictions will be compared to a hidden set of high-SNR images, using a set of metrics frequently used in image reconstruction: PSNR and SSIM, and their extensions: SI-PSNR and MS-SSIM.
You can find an example submission container here: https://github.com/ai4life-opencalls/AI4Life-MDC24-example-submission
Metrics
1. PSNR
Peak Signal-to-Noise Ratio (PSNR) is a widely used metric for assessing the quality of reconstructed images compared to the original ones. It is often used in image compression and restoration applications. When comparing two images, a higher PSNR value indicates that the reconstructed image is of higher quality and has a lower reconstruction error.
PSNR is defined as:
Where MAXI is the maximum possible pixel value of the image, for example, when the pixels are represented using 8 bits per sample, this is 255. In microscopy, we usually use the range of values in the ground-truth data. MSE is the Mean Squared Error between the original and reconstructed images.
Here, we use the PSNR implementation from the scikit-image package.
2. Scale-invariant PSNR
This metric is described in Luo, Yi, and Nima Mesgarani. "Tasnet: time-domain audio separation network for real-time, single-channel speech separation." 2018
SI-PSNR metric is invariant to the scale of the signals being compared, meaning that if one signal is a scaled version of another, the SI-SNR will not change, addressing a limitation of traditional SNR or PSNR metrics sensitive to signal amplitude changes.
SI-PSNR is defined as:
Where:
Where s and s are the estimated and target clean sources, respectively, s and s are both normalized to have zero-mean to ensure scale-invariance.
Here, we use the scale-invariant implementation from the careamics package.
3. SSIM
The Structural Similarity Index (SSIM) stands out as a metric closely aligned with human vision. It compares two images based on luminance, contrast, and structure of image windows, and it ranges from -1 to 1, where 1 indicates perfect similarity.
This metric was first described in Wang, Zhou & Bovik, Alan & Sheikh, Hamid & Simoncelli, Eero, “Image Quality Assessment: From Error Visibility to Structural Similarity," 2004
SSIM is defined as:
Where:
Luminance (l) is measured by averaging over all the pixel values.
Contrast (c) is measured by taking the pixel values' standard deviation (square root of variance).
Structure (s) is measured by dividing the covariance between the signals and dividing it by the standard deviation.
And α > 0, β > 0, γ > 0 denote the relative importance of each metric.
Here, we use the SSIM implementation from the scikit-image package.
4. Multiscale SSIM
MS-SSIM was described in Z. Wang, E. P. Simoncelli, and A. C. Bovik, "Multiscale structural similarity for image quality assessment," 2003
The drawback of the SSIM is that it is a single-scale approach, and the correct scale depends on viewing conditions (e.g., display resolution and viewing distance). The authors propose a method to compute the structural similarity score using multiple image scales and calibrate the parameters that weigh the relative importance between scales.
MS-SSIM is defined as:
Every j is an image obtained by iteratively applying a low-pass filter and downsampling the filtered image by a factor of 2. The luminance score is only computed at the highest scale, M, obtained after M-1 iterations. Parameters αj =βj =γj for all j’s, and the values are fixed for different scales.
This metirc allows to account for variations in image resolution and viewing conditions, providing a more comprehensive assessment of perceived image quality compared to the original SSIM.
Here, we use implementation from https://github.com/VainF/pytorch-msssim