Here are a bunch of my photos I have been using to test highlight recovery algorithms. I am releasing these photos into the public domain.
A mix of real world photos, and photos staged to testing. Most are taken ages ago with my Nikon D40, which is only six megapixels. There are some 12 and 24 mega pixel images too.
One thing that has been on my mind is how great it would be to have a public domain test set of images for multiple different cases. Hosted like an open-source project with version control etc. Would probably be useful and appreciated by developers and researchers alike
Oh good, it has pairs of unclipped and clipped versions!
I wonder if those should also contain some specifically designed to test clipping of individual channels?
Comparing to a ground-truth-image to judge failure modes of algorithms seems the way to go.
What metric for that comparison makes sense? PSNR, SSIM, \Delta E_{2000}, \Delta E_{ITP}?
To compare to a ground-truth-image Pixelshift files could be worth a try. Maybe @nosle can provide some. Using pixelshift files we would get rid of demosaic aritfacts in the processingā¦
I would be surprised if there is already a test procedure for this.
I wondered if a case separation into (single clipped channel(3 cases), two clipped channels(3cases), three channels clipped(1 case)) would give more insight into failure modes? It would require 7 pairs of carefully crafted test-images. Not sure if itās worth the effort.
At the same time it is somewhat concerting that there is a (fully justified) plea for more scientific procedures but the way algorithms are finally āchosenā imho has a lot of leeway. That could be wholeheartedly down to me not having every insight into the development steps, so apologies if that is too harsh of a judgement!
Some reasonable metric to measure against should be easy to motivate, comparing to ground-truth is not absurd eitherā¦I agree that demosaicing should be standardized or taken out of the equation. Maybe ājustā downsample strong enough then look at the metrics.
PSNR is best suited for 1:1 reconstruction, problem is images with qualitative artifacts can end up with higher PSNR than artifact-free ones. It is not perceptual.
SSIM tries to use variance to extract content similarity. Problem is, variance has no perceptual meaning and, well, variance is a piece-wise thingā¦ How do you decide the window width ?
Delta E is perceptual, but pixel-wise and not content-wise. Also itās not obvious if the goal of HL reconstruction is to reconstruct color, given its probability of being far out of usual color spaces gamut.
I guess using the Retinex model makes sense here, itās the one using perceptual relationships to neighbouhood. But there is no workable implementation Iām aware of because itās hairy.
Until then, pick the one metric that validates you bias the best. Thatās how it works with metrics.
Iāve seen papers which actively āgameā the PSNR etc., because it became the goal. It would be nice to have something better than arbitrary for this though.
PSNR and SSIM are bad for the mentioned reasons but they are free of perceptual concepts. Qualitative artifact screams āperceptual metricā to me.
This can be discussed. Obviously there is no perfect answer, just good enough.
If the algo is reconstructing data lying outside of the spectral locus, that data has to come back to display values anyway through a CAT or gamut compression or whatever, just as the trichormatic sensor values it uses to reconstruct stuff has to.
Being aware of the biases is the first step into the right direction. Since this is pure theoretical spitballing, I would say all of them: PSNR, SSIM and some flavour of \Delta E, throw in some Retinex for good measure if thereās a workable implementation. Then once the metrics are in, those can either be weighted against obvious outliers or chosen to prefer algorithms depending on the situation (the data to be reconstructed). For example: for texture reconstruction use Algo1, for gradient reconstruction use Algo2 because these metrics show that one is better at this, and two is better at this.
Not having a metric at all seems like itās not the best idea.
When I say āperceptualā, I mean āperceptually scaledā. Artifacts only mean you have a working pair of eyes and the image displays ringing, edge duplication, gradients reversal, cartooned objects, and so on, that can be linked to high PSNR as long as they minimize the error norm in average.
Go for it, try it. Iāll be there when you come back saying it wasnāt quite as simple as you initially thought.
All these metrics aim at actually predicting an average observer perceived similarity. None of them do it properly. Choosing or weighting them is simply hiding your subjectivity behind bullshit numbers. Might as well pull a ruler and measure our dicks, that will save CPU cycles and electricity.
I wasnāt claiming that sensor values to display values is a solved problem! I wanted to express that reconstructed data is going to go through the same pipeline.
Not choosing any metric cannot be what you suggest to judge the discrepancy between a ground truth and the reconstruction quality.
Thatās why I think @Iain 's dataset has such a significance, the presence of image-pairs to compare reconstruction quality to.
@PhotoPhysicsGuy Some of clipped/unclipped pairs are from handheld shots and will not be able to be used for precise evaluation.
If a ground truth is required for testing algorithms , then perhaps the best thing to do is take an unclipped image and create a clipped version by adjusting its exposure digitally.
Regarding metrics for highlight reconstruction, I tend to go with āif it looks right, it is rightā. I think one goal of highlight reconstruction is simply to stop the clipped areas from distracting the viewer from the content of the image. Completely flat areas are unnatural and stick out.
It seems to me that ground truths are, as of yet, not required. At the same time there is a lot of appeal to testing algorithm-quality against ground truths. Obviously optimizing for random-metric-no.23 means not much. But I would argue that not having a metric is equally too much guessing and opinion.
I agree that manually clipping channels on demand might involve less work.
Personally: Not a fan. It can be a rabbithole. The human visual system is SO complex, and SO sensitive to context, while this might work in X% of cases, it can fail ungracefully for the rest.
Perhaps, āIf it looks right, then it is rightā is over simplified, but I think everything comes back to what someone thinks when they look at an image.
It seems to me that a good metric for highlight reconstruction is the one which correlates to āit looks rightā most of the time. Having a quantifiable metric just means that you donāt have to do complicated tests to find out what ālooks rightā most of the time to most of the people.