Highlight recovery test set

Here are a bunch of my photos I have been using to test highlight recovery algorithms. I am releasing these photos into the public domain.

A mix of real world photos, and photos staged to testing. Most are taken ages ago with my Nikon D40, which is only six megapixels. There are some 12 and 24 mega pixel images too.

https://drive.google.com/drive/folders/1SmiQ7E01RaflZxIFpfi5FMCeZGpHirj-?usp=sharing

Some photos from other people (not in the public domain) that might be of interest for testing highlight recovery are

11 Likes

Awesome content @Iain !

One thing that has been on my mind is how great it would be to have a public domain test set of images for multiple different cases. Hosted like an open-source project with version control etc. Would probably be useful and appreciated by developers and researchers alike :slight_smile:

2 Likes

Thank you very much

Oh good, it has pairs of unclipped and clipped versions!
I wonder if those should also contain some specifically designed to test clipping of individual channels?

Comparing to a ground-truth-image to judge failure modes of algorithms seems the way to go.
What metric for that comparison makes sense? PSNR, SSIM, \Delta E_{2000}, \Delta E_{ITP}?

To compare to a ground-truth-image Pixelshift files could be worth a try. Maybe @nosle can provide some. Using pixelshift files we would get rid of demosaic aritfacts in the processingā€¦

1 Like

Unfortunately i am not aware of any ā€œstandardā€ test procedure for this case.

Also we should keep in mind, most of these images are hard-core clipped. Iain and i have been using them to test for worst-case scenarios.

Nor am I. But using pixelshift combined files could rule out one step to get closer to the truth :wink:

1 Like

I would be surprised if there is already a test procedure for this.

I wondered if a case separation into (single clipped channel(3 cases), two clipped channels(3cases), three channels clipped(1 case)) would give more insight into failure modes? It would require 7 pairs of carefully crafted test-images. Not sure if itā€™s worth the effort.

At the same time it is somewhat concerting that there is a (fully justified) plea for more scientific procedures but the way algorithms are finally ā€˜chosenā€™ imho has a lot of leeway. That could be wholeheartedly down to me not having every insight into the development steps, so apologies if that is too harsh of a judgement!
Some reasonable metric to measure against should be easy to motivate, comparing to ground-truth is not absurd eitherā€¦I agree that demosaicing should be standardized or taken out of the equation. Maybe ā€œjustā€ downsample strong enough then look at the metrics.

1 Like

PSNR, SSIM, or delta E all suck.

PSNR is best suited for 1:1 reconstruction, problem is images with qualitative artifacts can end up with higher PSNR than artifact-free ones. It is not perceptual.

SSIM tries to use variance to extract content similarity. Problem is, variance has no perceptual meaning and, well, variance is a piece-wise thingā€¦ How do you decide the window width ?

Delta E is perceptual, but pixel-wise and not content-wise. Also itā€™s not obvious if the goal of HL reconstruction is to reconstruct color, given its probability of being far out of usual color spaces gamut.

I guess using the Retinex model makes sense here, itā€™s the one using perceptual relationships to neighbouhood. But there is no workable implementation Iā€™m aware of because itā€™s hairy.

Until then, pick the one metric that validates you bias the best. Thatā€™s how it works with metrics.

2 Likes

Iā€™ve seen papers which actively ā€œgameā€ the PSNR etc., because it became the goal. It would be nice to have something better than arbitrary for this though.

1 Like

Yes they do.

PSNR and SSIM are bad for the mentioned reasons but they are free of perceptual concepts. Qualitative artifact screams ā€œperceptual metricā€ to me.

This can be discussed. Obviously there is no perfect answer, just good enough.

If the algo is reconstructing data lying outside of the spectral locus, that data has to come back to display values anyway through a CAT or gamut compression or whatever, just as the trichormatic sensor values it uses to reconstruct stuff has to.

Being aware of the biases is the first step into the right direction. Since this is pure theoretical spitballing, I would say all of them: PSNR, SSIM and some flavour of \Delta E, throw in some Retinex for good measure if thereā€™s a workable implementation. Then once the metrics are in, those can either be weighted against obvious outliers or chosen to prefer algorithms depending on the situation (the data to be reconstructed). For example: for texture reconstruction use Algo1, for gradient reconstruction use Algo2 because these metrics show that one is better at this, and two is better at this.

Not having a metric at all seems like itā€™s not the best idea.

źŸ»LIP: A Difference Evaluator for Alternating Images | Research ?

1 Like

When I say ā€œperceptualā€, I mean ā€œperceptually scaledā€. Artifacts only mean you have a working pair of eyes and the image displays ringing, edge duplication, gradients reversal, cartooned objects, and so on, that can be linked to high PSNR as long as they minimize the error norm in average.

Go for it, try it. Iā€™ll be there when you come back saying it wasnā€™t quite as simple as you initially thought.

All these metrics aim at actually predicting an average observer perceived similarity. None of them do it properly. Choosing or weighting them is simply hiding your subjectivity behind bullshit numbers. Might as well pull a ruler and measure our dicks, that will save CPU cycles and electricity.

Finally some metric that makes senseā€¦ Thanks !

1 Like

Thanks @Iain for the test images and @hanatos for yet another evaluator źŸ»or me to play with. :slight_smile:

1 Like

I wasnā€™t claiming that sensor values to display values is a solved problem! I wanted to express that reconstructed data is going to go through the same pipeline.

Not choosing any metric cannot be what you suggest to judge the discrepancy between a ground truth and the reconstruction quality.
Thatā€™s why I think @Iain 's dataset has such a significance, the presence of image-pairs to compare reconstruction quality to.

I havenā€™t checked yet, but are there any examples taken with a lens that displays purple fringing?

Iā€™ve found that can throw off inpainting algorithms.

@PhotoPhysicsGuy Some of clipped/unclipped pairs are from handheld shots and will not be able to be used for precise evaluation.

If a ground truth is required for testing algorithms , then perhaps the best thing to do is take an unclipped image and create a clipped version by adjusting its exposure digitally.

Regarding metrics for highlight reconstruction, I tend to go with ā€œif it looks right, it is rightā€. I think one goal of highlight reconstruction is simply to stop the clipped areas from distracting the viewer from the content of the image. Completely flat areas are unnatural and stick out.

@CarVac there is no significant purple fringing

1 Like

It seems to me that ground truths are, as of yet, not required. At the same time there is a lot of appeal to testing algorithm-quality against ground truths. Obviously optimizing for random-metric-no.23 means not much. But I would argue that not having a metric is equally too much guessing and opinion.

I agree that manually clipping channels on demand might involve less work.

Personally: Not a fan. It can be a rabbithole. The human visual system is SO complex, and SO sensitive to context, while this might work in X% of cases, it can fail ungracefully for the rest.

EDIT context sensitivity: HVS Illusion on Twitter

Perhaps, ā€˜If it looks right, then it is rightā€™ is over simplified, but I think everything comes back to what someone thinks when they look at an image.

It seems to me that a good metric for highlight reconstruction is the one which correlates to ā€˜it looks rightā€™ most of the time. Having a quantifiable metric just means that you donā€™t have to do complicated tests to find out what ā€˜looks rightā€™ most of the time to most of the people.

3 Likes

I sure can provide pixelshift images when I figure out what they should show :sweat_smile: