Highlight recovery test set

Iain · January 11, 2022, 6:08am

Here are a bunch of my photos I have been using to test highlight recovery algorithms. I am releasing these photos into the public domain.

A mix of real world photos, and photos staged to testing. Most are taken ages ago with my Nikon D40, which is only six megapixels. There are some 12 and 24 mega pixel images too.

https://drive.google.com/drive/folders/1SmiQ7E01RaflZxIFpfi5FMCeZGpHirj-?usp=sharing

Some photos from other people (not in the public domain) that might be of interest for testing highlight recovery are

jandren · January 11, 2022, 8:59am

Awesome content @Iain !

One thing that has been on my mind is how great it would be to have a public domain test set of images for multiple different cases. Hosted like an open-source project with version control etc. Would probably be useful and appreciated by developers and researchers alike

sallyanne · January 11, 2022, 9:16am

Thank you very much

PhotoPhysicsGuy · January 11, 2022, 1:20pm

Oh good, it has pairs of unclipped and clipped versions!
I wonder if those should also contain some specifically designed to test clipping of individual channels?

Comparing to a ground-truth-image to judge failure modes of algorithms seems the way to go.
What metric for that comparison makes sense? PSNR, SSIM, \Delta E_{2000}, \Delta E_{ITP}?

heckflosse · January 11, 2022, 1:28pm

To compare to a ground-truth-image Pixelshift files could be worth a try. Maybe @nosle can provide some. Using pixelshift files we would get rid of demosaic aritfacts in the processing…

hannoschwalm · January 11, 2022, 1:34pm

Unfortunately i am not aware of any “standard” test procedure for this case.

Also we should keep in mind, most of these images are hard-core clipped. Iain and i have been using them to test for worst-case scenarios.

heckflosse · January 11, 2022, 1:36pm

Nor am I. But using pixelshift combined files could rule out one step to get closer to the truth

PhotoPhysicsGuy · January 11, 2022, 1:58pm

I would be surprised if there is already a test procedure for this.

I wondered if a case separation into (single clipped channel(3 cases), two clipped channels(3cases), three channels clipped(1 case)) would give more insight into failure modes? It would require 7 pairs of carefully crafted test-images. Not sure if it’s worth the effort.

At the same time it is somewhat concerting that there is a (fully justified) plea for more scientific procedures but the way algorithms are finally ‘chosen’ imho has a lot of leeway. That could be wholeheartedly down to me not having every insight into the development steps, so apologies if that is too harsh of a judgement!
Some reasonable metric to measure against should be easy to motivate, comparing to ground-truth is not absurd either…I agree that demosaicing should be standardized or taken out of the equation. Maybe “just” downsample strong enough then look at the metrics.

anon41087856 · January 11, 2022, 2:08pm

PSNR, SSIM, or delta E all suck.

PSNR is best suited for 1:1 reconstruction, problem is images with qualitative artifacts can end up with higher PSNR than artifact-free ones. It is not perceptual.

SSIM tries to use variance to extract content similarity. Problem is, variance has no perceptual meaning and, well, variance is a piece-wise thing… How do you decide the window width ?

Delta E is perceptual, but pixel-wise and not content-wise. Also it’s not obvious if the goal of HL reconstruction is to reconstruct color, given its probability of being far out of usual color spaces gamut.

I guess using the Retinex model makes sense here, it’s the one using perceptual relationships to neighbouhood. But there is no workable implementation I’m aware of because it’s hairy.

Until then, pick the one metric that validates you bias the best. That’s how it works with metrics.

garagecoder · January 11, 2022, 2:13pm

I’ve seen papers which actively “game” the PSNR etc., because it became the goal. It would be nice to have something better than arbitrary for this though.

PhotoPhysicsGuy · January 11, 2022, 2:52pm

Yes they do.

PSNR and SSIM are bad for the mentioned reasons but they are free of perceptual concepts. Qualitative artifact screams “perceptual metric” to me.

This can be discussed. Obviously there is no perfect answer, just good enough.

If the algo is reconstructing data lying outside of the spectral locus, that data has to come back to display values anyway through a CAT or gamut compression or whatever, just as the trichormatic sensor values it uses to reconstruct stuff has to.

Being aware of the biases is the first step into the right direction. Since this is pure theoretical spitballing, I would say all of them: PSNR, SSIM and some flavour of \Delta E, throw in some Retinex for good measure if there’s a workable implementation. Then once the metrics are in, those can either be weighted against obvious outliers or chosen to prefer algorithms depending on the situation (the data to be reconstructed). For example: for texture reconstruction use Algo1, for gradient reconstruction use Algo2 because these metrics show that one is better at this, and two is better at this.

Not having a metric at all seems like it’s not the best idea.

hanatos · January 11, 2022, 3:02pm

ꟻLIP: A Difference Evaluator for Alternating Images | Research ?

anon41087856 · January 11, 2022, 3:37pm

When I say “perceptual”, I mean “perceptually scaled”. Artifacts only mean you have a working pair of eyes and the image displays ringing, edge duplication, gradients reversal, cartooned objects, and so on, that can be linked to high PSNR as long as they minimize the error norm in average.

Go for it, try it. I’ll be there when you come back saying it wasn’t quite as simple as you initially thought.

All these metrics aim at actually predicting an average observer perceived similarity. None of them do it properly. Choosing or weighting them is simply hiding your subjectivity behind bullshit numbers. Might as well pull a ruler and measure our dicks, that will save CPU cycles and electricity.

Finally some metric that makes sense… Thanks !

afre · January 11, 2022, 4:12pm

Thanks @Iain for the test images and @hanatos for yet another evaluator ꟻor me to play with.

PhotoPhysicsGuy · January 11, 2022, 5:26pm

I wasn’t claiming that sensor values to display values is a solved problem! I wanted to express that reconstructed data is going to go through the same pipeline.

Not choosing any metric cannot be what you suggest to judge the discrepancy between a ground truth and the reconstruction quality.
That’s why I think @Iain 's dataset has such a significance, the presence of image-pairs to compare reconstruction quality to.

CarVac · January 11, 2022, 5:47pm

I haven’t checked yet, but are there any examples taken with a lens that displays purple fringing?

I’ve found that can throw off inpainting algorithms.

Iain · January 11, 2022, 5:52pm

@PhotoPhysicsGuy Some of clipped/unclipped pairs are from handheld shots and will not be able to be used for precise evaluation.

If a ground truth is required for testing algorithms , then perhaps the best thing to do is take an unclipped image and create a clipped version by adjusting its exposure digitally.

Regarding metrics for highlight reconstruction, I tend to go with “if it looks right, it is right”. I think one goal of highlight reconstruction is simply to stop the clipped areas from distracting the viewer from the content of the image. Completely flat areas are unnatural and stick out.

@CarVac there is no significant purple fringing

PhotoPhysicsGuy · January 11, 2022, 6:18pm

It seems to me that ground truths are, as of yet, not required. At the same time there is a lot of appeal to testing algorithm-quality against ground truths. Obviously optimizing for random-metric-no.23 means not much. But I would argue that not having a metric is equally too much guessing and opinion.

I agree that manually clipping channels on demand might involve less work.

Personally: Not a fan. It can be a rabbithole. The human visual system is SO complex, and SO sensitive to context, while this might work in X% of cases, it can fail ungracefully for the rest.

EDIT context sensitivity: HVS Illusion on Twitter

Iain · January 11, 2022, 7:16pm

Perhaps, ‘If it looks right, then it is right’ is over simplified, but I think everything comes back to what someone thinks when they look at an image.

It seems to me that a good metric for highlight reconstruction is the one which correlates to ‘it looks right’ most of the time. Having a quantifiable metric just means that you don’t have to do complicated tests to find out what ‘looks right’ most of the time to most of the people.

nosle · January 11, 2022, 10:50pm

I sure can provide pixelshift images when I figure out what they should show