Comparing defaults across software

In light of the heavy chatter about presets and default looks and which program does it “best”, I decided to do a small experiment. I took 7 images from my collection of gathered images from the past years frequenting this forum (am I sorry, I haven’t spent time tracking the exact sources, but some images may seem familiar). These images are very different to each other, either because of their file format, sensor format, expose, dynamic range, noise level, etc. but I’ll not claim these images to be fully representative for my purpose.

Question: if I take the default processing of various raw conversions using different programs, average them and then compare each version with the average, who is closest to the average, the “hypothetical best default”?

Methodology: open the RAW file in the editor of choice, do as little processing as possible (in most cases, zero processing*), export as 16 bit TIFF, load these files in a custom script, align and crop away edges so that we compare identical pixels, resize to the size of the embedded JPEG (this is the closest I can get to the OOC JPEG - not every vendor embeds a full size JPEG) or 3000 pixels at most (update: added size restriction to improve comparison speed), convert all versions including the OOC JPEG to L*a*b* and average** across channels, then recombine to form the final image, then calculate the simple Euclidean root-mean square distance between the average and each original.

* I disabled lens distortion correction if it was enabled, or left out the image if I couldn’t (looking at you Photoshop :eyes: ). In two cases the embedded JPEG had distortion correction applied, this introduces some error I cannot mitigate.
** I now take the arithmetic mean, but as people already point out, this may not be a great choice. Update: I take the median value, which is less biased to outliers. The differences between the mean and median images were relatively small.

Software: the chosen editors are, in alphabetical order: Affinity 1.9.4, ART (latest dev), Capture One 20, darktable (latest master), Digikam 7.2.0, DxO Photolab 4.3.0, Filmulator (latest release), Lightroom Classic 10.2, LightZone (latest release), Luminar 4, ON1 Photo RAW 2021, PhotoFlow (latest release), Adobe Photoshop 2020, RawTherapee (latest dev), rawproc.
I have omitted Photivo, because it could not process most of my samples properly. I have also omitted rawproc, but am in contact with Glenn, because I saw some weird behavior. Update: rawproc is now included.
Whenever an editor could not open a file (e.g. because it doesn’t support X-Trans sensors), those are obviously not included in the analysis.

Results
This is going to be a long list. For each list, the smaller numbers mean a better similarity. The image shows the average. I have yet to think of a good way to share the other images if you want to compare them yourself…

Scene 1 - Wide dynamic range, sun clipping

0.00578805	Affinity 1.9.4
0.00876031	Luminar 4
0.0106033	ART
0.0113892	Capture One 20
0.0166794	RawTherapee
0.0206756	OOC JPEG
0.0232419	ON1 Photo RAW 2021
0.0235942	LightZone
0.0253301	Lightroom Classic 10.2
0.0253313	Photoshop 2020
0.026776	Filmulator
0.0476889	darktable
0.0695579	PhotoFlow
0.095769	Digikam
0.198581	rawproc

Scene 2 - Wide dynamic range, detailed, some clipping

0.0269101	Photoshop 2020
0.0269101	Lightroom Classic 10.2
0.0295188	LightZone
0.0316127	OOC JPEG
0.0358251	Capture One 20
0.037328	DxO PhotoLab 4.3.0
0.0427293	ON1 Photo RAW 2021
0.0450921	ART
0.0533722	RawTherapee
0.0615355	PhotoFlow
0.0617971	darktable
0.0668121	Affinity 1.9.4
0.072927	Filmulator
0.0773843	Digikam
0.0978238	Luminar 4
0.178724	rawproc

Scene 3 - Noisy scene

0.0342636	Luminar 4
0.0356646	Affinity 1.9.4
0.0357915	DxO PhotoLab 4.3.0
0.0416407	ART
0.0453766	Photoshop 2020
0.0453766	Lightroom Classic 10.2
0.0487241	Capture One 20
0.0690045	ON1 Photo RAW 2021
0.0732908	darktable
0.0748285	RawTherapee
0.0753369	OOC JPEG
0.102341	Filmulator
0.103192	PhotoFlow
0.107457	LightZone
0.137185	Digikam
0.240059	rawproc

Scene 4 - Typical bright landscape with sky
N.B. OOC JPEG had distortion correction applied, so some haze visible because of that.

0.0285808	Affinity 1.9.4
0.0297209	ART
0.0311513	Filmulator
0.0327813	Luminar 4
0.0358095	RawTherapee
0.0374848	Capture One 20
0.0390227	ON1 Photo RAW 2021
0.0478497	DxO PhotoLab 4.3.0
0.0494095	darktable
0.0635937	Digikam
0.0834656	OOC JPEG
0.0844319	PhotoFlow
0.103437	LightZone
0.216024	rawproc

Scene 5 - Bright blues, sharp small contrasts

0.0255882	darktable
0.0282571	Luminar 4
0.0336957	PhotoFlow
0.0392326	Digikam
0.0482835	ART
0.0485683	Affinity 1.9.4
0.0587456	RawTherapee
0.0631657	Capture One 20
0.0632366	LightZone
0.116961	rawproc
0.123669	OOC JPEG

Scene 6 - Blown highlights, strong CA
N.B. Some misalignment present.

0.044375	Luminar 4
0.0545543	Filmulator
0.0590963	ART
0.0604006	PhotoFlow
0.0615064	Affinity 1.9.4
0.0650114	DxO PhotoLab 4.3.0
0.0670756	RawTherapee
0.0722563	darktable
0.0725098	OOC JPEG
0.0778986	Capture One 20
0.0783475	Photoshop 2020
0.0783475	Lightroom Classic 10.2
0.0793131	ON1 Photo RAW 2021
0.0804826	LightZone
0.11173	    Digikam
0.19541	    rawproc

Scene 7 - Overexposed clouds, bright scene

0.0122128	Photoshop 2020
0.0122128	Lightroom Classic 10.2
0.0165731	Luminar 4
0.0175009	OOC JPEG
0.0182036	Affinity 1.9.4
0.0188194	ART
0.0211127	RawTherapee
0.0241559	ON1 Photo RAW 2021
0.0319731	Filmulator
0.0324239	DxO PhotoLab 4.3.0
0.0324248	LightZone
0.035527	Capture One 20
0.0542818	darktable
0.0932279	Digikam
0.0975635	PhotoFlow
0.184258	rawproc

Scene 8 - New: bright red hue and color target

0.03284	    OOC JPEG
0.0385984	ON1 Photo RAW 2021
0.041847	Capture One 20
0.0430696	DxO PhotoLab 4.3.0
0.0446965	Affinity 1.9.4
0.0475985	Photoshop 2020
0.0475985	Lightroom Classic 10.2
0.0481924	Luminar 4
0.050561	Filmulator
0.0526564	RawTherapee
0.0596389	ART
0.0714936	darktable
0.0756244	Digikam
0.0824655	LightZone
0.0863091	PhotoFlow
0.179325	rawproc

Summary of ranks
Abbreviations should be self-explanatory:

1   Aff   PS    Lm4   Aff   dt    Lm4   PS    OOC  
2   Lm4   LR    Aff   ART   Lm4   Film  LR    ON1
3   ART   LZ    DxO   Film  PF    ART   Lm4   C1
4   C1    OOC   ART   Lm4   DK    PF    OOC   DxO  
5   RT    C1    PS    RT    ART   Aff   Aff   Aff  
6   OOC   DxO   LR    C1    Aff   DxO   ART   PS   
7   ON1   ON1   C1    ON1   RT    RT    RT    LR   
8   LZ    ART   ON1   DxO   C1    dt    ON1   Lm4  
9   LR    RT    dt    dt    LZ    OOC   Film  Film
10  PS    PF    RT    DK    rp    C1    DxO   RT   
11  Film  dt    OOC   OOC   OOC   PS    LZ    ART
12  dt    Aff   Film  PF          LR    C1    dt
13  PF    Film  PF    LZ          ON1   dt    DK
14  DK    DK    LZ    rp          LZ    DK    LZ   
15  rp    Lm4   DK                DK    PF    PF   
16        rp    rp                rp    rp    rp   

Some conclusions: there are probably many things you can say about the results. I have not studied them extensively yet, but a few things to mention:

  1. No program is a clear winner (it was not a competition anyway)
  2. Photoshop and Lightroom are not always equal
  3. Being low on the list means you’re different, not necessarily worse
  4. Whether this average could be considered a better starting point for further editing than any of the individual options is debatable.
  5. This says nothing about how easy it is to improve the result (sometimes a single slider or setting can make a world of difference).

Please feel free to ask for clarifications. Looking forward to hearing some opinions or ideas for further comparison.

11 Likes

Ideally we could run a double-elimination tournament, comparing a pair a day.

1 Like

Nice!

I would call the metric used here ‘distance-to-average’. I’m open for a better name though.

Would it make sense to include a picture of a CC24 chart and measure deltaE for each participant and the mean? (This assumes of course the value of a colorimetrically correct default, people might disagree with that)
I don’t know how this should be included, if at all, in the ‘distance-to-average’ metric or if it is just an additional metric.

Averaging would allow poorly constructed or unusual data to influence the reference image. Future work could include a more robust statistic or blending method, and factor in image quality metrics to reject outliers (and bad images).

There are many research papers that are dedicated to the comparison of images, but they are mostly interested in comparing and contrasting specific sensors, algorithms or neural network workflows; challenging conditions to the problem of image enhancement; or artifacts that interfere with machine learning.

Still other research provide databases of images for this type of inquiry, though not many are as broad in scope and large in sample size as could be for your type of general analysis.

1 Like

Hi,

Indeed, but I thought that it should be your job to justify this :slight_smile: so, why did you pick it? One very bad (imho) feature of the average is that it’s very sensitive to outliers – if you have one “very bad” (please note the quotes) tool in the pack, it might skew your results pretty significantly…
On the other hand, I do not an alternative suggestion, but that might be because the purpose of the comparison is not completely clear to me :man_shrugging:

3 Likes

The median of the selected group would be more robust against outliers.

Completing my thoughts from my previous post, this thread is a good start to a less circular discussion on defaults. It is good to substantiate claims and opinions with data.

Choices aside, the method was clearly laid out, a draft we could extend if someone feels there is a need to do so. @Thanatomanic Time and motivation permitting, maybe another to-do would be to elaborate on the method so that other people could reproduce it and add more images and data.

The method is clear, but the purpose and what being the best according to the chosen metric mean are not, to me at least. But I should probably just shut up, I certainly didn’t want to derail the discussion. My apologies.

1 Like

No worries. It is a legitimate concern that I also noted. What I am saying is that once there is a framework what happens at each step can be replaced with something better to improve the method or adapt it to other purposes. That is what most research papers do anyway: change a little, then hope it gets published and retweeted *ahem* cited.

[Sometimes (well, more than sometimes, because this is quite a problem), it is pure laziness or copy enough not to get into trouble but alas I digress.]

That’s an interesting one to add! The deltaE metric would be something else, but also interesting to measure. Maybe even per color patch.

With regards to taking single arithmetic means: I agree that this is probably a rather poor averaging method. I can easily rerun the processing and taking the median, other metrics are equally possible if there are other suggestions.

@agriggio I didn’t explain myself very well and I have rephrased this in my post. It wasn’t about the choice of the arithmetic mean. It now reads “Whether such an average can be considered a better starting point for further editing than any of the other options is debatable.” The discussion of what makes a good editing starting point is not something I want to address with my comparison. I want to try and focus on the defaults.

To be fair, I don’t have a well-grounded research question. Please call me out on that, if necessary, and stimulate my brain cells :slight_smile:
I think for now my goal was simply to try and somehow quantify how big or small the differences between defaults are. Many people have a (strong) personal preference for a program and its default processing. The premise is that if we average out the default processing in different programs, we average out people’s preferences as well. This might help to conclude what a “better” default is, that will appease more people. It might also have the completely opposite effect of getting to a default that’s never distinguished enough for people’s taste. I thought this was worthwhile to investigate.

I’ll have to think about how to do this. I’m not proficient in either G’MIC or Python, but I’m sure people could cook up scripts to create the average images much nicer than my quick (closed-source) Mathematica notebook.
Other than that, I think I explained all my steps insofar that they should be reproducible already. If not, I’m happy to clarify.

1 Like

This is a hilarious exercise! I like it. The mean processing is also interesting to see. Unsurprisingly unsurprising!

1 Like

Thanks for doing this, @Thanatomanic.

I agree that median would be a better average than (arithmetic) mean. It may not make much difference to the rankings.

Can you say how the “similarity” numbers are calculated? If the two images were identical, would the number be infinity or 10000 or something else?

I am curious about how much visual difference there is. Could you post a pair of image from one scene, showing the average image and the result that is furthest from the average?

Sure thing.
Scene 1: Mean


Scene 1: Biggest difference (Digikam)

Scene 1: Smallest difference (Lightroom)

Scene 5: Mean


Scene 5: Biggest difference (LightZone)

Scene 5: Smallest difference (OOC JPEG)

Unfortunately, this is the part I’m least certain about and it may turn my entire comparison around. Two identical images give a “similarity” of 0, so less seems more similar. Though it doesn’t look like smaller numbers are always more similar - at least not from my inspection. Perhaps I need another metric here…

Very cool exercise. Love it.

How about getting rid of the pixel-level and doing both a resize and slight blur before calculating averages and the differences? We are after “the Look” after all, not a 1:1 comparison.

I once ran an analysis across my archive with all the images rescaled to 64x64 (unproportionally) - of course that would be way to brutal for this, but I can see an advantage for going with e.g. 1024x1024 or whatever size has enough data for comparison but does not skew the results by relying on minute details.

PS: Of course the rescale should be a non-sharpening one and I still would add a simple blur somewhere in the 1…3 pixel range.

1 Like

Thanks, @Thanatomanic.

There are clear differences between the results, as if human editors had decided on different relative importances of picture elements.

I am suspicious of the OP “the higher numbers mean a better similarity” if identical images give a similarity of zero.

My usual comparisons are with ImageMagick, the RMSE (root mean squared error) metric. Results are between 0.0 for identical images, and 1.0 for images that are as different as possible. As a rule of thumb, when the result is less than 0.01 and the differences are evenly distributed, I can’t see any visual difference.

With that metric, the RMSE difference from the means, using your supplied JPEGs are:

The Digikam image is 1283 pixels wide, so I have cropped the final 3 columns off.

scene 1:
Biggest difference (Digikam)     0.145474
Smallest difference (Lightroom)  0.058388
scene 5:
Biggest difference (LightZone)  0.123831
Smallest difference (OOC JPEG)  0.140308

Oops. With this measurement, the scene 5 OOC is more different to the mean than the LZ is different to the mean, contradicting your results. But there isn’t much between them; they are both radically different to the mean.

We can blur the images with “-blur 0x3”. This is a Gaussian blur with sigma=3 pixels, so the radius is approximately 9 pixels. Results are then:

scene 1:
Biggest difference (Digikam)     0.145067
Smallest difference (Lightroom)  0.0580373

The blur hasn’t affected the scene 1 results much, because it is quite a “smooth” image (little high-frequency detail).

scene 5:
Biggest difference (LightZone)  0.0580373
Smallest difference (OOC JPEG)  0.0792639

The blur has affected the results because the image has loads of high-frequency detail, but hasn’t changed the relative order.

Of course, we should use 16-bit TIFFs, not 8-bit (lossy?) JPEGs, but I expect that doesn’t make much difference.

For transparency, here is my Windows BAT script:

%IMG7%magick ^
  scene1_mean.jpeg ^
  ( scene1_DK.jpeg -crop 1920x1280+0+0 +repage ) ^
  -metric RMSE ^
  -format "%%[distortion]\n" ^
  -compare ^
  info:

%IMG7%magick ^
  scene1_mean.jpeg ^
  scene1_LR.jpeg ^
  -metric RMSE ^
  -format "%%[distortion]\n" ^
  -compare ^
  info:

%IMG7%magick ^
  scene5_mean.jpeg ^
  scene5_LZ.jpeg ^
  -metric RMSE ^
  -format "%%[distortion]\n" ^
  -compare ^
  info:

%IMG7%magick ^
  scene5_mean.jpeg ^
  scene5_OOC.jpeg ^
  -metric RMSE ^
  -format "%%[distortion]\n" ^
  -compare ^
  info:


%IMG7%magick ^
  scene1_mean.jpeg ^
  ( scene1_DK.jpeg -crop 1920x1280+0+0 +repage ) ^
  -blur 0x3 ^
  -metric RMSE ^
  -format "%%[distortion]\n" ^
  -compare ^
  info:

%IMG7%magick ^
  scene1_mean.jpeg ^
  scene1_LR.jpeg ^
  -blur 0x3 ^
  -metric RMSE ^
  -format "%%[distortion]\n" ^
  -compare ^
  info:

%IMG7%magick ^
  scene5_mean.jpeg ^
  scene5_LZ.jpeg ^
  -blur 0x3 ^
  -metric RMSE ^
  -format "%%[distortion]\n" ^
  -compare ^
  info:

%IMG7%magick ^
  scene5_mean.jpeg ^
  scene5_OOC.jpeg ^
  -blur 0x3 ^
  -metric RMSE ^
  -format "%%[distortion]\n" ^
  -compare ^
  info:
3 Likes

Hi @snibgo I’ve made a big update of my initial post. To summarize:

  1. I take the median of images as a reference point for comparison. The differences with the mean were not shockingly big.
  2. I calculate the RMS in Lab space to determine similarity. This has changed the order quite a bit, but now I’m much more confident about the correctness of the results.
  3. I scale the images down before determining similarities, purely for performance reasons.
  4. Thanks to a tip from @ggbutcher rawproc is now included, but I think I still have done something wrong, because from visual inspection it shouldn’t be last in the rankings. Something for a future update.
  5. I’ve included Scene 8 that contains a color chart. This one is very interesting to examine further. There are quite big differences, and all non-commercial programs end up last in the list.

As an example, look at the differences between darktable (first), DxO PhotoLab (second) and RawTherapee (third).



4 Likes

It would be interesting to see a mosaic of small thumbnails. Columns or rows of same scene across software. Just to know if the difference is perceptible at that scale.

Flipping throught those colour chart images at high speed was very interesting. Very different processing. Darktable blue patch really stand out.

RT and DXO colour chart looks quite similar but the red flowers differ a lot. Very interesting stuff!

Oh, don’t be so quick to indict your process; there HAS to be something wrong with rawproc… :crazy_face:

If you don’t mind, can you send me the TIFF? With that, I can inspect both the image itself and the rawproc toolchain, which is in the metadata… I wouldn’t want you to spend quality time chasing a problem in your chain if there’s something in mine…

Interesting differences.

Peculiar that the darktable one seems to only have the best pastel representation (looking at the warmer/colder section, top part of the colour checker). The other 2 seem to have trouble with that with the default settings used.

Overall DxO and RT seem to have the more pleasing colour and detail representation. RT has a bit more detail compared to DxO and I think I like the colours just a bit better in the DxO version if I do not look at the colour checker. If I do look at the checker I prefer the RT colours.

Assuming that this would be a starting point for users that use these defaults I don’t think there’s much of an influential difference between 2 and 3.

They are probably pushed out by the contrast in the DXO RT sampes?

It’s interesting how different the flowers are. I guess that’s a sign of different handling of extreme colours. (a frequent topic on these boards)

I think good* handling of extreme colours and contrast is one area where defaults really matter. Not knowing what the flowers look like make it difficult to decide though.

*hehe