Evaluate hue shifts the scientific way

I have seen @elle and @ggbutcher do screen evaluations of some image processing algorithms (including mine) based on display-referred Lch values and histogram likeliness, so maybe it’s time to provide some methodic way to stop saying whatever, because it’s bad for my nerves. This is a framework based on Python and darktable output to measure systematic deviations.

What ?

You. Can. Measure. Deviations. Only. On. Synthetic. Imagery.

Help yourself here : https://hdrihaven.com/hdri/?h=aerodynamics_workshop

The problem of real-world photos is nobody knows their true spectral values, because they have been filtered by a sensor and various corrections and adjustements, so you have to trust too many black boxes. 3D renders and sythetic imagery are the only way to get “true”, uncorrected pictures that can be used as references.

Then, you ideally need files encoded in 32 bits floats to avoid every quantization error.

How ?

In darktable, save your files in PFM 32 bits encoded in XYZ space (there is an hidden option to enable Lab and XYZ outputs). To get from the HDRi files to darktable, you have to convert them through Blender in PFM full precision.

Then, there is a small lib I have put together to open the PFM file and store it as a numpy matrix. We convert the XYZ data to xyY, center the xyY space on the equi-energetic white so that we can express the saturation as the euclidian norm of the (x, y) vector and the hue as its angle. Then, we compute the root mean of the square error/deviation over the whole picture, the error being the difference between reference and output hue and saturation.

import numpy as np
import matplotlib.pyplot as plt

def load_pfm(fn):
    
    if fn.endswith(".pfm"):
        fid = open(fn, "rb")
    else:
        print("No pfm file! \n")
        return
    
    raw_data = fid.readlines()
    fid.close()

    cols = int(raw_data[1].strip().split(b" ")[0])
    rows = int(raw_data[1].strip().split(b" ")[1])
    
    del raw_data[2]
    del raw_data[1]
    del raw_data[0]
    
    image = np.frombuffer(b"".join(raw_data), dtype=np.float32)
    del raw_data
    image = image.reshape(cols, rows, 3).T

    return image

hue = 0
sat = 1

def convert_2_hsY(image):
    """
        Takes an XYZ input image
        Outputs a [hue, saturation, Y] image
    """
    
    sum_channels = np.sum(image, axis=2)
    tmp = np.empty_like(image)
    
    # Convert to xyz centered in (x, y) = (1/3, 1/3)
    # which is the equi-energetic white in xyY
    # https://en.wikipedia.org/wiki/CIE_1931_color_space#/media/File:CIE1931_rgxy.png
    for i in range(2):
        tmp[:, :, i] = image[:, :, i] / sum_channels - 1/3
        
    out = np.empty_like(image)
    
    # Copy Y
    out[:, :, 2] = image[:, :, 2]
    
    # The hue is the angle of the (x, y) vector
    out[:, :, hue] = np.arctan2(tmp[:, :, 1], tmp[:, :, 0])
    
    # The saturation is the norm of the (x, y) vector
    out[:, :, sat] = (tmp[:, :, 1]**2 + tmp[:, :, 0]**2)**0.5
    
    return out

def RMSE(x, y):
    # https://en.wikipedia.org/wiki/Root-mean-square_deviation
    SE = (x-y)**2
    # Output the RMSE and the max of the error
    return (np.sum(SE)/ x.size)**0.5 , np.amax(SE)

Results

If you want to compare a set of pictures in a systematic way, you can loop over an array of files, that’s more convenient:

files = [
    "Téléchargements/aerodynamics_workshop_16k-filmic-chroma.pfm",
    "Téléchargements/aerodynamics_workshop_16k-filmic-non-chroma.pfm",
    "Téléchargements/aerodynamics_workshop_16k-filmic-chroma-desat.pfm",
    "Téléchargements/aerodynamics_workshop_16k-basecurve-hdr.pfm",
    "Téléchargements/aerodynamics_workshop_16k-basecurve.pfm",
]

reference = load_pfm("Téléchargements/aerodynamics_workshop_16k-reference.pfm")
reference = convert_2_hsY(reference )

for f in files:
    print(f)
    print("---------------------------------------------------")
    img = load_pfm(f)
    img = convert_2_hsY(img)
    
    print("Hue \tRMSE = %.4g \tmax(SE) = %.4g" % RMSE(reference[:, :, hue], img[:, :, hue]))
    print("Sat \tRMSE = %.4g \tmax(SE) = %.4g" % RMSE(reference[:, :, sat], img[:, :, sat]))
    print("All \tRMSE = %.4g \tmax(SE) = %.4g" % RMSE(reference[:, :, hue:sat], img[:, :, hue:sat]))
    print("---------------------------------------------------\n")

Then, the output is nice:

éléchargements/aerodynamics_workshop_16k-filmic-chroma.pfm
---------------------------------------------------
Hue 	RMSE = 1.111e-07 	max(SE) = 3.638e-12
Sat 	RMSE = 3.939e-05 	max(SE) = 6.815e-09
All 	RMSE = 1.111e-07 	max(SE) = 3.638e-12
---------------------------------------------------

Téléchargements/aerodynamics_workshop_16k-filmic-non-chroma.pfm
---------------------------------------------------
Hue 	RMSE = 1.576e-07 	max(SE) = 3.638e-12
Sat 	RMSE = 4.674e-05 	max(SE) = 6.162e-09
All 	RMSE = 1.576e-07 	max(SE) = 3.638e-12
---------------------------------------------------

Téléchargements/aerodynamics_workshop_16k-filmic-chroma-desat.pfm
---------------------------------------------------
Hue 	RMSE = 1.086e-07 	max(SE) = 2.785e-12
Sat 	RMSE = 4.029e-05 	max(SE) = 4.094e-09
All 	RMSE = 1.086e-07 	max(SE) = 2.785e-12
---------------------------------------------------

Téléchargements/aerodynamics_workshop_16k-basecurve-hdr.pfm
---------------------------------------------------
Hue 	RMSE = 4.627e-07 	max(SE) = 3.029e-10
Sat 	RMSE = 9.426e-05 	max(SE) = 1.463e-08
All 	RMSE = 4.627e-07 	max(SE) = 3.029e-10
---------------------------------------------------

Téléchargements/aerodynamics_workshop_16k-basecurve.pfm
---------------------------------------------------
Hue 	RMSE = 2.699e-07 	max(SE) = 1.455e-11
Sat 	RMSE = 9.373e-05 	max(SE) = 1.549e-08
All 	RMSE = 2.699e-07 	max(SE) = 1.455e-11
---------------------------------------------------

So, the filmic-non-chroma is the current filmic version in darktable master branch. The filmic-non-chroma is a variant I’m working on with chroma handcuffs, and the filmic-chroma-desat is the same as the previous with a - 50 % desaturation, performed in color balance.

You see that you get roughly half the RMSE with filmic variants than with the basecurve, and basecurve + exposure fusion is ever worse, hence me calling that thing silly.

This what the filmic-chroma-desat looks like:

Another version of filmic-chroma-desat

And its metrics:

Téléchargements/aerodynamics_workshop_16k-filmic-chroma-desat-2.pfm
---------------------------------------------------
Hue 	RMSE = 1.663e-07 	max(SE) = 4.604e-12
Sat 	RMSE = 9.019e-05 	max(SE) = 1.786e-08
All 	RMSE = 1.663e-07 	max(SE) = 4.604e-12
---------------------------------------------------

Now, basecurve-hdr:

Recall its metrics:

Téléchargements/aerodynamics_workshop_16k-basecurve-hdr.pfm
---------------------------------------------------
Hue 	RMSE = 4.627e-07 	max(SE) = 3.029e-10
Sat 	RMSE = 9.426e-05 	max(SE) = 1.463e-08
All 	RMSE = 4.627e-07 	max(SE) = 3.029e-10
---------------------------------------------------

The original is here: https://hdrihaven.com/hdri/?h=aerodynamics_workshop

3 Likes

thanks, very interesting! would it be possible to show the metrics for results that are “somewhat visually similar”, at least in terms of luminance and contrast? (I know this might sound a bit vague, but I hope you get my point)

Yes, I would like that. An error map would be nice as well to show sat and hue shifts. :slight_smile:

A poorly chosen set of synthetic images is a good way to tweak an algorithm to produce not so useful results.

From a practical point of view, I don’t care about hue or sat changes in one or another set of synthetic images after being subjected to one or another algorithm, except insofar as said set of synthetic images might allow someone to tweak algorithms to minimize hue changes.

In the digital darkroom, what I care about is what the algorithms do to my actual images. Ideally I want my editing algorithms to keep the hues in my scene-referred renditions of my raw files unchanged until such time as I decide to modify those hues deliberately rather than by accident of whatever poorly chosen editing algorithm. Which as an aside GIMP’s LCh and Luminance blend modes make very possible to keep hues from shifting accidentally.

Well, yes, 32-bit floating point is good. For the last five years my own editing pipeline has been 100% at 32f from raw file forward, until the final export to disk as a finished image.

Why are you using PFM and invoking Blender? Is there a problem with OpenEXR? Floating point tiffs?

The best minds in the field of color science have pointed out that XYZ and xyY are not perceptually uniform.

LAB was designed to allow measurements of “just noticeable difference”, that is, differences real people can really see.

But you are saying hue and sat changes calculated over an entire image using xyY are a useful measurement of the sort of hue and saturation changes people actually perceive?

Have you tried separating out your RMSE measurements for shadows, mid-tones, and highlights?

For what it’s woth, in color appearance models “saturation” and “chroma” are not the same. The “euclidean norm of the (x,y) vector” would correlate with “chroma”, not “saturation”. Unless by “norm” maybe you mean dividing by Y? It would be nice if you could possibly give explanations of terms that you use such that everyone might be able to follow along with what you say. Technical terms are nice to have, but even nicer when followed by explanations.

@anon41087856 you opened by saying that real-world photos involve too many black-boxes and measuring deviations should only be done on synthetic images, but then go on to do the opposite and measure a real-world photo - the aerodynamic workshop, which passed through various filters and a sensor, various software and came out the other end of a string of black boxes. Could you clarify what you meant by that?

I agree. A black box isn’t simply a set of tools and systems that aren’t in your wheelhouse. I am pretty sure that the boxes used by @Elle and @ggbutcher aren’t entirely opaque. :stuck_out_tongue: Error measured in this way isn’t foolproof either. Error needs to be grounded in sensible test conditions and criteria in order for it to make sense. This is partly why I asked for an error map and seconded @agriggio’s request. At least then we would be able to compare the magnitude and distribution of these “sat” and “hue” shifts, at least defined by xyY.

the photo I used is a CG 3D rendering.

1 Like

Except we are comparing algos that come in the middle of a scene-referred pipe, not the whole pipe with its perceptual output. So you set everything equal and change one single parameter at the time.

Who said anything here about perceptual spaces ? Since when do photons live in your perceptual field ? Also I would love to see who you call the best minds, since all I have spotted so far are random internet dudes and a bunch of clueless coding monkeys.

I never mentionned people, this is the closest I can get to scene-referred ratios. I really don’t care about people (if it wasn’t already obvious…).

It’s my understanding that saturation is the distance of a color from pure white/grey. In xyY, the Y axis is orthogonal to the chroma plane and the pure white is at (x, y) = (1/3, 1/3). So, I believe that, with x and y origins shifted to (1/3, 1/3), the direction of the (x’, y’) vector is the hue and its euclidian norm is a representation of the saturation. Chroma is (x, y) in the cartesian coordinates. But these is just likeliness metrics, really, with some extra effort to make them meaningfull. Basic RMSE don’t need that.

I thought that, giving my source code, it was self explanatory, but since it’s not:

  • euclidian norm of the vector : ||u|| = sqrt(x^2 + y^2) → saturation
  • direction of the vector : theta = arctan(y/x) → hue
  • norm and direction of the vector define a polar referential.

“somewhat visually similar” defeats all purpose of any metric since it’s perceptually tainted. And also, since the transfer functions are different, it’s very difficult to get similar results.

You might start with works by Richard Hunter and Mark Fairchild. Most of us would probably be happy to get started by perusing the color science articles on Wikipedia starting with the articles on CIE XYZ and CIELAB. But surely that’s too low-tech for someone with the level of scientific training that you have. Though the articles referenced at the bottom of the various Wikipedia articles are often really good and considerably more technical than the actual Wikipedia articles.

I don’t claim to be an expert. But when Hunter and Fairchild and the CIE say that XYZ isn’t perceptually uniform and so that’s why we have LAB for measuring color differences, I do pay attention.

I wrote an introductory article for us coding monkeys - please feel free to tear it apart:

Completely Painless Programmer’s Guide to XYZ, RGB, ICC, xyY, and TRCs

You might find this website useful (has lots of references and is very readable):

http://handprint.com/LS/CVS/color.html

You really should pay attention to what @gwgill says to you when he tries to help explain where your claims regarding ICC profile color management have gone astray, he’s not just some random internet dude. Indeed our friend @troy_s aka “anonymous” has made it abundantly clear that he has the highest regard for Graeme Gill’s expertise on color.

Photons live out there in the world. Color perception happens in the eyes and brain. It’s kind of important to keep straight what sorts of “stuff” requires dealing with photons and what sorts of “stuff” require dealing with color perception and color appearances. Hue shifts belong in the realm of color perception, not photons. xyY is not the right color space for measuring hue shifts.

Yes, that would complicate things. I still think it would be useful if you were to compare them by keeping the end result as similar as possible. That is what most papers do anyway. Comparing changes in a given parameter is only meaningful when you are stress testing the exact same algorithm.

The photo I used is a CG 3D rendering

Are you sure? Normally the images at hdri-heaven are fotos. I couldn’t find any hint, that thos image is not a photo.

@anon41087856, my apologies, I have been picking at your request for quantification, but have been distracted by the day job (curses!), home responsibilities, and the American Thanksgiving holiday. To let you know, the Imagemagick diff tools don’t tell me differences in ways with which I’m concerned, so I’m working on telling why, for certain questions, the image histogram is an appropriate tool. I’m referring to the statistical concept, not the little postage-stamp rendering, but I’ll assert even that histogram rendering can be useful in detecting and even coarsely characterizing change.

My specific question was, ‘can the radiometric relationship of camera data be retained through an ICC colorspace conversion?’, and I do believe comparing the before and after histogram renderings can tell you something significant about that.

@anon41087856 it is my opinion that the photo you used is a stitched panorama made from bracketed photos. I’ve been doing that for years and see the stitching errors and other issues. Unless the stitching errors and other issues are part of the rendering, but I would find that odd.

1 Like

Just to clarify some terminology:

  • The xy position of a color on the xy chromaticity plane is the color’s “chromaticity”, not its “chroma”.

  • I’m not aware of anyone having defined a color space that’s a formal polar transform of xyY space. But by analogy with existing color spaces defined by polar transforms (eg LCh, JCh), then:

    • As you’ve already said, the angle from the positive x-axis to the line from the white point to the xy position of the color in question would be referred to as the xyY “hue”.
    • The distance from the xy chromaticity of the white point to the xy position of the color in question would be referred to as the xy “Chroma”, not “Saturation”.
    • By analogy, “saturation” would be the result of dividing “Chroma” by Y.

This is a nice readable write-up on XYZ and xyY:

http://dougkerr.net/Pumpkin/articles/CIE_XYZ.pdf

From a paper by Fairchild on Color Appearance (http://rit-mcsl.org/fairchild/PDFs/AppearanceLec.pdf):

Due to the lack of a related chromaticity diagram, saturation is not officially defined in CIELAB. However recalling the [color appearance] definitions of chroma (colorfulness/ brightness of white), lightness (brightness/brightness of white), and saturation (colorfulness/brightness) . . . [then a correlate of saturation for CIECLAB is] Saturation = C*/L*

Also see Why is color for some brief comments on chroma vs saturation:
http://www.rit-mcsl.org/fairchild/WhyIsColor/Questions/4-8.html

Yes. The R, G, B channel values will change. But if the original image is scene-referred* then converting the image to another linear gamma RGB color space preserves the scene-referred nature of the original camera data. Otherwise it would be impossible to get from camera space to ACES and still have scene-referred data. I’m assuming the color space conversion is either done without any clipping, or else the original image color gamut fits entirely within the destination color gamut.**

* As scene-referred as possible given the “black box” nature of sensors and lenses and the limitations of camera input profiles - by now we should all just assume that as @afre has pointed out, we don’t take photographs with scientific apparatus.

** Problem colors like saturated bright yellow-greens and dark violet-blues as defined/located in XYZ by many linear gamma camera matrix input profiles, don’t actually fit inside ACES. Those colors require special handling, LUT profiles or gamut mapping or such.

From the ‘unbounded’ thread, I got that the definition of scene-referred was image data that 1) had its original radiometric relationship, and 2) had been EV-compensated (if necessary) to put the desired middle-gray tone at 0.18.

Yes! Please let it be so! I can’t handle two interventions in one month (first one was my ‘identity’ crisis… :smile: )

Hi,

Ok, let me change my request then. Would it be possible to show the metrics for several different variants of base curves and filmic with different parameters, to get a feeling on how “stable” the metrics are?

Also, I don’t quite understand why you only want artificial images as input. Since you are evaluating the error relative to the reference image, it shouldn’t matter where the reference comes from, no? In other words, if your reference is a dcraw-generated TIFF from a camera, why can’t you treat that as your “artificial” image?

1 Like

Agreed, I don’t think it’s at all clear what error you’re measuring here. XYZ is based on human response to tristimulus… you should compare to integrals of spectral data if the purpose is a physical error. If you use xyY it would usually be dominant wavelength and excitation purity as correlates for hue/saturation.

That’s a rather ambiguous statement, that may lead some people to the wrong conclusion. XYZ is based on tristimulous color matching. If the XYZ’s match, then (on average) people agree that the colors are the same. It doesn’t follow that a difference in XYZ values at all corresponds proportionally to the perception of color difference. The development of Yu’v’, L*u*v*, L*a*b*, Delta E 94, Delta E 2000, DIN99 etc. are a testament to this…

Poor correlates to subjective hue and saturation. As a chromaticity space, Yxy is very poor - Yu’v’ is superior in terms of visual uniformity.
etc. etc. Color science hasn’t been standing still for 80 years since the original standard observer was defined.