Setting aside the philosophically hard part of that problem (a la “how can I know your red is the same as my red”—if I feel your red like you feel blue, but I call it “red”, you would have no way of knowing), the practical question on how to balance—especially with mixed light sources, which is often the case in urban environments at night—has been bothering me for a while, and the way I have come to approach it (quite recently, I think) is this.
When we observe a scene, our eyes pick up a fairly expansive spectrum and dynamic range of light. Then, somewhat counter-intuitively, our minds more or less construct a 3D scene using that raw information as an input. But the fun thing is, it’s not the only input our mind uses.
The world is complex and our minds are constantly busy simplifying and predicting what happens next, and in part the scene we construct in the moment contains what we unconsciously expect there to be. While your brain is powerful, it still doesn’t have enough power to resolve reality down to tiniest detail, so to allow you to function in daily life it builds a simplified map of it.
Whatever we have come to expect throughout our lifetimes affects the scene we construct. For example, from early childhood we strongly anchor by human faces—whether it is lit by an orange streetlight, or sick-green hospital fluorescent lamp, we perceive a human face. If there’s a candle involved, we may perceive romantic vibe (or maybe a spooky one, depending on the rest of the scene). If we see a candle-lit paper sheet, we see a sheet of paper—we have no problem with it being yellow, we’ve seen plenty of paper. Experiencing scene over time from different angles (moving our heads) gets us extra information for our predictive minds.
Add the component of time to the mix. If you sat in a dull-lit grey room all day then stepped into a sunlit garden, colors will jump at you from left and right. If you spent much of your life in a small town and then stepped on a roof of a building in the middle of a megapolis at night for the first time, the visuals will be stunning. Whatever your mood has been that day, whatever you did or consumed, whatever the air smells like, whatever feelings of the past were evoked—there’s a whole mix that determines and colours what you perceive in the moment.
Now, what happens when we produce a photo (presuming raw capture)?
When we shoot, we get a static 2D slice of light. Modern tech gets us decent dynamic ranges, but still nowhere near what our eyes see, with RGB separation induced metameric failures, and obviously none of that emotional and experiential baggage.
When we produce a deliverable, we interpret those raw scene values in order to fit it all into even more limited dynamic range and gamut of our target media. Whether it’s sRGB, P3, paper, or high-nit Rec. 2020, the range we work with is just so tiny compared to what our cameras capture, to speak nothing of what our eyes saw in the first place. So any idea of “true” reproduction automatically goes out of the window in most real-life scenes.
For example, if you balance a scene with an orange streetlight by a colour card, someone will surely complain that the streetlight on that bus stop is not actually white but yellow. And if you calibrate it by some “canonical” white point like D50, then any human face under that orange streetlight will clip straight into red.
So how do we balance in the end (meaning general colour interpretation in the final photo, not necessarily using the WB tool)?
In my opinion, correct answers to this question only exist in very narrow scenarios of technical photography, such as if you’re digitising a work of art.
Otherwise, use colour cards, calibrate, all of that helps—but in the end, use your eyes and do whatever you think is appropriate (some super helpful pointers in this thread on how to make it easier) to convey what you want to convey in the best way. Limitations of output media don’t give us enough headroom to reproduce a real-life scene, but it doesn’t matter: no other person will see the scene like you did.