I sometimes encounter images that have almost the same size. My strong suspicion is that only one of these is a direct resize from the original, but others a resized versions of the smaller one. I want to be able to discard the latter.
Is there a way to accurately determine which is which? Perhaps by looking at artifacts due to JPEG block compression?
I have no idea, if this is possible. If I do resizing myself, I always use the “original”. Either starting from the developed raw or from version with lossless compression, mostly png. I would only use a jpeg file if no other option exists.
So, if you have different versions of unknown origin, use the largest resolution with the highest compression quality (the value can be discovered from the file).
I agree that downsizing to a desired resolution should always be done from the original. However, I have an archive of resized images and not the originals.
I looked into this, but it gives a bad estimate imo. I have seen many cases where the image was upsized and saved with a Q90+ quality.
You might have a timestamp for latest modification in the metadata. Not sure if that would reflect the resizing operations, though. But it could give an extra data point (with size and quality).
Also, resizing operations tend to lose the finest details, and in any case you can’t add details on upscaling.
So if you have two very close in size, the one with the most fine detail is probably the original. Note that this is about fine detail, not perceived sharpness (which can be increased through sharpening with e.g. USM, should be visible in 100% view).
But that would let you chose the best image, not sure if that’s good enough or if you really need the oldest
Ha ha, examining images is my hobby. My suggestion is two-fold. Determine image characteristics using decomposition or transformation methods and image quality measurements. Resizing, if you could make good guesses of the algorithms you used, usually have predictable signatures.
E.g., using @rvietor’s suggestion as the problem to investigate, separate images into detail scales and then compare and contrast the differences among the finest stages between images.
PS - It would be fun if you could provide a sample set to play with. 
I thought this problem would be easy, but I haven’t found a solution.
We have two JPEGs of the same image at slightly different sizes, and want to know which is the ancestor of the other, either upsized or downsized. How do we know which is which? And how certain can we be?


That’s because the way the question is posed, there seems to be no information outside the images themselves. And if they don’t have EXIF info (as in @snibgo 's example), there’s even less to go on.
E.g. if you had recovered a disk with the images stored in a particular way, filesystem dates or directory names could have given clues. If there is EXIF info on your side, camera information could at least help you find the native size of the images, or you could have the name of the editing program used, and so on.
Are you even sure that one of the images is derived from the other, or can both be derived from a third image, perhaps one that’s now lost/deleted?
But even in that case, it looks like it’s going to be a manual job, I see very little chance of automating anything here.
Unfortunately, for privacy reasons, I cannot share any images from the collection. Perhaps I can share some crops later on. But the assumed effect is easy to simulate, see below.
Indeed, the only reliable information I have is the image data. EXIF is often absent, and file creation date is meaningless for a large part, as most files have been copied over from an HD and therefore have the same creation date.
Pretty sure, yes. I see images with a higher pixel count but noticeable loss in quality and fine detail. The only explanation I can think of, is that the smaller image was upsized.
Edit: After rereading I must add that your suggestion of a lost ‘original’ is certainly also possible.
Test Case
To make some sort of test case, I did the following:
- Take a ground-truth (GT) image
- Resize GT to 600x400, save as JPEG with Q=50 (to get significant artifacts)
- Resize the saved image to 690x460, save
- Resize GT to 690x460 directly, save
Ground truth
Result 1

Result 2

Result 3

Two key questions:
- Is there a way to assert that #2 should be discarded, because it is an upsize of #1?
- Is there a way to assert that #2 should be discarded, because #3 is higher in quality?
To test (1) I could simply rescale #1 and compare it with #2. If that matches, I know I should keep #1, because #2 does not have ‘more information’.
To test (2) I am looking for something to quantify the higher quality of #3 over #2 (to my eyes it is clear).
With a blink comparison at 4x multiplication, I can clearly see that result3 has more detail in and around the eye than result2. This isn’t just higher local contrast. So result2 can be discarded in favour of result3. (I am assuming that “more detail” means the image is closer to an original image.)
Result1 is more difficult. It seems to me that is is more similar to result2 than result3, but I wouldn’t swear to this. And my subjective judgement may be influenced by my knowledge that result2 came from result1.
To test (1) I could simply rescale #1 and compare it with #2. If that matches, …
That assumes you know how the rescaling (and JPEG compression) was done. If you don’t know that, there are too many variables.
@Thanatomanic : So if you have that little information about the images, why is it so important to identify the “original” image? I think I would go for the best of each lot, with a preference for the larger image if no other criterion gave a winner, not caring overmuch about which is the original (as that info seems to be lost).
My reasoning: you seem to be dealing with rather small size differences between duplicates. The only reasons I can think of to enlarge an image slightly is to adapt it for a specific device or print size. But that already assumes a certain knowledge to even think of doing that.
You know where those images come from, and may have some idea about their history and the knowledge level of the photographer/user of the images. Perhaps taking that into account would help.
As for your tests, keep in mind that most programs won’t use a default quality setting of 50%, afaik. So if the downscaling was done at that level, it must have been intentional (to save space?).
To be honest, I think you are overthinking this a bit. In your tests, you know what’s done to the images. You have no idea what the relation is between similar images in your archive.
And if worst comes to worst, keep all of the images (disk space isn’t that expensive, although that would depend on the archive size…), and make a master archive with the best of each group, for whatever test you use to decide which is “best”. If you are using a Linux file system, you might even use hard links into the original archive, to avoid duplicating images.
