Dumping unmodified raw image data from raw files?

priort · May 5, 2021, 3:50am

Sorry I assumed it would. He also mentioned another method but maybe that is jpg only as well…
https://exiftool.org/forum/index.php?topic=6659.msg56358#msg56358

rt966298 · May 5, 2021, 3:53am

Yes, that’s basically the same method, so it has the same limitations.

rt966298 · May 5, 2021, 4:15am

I just tried with more RAW formats.

This seems to work for DJI raw DNG files, but not for Apple raw DNG files.

priort · May 5, 2021, 4:31am

I think DNG files can have a unique value independent of meta data… again might be easy to leverage if you have DNG files and they use it… page 44 https://www.adobe.com/content/dam/acom/en/products/photoshop/pdfs/dng_spec_1.4.0.0.pdf

afre · May 5, 2021, 4:53am

Okay… I guess the admins could merge your account if you and they wanted.

rt966298 · May 5, 2021, 4:59am

Thanks for taking the time to research this.

That RawDataUniqueID EXIF tag is just part of the DNG spec, and it’s dependent on whatever the DNG writer chooses to put there, if anything at all.

Most cameras that support this just use some randomly generated ID which wouldn’t suitable here because it can’t be calculated from scratch based solely on the raw image data.

But it would be a good place to store the hashes for DNG files, and it would probably work well with software that recognizes it.

Regardless, I was planning on using exiftool’s ImageUniqueID EXIF tag to store the hashes for all/other raw file types.

ETA: It appears that exiftool supports writing the RawDataUniqueID EXIF tag for non-DNG raw files as well, so I’ll probably end up using this tag since some software already supports it for DNG files.

rt966298 · May 5, 2021, 5:02am

I think it would be easier if they just removed the daily posting limit on my first account.

Thanatomanic · May 5, 2021, 8:59am

I think the discussion about whether you know about DAM and sidecars is relevant, because it scratches at the underlying assumptions in your request. You want to focus on the raw data, because you expect other parts of the original image file to change. The point is, if your workflow is setup right, this never needs to happen. Simply hash the original image file and be done with it. Your raw processing editor of choice should never, ever need to modify any bit of the original file.

Then again, if you’re set to focus on the raw data, you need to be aware that for some camera’s the sensor data may contain garbage data or black pixels and the library may cut away those pixels. You need to think about how to handle this.

ggbutcher · May 5, 2021, 1:34pm

Just pushed small fix to reallyraw.cpp: free(imagebuffer). Thought about it last night just before bedtime; lay awake feeling badly for not having it in the code…

snibgo · May 5, 2021, 2:06pm

That is true, but sadly not always feasible. Suppose I am a small cog in a large team of image processors, and different team members use different tools to do a variety of tasks, and there is inadequate version control of images. I receive a version of an image. I need to know: is this image identical to one I already have, or not?

In this case, libraw unpack() would be sufficient for hash calculation. I’m not sure why the “reallyraw” data is preferred. Perhaps because that is immune to changes in unpack() processing.

ggbutcher · May 5, 2021, 2:46pm

Been thinking about this since @rt966298 (@rt985426?) started the thread; using the product of libraw_unpack() and any of its underlying logic essentially ties any software that does image comparision tests to libraw when there may not be a need. Looking at the metadata exposed by exiftool, decodable tags in the subifd for the primary image provide enough information to extract the recorded data (offset from the start-of-file, size).

rt966298 · May 5, 2021, 5:25pm

rt966298 · May 5, 2021, 5:31pm

Yes, that is the reason.

I previously asked the libraw maintainers about this specifically, and they said that the output from libraw unpack() is not guaranteed to never change in the future.

So if you just want to compare two files, then libraw unpack() would be sufficient.

But if you want to calculate hashes over time, then you would have to freeze the version of libraw for these calculations.

By using the unmodified raw image data, we are future-proofing. This is the most correct way to do calculate hashes of the raw image data, imho, since it is not dependent on software, version, or runtime parameters.

afre · May 5, 2021, 6:51pm

@rt966298 I have had some ideas to work around your problem but I will wait for this thread to develop first. In the meantime, I changed the category of this thread and added tags to make it more visible.

rt966298 · May 5, 2021, 8:53pm

As of this writing, ggbutcher’s program already does exactly what it’s supposed to do for the simple raw file formats that it can handle. I think it is already a good, working proof-of-concept program.

I’m already using it to calculate hashes for a bunch of older raw files going back 15+ years, testing multiple file types, and incorporating the hashes into my workflow to see what kinds of issues I encounter in my particular scenarios.

I might eventually write an FAQ to address the questions that keep coming up.

rt966298 · May 6, 2021, 6:22pm

For the more complex file formats with complex data structures, I have consulted with the libraw maintainers for guidance.

Basically for these files, the difficulty arises because there is currently no meaningful data_offset and data_size because libraw loads and processes the files on the fly, and not into memory.

So in order to leverage libraw’s capabilities to handle these types of files, the xxx_load_raw() functionality of each of the decoders would have to be modified to get the equivalents of data_offset and data_size, or possibly copy the bits to a memory buffer as the file is loaded.

I’m wondering what the best way would be to do this, and possibly provide functionality that the maintainers might want to include like they did for get_internal_data_pointer().

I think getting the equivalents of data_offset and data_size is all that would be needed. I tried modifying your program and set data_size to something static just to see what would happen with complex file formats like Panasonic RW2. It seems to me like this would work with the correct data_offset and data_size.

ggbutcher · May 6, 2021, 11:13pm

Give me a make/model of camera that libraw handles that way, and I’ll get a DPReview raw file from that camera and poke around the file structure. For NEFs, I can divine the offset and size from the metadata, without need for libraw.

rt966298 · May 7, 2021, 12:43am

The Panasonic RW2 raw files that are giving me data_size of 0 are from a Panasonic DC-ZS200.

The iPhone DNG raw files that are giving me inconsistent data_size of 1126, 1146, etc., are from an iPhone 12 Pro Max.

ggbutcher · May 7, 2021, 1:25am

Got a DC-ZS200 file, no joy with exiftool. My other tools are not available to me right now, won’t be able to use them until next week…

rt966298 · May 11, 2021, 3:42am

I spent a little time over the weekend trying to understand how libraw loads raw files, starting with panasonic_load_raw() for the Panasonic RW2 files that I have, as well as some of the other xxx_load_raw() functionality for other raw formats in decoders_dcraw.cpp and for other formats that I have.

I count 64 different xxx_load_raw(), and the complex ones are all different.

I’m not sure what the best way would be to approach this for a general solution, i.e. modifying each individual xxx_load_raw() vs. something more elegant that happens outside of xxx_load_raw().

Do you have any ideas?

Also, I’m trying to think of a good way to validate the output.