Dumping unmodified raw image data from raw files?

That is true, but sadly not always feasible. Suppose I am a small cog in a large team of image processors, and different team members use different tools to do a variety of tasks, and there is inadequate version control of images. I receive a version of an image. I need to know: is this image identical to one I already have, or not?

In this case, libraw unpack() would be sufficient for hash calculation. I’m not sure why the “reallyraw” data is preferred. Perhaps because that is immune to changes in unpack() processing.

Been thinking about this since @rt966298 (@rt985426?) started the thread; using the product of libraw_unpack() and any of its underlying logic essentially ties any software that does image comparision tests to libraw when there may not be a need. Looking at the metadata exposed by exiftool, decodable tags in the subifd for the primary image provide enough information to extract the recorded data (offset from the start-of-file, size).

:rofl: :joy:

Yes, that is the reason.

I previously asked the libraw maintainers about this specifically, and they said that the output from libraw unpack() is not guaranteed to never change in the future.

So if you just want to compare two files, then libraw unpack() would be sufficient.

But if you want to calculate hashes over time, then you would have to freeze the version of libraw for these calculations.

By using the unmodified raw image data, we are future-proofing. This is the most correct way to do calculate hashes of the raw image data, imho, since it is not dependent on software, version, or runtime parameters.

1 Like

@rt966298 I have had some ideas to work around your problem but I will wait for this thread to develop first. In the meantime, I changed the category of this thread and added tags to make it more visible.

As of this writing, ggbutcher’s program already does exactly what it’s supposed to do for the simple raw file formats that it can handle. I think it is already a good, working proof-of-concept program.

I’m already using it to calculate hashes for a bunch of older raw files going back 15+ years, testing multiple file types, and incorporating the hashes into my workflow to see what kinds of issues I encounter in my particular scenarios.

I might eventually write an FAQ to address the questions that keep coming up.

3 Likes

For the more complex file formats with complex data structures, I have consulted with the libraw maintainers for guidance.

Basically for these files, the difficulty arises because there is currently no meaningful data_offset and data_size because libraw loads and processes the files on the fly, and not into memory.

So in order to leverage libraw’s capabilities to handle these types of files, the xxx_load_raw() functionality of each of the decoders would have to be modified to get the equivalents of data_offset and data_size, or possibly copy the bits to a memory buffer as the file is loaded.

I’m wondering what the best way would be to do this, and possibly provide functionality that the maintainers might want to include like they did for get_internal_data_pointer().

I think getting the equivalents of data_offset and data_size is all that would be needed. I tried modifying your program and set data_size to something static just to see what would happen with complex file formats like Panasonic RW2. It seems to me like this would work with the correct data_offset and data_size.

Give me a make/model of camera that libraw handles that way, and I’ll get a DPReview raw file from that camera and poke around the file structure. For NEFs, I can divine the offset and size from the metadata, without need for libraw.

The Panasonic RW2 raw files that are giving me data_size of 0 are from a Panasonic DC-ZS200.

The iPhone DNG raw files that are giving me inconsistent data_size of 1126, 1146, etc., are from an iPhone 12 Pro Max.

Got a DC-ZS200 file, no joy with exiftool. My other tools are not available to me right now, won’t be able to use them until next week…

I spent a little time over the weekend trying to understand how libraw loads raw files, starting with panasonic_load_raw() for the Panasonic RW2 files that I have, as well as some of the other xxx_load_raw() functionality for other raw formats in decoders_dcraw.cpp and for other formats that I have.

I count 64 different xxx_load_raw(), and the complex ones are all different.

I’m not sure what the best way would be to approach this for a general solution, i.e. modifying each individual xxx_load_raw() vs. something more elegant that happens outside of xxx_load_raw().

Do you have any ideas?

Also, I’m trying to think of a good way to validate the output.

I’ll have to look, see if there’s somewhere in the pipe where the file data is collected. If not, then using libraw for any file won’t work…

Now, the data in the file should be contiguous, at least logically, and any information with regard to that offset and size can be used like is already done in reallyraw2dat.cpp. That approach requires a metadata library that can present those tags; exiv2 then becomes a potentially useful approach.

My knowledge of available hash algorithms is a bit dated, but if you’re not worried about vectors of compromise most any should work.

Yeah, I think exiv2 might provide a better starting point for the general case since it already dumps the raw image data when writing files, and if there’s not an analogous way to do this with libraw.

I think md5 should be sufficient for most people, and it would work well with the already-existing 128-bit RawDataUniqueID EXIF tag.

Support for more hashing algorithms could always be added down the road for people concerned about collisions.

Eventually, the command-line options might look something like this:

reallyraw2dat [ -v ] -i < infile > [ -o < outfile > | -stdout | -md5 | -sha256 | -sha512 | … ]

So, with a bit of newfound time to spare, I dug into the metadata thing. One thing about the metadata tools, exiftool and exiv2, they won’t display the subifd tags for the various images included in a raw file. However, in exiv2’s samples, exifprint will happily display it all. So, I pulled that out of the exiv2 source tree, compiled it separately, and used it to display all the metadata of one of my NEFs. The relevant file offset and data size are found by running exifprint with a grep for ‘Strip’, like this:

$ exifprint DSG_3111.NEF |grep Strip
Exif.Image.StripOffsets                      0x0111 Long        1  114732
Exif.Image.RowsPerStrip                      0x0116 Long        1  120
Exif.Image.StripByteCounts                   0x0117 Long        1  57600
Exif.SubImage2.StripOffsets                  0x0111 Long        1  1645472
Exif.SubImage2.RowsPerStrip                  0x0116 Long        1  3280
Exif.SubImage2.StripByteCounts               0x0117 Long        1  17885222

SubImage2 is the raw data we’re looking for; Image is a thumbnail. The relevant tags are: 1) RowsPerStrip, which in this case 3280 is the number of rows in the image data, so the full data set is in one strip; 2) StripOffsets, which is the distance into the file to find the data, and 3) StripByteCounts, which is the data glob’s size. I modified reallyraw2dat.cpp’s printf to print the rawimage and imagesize variables and the values are the same as StripOffsets and StripByteCounts for the same NEF.

Applying the same thinking to a Panasonic RW2, I used exifprint to extract the same information:

$ exifprint P1013104.RW2 |grep Strip
Exif.PanasonicRaw.StripOffsets               0x0111 Long        1  4294967295
Exif.PanasonicRaw.RowsPerStrip               0x0116 Long        1  3664
Exif.PanasonicRaw.StripByteCounts            0x0117 Long        1  0

Well, the same approach isn’t going to work here, the StripOffsets number is larger than the file size and the StripByteCounts doesn’t support the thesis.

Back to the drawing board… the xxx_load_raw() routines will probably give up secrets, but I’m afraid there may be no homogeneity in the approaches…

For the Panasonic RW2, you want to use RawDataOffset, which seems to be correct for this file type. But the StripByteCounts is still 0.

Check out Writer.pl from exiftool:

#------------------------------------------------------------------------------
# Copy image data from one file to another
# Inputs: 0) ExifTool object reference
#         1) reference to list of image data [ position, size, pad bytes ]
#         2) output file ref
# Returns: true on success
sub CopyImageData($$$)
{
    my ($self, $imageDataBlocks, $outfile) = @_;
    my $raf = $$self{RAF};
    my ($dataBlock, $err);
    my $num = @$imageDataBlocks;
    $self->VPrint(0, "  Copying $num image data blocks\n") if $num;
    foreach $dataBlock (@$imageDataBlocks) {
        my ($pos, $size, $pad) = @$dataBlock;
        $raf->Seek($pos, 0) or $err = 'read', last;
        my $result = CopyBlock($raf, $outfile, $size);
        $result or $err = defined $result ? 'read' : 'writ';
        # pad if necessary
        Write($outfile, "\0" x $pad) or $err = 'writ' if $pad;
        last if $err;
    }
    if ($err) {
        $self->Error("Error ${err}ing image data");
        return 0;
    }
    return 1;
}

Nothing new here, the position and size are provided in the command line; the challenge is still about divining them…

I’ve seen @Iliah_Borg post here before; his perspective as one of the libraw developers would probably be quite insightful…

Right, but CopyImageData is provided everything it needs by the rest of ExifTool. So presumably, we could just return those $pos and $size parameters.

Or we could basically copy the bits to a memory buffer or a different output file. That should give us the unmodified raw image data.

I just tested with a few debug statements:

print "$num image data blocks\n";
print "pos: $pos, size: $size, pad: $pad, ";

and tested on one of the Panasonic RW2 files I have.

I got the same offset as reported other programs, and the size reported here appears to be plausible, 22986752 for a 23603906 file.

I think this might work.

I just tried this on the other file formats that were working with reallyraw2dat, and got the correct values for the offset and size.