Dumping unmodified raw image data from raw files?

As of this writing, ggbutcher’s program already does exactly what it’s supposed to do for the simple raw file formats that it can handle. I think it is already a good, working proof-of-concept program.

I’m already using it to calculate hashes for a bunch of older raw files going back 15+ years, testing multiple file types, and incorporating the hashes into my workflow to see what kinds of issues I encounter in my particular scenarios.

I might eventually write an FAQ to address the questions that keep coming up.

3 Likes

For the more complex file formats with complex data structures, I have consulted with the libraw maintainers for guidance.

Basically for these files, the difficulty arises because there is currently no meaningful data_offset and data_size because libraw loads and processes the files on the fly, and not into memory.

So in order to leverage libraw’s capabilities to handle these types of files, the xxx_load_raw() functionality of each of the decoders would have to be modified to get the equivalents of data_offset and data_size, or possibly copy the bits to a memory buffer as the file is loaded.

I’m wondering what the best way would be to do this, and possibly provide functionality that the maintainers might want to include like they did for get_internal_data_pointer().

I think getting the equivalents of data_offset and data_size is all that would be needed. I tried modifying your program and set data_size to something static just to see what would happen with complex file formats like Panasonic RW2. It seems to me like this would work with the correct data_offset and data_size.

Give me a make/model of camera that libraw handles that way, and I’ll get a DPReview raw file from that camera and poke around the file structure. For NEFs, I can divine the offset and size from the metadata, without need for libraw.

The Panasonic RW2 raw files that are giving me data_size of 0 are from a Panasonic DC-ZS200.

The iPhone DNG raw files that are giving me inconsistent data_size of 1126, 1146, etc., are from an iPhone 12 Pro Max.

Got a DC-ZS200 file, no joy with exiftool. My other tools are not available to me right now, won’t be able to use them until next week…

I spent a little time over the weekend trying to understand how libraw loads raw files, starting with panasonic_load_raw() for the Panasonic RW2 files that I have, as well as some of the other xxx_load_raw() functionality for other raw formats in decoders_dcraw.cpp and for other formats that I have.

I count 64 different xxx_load_raw(), and the complex ones are all different.

I’m not sure what the best way would be to approach this for a general solution, i.e. modifying each individual xxx_load_raw() vs. something more elegant that happens outside of xxx_load_raw().

Do you have any ideas?

Also, I’m trying to think of a good way to validate the output.

I’ll have to look, see if there’s somewhere in the pipe where the file data is collected. If not, then using libraw for any file won’t work…

Now, the data in the file should be contiguous, at least logically, and any information with regard to that offset and size can be used like is already done in reallyraw2dat.cpp. That approach requires a metadata library that can present those tags; exiv2 then becomes a potentially useful approach.

My knowledge of available hash algorithms is a bit dated, but if you’re not worried about vectors of compromise most any should work.

Yeah, I think exiv2 might provide a better starting point for the general case since it already dumps the raw image data when writing files, and if there’s not an analogous way to do this with libraw.

I think md5 should be sufficient for most people, and it would work well with the already-existing 128-bit RawDataUniqueID EXIF tag.

Support for more hashing algorithms could always be added down the road for people concerned about collisions.

Eventually, the command-line options might look something like this:

reallyraw2dat [ -v ] -i < infile > [ -o < outfile > | -stdout | -md5 | -sha256 | -sha512 | … ]

So, with a bit of newfound time to spare, I dug into the metadata thing. One thing about the metadata tools, exiftool and exiv2, they won’t display the subifd tags for the various images included in a raw file. However, in exiv2’s samples, exifprint will happily display it all. So, I pulled that out of the exiv2 source tree, compiled it separately, and used it to display all the metadata of one of my NEFs. The relevant file offset and data size are found by running exifprint with a grep for ‘Strip’, like this:

$ exifprint DSG_3111.NEF |grep Strip
Exif.Image.StripOffsets                      0x0111 Long        1  114732
Exif.Image.RowsPerStrip                      0x0116 Long        1  120
Exif.Image.StripByteCounts                   0x0117 Long        1  57600
Exif.SubImage2.StripOffsets                  0x0111 Long        1  1645472
Exif.SubImage2.RowsPerStrip                  0x0116 Long        1  3280
Exif.SubImage2.StripByteCounts               0x0117 Long        1  17885222

SubImage2 is the raw data we’re looking for; Image is a thumbnail. The relevant tags are: 1) RowsPerStrip, which in this case 3280 is the number of rows in the image data, so the full data set is in one strip; 2) StripOffsets, which is the distance into the file to find the data, and 3) StripByteCounts, which is the data glob’s size. I modified reallyraw2dat.cpp’s printf to print the rawimage and imagesize variables and the values are the same as StripOffsets and StripByteCounts for the same NEF.

Applying the same thinking to a Panasonic RW2, I used exifprint to extract the same information:

$ exifprint P1013104.RW2 |grep Strip
Exif.PanasonicRaw.StripOffsets               0x0111 Long        1  4294967295
Exif.PanasonicRaw.RowsPerStrip               0x0116 Long        1  3664
Exif.PanasonicRaw.StripByteCounts            0x0117 Long        1  0

Well, the same approach isn’t going to work here, the StripOffsets number is larger than the file size and the StripByteCounts doesn’t support the thesis.

Back to the drawing board… the xxx_load_raw() routines will probably give up secrets, but I’m afraid there may be no homogeneity in the approaches…

For the Panasonic RW2, you want to use RawDataOffset, which seems to be correct for this file type. But the StripByteCounts is still 0.

Check out Writer.pl from exiftool:

#------------------------------------------------------------------------------
# Copy image data from one file to another
# Inputs: 0) ExifTool object reference
#         1) reference to list of image data [ position, size, pad bytes ]
#         2) output file ref
# Returns: true on success
sub CopyImageData($$$)
{
    my ($self, $imageDataBlocks, $outfile) = @_;
    my $raf = $$self{RAF};
    my ($dataBlock, $err);
    my $num = @$imageDataBlocks;
    $self->VPrint(0, "  Copying $num image data blocks\n") if $num;
    foreach $dataBlock (@$imageDataBlocks) {
        my ($pos, $size, $pad) = @$dataBlock;
        $raf->Seek($pos, 0) or $err = 'read', last;
        my $result = CopyBlock($raf, $outfile, $size);
        $result or $err = defined $result ? 'read' : 'writ';
        # pad if necessary
        Write($outfile, "\0" x $pad) or $err = 'writ' if $pad;
        last if $err;
    }
    if ($err) {
        $self->Error("Error ${err}ing image data");
        return 0;
    }
    return 1;
}

Nothing new here, the position and size are provided in the command line; the challenge is still about divining them…

I’ve seen @Iliah_Borg post here before; his perspective as one of the libraw developers would probably be quite insightful…

Right, but CopyImageData is provided everything it needs by the rest of ExifTool. So presumably, we could just return those $pos and $size parameters.

Or we could basically copy the bits to a memory buffer or a different output file. That should give us the unmodified raw image data.

I just tested with a few debug statements:

print "$num image data blocks\n";
print "pos: $pos, size: $size, pad: $pad, ";

and tested on one of the Panasonic RW2 files I have.

I got the same offset as reported other programs, and the size reported here appears to be plausible, 22986752 for a 23603906 file.

I think this might work.

I just tried this on the other file formats that were working with reallyraw2dat, and got the correct values for the offset and size.

I’ll be otherwise tied up with something else today, will pick apart the exiftool routine on Thursday…

Here’s some testing on iPhone 12 Pro Max ProRaw DNG files which has multiple data blocks.

65 image data blocks
pos: 3612, size: 145845, pad: 0, pos: 149457, size: 153732, pad: 0, pos: 303189, size: 156425, pad: 0, pos: 459614, size: 157925, pad: 0, pos: 617539, size: 158392, pad: 0, pos: 775931, size: 157352, pad: 0, pos: 933283, size: 155805, pad: 0, pos: 1089088, size: 154272, pad: 0, pos: 1243360, size: 146496, pad: 0, pos: 1389856, size: 153557, pad: 0, pos: 1543413, size: 156022, pad: 0, pos: 1699435, size: 155417, pad: 0, pos: 1854852, size: 155741, pad: 0, pos: 2010593, size: 156477, pad: 0, pos: 2167070, size: 147454, pad: 0, pos: 2314524, size: 144603, pad: 0, pos: 2459127, size: 146933, pad: 0, pos: 2606060, size: 153178, pad: 0, pos: 2759238, size: 159627, pad: 0, pos: 2918865, size: 161667, pad: 0, pos: 3080532, size: 159122, pad: 0, pos: 3239654, size: 156860, pad: 0, pos: 3396514, size: 144587, pad: 0, pos: 3541101, size: 146435, pad: 0, pos: 3687536, size: 144645, pad: 0, pos: 3832181, size: 154953, pad: 0, pos: 3987134, size: 159339, pad: 0, pos: 4146473, size: 157849, pad: 0, pos: 4304322, size: 155740, pad: 0, pos: 4460062, size: 154221, pad: 0, pos: 4614283, size: 152040, pad: 0, pos: 4766323, size: 152265, pad: 0, pos: 4918588, size: 146993, pad: 0, pos: 5065581, size: 148942, pad: 0, pos: 5214523, size: 152775, pad: 0, pos: 5367298, size: 144907, pad: 0, pos: 5512205, size: 147170, pad: 0, pos: 5659375, size: 143499, pad: 0, pos: 5802874, size: 142795, pad: 0, pos: 5945669, size: 151072, pad: 0, pos: 6096741, size: 148978, pad: 0, pos: 6245719, size: 147370, pad: 0, pos: 6393089, size: 153343, pad: 0, pos: 6546432, size: 154764, pad: 0, pos: 6701196, size: 152782, pad: 0, pos: 6853978, size: 151604, pad: 0, pos: 7005582, size: 149113, pad: 0, pos: 7154695, size: 145593, pad: 0, pos: 7300288, size: 147684, pad: 0, pos: 7447972, size: 146889, pad: 0, pos: 7594861, size: 148436, pad: 0, pos: 7743297, size: 151330, pad: 0, pos: 7894627, size: 152334, pad: 0, pos: 8046961, size: 151399, pad: 0, pos: 8198360, size: 147186, pad: 0, pos: 8345546, size: 142587, pad: 0, pos: 8488133, size: 147791, pad: 0, pos: 8635924, size: 150236, pad: 0, pos: 8786160, size: 148456, pad: 0, pos: 8934616, size: 147215, pad: 0, pos: 9081831, size: 146308, pad: 0, pos: 9228139, size: 146062, pad: 0, pos: 9374201, size: 145635, pad: 0, pos: 9519836, size: 143228, pad: 0, pos: 9663064, size: 218771, pad: 1,

So I think that we can just dump the data as it is being copied for any raw format that exiftool supports.

I just tested this out by opening a test.dat file and writing each block in the foreach loop to this file, omitting the padding.

sub CopyImageData($$$)
{
    my ($self, $imageDataBlocks, $outfile) = @_;
    my $raf = $$self{RAF};
    my ($dataBlock, $err);
    my $num = @$imageDataBlocks;
    $self->VPrint(0, "  Copying $num image data blocks\n") if $num;

    my $filename = "./test.dat";
    open(FH, '>', $filename);

    foreach $dataBlock (@$imageDataBlocks) {
        my ($pos, $size, $pad) = @$dataBlock;
        $raf->Seek($pos, 0) or $err = 'read', last;

        my $buff;
        $raf->Read($buff, $size+$pad);
        print FH $buff;
        $raf->Seek($pos, 0) or $err = 'read', last; # reset

        my $result = CopyBlock($raf, $outfile, $size);
        $result or $err = defined $result ? 'read' : 'writ';
        # pad if necessary
        Write($outfile, "\0" x $pad) or $err = 'writ' if $pad;
        last if $err;
    }

    close(FH);

    if ($err) {
        $self->Error("Error ${err}ing image data");
        return 0;
    }
    return 1;
}

The size of the DNG file IMG.DNG is 9881836 and the size of the resulting test.dat is 9878223, which seems plausible.

Furthermore, when I do a diff -ua test.dat and IMG.DNG, there are only a few differences: 1) the stuff at the beginning of the IMG.DNG file which clearly has the EXIF data; and 2) no newline at the end of the test.dat.

So far so good.

But there also appears to be 3) some data (roughly 48 bits) at the beginning of test.dat that aren’t also in IMG.DNG, and that will change the hashes.

--- test.dat
+++ IMG.DNG
@@ -1,4 +1,21 @@
-< stuff >
+< stuff >
+< stuff >
+< stuff >
+< stuff >
+< stuff >
+< stuff >
+< stuff >
+< stuff >
+< stuff >
+< stuff >
+< stuff >
+< stuff >
+< stuff >
+< stuff >
+< stuff >
+< stuff >
+< stuff >
+< stuff >
@@ -21881,4 +21898,4 @@
-< stuff >
\ No newline at end of file
+< stuff >
\ No newline at end of file

So this approach will require a little more investigation and testing, but I think it will work.

ETA #1: Upon further examination, it appears that the 3) apparent difference at the beginning of the test.dat dump is actually in-line at the end of the EXIF section, and therefore is an artifact of how diff processes line-by-line.

So the output data matches bit-for-bit for the file types I have tested thus far.

ETA #2: I had incorrectly assumed that only the last dataBlock has any padding, but that’s not true based on the way exiftool is reading the data.

Through my testing, exiftool is finding non-terminal dataBlocks that it determines need padding.

Since we want to copy each dataBlock as it exists in the original file including any in-line garbage data that is considered padding, we should copy each dataBlock for $size + $pad.

This is some serious sleuthing you guys… :slight_smile:

Upon further examination, it appears that the 3) apparent difference at the beginning of the test.dat dump is actually in-line at the end of the EXIF section, and therefore is an artifact of how diff processes line-by-line.

So the output data matches bit-for-bit for the file types I have tested thus far.

1 Like

I had incorrectly assumed that only the last dataBlock has any padding, but that’s not true based on the way exiftool is reading the data.

Through my testing, exiftool is finding non-terminal dataBlocks that it determines need padding.

Since we want to copy each dataBlock as it exists in the original file including any in-line garbage data that is considered padding, we should copy each dataBlock for $size + $pad.

Also, exiftool is writing ‘\0’ for the padding to its output file when, imho, it shouldn’t modify the raw image data. This would immediately modify the raw image data on the first write for affected raw files. I’ll ask the maintainer about this.