Dumping unmodified raw image data from raw files?

ggbutcher · May 7, 2021, 1:25am

Got a DC-ZS200 file, no joy with exiftool. My other tools are not available to me right now, won’t be able to use them until next week…

rt966298 · May 11, 2021, 3:42am

I spent a little time over the weekend trying to understand how libraw loads raw files, starting with panasonic_load_raw() for the Panasonic RW2 files that I have, as well as some of the other xxx_load_raw() functionality for other raw formats in decoders_dcraw.cpp and for other formats that I have.

I count 64 different xxx_load_raw(), and the complex ones are all different.

I’m not sure what the best way would be to approach this for a general solution, i.e. modifying each individual xxx_load_raw() vs. something more elegant that happens outside of xxx_load_raw().

Do you have any ideas?

Also, I’m trying to think of a good way to validate the output.

ggbutcher · May 11, 2021, 4:29pm

I’ll have to look, see if there’s somewhere in the pipe where the file data is collected. If not, then using libraw for any file won’t work…

Now, the data in the file should be contiguous, at least logically, and any information with regard to that offset and size can be used like is already done in reallyraw2dat.cpp. That approach requires a metadata library that can present those tags; exiv2 then becomes a potentially useful approach.

My knowledge of available hash algorithms is a bit dated, but if you’re not worried about vectors of compromise most any should work.

rt966298 · May 11, 2021, 10:38pm

Yeah, I think exiv2 might provide a better starting point for the general case since it already dumps the raw image data when writing files, and if there’s not an analogous way to do this with libraw.

I think md5 should be sufficient for most people, and it would work well with the already-existing 128-bit RawDataUniqueID EXIF tag.

Support for more hashing algorithms could always be added down the road for people concerned about collisions.

Eventually, the command-line options might look something like this:

reallyraw2dat [ -v ] -i < infile > [ -o < outfile > | -stdout | -md5 | -sha256 | -sha512 | … ]

ggbutcher · May 12, 2021, 12:02am

So, with a bit of newfound time to spare, I dug into the metadata thing. One thing about the metadata tools, exiftool and exiv2, they won’t display the subifd tags for the various images included in a raw file. However, in exiv2’s samples, exifprint will happily display it all. So, I pulled that out of the exiv2 source tree, compiled it separately, and used it to display all the metadata of one of my NEFs. The relevant file offset and data size are found by running exifprint with a grep for ‘Strip’, like this:

$ exifprint DSG_3111.NEF |grep Strip
Exif.Image.StripOffsets                      0x0111 Long        1  114732
Exif.Image.RowsPerStrip                      0x0116 Long        1  120
Exif.Image.StripByteCounts                   0x0117 Long        1  57600
Exif.SubImage2.StripOffsets                  0x0111 Long        1  1645472
Exif.SubImage2.RowsPerStrip                  0x0116 Long        1  3280
Exif.SubImage2.StripByteCounts               0x0117 Long        1  17885222

SubImage2 is the raw data we’re looking for; Image is a thumbnail. The relevant tags are: 1) RowsPerStrip, which in this case 3280 is the number of rows in the image data, so the full data set is in one strip; 2) StripOffsets, which is the distance into the file to find the data, and 3) StripByteCounts, which is the data glob’s size. I modified reallyraw2dat.cpp’s printf to print the rawimage and imagesize variables and the values are the same as StripOffsets and StripByteCounts for the same NEF.

Applying the same thinking to a Panasonic RW2, I used exifprint to extract the same information:

$ exifprint P1013104.RW2 |grep Strip
Exif.PanasonicRaw.StripOffsets               0x0111 Long        1  4294967295
Exif.PanasonicRaw.RowsPerStrip               0x0116 Long        1  3664
Exif.PanasonicRaw.StripByteCounts            0x0117 Long        1  0

Well, the same approach isn’t going to work here, the StripOffsets number is larger than the file size and the StripByteCounts doesn’t support the thesis.

Back to the drawing board… the xxx_load_raw() routines will probably give up secrets, but I’m afraid there may be no homogeneity in the approaches…

rt966298 · May 12, 2021, 1:00am

For the Panasonic RW2, you want to use RawDataOffset, which seems to be correct for this file type. But the StripByteCounts is still 0.

rt966298 · May 12, 2021, 1:18am

Check out Writer.pl from exiftool:

#------------------------------------------------------------------------------
# Copy image data from one file to another
# Inputs: 0) ExifTool object reference
#         1) reference to list of image data [ position, size, pad bytes ]
#         2) output file ref
# Returns: true on success
sub CopyImageData($$$)
{
    my ($self, $imageDataBlocks, $outfile) = @_;
    my $raf = $$self{RAF};
    my ($dataBlock, $err);
    my $num = @$imageDataBlocks;
    $self->VPrint(0, "  Copying $num image data blocks\n") if $num;
    foreach $dataBlock (@$imageDataBlocks) {
        my ($pos, $size, $pad) = @$dataBlock;
        $raf->Seek($pos, 0) or $err = 'read', last;
        my $result = CopyBlock($raf, $outfile, $size);
        $result or $err = defined $result ? 'read' : 'writ';
        # pad if necessary
        Write($outfile, "\0" x $pad) or $err = 'writ' if $pad;
        last if $err;
    }
    if ($err) {
        $self->Error("Error ${err}ing image data");
        return 0;
    }
    return 1;
}

ggbutcher · May 12, 2021, 1:44am

Nothing new here, the position and size are provided in the command line; the challenge is still about divining them…

I’ve seen @Iliah_Borg post here before; his perspective as one of the libraw developers would probably be quite insightful…

rt966298 · May 12, 2021, 2:07am

Right, but CopyImageData is provided everything it needs by the rest of ExifTool. So presumably, we could just return those $pos and $size parameters.

Or we could basically copy the bits to a memory buffer or a different output file. That should give us the unmodified raw image data.

rt966298 · May 12, 2021, 2:22am

I just tested with a few debug statements:

print "$num image data blocks\n";
print "pos: $pos, size: $size, pad: $pad, ";

and tested on one of the Panasonic RW2 files I have.

I got the same offset as reported other programs, and the size reported here appears to be plausible, 22986752 for a 23603906 file.

I think this might work.

rt966298 · May 12, 2021, 2:26am

I just tried this on the other file formats that were working with reallyraw2dat, and got the correct values for the offset and size.

ggbutcher · May 12, 2021, 10:05am

I’ll be otherwise tied up with something else today, will pick apart the exiftool routine on Thursday…

rt966298 · May 12, 2021, 4:46pm

Here’s some testing on iPhone 12 Pro Max ProRaw DNG files which has multiple data blocks.

65 image data blocks
pos: 3612, size: 145845, pad: 0, pos: 149457, size: 153732, pad: 0, pos: 303189, size: 156425, pad: 0, pos: 459614, size: 157925, pad: 0, pos: 617539, size: 158392, pad: 0, pos: 775931, size: 157352, pad: 0, pos: 933283, size: 155805, pad: 0, pos: 1089088, size: 154272, pad: 0, pos: 1243360, size: 146496, pad: 0, pos: 1389856, size: 153557, pad: 0, pos: 1543413, size: 156022, pad: 0, pos: 1699435, size: 155417, pad: 0, pos: 1854852, size: 155741, pad: 0, pos: 2010593, size: 156477, pad: 0, pos: 2167070, size: 147454, pad: 0, pos: 2314524, size: 144603, pad: 0, pos: 2459127, size: 146933, pad: 0, pos: 2606060, size: 153178, pad: 0, pos: 2759238, size: 159627, pad: 0, pos: 2918865, size: 161667, pad: 0, pos: 3080532, size: 159122, pad: 0, pos: 3239654, size: 156860, pad: 0, pos: 3396514, size: 144587, pad: 0, pos: 3541101, size: 146435, pad: 0, pos: 3687536, size: 144645, pad: 0, pos: 3832181, size: 154953, pad: 0, pos: 3987134, size: 159339, pad: 0, pos: 4146473, size: 157849, pad: 0, pos: 4304322, size: 155740, pad: 0, pos: 4460062, size: 154221, pad: 0, pos: 4614283, size: 152040, pad: 0, pos: 4766323, size: 152265, pad: 0, pos: 4918588, size: 146993, pad: 0, pos: 5065581, size: 148942, pad: 0, pos: 5214523, size: 152775, pad: 0, pos: 5367298, size: 144907, pad: 0, pos: 5512205, size: 147170, pad: 0, pos: 5659375, size: 143499, pad: 0, pos: 5802874, size: 142795, pad: 0, pos: 5945669, size: 151072, pad: 0, pos: 6096741, size: 148978, pad: 0, pos: 6245719, size: 147370, pad: 0, pos: 6393089, size: 153343, pad: 0, pos: 6546432, size: 154764, pad: 0, pos: 6701196, size: 152782, pad: 0, pos: 6853978, size: 151604, pad: 0, pos: 7005582, size: 149113, pad: 0, pos: 7154695, size: 145593, pad: 0, pos: 7300288, size: 147684, pad: 0, pos: 7447972, size: 146889, pad: 0, pos: 7594861, size: 148436, pad: 0, pos: 7743297, size: 151330, pad: 0, pos: 7894627, size: 152334, pad: 0, pos: 8046961, size: 151399, pad: 0, pos: 8198360, size: 147186, pad: 0, pos: 8345546, size: 142587, pad: 0, pos: 8488133, size: 147791, pad: 0, pos: 8635924, size: 150236, pad: 0, pos: 8786160, size: 148456, pad: 0, pos: 8934616, size: 147215, pad: 0, pos: 9081831, size: 146308, pad: 0, pos: 9228139, size: 146062, pad: 0, pos: 9374201, size: 145635, pad: 0, pos: 9519836, size: 143228, pad: 0, pos: 9663064, size: 218771, pad: 1,

So I think that we can just dump the data as it is being copied for any raw format that exiftool supports.

I just tested this out by opening a test.dat file and writing each block in the foreach loop to this file, omitting the padding.

sub CopyImageData($$$)
{
    my ($self, $imageDataBlocks, $outfile) = @_;
    my $raf = $$self{RAF};
    my ($dataBlock, $err);
    my $num = @$imageDataBlocks;
    $self->VPrint(0, "  Copying $num image data blocks\n") if $num;

    my $filename = "./test.dat";
    open(FH, '>', $filename);

    foreach $dataBlock (@$imageDataBlocks) {
        my ($pos, $size, $pad) = @$dataBlock;
        $raf->Seek($pos, 0) or $err = 'read', last;

        my $buff;
        $raf->Read($buff, $size+$pad);
        print FH $buff;
        $raf->Seek($pos, 0) or $err = 'read', last; # reset

        my $result = CopyBlock($raf, $outfile, $size);
        $result or $err = defined $result ? 'read' : 'writ';
        # pad if necessary
        Write($outfile, "\0" x $pad) or $err = 'writ' if $pad;
        last if $err;
    }

    close(FH);

    if ($err) {
        $self->Error("Error ${err}ing image data");
        return 0;
    }
    return 1;
}

The size of the DNG file IMG.DNG is 9881836 and the size of the resulting test.dat is 9878223, which seems plausible.

Furthermore, when I do a diff -ua test.dat and IMG.DNG, there are only a few differences: 1) the stuff at the beginning of the IMG.DNG file which clearly has the EXIF data; and 2) no newline at the end of the test.dat.

So far so good.

But there also appears to be 3) some data (roughly 48 bits) at the beginning of test.dat that aren’t also in IMG.DNG, and that will change the hashes.

--- test.dat
+++ IMG.DNG
@@ -1,4 +1,21 @@
-< stuff >
+< stuff >
+< stuff >
+< stuff >
+< stuff >
+< stuff >
+< stuff >
+< stuff >
+< stuff >
+< stuff >
+< stuff >
+< stuff >
+< stuff >
+< stuff >
+< stuff >
+< stuff >
+< stuff >
+< stuff >
+< stuff >
@@ -21881,4 +21898,4 @@
-< stuff >
\ No newline at end of file
+< stuff >
\ No newline at end of file

So this approach will require a little more investigation and testing, but I think it will work.

ETA #1: Upon further examination, it appears that the 3) apparent difference at the beginning of the test.dat dump is actually in-line at the end of the EXIF section, and therefore is an artifact of how diff processes line-by-line.

So the output data matches bit-for-bit for the file types I have tested thus far.

ETA #2: I had incorrectly assumed that only the last dataBlock has any padding, but that’s not true based on the way exiftool is reading the data.

Through my testing, exiftool is finding non-terminal dataBlocks that it determines need padding.

Since we want to copy each dataBlock as it exists in the original file including any in-line garbage data that is considered padding, we should copy each dataBlock for $size + $pad.

priort · May 12, 2021, 4:50pm

This is some serious sleuthing you guys…

rt966298 · May 12, 2021, 5:22pm

Upon further examination, it appears that the 3) apparent difference at the beginning of the test.dat dump is actually in-line at the end of the EXIF section, and therefore is an artifact of how diff processes line-by-line.

So the output data matches bit-for-bit for the file types I have tested thus far.

rt966298 · May 12, 2021, 10:52pm

I had incorrectly assumed that only the last dataBlock has any padding, but that’s not true based on the way exiftool is reading the data.

Through my testing, exiftool is finding non-terminal dataBlocks that it determines need padding.

Since we want to copy each dataBlock as it exists in the original file including any in-line garbage data that is considered padding, we should copy each dataBlock for $size + $pad.

Also, exiftool is writing ‘\0’ for the padding to its output file when, imho, it shouldn’t modify the raw image data. This would immediately modify the raw image data on the first write for affected raw files. I’ll ask the maintainer about this.

rt966298 · May 13, 2021, 12:41am

There are some raw formats that CopyImageData does not read sequentially from the input file.

For example:

3 image data blocks
pos: 725396, size: 24660480, pad: 0
pos: 55962, size: 669434, pad: 0
pos: 33118, size: 22844, pad: 0

So simply dumping this data in the order that CopyImageData reads it (which was a hack for testing purposes) isn’t going to work for the general case since the goal is to return the unmodified raw image data as it exists in the file.

We should probably be using Writer.pl’s CopyBlock in a separate subroutine anyways.

Either that, or maybe: 1) re-order the dataBlocks by pos and then dump each dataBlock; or 2) keep track of pos, size, and pad, and then dump from the lowest pos for the sum of ($size + $pad).

ggbutcher · May 13, 2021, 4:11am

Nice work with exiftool…

Tomorrow, I’m going to take a look at the libraw routines to see if there’s a simple code insertion at the discovery of each image byte that could save them as the routine goes about it’s business. I think a libraw solution will be easier to incorporate in a program…

rt985426 · May 13, 2021, 4:27am

[Ugh, I reached my maximum post count for the day again, so I have to switch back to my other account…]

exiftool apparently also reorders the dataBlocks sequentially by position when necessary on the first write to file.

So I’m not sure the best way to handle this since the sections of the before and after files that contain the raw image data are now not identical.

But after the first write, the sections of the files that contain the raw image data appear to not change on subsequent writes.

Maybe it would be better to use the sorted dataBlocks to calculate hashes, even though that’s different than my original goal of using the data as it exists in the file.

But this way, the reordered raw image data would still be immutable and not necessarily dependent on software, version, or runtime parameters.

It’s one extra step in the calculation that you’d have to port to other software implementations, e.g. one using libraw.

heckflosse · May 13, 2021, 10:37am

I increased your level for both accounts.