Rawsamples.ch and hosting raw sample files

Having a discussion with @LebedevRI on irc this past couple of days about the state of Rawsamples.ch (in short, not good):

Dear Visitor

Rawsamples.ch is actually not up-to-date, as it was hijacked by a SQL-Injection and the database which belongs to the Joomla-CMS was corrupted. I’m very sorry about this, especially as I haven’t had an actual backup of the Database-Content. As long as you see this remark the repair of rawsamples.ch is ongoing and the content is not current (it is between Dec. 2015-Jan.2016). Stay tuned, I’m working on the content.

Thanks for your patience.

Jakob Rohrbach / Nov. 2nd. 2016

LebedevRI was asking me about possibly building out a replacement for the site that would continue its functionality.

As a first step I’ve gone ahead and scraped all of the files from the site and have stored them on the main site (https://pixls.us/files/rawsamples/files/ [6.8G total]). Don’t be fooled into grabbing the all_raws_rawsamples_ch.7z file, it’s missing stuff and is at least an order of magnitude smaller than it should be.

Archiving these files is all well and good, but the harder question is what a good long-term solution might be that will replicate the capabilities that the original site had (or maybe do it better).

but in the end it boils down to:

  1. nice friendly ui to check whether all expected samples for a given camera available or not and upload missing
  1. table output of all avaliable raw samples; <- camera list must be cross-referenecd with https://github.com/darktable-org/darktable/blob/master/src/external/rawspeed/data/cameras.xml

So I have a question for everyone (mostly @staff). What is a good solution moving forward to continue hosting and maintaining sample raw files? I was thinking primarily of two main approaches:

  1. Use the forums. It’s relatively simple to setup a new category just for posting raw files. We have more than enough eyeballs to make sure any manual formatting of post titles or tags would be managed fine.
  • Pros

    • Low impact/barrier to implementation
    • threaded discussion available (by nature of implementation)
    • already setup to upload up to 100MB files to amazon
  • Cons

    • searching/filtering available/missing files might be more clumsy?
    • manual intervention possibly required to correctly sort/tag files
    • possibly polluting topic lists with new raw uploads (showing up as new posts)
  1. Build something new and expose it at an address like pixls.us/raw
    This could be as simple as re-implementing what is already being done on rawsamples.ch. That is, emailing files to one of us to include in the list manually, or possibly using an anonymous FTP and requiring an admin to get, verify, and include the file on the page.
    This could be extended to something more modern, possibly more secure, and possibly at least slightly more automated.
  • Pros

    • Full customization of page listings (cleaner look)
    • Easier to generate view as requested (list of uploaded/missing + tables)
  • Cons

    • More work, higher impact.
      Though, replicating the current workflow of rawsamples.ch wouldn’t be that hard honestly. Something more fancy might require quite a bit more time/work.

Tools immediately available to use are FTP (could set this up right now), the normal pixls.us build system, this forum.

I will expouse my love of git here and now and say that we should have a git repo of raw files :slight_smile: I would be happy to accept files by email and include them in the git repo; I’d also be happy to take pull requests on github et al and maintain the integrity of the repo. Git works for the obvious reason, it is distributed, has integretiy checking, and plentifly amounts of free hosting. Also simple tools, git + a script to generate a web page, means low maintenance, less maintenance resources, less programming maintenance, less web site overhead. Less is more!

It wouldn’t be hard to write an exiv2/exiftool script that loops over the directly and genrates the table files and links to each individual one.

We could then include it on the main pixls site, pixls.us/raw or what-have-you.

1 Like

Are you aware that a complete ISO set of e.g. Pentax K-3 II pixelshift files has a size of 2.6 GB? Add the size of a complete set of Pentax K-3 II HDR files (1.95 GB) and the size of a complete set of Pentax K-3 II standard raw files (0.65 GB) and multiply by 2 (Pentax K-3 II supports PEF and DNG output) you end up with 10.4 GB for that camera. Of course it’s an extreme example because of the multiple formats this camera supports but for Pentax K-1 it’s even more…

1 Like

We can do one of the following:

  • use git submodules
  • use git LFS
  • use git-annex

But I’m always open to suggestions. :smiley:

1 Like

So hello there.
I’m only in favor of the second option, and i hope it will be clear why, unless i fail to write-up my basic vision of what’s need to be there:

  • Basic structure/list of the samples can be autogenerated from that cameras.xml.
  • As you can see, that file has following entries: <Camera make="$1" model="$2" mode="$3">
    For starters, what we need is: one raw file per "$1 $2 $3" tuple.
    That list can be trivially auto-genenerated.
    $ grep "<Camera " cameras.xml | wc -l tells me that there are 841 (sic!) entries.
    • Also, there are <Alias id="$4">$5</Alias> sub-entries in some <Camera> entries, so actually
      The is a second tuple: "$1 $5 $3" (yes, no typos here.)
      $ grep "<Alias " cameras.xml | wc -l tells me that there are 43 aliases.
      So multiply some portion of 841 by 43. That is, 1000+ samples?
    • As pointed out, there may be need for more than one sample per tuple, e.g. just some random shot for regression testing, and a proper color checker picture, etc.
  • For our continuous integration purposes (hallo, apparently new rawspeed maintainer here speaking), all the samples need to be downloadable as one big tarball. And maybe there should be smaller tarballs, e.g. per-manufacturer, per-camera.
  • License for new sample uploads MUST be well-defined, and accepted before upload.
    I’m proposing to enforce at least the same license as rawsamples.ch uses: CC-BY-NC-SA.
    Otherwise, CC0?
  • There needs to be a very simple and friendly web-ui for sample submission.
  • User selects camera manufacturer ($1), then selects camera model ($2+$5), and is presented with either a message that all samples are present, or with a list of wanted samples.
  • New submissions should not go live immediately, but instead there should probably be some pre-moderation.

This is the very basis of what we need for rawspeed, in my opinion.
I’m not so sure that representing it as a forum will work well :slight_smile:

1 Like

Sounds good. Do we also want info about the used firmware of the camera used to shoot the samples? In most cases I guess that could be extracted on upload. Thus we might also want some database that stores the metadata (make, model, …) to make it easy to search.

1 Like

As an alternative to FTP, you could consider using AWS Lambda. The upload portion can be a simple submit form in piixls.us/raw as @patdavid suggests, and that handles the upload to an “incoming” storage bucket (RAW files and submitter metadata). That triggers a Lambda function which takes care of everything else: licence checks, RAW metadata checks, etc. and finally moves the files into appropriate buckets/folders for serving and updates the database.

Just a suggestion.

1 Like

I’m a little bit rusty with web dev, but i’m pretty sure that is indeed absolutely needed for proper implementation.

It strikes me as important because it could well be very helpful in helping the the database indicate which files have metadata that is unable to be read by mainstream free software tools. One obvious example of this is magic lantern produced RAWs for Canon cameras that don’t otherwise produce them.

On the other hand, a less obvious example (and maybe less relevant example) is when RAW processors produce a DNG whose metadata is corrupt. I don’t know if the new database will host files of the latter type, but I suppose one could make an argument they do belong. In that case it’s not just a matter of firmware, but the software and version that produced them, potentially adding yet another couple of fields.

A third example is Android. Last I checked the metadata in Android RAW files is mostly unusable due to a bug in Google’s RAW support. So the Android version would be useful too.

From http://www.magiclantern.fm/:

Supported Cameras:
5D2, 5D3, 6D, 7D, 50D, 60D, 500D/T1i, 550D/T2i, 600D/T3i, 650D/T4i, 700D/T5i, 1100D/T3, EOS M
In progress:
70D, 100D/SL1
Inactive ports (help welcome):
5D classic (old version available)
40D (not working at all)

Which one of these cameras don’t produce raw files?

You don’t mean when exiv2/exiftool is used to modify the raw file?
If not, then that is still a raw file from camera, so i’m pretty sure it fits here.

Same, still a raw file from camera.

Sorry my mistake. I should have written CHDK, which produces CR2 and other files whose metadata cannot be read by exiv2 / ExifTool.

There are two issues here:

  1. The submitter modifies the RAW file before uploading it to the raw samples DB, using some or other tool. It definitely happens, and it’s highly annoying.
  2. Some software programs e.g. DxO OpticsPro can produce linear DNG files. Sometimes their metadata is corrupt.
$ grep -i canon cameras.xml | grep chdk
        <Camera make="Canon" model="PowerShot SD300" supported="no" mode="chdk">
        <Camera make="Canon" model="PowerShot A460" supported="no" mode="chdk">
        <Camera make="Canon" model="PowerShot A610" mode="chdk">
        <Camera make="Canon" model="PowerShot A530" supported="no" mode="chdk">
        <Camera make="Canon" model="PowerShot S3 IS" mode="chdk">
        <Camera make="Canon" model="PowerShot A620" mode="chdk">
        <Camera make="Canon" model="PowerShot A470" supported="no" mode="chdk">
        <Camera make="Canon" model="PowerShot A720 IS" mode="chdk">
        <Camera make="Canon" model="PowerShot A630" mode="chdk">
        <Camera make="Canon" model="PowerShot A640" mode="chdk">
        <Camera make="Canon" model="PowerShot A650" mode="chdk">
        <Camera make="Canon" model="PowerShot SX110 IS" mode="chdk">
        <Camera make="Canon" model="PowerShot SX120 IS" supported="no" mode="chdk">
        <Camera make="Canon" model="PowerShot SX20 IS" supported="no" mode="chdk">
        <Camera make="Canon" model="PowerShot SX220 HS" supported="no" mode="chdk">
        <Camera make="Canon" model="PowerShot SX30 IS" supported="no" mode="chdk">
        <Camera make="Canon" model="PowerShot A3300 IS" supported="no" mode="chdk">

So yes, all these cameras are already in the list “samples wanted”, and are more or less supported by darktable/rawspeed.

Sorry, no sympathy/understanding from me here. Ability to modify raw files is a bug in some other tool. If the samples are modified by user, then it’s users fault and these samples are just garbage, no point in keeping them.

Do note that those are not really a cameras (and this is mostly about cameras), but just buggy software. For rawspeed, i’m mostly sure we won’t be adding workarounds to keep intentionally-broken (read: edited) raw files working. That being said, even darktable’s “create HDR DNG” functionality is still producing half-broken DNG’s (no white balance info)

I’ve analyzed every file in the raw sample DB. Last I checked those CHDK files are impossible to get metadata out of using mainstream free software tools. Maybe that’s not important to you as someone interested primarily in rawspeed? But it is important to myself, and maybe to others too. Personally I find them useful as samples of garbage input that my program should handle gracefully. So I think it would be very useful for the DB to differentiate between categories of RAW.

Then they should be purged and the new DB should make it very clear to uploaders that they should never submit user modified RAW files. The existing raw samples site did not do that last I checked.

It’s not clear to me if the new site should host all RAW files or just RAW files straight out of the camera. That’s why I bring up that example. Personally I don’t mind either way but it’s something to keep in mind – I imagine RAW samples generated by software programs might be useful to some free software developers.

Good point! Even corrupted raw files are good for test purpose.

True. Though it is simple-ish to generate such files, by corrupting the proper raws. (also, fuzzing)

Yep, i did state that in my first comment already.

Now that is a valid important point.

It’s not really up to me to decide that. I can only say (already did say) what samples i want available.

PS: it’s raw, not RAW.

I’m also in a discussion with @LebedevRI about a regression testing tool for rawspeed. I am wondering: has anyone tried to contact Jakob Rohrbach from rawsamples.ch and ask him about the current (and expected future) state of his site? If so: what did he have to say about it?

1 Like

I agree that we need (it would be great to have) a source for raw files.

But why do we need it?

There are technical reasons common to all raw development software like getting proper black and white levels and raw crops for example.
The same is true for raw files to check correct decoding (corrupted raw files, manufacturers incorporating new compression formats etc.)

But there are also differences in requirements:

DT needs to get some data for raw denoise iirc (please correct me if I’m wrong here)

RT needs to get proper white levels from raw files scaled by ISO and aperture (means even more raw samples). In fact it does not need it, but it supports it!

PF needs (don’t know, @Carmelo_DrRaw )

Maybe there are more than the above mentioned differences.

Doh, now we have the first difference. That means we have to make a superset of all requirements.

I’m absolutely not against that.
I just want to mention it and that in this case a cross-ref to cameras.xml may not be enough to fulfill the requirements.

PhF is “stealing” the RAW decoding from DT and RT, so it does not have additional requirements so far…

All this “different requirements” are just different samples, per camera.
The camera list is still the same, for every program. Because there is a fixed list of produced cameras.
So i completely fail to see why would it not be possible to generate camera list from cameras.xml.

Edit: yes, cameras.xml does not list all the cameras ever produced, so additional sources may be used, like RT’s camconst.json(?); and there should pobably be a way to select “sample for camera not in the list” during upload. The camera maker and model can be trivially extracted from Exif/Makernotes

Why tarballs… people aren’t going to download hundreds of small files, and the files are already in a compressed format. A hierarchical directory would be enough (people would just download a subtree…). With Unix links you can have several different hierarchies without duplicating the files.

Something to take in account is the network bandwidth. Of course DigitalOcean isn’t really metering bandwidth (if we put this on DO) and this may not be a very popular site, but this is still a pretty big repository. So having some way to limit the bandwidth for anonymous users (those using HTTP/FTP, for instances) and a speedier method for identified people (SCP/SFTP using SSH keys, for instance).