Having a discussion with @LebedevRI on irc this past couple of days about the state of Rawsamples.ch (in short, not good):
Dear Visitor
Rawsamples.ch is actually not up-to-date, as it was hijacked by a SQL-Injection and the database which belongs to the Joomla-CMS was corrupted. I’m very sorry about this, especially as I haven’t had an actual backup of the Database-Content. As long as you see this remark the repair of rawsamples.ch is ongoing and the content is not current (it is between Dec. 2015-Jan.2016). Stay tuned, I’m working on the content.
Thanks for your patience.
Jakob Rohrbach / Nov. 2nd. 2016
LebedevRI was asking me about possibly building out a replacement for the site that would continue its functionality.
As a first step I’ve gone ahead and scraped all of the files from the site and have stored them on the main site (https://pixls.us/files/rawsamples/files/ [6.8G total]). Don’t be fooled into grabbing the all_raws_rawsamples_ch.7z file, it’s missing stuff and is at least an order of magnitude smaller than it should be.
Archiving these files is all well and good, but the harder question is what a good long-term solution might be that will replicate the capabilities that the original site had (or maybe do it better).
<LebedevRI>:
but in the end it boils down to:
nice friendly ui to check whether all expected samples for a given camera available or not and upload missing
So I have a question for everyone (mostly @staff). What is a good solution moving forward to continue hosting and maintaining sample raw files? I was thinking primarily of two main approaches:
Use the forums. It’s relatively simple to setup a new category just for posting raw files. We have more than enough eyeballs to make sure any manual formatting of post titles or tags would be managed fine.
Pros
Low impact/barrier to implementation
threaded discussion available (by nature of implementation)
already setup to upload up to 100MB files to amazon
Cons
searching/filtering available/missing files might be more clumsy?
manual intervention possibly required to correctly sort/tag files
possibly polluting topic lists with new raw uploads (showing up as new posts)
Build something new and expose it at an address like pixls.us/raw
This could be as simple as re-implementing what is already being done on rawsamples.ch. That is, emailing files to one of us to include in the list manually, or possibly using an anonymous FTP and requiring an admin to get, verify, and include the file on the page.
This could be extended to something more modern, possibly more secure, and possibly at least slightly more automated.
Pros
Full customization of page listings (cleaner look)
Easier to generate view as requested (list of uploaded/missing + tables)
Cons
More work, higher impact.
Though, replicating the current workflow of rawsamples.ch wouldn’t be that hard honestly. Something more fancy might require quite a bit more time/work.
Tools immediately available to use are FTP (could set this up right now), the normal pixls.us build system, this forum.
I will expouse my love of git here and now and say that we should have a git repo of raw files I would be happy to accept files by email and include them in the git repo; I’d also be happy to take pull requests on github et al and maintain the integrity of the repo. Git works for the obvious reason, it is distributed, has integretiy checking, and plentifly amounts of free hosting. Also simple tools, git + a script to generate a web page, means low maintenance, less maintenance resources, less programming maintenance, less web site overhead. Less is more!
It wouldn’t be hard to write an exiv2/exiftool script that loops over the directly and genrates the table files and links to each individual one.
We could then include it on the main pixls site, pixls.us/raw or what-have-you.
Are you aware that a complete ISO set of e.g. Pentax K-3 II pixelshift files has a size of 2.6 GB? Add the size of a complete set of Pentax K-3 II HDR files (1.95 GB) and the size of a complete set of Pentax K-3 II standard raw files (0.65 GB) and multiply by 2 (Pentax K-3 II supports PEF and DNG output) you end up with 10.4 GB for that camera. Of course it’s an extreme example because of the multiple formats this camera supports but for Pentax K-1 it’s even more…
So hello there.
I’m only in favor of the second option, and i hope it will be clear why, unless i fail to write-up my basic vision of what’s need to be there:
Basic structure/list of the samples can be autogenerated from that cameras.xml.
As you can see, that file has following entries: <Camera make="$1" model="$2" mode="$3">
For starters, what we need is: one raw file per "$1 $2 $3" tuple.
That list can be trivially auto-genenerated. $ grep "<Camera " cameras.xml | wc -l tells me that there are 841 (sic!) entries.
Also, there are <Alias id="$4">$5</Alias> sub-entries in some <Camera> entries, so actually
The is a second tuple: "$1 $5 $3" (yes, no typos here.) $ grep "<Alias " cameras.xml | wc -l tells me that there are 43 aliases.
So multiply some portion of 841 by 43. That is, 1000+ samples?
As pointed out, there may be need for more than one sample per tuple, e.g. just some random shot for regression testing, and a proper color checker picture, etc.
For our continuous integration purposes (hallo, apparently new rawspeed maintainer here speaking), all the samples need to be downloadable as one big tarball. And maybe there should be smaller tarballs, e.g. per-manufacturer, per-camera.
License for new sample uploads MUST be well-defined, and accepted before upload.
I’m proposing to enforce at least the same license as rawsamples.ch uses: CC-BY-NC-SA.
Otherwise, CC0?
There needs to be a very simple and friendly web-ui for sample submission.
User selects camera manufacturer ($1), then selects camera model ($2+$5), and is presented with either a message that all samples are present, or with a list of wanted samples.
New submissions should not go live immediately, but instead there should probably be some pre-moderation.
This is the very basis of what we need for rawspeed, in my opinion.
I’m not so sure that representing it as a forum will work well
Sounds good. Do we also want info about the used firmware of the camera used to shoot the samples? In most cases I guess that could be extracted on upload. Thus we might also want some database that stores the metadata (make, model, …) to make it easy to search.
As an alternative to FTP, you could consider using AWS Lambda. The upload portion can be a simple submit form in piixls.us/raw as @patdavid suggests, and that handles the upload to an “incoming” storage bucket (RAW files and submitter metadata). That triggers a Lambda function which takes care of everything else: licence checks, RAW metadata checks, etc. and finally moves the files into appropriate buckets/folders for serving and updates the database.
It strikes me as important because it could well be very helpful in helping the the database indicate which files have metadata that is unable to be read by mainstream free software tools. One obvious example of this is magic lantern produced RAWs for Canon cameras that don’t otherwise produce them.
On the other hand, a less obvious example (and maybe less relevant example) is when RAW processors produce a DNG whose metadata is corrupt. I don’t know if the new database will host files of the latter type, but I suppose one could make an argument they do belong. In that case it’s not just a matter of firmware, but the software and version that produced them, potentially adding yet another couple of fields.
A third example is Android. Last I checked the metadata in Android RAW files is mostly unusable due to a bug in Google’s RAW support. So the Android version would be useful too.
Sorry my mistake. I should have written CHDK, which produces CR2 and other files whose metadata cannot be read by exiv2 / ExifTool.
There are two issues here:
The submitter modifies the RAW file before uploading it to the raw samples DB, using some or other tool. It definitely happens, and it’s highly annoying.
Some software programs e.g. DxO OpticsPro can produce linear DNG files. Sometimes their metadata is corrupt.
So yes, all these cameras are already in the list “samples wanted”, and are more or less supported by darktable/rawspeed.
Sorry, no sympathy/understanding from me here. Ability to modify raw files is a bug in some other tool. If the samples are modified by user, then it’s users fault and these samples are just garbage, no point in keeping them.
Do note that those are not really a cameras (and this is mostly about cameras), but just buggy software. For rawspeed, i’m mostly sure we won’t be adding workarounds to keep intentionally-broken (read: edited) raw files working. That being said, even darktable’s “create HDR DNG” functionality is still producing half-broken DNG’s (no white balance info)
I’ve analyzed every file in the raw sample DB. Last I checked those CHDK files are impossible to get metadata out of using mainstream free software tools. Maybe that’s not important to you as someone interested primarily in rawspeed? But it is important to myself, and maybe to others too. Personally I find them useful as samples of garbage input that my program should handle gracefully. So I think it would be very useful for the DB to differentiate between categories of RAW.
Then they should be purged and the new DB should make it very clear to uploaders that they should never submit user modified RAW files. The existing raw samples site did not do that last I checked.
It’s not clear to me if the new site should host all RAW files or just RAW files straight out of the camera. That’s why I bring up that example. Personally I don’t mind either way but it’s something to keep in mind – I imagine RAW samples generated by software programs might be useful to some free software developers.
I’m also in a discussion with @LebedevRI about a regression testing tool for rawspeed. I am wondering: has anyone tried to contact Jakob Rohrbach from rawsamples.ch and ask him about the current (and expected future) state of his site? If so: what did he have to say about it?
I agree that we need (it would be great to have) a source for raw files.
But why do we need it?
There are technical reasons common to all raw development software like getting proper black and white levels and raw crops for example.
The same is true for raw files to check correct decoding (corrupted raw files, manufacturers incorporating new compression formats etc.)
But there are also differences in requirements:
DT needs to get some data for raw denoise iirc (please correct me if I’m wrong here)
RT needs to get proper white levels from raw files scaled by ISO and aperture (means even more raw samples). In fact it does not need it, but it supports it!
Maybe there are more than the above mentioned differences.
Doh, now we have the first difference. That means we have to make a superset of all requirements.
I’m absolutely not against that.
I just want to mention it and that in this case a cross-ref to cameras.xml may not be enough to fulfill the requirements.
All this “different requirements” are just different samples, per camera.
The camera list is still the same, for every program. Because there is a fixed list of produced cameras.
So i completely fail to see why would it not be possible to generate camera list from cameras.xml.
Edit: yes, cameras.xml does not list all the cameras ever produced, so additional sources may be used, like RT’s camconst.json(?); and there should pobably be a way to select “sample for camera not in the list” during upload. The camera maker and model can be trivially extracted from Exif/Makernotes
Why tarballs… people aren’t going to download hundreds of small files, and the files are already in a compressed format. A hierarchical directory would be enough (people would just download a subtree…). With Unix links you can have several different hierarchies without duplicating the files.
Something to take in account is the network bandwidth. Of course DigitalOcean isn’t really metering bandwidth (if we put this on DO) and this may not be a very popular site, but this is still a pretty big repository. So having some way to limit the bandwidth for anonymous users (those using HTTP/FTP, for instances) and a speedier method for identified people (SCP/SFTP using SSH keys, for instance).