raw.pixls.us zip download issues

Lars_V · July 4, 2019, 9:30am

Hi,
I’m trying to download your raw sample archive at raw.pixls.us using the ZIP archive link - no luck so far, download stops at about 20-22GB of 26GB every time. I tried multiple computers and browsers, same behavior, and I’m on a gigabit connection so timeout on my side is likely not the problem. Any ideas?

Suggestion: Set up an alternate download using Resilio Sync, for entire archive as zip as well as for uncompressed catalog.

Oh, and thanks for publishing all these samples!

Best,
Lars

LebedevRI · July 4, 2019, 1:54pm

Just use rsync?

darix · July 4, 2019, 2:01pm

listing via rsync rsync://raw.pixls.us/
to download rsync -av rsync://raw.pixls.us/data/ raw-pixls-us-data/

Just curious … what will you use the data for? what kind of project?

darix · July 4, 2019, 2:02pm

paging @andabata to have some rsync documentation on https://raw.pixls.us/

Lars_V · July 4, 2019, 2:15pm

Ten years ago I built my own photo editing software, with dcraw included. I’m in the process of porting the codebase to Android and iOS, went looking for more raw samples to complete my own repository.

Lars_V · July 4, 2019, 2:18pm

I will give rsync a try, thanks for the tip. Just note that the download link fails every single time, so posting the link on the home page is just misleading.

darix · July 4, 2019, 2:19pm

well downloading 22+GB via a zip file is less than optimal in any case. especially for repeated downloads to get updates.

Are you grabbing the files for monochrome? (that is your project no?)

Lars_V · July 4, 2019, 2:20pm

I still believe Resilio Sync has its merits, as it is peer-to-peer for efficiency, and quite resilient.

darix · July 4, 2019, 2:21pm

rsync aint peer to peer but it is very efficient at fetching updates to a large file tree.

Lars_V · July 4, 2019, 3:02pm

Okay I tried the proposed rsync arguments above - looks like source is links so all I got was copies of said links. I had better luck with these rsync options:
-b -r -L -v

Lars_V · July 4, 2019, 3:04pm

Yep monochrome. And yes separate files is a relief for sure
I’ll see what I can contribute, don’t have many cameras here so I don’t own most of my test files.

Tobias · July 4, 2019, 3:18pm

This monochrome?
http://mochro.com/Pages/Home.aspx

andabata · July 4, 2019, 4:02pm

Yes, you can also access the data via rsync, and indeed those are symlinked files
you can also use Index of /data

You can also bribe me and i’ll send it on 5"25 floppies

Lars_V · July 4, 2019, 4:29pm

Yep. Haven’t updated that site in a while.

clanmills · December 9, 2020, 7:18pm

Folks:

I’d like to comment about this as my experience may help others.

I tried downloading the archive on macOS and Safari gives up after about 15GBytes. So, I tried curl. Much the same curl: (18) transfer closed with 16957967827 bytes remaining to read.

A little more success with: rsync -av rsync://raw.pixls.us/data/ raw-pixls-us-data/, however I end up with 3000 links to ‘storage’.

Downloading the links is very useful because it reveals the structure of the archive and files available. It only takes a couple of seconds to have an index of the images on offer. I’m sure given the path to a file, an rsync command can be readily constructed to download a file, or directory of interest.

Better with rsync -avL rsync://raw.pixls.us/data/ raw-pixls-us-data/

However that only downloaded about 600 images for a total of 6GB.

Success by repeatedly running:

$ cd raw-pixls-us-data/
$ for i in $(find . -maxdepth 1 -type d)
do
    i=$(basename $i)
    rsync -avL "rsync://raw.pixls.us/data/$i" "$i"
done

I’ve formatted the command for clarity of presentation. In reality it’s the following one-liner, so running it repeatedly is easy.

for i in $(find . -maxdepth 1 -type d);do i=$(basename $i);rsync -avL rsync://raw.pixls.us/data/$i $i;done

2804 files for a total of 68.3 GB.

Hope somebody finds my observations useful.

andabata · December 9, 2020, 8:53pm

Thanks for the headsup.

It seems the webservers doesn’t like to serve 32GB files, just tried it with wget/curl/lftpget and they all fail. I will remove the zip download option, and a something about rsync aswell. As @darix suggested eons ago.
/me gets of his lazy ass

clanmills · December 10, 2020, 10:16am

@andabata There’s more to this than I wrote last night. I’ve updated the note to describe what I did to finish the job today. I hope you find that useful.

I am the maintainer of Exiv2 and I believe you’ve used that to obtain the metadata that you publish about the raw images. Thank You for using Exiv2. An acknowledgement of Exiv2 would be appreciated.

Great Job Everybody to create and maintain this archive. A very useful resource. I’m writing a book and have developed a metadata parser called ‘tvisitor’ which is being tested with your archive, the ExifTool archive and a collection of test images accumulated from issues reported concerning Exiv2. So more than 17000 images! Caution: tvisitor is still in development and expected to be finished in January 2021.

https://clanmills.com/exiv2/book/

exiv2

atwright147 · December 12, 2020, 10:02pm

I have been trying to download this myself today.

Looking through the page’s source code I noticed that you can get a full list of the elements in the data grid by making a GET request to: https://raw.pixls.us/json/getrepository.php?set=all.

In that payload there is a data Array, which contains an Array for each image. And element 8 (7 when using zero based indexing) contains the HTML for a link to the image. With a little bit of parsing the URL can be extracted from this string.

I JavaScript, fetching a list of downloadable URLs looks like this:

const fetch = require('node-fetch');

const RAW_PIXLS_ALL_URL = 'https://raw.pixls.us/json/getrepository.php?set=all';

(async () => {
  const res = await fetch(RAW_PIXLS_ALL_URL);
  const json = await res.json();

  const regex = new RegExp("href='(.+?)'");
  const urls = json.data.map(item => item[7].match(regex)[1]);

  console.info(urls);
})();

atwright147 · December 12, 2020, 11:20pm

Building on that, you can get those elements in a shell (if jq is installed) via:

curl "https://raw.pixls.us/json/getrepository.php?set=all" | jq '.data[][7]'

And if you have lynx installed, you can build on my previous command to easily and accurately extract the URLs via:

curl "https://raw.pixls.us/json/getrepository.php?set=all" | jq '.data[][7]' | lynx -stdin -dump -listonly -nonumbers

I eventually hacked this crappy code together to perform the downloads:

#!/usr/bin/env sh

set -eu

wget -O- "https://raw.pixls.us/json/getrepository.php?set=all" | jq '.data[][7]' | lynx -stdin -dump -listonly -nonumbers | uniq > images.txt &&\
wget -nc -i images.txt

It downloads the list as a JSON file and grabs the RAW file URLs, then saves the file out as images.txt. Then, it downloads each file one by one from the contents of the text file, taking care not to re-download anything already present.

Ideally, I would have used wget's -N flag to check file dates and prevent re-downloading in a more intelligent manner, but unfortunately, every file gets the current date time. This is (I guess) caused by a bug in getfile.php which should be using the file’s actual stats.

Ideally, we need another API with directly useable data like:

[
    {
        "manufacturer": "Canon",
        "model": "EOS 7D",
        "type": "sRAW2",
        "ratio": "3:2",
        "fileSize": "17.92",
        "license": "Creative Commons 0 - Public Domain",
        "licenseUrl": "https://creativecommons.org/publicdomain/zero/1.0/",
        "created_at": "2016-12-29",
        "added_at": "2016-12-29",
        "updated_at": "2016-12-29",
        "url": "https://raw.pixls.us/getfile.php/129/nice/Canon - EOS 7D - sRAW2 (sRAW) (3:2).CR2",
        "checksum": "9a32e26509c5c7b3346c27a2135d2b8c2e37ba1c",
        "metadataUrl": "https://raw.pixls.us/getfile.php/129/exif/RAW_CANON_EOS_7D-sraw.CR2.exif.txt"
    },
    {
        ...
    }
]

andabata · December 15, 2020, 8:31pm

Lazy ass was a bit lazy, took a bit longer then needed. Had to do some yak shaving as well.
But zip download has been removed, added an example how to use rsync to mirror the data.
Added links to https://exiv2.org and https://exiftool.org. We use both to extract metadata from the raws.

And you can thank @LebedevRI for maintaining the actual archive. I only created it and host it