Fellow data-hoarders, how do you manage your files?

Sitwon · July 24, 2020, 2:10pm

I’m the type of person who prefers to archive rather than delete old data. And I’ve suffered data-loss before, which has made me pretty diligent about backups as well.

When dealing specifically with my photography raws and video files, I have yet to find a tool, workflow, or strategy that seems to perfectly fit my needs. I’ve cobbled together a combination of tools with some shell scripts and some manual steps to best approximate my desired solution.

I feel in my gut that other creators out there must face similar challenges with managing and keeping backups of collections of digital assets. So I’d like to hear what others are doing.

What’s working for you?
What is your system for organizing your files?
What is your backup strategy like? (Does it change when you’re traveling?)
How do you archive older collections, and how to you retrieve those collections if you need to work on them again?
Are there any challenges you still face with your chosen solution?

I’m hoping the collective wisdom will lead me to a more satisfying or complete solution to my current data management practices.

CarVac · July 24, 2020, 2:29pm

Filmulator backs up as it imports from card. I have it go to my SSD for main use and an internal hard drive for live backup.

JPEG output goes to the SSD only but I rsync it out to the hard drive periodically.

Then I rsync it to external drives, and clear past years’ photos from the ssd each year.

paperdigits · July 24, 2020, 2:46pm

I’m using git-annex to manage my raw files and git to version control xmps. I don’t have a backup per-se, but my git annex repos are replicated over 8 separate physical disks.

When I get restic going again, I will include my git annex repos in that backup.

ggbutcher · July 24, 2020, 5:11pm

Since 2005, this:

two computers, one is my working machine, the other was our HTPC behind the tv, now it’s just a server.
on the working machine, copy ALL the raws from a session to ~/Pictures/[year]/[date-title]/NEFs - [date-title] is the top directory of the session, and NEFs is where the Nikon raws go.
batch process ALL of the raws to proof JPEGs deposited in the [date-title] directory
once in a while, I’ll delete images due to egregious capture, such as inadvertent shutter button press capturing a nice picture of the dirt… both proof JPEG and the raw go. I really don’t do this often, as I think of my captures all as data of a particular period in the grand timeline…
log into the server computer and run rsync to sync the working computer’s ~/Pictures directory to the server’s ~/Pictures directory. Once I’ve done this, I usually reformat the camera card since I now have two copies of the NEFs.
For off-site backup, we bought a few 1TB LaCie portable drives, once in a while I’ll connect one to the server and kick off a copy of ~/Pictures to the portable, then give it to one of the kids to take home. They like having the sum collection of images for reminisce, Facebook sharing, and calendar creation, and it puts backups away from our house.

My portable computer is a Lenovo tablet; lately I’ve taken to using it to do post-process directly on the working machine files, using a samba share. I’ve got some file permission shenanigans to work out, but it works well even on our old 2.4G wifi…

Oh, when traveling, I back up the days’ shootings on the tablet, and keep the camera cards populated until we get home and I can copy the sum total collection to the two machines described above…

chris · July 24, 2020, 7:30pm

I am as well using git-annex for raw and other source files, and git for sidecar files. The git-annex contents are replicated on several backup disks, at least one of them stored off site. On one of the backup disks I have an additional direct mode git-annex repo of all files. This one is used together with fslint to delete memory cards ensuring only backed-up files are deleted.

The next step would be to couple git-annex with the xmp metadata such that files with high star rating are backed up with more copies than e.g. one star or rejected files. But that’s work in progress or not even started.

beachbum · July 24, 2020, 7:36pm

files since 1994 up to today
actual workflow: two external HDD (4TB) one in my office, one in my home (+/- 25 km apart)

new data will be transferred to the HDD at home, every now and then (once per week or once per month) I take my disk from home to the office and come with the other one back home. I will then transfer the new data to that disk, too.

Every file is now on three disks: in my PC, in my home and in the office. If the new data were NEF/RAW data, I will now delete them from the SD card (not before). When I am traveling and can’t back-up every day, I use new sets of SD cards all the time. Right now I have 12 SD-Cards (six pairs) in use (32GB SanDisk 95mb/s)

Theft, fire, anything … pretty sure one of my disks will survive. Not a no-brainer but a safe way to handle my data.

nosle · July 24, 2020, 7:58pm

There has been a similar topic before but my setup is as follows. All three cameras shoot DNG raws and jpgs.

Rapid photo downloader into:
img/[year]/[month][day]-[jobcode]/[year][month][day][time].DNG onto the harddrive of my workstation. I use the card reader on the workstation. JPG’s and DNG’s in the same folder.

I, at best, then go through with geeqie and delete errors and obvious inferior shots in a series. The JPG+DNg is collapsed into one file. Sometimes this culling never happens.

Backup using restic to rpi sitting in a router/fiber box above my entrance door. One copy of the disk swapped out to my parents ~1000km away about two years ago.

Good shots get one star (exceptional images two stars), developed and exported a web and export subfolder.

When using darktable I tagged the full folder of files with Geeqie before processing. Now with Rawtherapee I tag only the processed files because the RT meta data handling is very poor. The downside to this is having to re-tag all files that I tweak and export again.

The files in the web folder get uploaded to either a private or public personal website at 2048px resolution (website is old and some files are smaller). The files in the export folder goes through a selection and some are printed to 10x15 on an inkjet. The stack is pretty tall by now. Few are printed larger up to A3. The prints are stored at home or taped to the wall using four small folded pieced of magic tape*. Images are an important part of my work life and building, reconfiguring and re-thinking collections of ideas and images quickly and freely has leaked into my personal life and I prefer this loose attitude to images. Only art my kids or artist friends have gifted are framed.

I was an early adopter of git-annex and used it for various data I needed access to from a lot of devices. Quickly noticed my foot shooting capacity was to high for me to manage my photos with git annex. I still use it for a reference library of images, recipes, notes etc where I find the addurl command useful as you automatically store where the file comes from which is useful when I need to attribute the file.

* fold a ~4mm tab at the end of a 20mm or shorter piece of magic tape. By orienting the non stick tab inwards to the paper you can remove or attach tape an infinite amount of times without ripping the paper or the wall. The tear motion rolls the tape of the edge of the paper in a safe way.

Sitwon · July 24, 2020, 9:07pm

I’m recognizing some familiar themes here that I’m already implementing:

Organizing files by date/roll, but otherwise a relatively flat directory structure.
Having a least two copies (local workstation and either external drive or local NAS) before deleting from the original card.
Syncing or backing up with rysnc/restic/git-annex.
Eventual consistency consistency with the off-site storage.
Everyone seems to describe a process that involves a lot of manual steps.

Personally I’ve been using rsync and restic, with both a local and cloud (B2) restic repository. But I very recently learned (the hard way) that restoring a 12TB snapshot from B2 using restic is a painfully slow process. So I’m thinking I may revisit how I’m implementing restic to mitigate these performance problems until they have been sufficiently addressed.

The solution I’m ultimately seeking would ideally have the following traits:

Automate as much of the process as possible.
Support opportunistically syncing to multiple targets. (External drive, local NAS, cloud … depending on what is available.)
The syncing should happen automatically in the background, but there should be some clear feedback mechanism for the user to see how many copies exist (so we know when it’s safe to clear a memory card).
Versioning support would be ideal (so more restic than rsync), even though the largest binaries tend not to ever change (so git-annex would technically be fine).
For external hard drives, I would ideally want it to keep track of which collections are on which drive. I have a growing collection of external drives now, since no single external drive I own is large enough to mirror my entire NAS.
It should be able to handle syncing from multiple workstations. (Because I use multiple computers; at this point handling multiple users is out of scope.)
My ideal solution would have a concept of hot/warm/cold rolls. New rolls that I’m still working on would be Hot, recently finished rolls would be Warm, and old rolls would be Cold… and store them accordingly (prioritizing the hottest for faster or more accessible storage and archiving the oldest to slower/cheaper/less accessible storage automatically).

These comments have already given me another idea: tying the durability and availability to the star-count. (More stars = more copies or stored to more durable volumes.)

In addition to the tools already mentioned, I’ve also looked at Perkeep and Boar VCS as prior art worth exploring further.

I know that the benefits of content-addressable storage solutions (restic, perkeep, boar) are diminished when working with large and already compressed binary blobs that never change. However a more sophisticated backend might selectively replicate objects across storage pools to make some data more durable or more accessible (inspired by Object Storage systems like S3).

paperdigits · July 24, 2020, 11:21pm

Brutal!

guille2306 · July 25, 2020, 1:32am

Adding some data-points to the already good answers (I’m learning one or two things to try). Keep in mind that my image collection is small (400-500GB).

I have three base folders:

images downloaded from the camera
the working folder
images already edited (my archive)

Images start in (1), are copied to (2), and once I’ve finished with them I move the result to (3) and delete the copies in (1) and (2). So far the three folders fit in my 1TB notebook drive, so everything is always accesible. However, I point different databases to each folder:

(1) is in a digiKam database in case I need to find a particular image before edition
(2) is in a darktable database for culling, tagging, and edition
(3) is in separate databases in both digiKam (for fast searching) and darktable (in case I need to re-export)

This way I keep my archive databases clean and independent of the mess that are the import and working databases.

After a long time using home-made incremental rsync scripts, I switched to BackupPC on the home server and I’m very happy with it. I take an incremental snapshot of everything every night, and a full backup every 15 days. Snapshots for folders (1) and (3) are kept for at least a year, snapshots for folder (2) are kept for at least 2 months. Between darktable’s policy of never touching the RAW files and BackupPC excellent de-duplication support this has very little overhead in terms of space.

The latest BackupPC snapshot is exported to an encrypted external drive once every two months and this drive is kept at my office. Beside making an off-site copy, this also verifies that the full backup is OK.

The images are not erased from the camera card until it’s full. As I have several cards (and take few pictures), it means I usually have a 4th copy of the images in there for a long time.

When I’m traveling it means I don’t have access to my BackupPC server. In that case I download the images to the computer or tablet I’m traveling with, and use Resilio Sync to send a copy to my home server, just in case (that copy is erased once the images enter the BackupPC pool).

Not challenges per se, but I feel that I could extract a lot more from the tools at hand (and I don’t mean here the editing part, I know I can improve that! ). For example, I need to explore better how to integrate digiKam with darktable. Currently I use digiKam only as a searching database, all tagging and culling is done in darktable, and all file moving is done by hand outside of them. Maybe using the full DAM capabilities of digiKam and leaving darktable as only the RAW developer would be more efficient, but I’m yet to test the full workflow (I’m waiting for digiKam 7 to hit the Flathub repo…).

pphoto · July 25, 2020, 6:23am

My process when coming home from a photo shooting and putting the SD-card into my PC:

Make a new folder in my RAW edit folder, for ex. /RAW/edit/20200710_shooting_Karen/
Copy files from SD-card into the folder
Bulk rename images (like 20200710_shooting_Karen.dng)
Start file sync with my 2nd PC using Resilio Sync
Develop RAWs with darktable
Export JPG to my photo archive /photos/20200710_shooting_Karen/
Move RAW+XMP files into my RAW archive, for ex. /RAW/archive/20200710_shooting_Karen/ (while file syncing with my 2nd pc is still active btw.)
Copy RAW+XMP files into my nextcloud (where I got a copy of my RAW archive)
Delete files from SD-card
Open Digikam (photo archive with JPGs only) and add keywords to the images

My photo archive with all JPGs is also synced with my 2nd PC and my smartphone, moreover I got a copy in my Nextcloud

chris · July 25, 2020, 10:20am

Could you please elaborate a bit on that? While I love the possibilities of git-annex I still feel a tiny bit of doubt since the files are not so easily accessible any more (e.g., fslint does not work on symlinks). Therefore I keep one direct mode copy. But it would be great to learn caveats others were stumbling upon.

garrett · July 25, 2020, 11:22am

I’m reading through this and realizing it seems no one has a good solution for this much data.

OK, what follows is basically a long, unfiltered rant/story. Feel free to skip it.

@guille2306 is “lucky” with “just” 500 GB. It’s much more managable than the multiple TB archives many of us have. (I have > 3 TB of photos, mostly raw, since 1999. First few years were still JPEGs.) Dealing with this much data (and more, as I have more than just photos) is a super-unfun, expensive, time-consuming chore.

My “solution” is to have a huge working hard drive over USB 3. I periodically rsync that to my 4-disk NAS. And that’s already too technical for most people, really. (And I’ve been using Linux since 1996. And develop software. So that’s saying something.)

Meanwhile, I haven’t figured out cloud storage (due to the massive volume of data and cost and having a slow network until a month ago) and I don’t have an office to drop off a spare drive at (as I work from home).

And I have had data loss — I’ve had several hard drives crash and die over the years. I’ve been lucky enough to be able to scrape some data off of the spinning disks and piece it together with (mostly otherwise complete) backups to get my whole collection back together. (Once, when this happened, I was lucky enough to not have deleted SD cards.) And this is with backing up.

And it always feels like I’m building new hard drives all the time (even for my NAS, which went through a phase of having one hard drive die after another — even one during a rebuild after replacing another; good thing it’s 2 disk failover).

For what it’s worth, my directory structure is:
[Storage location]/Photos/YYYY/YYYY-MM-DD/ so they’re all organized, I can have photos in some consistent manner, I can view them fine in a sorted manner from a file manager (in addition to darktable or digiKam), and there aren’t way too many photos in a single folder. The stuff like “So and so’s Wedding” or “Random park” or “Sister’s birthday” is all in the metadata. I use tags for this. Both darktable and digikam (and basically everything else) can quickly find photos through metadata, especially if you bookmark it (for building a collection, like adding tags for “Calendar 2021” and you’re slowly adding photos to pick from to narrow down).

…Except, really, the weddings. I keep that as [Storage location]/Photos/Weddings/[name of couple] because they’re (generally) timeless, have lots of people I don’t know, aren’t photos I revisit (that often, unless someone requests it), and I’m not really a wedding photographer.

I used to do the random name affixes to directories and I also used to break down the directories by month, but then I found it’s easy to lose context, hard to search for an exact date, and the browsing should be done in the photo manager, not file manager (generally). (Plus, darktable has an extension to open in a file manager and I think digiKam can do this by default.) So you can still show the containing folder for any photo quite easily, if you want to do “file management” instead.

Yes, it’s redundant to have year folders and include years in the date, but it makes the folders self-contained and more obvious. It’s also redundant with the date/time browsing in darktable, etc., but at least there’s an easy automated way to store files on disk that I don’t need to think about. Yes, some events span over a day (especially trips) and some days have multiple — but, again, that’s what metadata is for and why darktable, digiKam, and others let you browse by that.

The only thing that really fails with a scheme like this? If you don’t add metadata. But I make it a point to add at least a minimum of the location of where I took the photos. And I have done some archeology to figure out where some photos were taken by looking at spikes of a high # of photos taken together (while date browsing) and looking through events I’ve been to and old emails to figure out where I took something based on when I took it.

I still wish someone made a nice, simple backup solution with a “best practices” guide that:

doesn’t require use of a terminal
doesn’t require someone to compile something
makes backing up somewhere (especially remote) very easy
has the remote in a file structure that’s comprehensible instead of some hashed or encrypted thing (I care about encryption, but not for my photos), so if I don’t have access to the tool (or if it gets confused for whatever reason), I can just copy them over (via rsync or even a drag-and-drop in a filemanager)

(I can compile something and use a terminal, but when I’m photo mode on personal time, I don’t really want to have to. Plus, others shouldn’t have to anyway.)

All the solutions currently seem to be like “here’s some random program with lots of options; compile it and set it up with a huge config file… it’s super flexible so choose AWS or Backblaze or something… have fun!” (How do Windows and Mac folks deal with this problem of too-much-data? Surely they have it too?)

I guess this all kind of works so far for me? I still have all my photos, despite hardware failures. But it could be a lot better, I know.

TL;DR: Directory format is a solved problem. Metadata is good to add and have, especially for browsing. Storage is a burden. Backups are even more of an chore (especially figuring out what to do about off-site backups).

guille2306 · July 25, 2020, 1:11pm

I would say there is a “triangle of backup”: cheap, easy, robust, pick two of them .

Have you considered having a small server dedicated to the backup, to winch you can keep the USB3 drive connected? (something like a RPi 4). That would make it automatic and, depending on the tool you use, relatively maintenance free.

I feel your pain regarding the cloud (my upload speed is 100kB/s, and even the 10USD/month it would cost me the space for my backup is as bit too much for me). What I’m considering is to ‘convince’ my brother that backup is very important and he should have a small server like me, and then rsync both servers.

Other options I’ve seen: send your drive to a friend/family that lives far away by mail, or put it in a safe in a bank (if you have one).

guille2306 · July 25, 2020, 1:23pm

This could be particularly hard to find because in order to be efficient and robust, the most reliable backup systems use some kind of obscure internal format to save the files (hashes help a lot for de-duplication and checking). One option: use a simple sync program, and pass the whole deduplication/compression/snapshot part to the filesystem by using something like ZFS.

guzzisti · July 25, 2020, 1:24pm

I have currently a few hundred GB of image data (raws, jpegs, sidecars), my “normal” workflow is:

Copy from SDCard to working folder on my workstation (SSD)
delete & develop (i’m pretty reckless in deleting “not so good” shots, i’d rather keep 2 exceptional shots than 20 good)
the whole bunch of files (raws, jpegs, sidecars) go to my homeserver. I use a rather flat directory structure sorted by “generic topic/event/day”.
backup from my homeserver on 3 hard disks, that are kept in a safe at home, and additionally at night an encrypted backup to a “cloud storage”.

Backup is a essential point for me, as a suffered from a data loss a few years ago.

ggbutcher · July 25, 2020, 1:25pm

Whatever the tools, there’s a fundamental concept that needs homage: multiple (at least 2) copies on separate storage media.

With digital media, what you need to do is mitigate your exposure to a failure. Sounds simple, but you need to consider not only the probability of that failure but also the time to repair. That means, if a hard drive containing all your photos craps out, if you’re down to one copy you need to restore the failed copy before that last one goes. This is where NAS with certain RAID options can bite, as two brand new hard drives of the same make/model will act like car headlights - when one goes, the other is not far behind. Other failure modes are more direct in this - a house fire that consumes all your copies just also consumed all of your ability to repair in one single event.

I think in the above I just described the fundamental concerns. With that, any compliant amalgam of tools should do…

Sitwon · July 25, 2020, 1:35pm

@garrett I think you understand exactly why I started this thread. I totally grok most of your sentiments.

We may have different levels of comfort with regards to specific implementation details, but I’m also trying to find a solution that works for my wife, who is an easily-frustrated, less-technical, Macbook user. And she’s not the type to diligently follow a manual ritual like most people here have described. (She’s been saved more than once because I installed automatic backup software on all her computers that runs in the background.)

I take this to mean that you prefer rsync over restic/boar/perkeep/etc.

Would that still be the case if the files were stored on-disk in a compressed format, but also exposed as a conventional tree of files (eg. a FUSE filesystem view) on the primary backup server (your NAS)?

What is your tolerance for having a server-side component at all? (Eg, software running on your NAS or attached to your NAS.)

I’m still undecided if I want a server-side component. A server-side component would offer some advantages, and could simplify the solution in a few ways, but I think the client-side component is the more critical part of the solution.

guille2306 · July 25, 2020, 1:50pm

You can also trow the balance of acceptable losses into the mix. My example:

my most probable loss scenario is user error (wrongly delete a file), followed by somebody stealing my laptop → I make automatic daily backups to my server at home. It’s easy, fast, and I loss at most a day of work
next is somebody stealing my laptop AND server by breaking into my home → I make a manual copy every 2 months to an external drive that I leave at work. It’s a bit cumbersome and relatively slow, so I can I accept a loss of 1-2 months of data
next are very small chance scenarios: multiple backup drives failing, city-level destruction, me having to leave without my things, etc. → I have nothing in place yet, but I plan to buy a second external drive and swap it at my parent’s house from time to time. It’s expensive and very slow, but chances are very small so I can accept a 1-2 year loss of data.

guille2306 · July 25, 2020, 1:57pm

I went trough the same discussion when I switched from rsync scripts to BackupPC, I even regarded the ‘obscure’ file structure of the latter as a downside. In the end, the automatic, set-and-forget nature for BackupPC, which I can also use for my wife’s Windows computer, won the fight.