Incredibly Slow File Browser and RT Not Using CPU

  1. Update to 5.7.
  2. HDDs are slow, reading 3000 * 20MB takes time.
  3. Windows Explorer does not use the raw data, it only shows the JPEG image embedded in each raw file.

Regarding point 1, I will do that, thanks!
For 2 and 3, these are just JPEG files, not even editing RAW yet. I was just testing the waters with RT. Why is Raw Therapee literally 120x slower than Windows Explore / Windows Photos in opening, viewing, and sorting them?

Thanks for the information! I tried limiting the thumbnail size, but it didn’t seem to increase the speed. I’m just surprised that other software like Windows Photos is able to do this thumbnail creation and previewing of photos at almost 100x faster.

Actually I think I understand this now. Windows thumbnails are lower resolution, and it doesn’t use the entirety of the JPEG data to create them. Even when I open the images initially, they are heavily compressed until I wait a second or two, which is aligned with Raw Therapee’s time. But a compressed version is perfect to load it quickly so I can decide if it is a photo worth working on or not. I guess I’ll just have to use different software than Raw Therapee to filter through and tag the best photos, then only upload the best photos to Raw Therapee later from a SDD to do edits and batch processing.

Thank you for your help!

Another thing to try (since @heckflosse indicated this is a one time penalty):

Load up the folder in RT, go grab a coffee, come back later. You should only need to do this once as long as something doesn’t invalidate the cache.

1 Like

@NinjaPhotographer That’s also important as there have been some speedups for filebrowser in RT 5.7 especially for folders with a huge amount of files.

1 Like

It’s also very slow jumping between images.

My workflow for shooting basketball games consists of cropping and rotating JPEGs from a 7DmkII (20 MP). I like RT over Darktable for this purpose because the cropping is done by drawing the bounds of the box rather than having to pull in from the edges; drawing the box is usually faster to do. For that matter, I’m also carrying a patch to have the crop tool not reset to the move tool (which seems useless anyway if you’re not zoomed in, but that’s another matter). I don’t actually export from RT, which is slow; I wrote a shell script with a perl helper to compute the appropriate crop and rotate and use ImageMagick on the command line to apply it (which lets me parallelize the operations). When you’re trying to process 500 images ASAP, things like that matter. Anyway.

In addition to the file browser loading very slowly (which isn’t that much of a problem, since it only happens once and I can do something else while I’m waiting for a few minutes), moving through the images (with f3 and f4) is also very slow – it takes a second or two (on an i7-6820/6920HQ) to load the next/previous image. There’s no obvious reason why that should be the case, particularly on JPEG files when there’s no processing (that I’m aware of) going on here.

I have a callgrind running on RT trying to load 100 files to try to see where this is all going. Loading a 20 MP JPEG on a reasonably modern processor really shouldn’t be more than tens or so of milliseconds.

Oh yeah, this is on Linux, using gcc 7.5, on the tip of the dev branch plus two patches I’m carrying (not to switch back to the hand tool, and to set the default crop ratio to not be fixed, neither of which have any performance implications).

It turned out to be a real headache to run under callgrind. Callgrind is of course slow, but this was borderline pathological. It looked like most of the time was being spent in LCMS transform code, even when I thought I had everything set (neutral profile, no monitor correction, cleared out the database and .pp3 files, and so forth).

I’m going to try again. I’d really like image loading to be faster, but I have enough FOSS projects on my plate that I can’t really add another.

If you’re using the OS package for LCMS, it may not be compiled with threading enabled. I compiled and installed my own LittleCMS2 in /usr/local, and my display transforms in rawproc sped up substantially.

Built it myself from source, and it does look like threading is enabled. The most relevant lines in CMakeCache.txt look like

//Build with OpenMP support
OPTION_OMP:BOOL=ON

//CXX compiler flags for OpenMP parallelization
OpenMP_CXX_FLAGS:STRING=-fopenmp

//CXX compiler libraries for OpenMP parallelization
OpenMP_CXX_LIB_NAMES:STRING=gomp;pthread

//C compiler flags for OpenMP parallelization
OpenMP_C_FLAGS:STRING=-fopenmp

//C compiler libraries for OpenMP parallelization
OpenMP_C_LIB_NAMES:STRING=gomp;pthread

//Path to the gomp library for OpenMP
OpenMP_gomp_LIBRARY:FILEPATH=/usr/lib64/gcc/x86_64-suse-linux/7/libgomp.so

//Path to the pthread library for OpenMP
OpenMP_pthread_LIBRARY:FILEPATH=/usr/lib64/libpthread.so

So it looks like part of my problem is that I had some old 5.7 version squirreled away in /usr/local/bin (how my carry patches appeared to work, I have no idea).

% rawtherapee -v
RawTherapee, version 5.7-381-g4fc28370a

Second, turning off every CMS option I could find greatly improves performance, although it’s far from instantaneous. I guess I really need a dedicated crop-and-rotate tool that simply writes out a sidecar that I can parse.

Well, it turns out that it was compiled without threads, apparently due to a linker bug. I tried removing --without-threads, and it indeed fails to link (undefined references to various pthread_mutex_*). And I verified that -lpthread is in the link line. Doesn’t seem to matter which compiler I use, either. And I don’t think I quite want to try swapping out my binutils (2.33.1).

From the callgrind output, it looks like it’s making an awful lot of calls to cmsDoTransform (about 28000 per function, from two functions each called about 160 times, which itself seems high considering that I just loaded two photos and flipped back and forth about 4 times until I ran out of time). But as I said, I don’t have time to dig into the code here, which codebase I’m completely unfamiliar with. I have enough with Gutenprint and KPhotoAlbum to keep me busy.

This reminds me of a common xkcd theme:

(xkcd comics are under Creative Commons Attribution-NonCommercial 2.5 License)

3 Likes

A large number of cmsDoTransform calls is appropriate, the function is arranged so a user can call it “per-stride” and do their own parallelization. However, 28000 does sound large, as the stride normally corresponds to a row of pixels in an image…

Well, when I have to select 400-500 keepers from 2500 frames, and I want to get them out to the sports information department and the teams as soon as possible, little things do add up very quickly. 2 extra seconds spent on 400 frames is about 13 minutes, not to mention the frustration of waiting for the next frame to load.

Besides, the tuning work I did on KPhotoAlbum made it better for everyone :slight_smile:

A large number of cmsDoTransform calls is appropriate, the function is arranged so a user can call it “per-stride” and do their own parallelization. However, 28000 does sound large, as the stride normally corresponds to a row of pixels in an image…

If it’s one call per row, that sounds about right (modulo opportunities for caching) then, since I was flipping back and forth between two images, 20 and 24 MP, several times.

I guess I’ll just have to create a profile that turns everything that does color management off, then. But even with color management, this strikes me as very slow on a modern CPU. There must be quite a bit of other processing going on.

UPDATE: I just re-profiled it, and it looks like it’s spending a lot of its time in various GTK drawing operations, which I think can be a lot faster than what I’m seeing. I’m also seeing a lot of time spent in internal pixel conversions that are done by function call that isn’t getting inlined; that’s adding up too.

1 Like

So, I scrolled back a bit, and found that the problem du jour is ‘crop’ For that operation, should not an initial image be produced, and suffice through various iterations of crop? Or, am I not understanding the dynamic… very probable.

Who’s doing all that huffin‘ n‘ puffin‘? RT or a GTK component?

Who’s doing all that huffin‘ n‘ puffin‘? RT or a GTK component?

With default settings, it’s mostly LittleCMS that’s easing up the time. Going to a completely unmanaged workflow, it looks to be a good part of both. With all the threading it’s a bit hard to interpret the callgrind output, but I’d say it’s probably in the range of 1/3 RT and 2/3 GTK. Watching the UI in action as I flip between images suggests that that’s likely not to be too far off the mark. It’s very possible that inefficiencies in the UI are forcing RT to do more computation than necessary.

kcachegrind doesn’t seem to provide a convenient way to cut and paste from the function list, or I’d cut and paste some of it in.

So, I scrolled back a bit, and found that the problem du jour is ‘crop’ For that operation, should not an initial image be produced, and suffice through various iterations of crop? Or, am I not understanding the dynamic… very probable.

As long as I can extract the geometry of the crop and rotate operations from the sidecar files (and can live with the high ISO image noise), I don’t need RT to actually produce an image, so there’s no need for color management for what I’m doing here (that’s not true for other things, where I actually do want to adjust color).

ImageMagick (via the convert command) appears to be a lot more efficient (not to mention that I can parallelize the images to be processed) than RT. So once I’ve gone through and cropped and rotated the images as needed, I can simply take the sidecar files and apply the geometric transformations to the image files, in addition to adding a watermark, all in one shot. It takes me no more than a few minutes to process 300-400 photos on my E3-1505v5 (i7-6820HQ/i7-6920HQ) laptop; if I ran it on my Ryzen 3900X server, I wouldn’t really have time to get a glass of water.

1 Like