[feedback needed] integrating nind-denoise with darktable

mikrom · September 24, 2025, 11:25am

It’s better to do it once like this:

echo 'export PYTORCH_ROCM_ARCH="gfx1031"' >> /home/a/nind-denoise/.venv/bin/activate
echo 'export HIP_VISIBLE_DEVICES=0' >> /home/a/nind-denoise/.venv/bin/activate
echo 'export ROCM_PATH=/opt/rocm' >> /home/a/nind-denoise/.venv/bin/activate
echo 'export HSA_OVERRIDE_GFX_VERSION=10.3.0' >> /home/a/nind-denoise/.venv/bin/activate

rengo · September 24, 2025, 12:52pm

Ah, excellent. Thanks for the advoce; I admit I’m flopping about in dt-lua. Spending a lot more time reading then I am writing. Considering everything I want to do is really simple, mostly I just feel kinda dumb. BUT I finally figured out where to grab an xmp path without making bad assumptions so there’s that.

In more recent (unpublishdd) renditions I spent a chunk of effort getting away from subprocess calls entirely except for a python entry-point. And then worked on dropping the darktable dependency entirely which brings me to

Rawnind is a colloquialism/shorthand for Dr. Brummer’s follow up paper (I think its linked somewhere in this thread already). As the name implies, it involves working with a new dataset - thematically similar to the first, but full of raw images. Go figure. He ran a bunch of experiments with denoising, denoising w/(implicit) demosaicing, and also combined with compression. The codebase has all kinds of tangentially related goodies that come with years of doctoral studies, including his own raw development ‘modules.’ In both numpy and pytorch, no less.

I digress. Anyway the big advantage to feeding the 'nets raw images isnt actually image quality, which is comparable. You get far more effective compression from images that have had noise removed, so there’s definitely a business case there - very practical. The other thing you get is computational efficiency. Its a lot less computational horsepower to feed the neural nets raw Bayer pattern (smaller input dimensions), and as a result it is much faster at inferece.

It seems a lot of his experimentation was with what processing to do before feeding to the net - hence all the modular raw development code.

The side effect is that his new codebase is way easier to adapt and much more flexible from a workflow perspective - which is why I’ve been sinking more time into wrapping my head around it than I have spentworking on this one. It just seems like a better time investment.

I don’t have a roadmap or anything, everything I’ve done so far has been exploratory. Mostly studyjng code and writing myself documentation. Ive been thinking about how it might best be restructured to take advantage if its flexibility & make it more maintainable, with sort of a side focus on arriving at a structure where I can dynamically assemble contiguous tracable graphs

rengo · September 24, 2025, 12:54pm

*pytorch graphs, that is. Because then they can be exported to intermediate representation & compiled. Which seems like a desirable feature.

rengo · September 24, 2025, 1:06pm

TL;DR - rawnind makes it much easier to implement pure ex ante and post hoc tooling, so it could support denoise → import or export → denoise (and optionally compress) without having to split the history stack or anything. A standalone tool would basically be gratis, and it would be usable with the photo editing software of your choice.

The extra super cool workflow would be doing your edits in darktable (or whatever), and then using the xmp/style to assemble a full (compileable) ISP and feeding your image(s) through it. Kind of like a self extracting executable? Taking nondestructive to the max in any event.

wpferguson · September 24, 2025, 2:52pm

The documentation is at darktable lua documentation - Home.

If you have still have questions, feel free to ask

hanatos · September 30, 2025, 8:18am

i’m really interested in making this fast and streamlining the integration. i’m running prep_image_dataset.py today, i hope it’ll finish. do you have any experience with training? i want to re-train on the most simplified network architecture to make it run faster. i suppose this can be a huge timesink… so any datapoints from prior experience how many layers/how many channels are required for what quality trade-off are very valuable. i’ll start with the jddcnn u-net architecture that is already in vkdt. it yielded mixed results, but maybe because of lack of good training set.

rengo · September 30, 2025, 1:25pm

Well, now that’s a right proper query if i’d ever heard one.

There’s probably more to answering this than can properly be said in a forum post. Lots of considerations, too. I can give you some broad hints to sort of set you on the right path, maybe. The first being, that this is, broadly speaking, what Ive been working on most for the last couple weeks and I haven’t actually started any training runs yet. You’re not going to ‘get lucky’ and be set to go with a handful of experiments. Spend some time now putting together some scaffolding. Neural architecture search is what you want to read about here. Second,

pruning and quantization are promising

IMHO. That’s probably what I’m going to look at first once I finish scaffolding. Third is that you are right, this is a huge time-sink, so important to temper excitement, but its mostly active work only for your GPU/compute if you do it right. Fourth is to consider what you are targeting. Its probably not going to be “one size fits all.” Unless you’re whisper.

Take it all with a grain of salt; I am out of practice and have had to play catch-up so my info may or may not be totally out of date and irrelevant. I hope not, though.

jandren · September 30, 2025, 5:10pm

I have some small learnings from small tests trying to optimize filter kernels.

Use pixelshuffle instead of transposed convolution, a lot better! Always save your results to a file, its a waste to always start from scratch and its easier to sometimes manually do the training scheduling that way. Clipping the gradients to something reasonable was helpful when my weights diverged from to high learning step, it made it possible to maintain large training steps in the beginning saving ALOT of time for me. Be careful about shuffling data from cpu to gpu but i feel like you if anyone knows that one from before. (I could fit my entire patchified dataset on vram so i ended up loading everything to the gpu once…)

Good luck!

rengo · October 1, 2025, 12:07am

Now there’s some good advice. Pixelshuffle is already in the rawnind models and yeah it works a treat.

I didn’t talk about data augmentation or checkpointing: you should do them.

hanatos · October 1, 2025, 7:49am

thanks for the suggestions to both of you!

yeah so i have this condition. i don’t want to load/link bloatware, so i keep some custom cooperative matrix GEMM code around for the convolutions (as opposed to linking in gigabytes of runtime or even straight python). it’s quite a bit of work to make these run at speed. i really really don’t want to / don’t have the time to look at every fancy detour that results in an additional piece of code that needs to be written. i think this is also the reason why open image denoise uses simplistic nearest upsampling and then optimises the code to death. i suppose you can always fix whatever you oversimplified in this step with one more convolution after the upsampling. this is also the reason why i can’t just use the pretrained weights.

i was hoping to simply use the training pytorch scripts provided with rawnind. seems like these are “production proven”

in other news… the image preparation python script has now been running for 24h straight and finished like 30%. there is a chance that i’ll have to cut power due to work on the house before this thing finishes even preparing the input data for the training. i may or may not have the patience to restart it (if any of you knows a place where the readily prepared dataset can be downloaded or of some way to speed up this process, let me know…).

rengo · October 1, 2025, 10:12am

Yeah let me take a look.

OK I took a look. Let me know how it goes

rengo · October 1, 2025, 10:12am

github.com/CommReteris/rawnind_jddc

Performance Optimizations for Dataset Prep

tools ← performance-optimizations

opened 09:56AM - 01 Oct 25 UTC

CommReteris

+1508 -42

# Performance Optimizations for prep_image_dataset.py ## Overview This PR im…plements performance optimizations for the image dataset preparation script, targeting the most computationally expensive operations: image alignment, I/O operations, and crops processing. ## things done ### 1. U-pick Alignment method: - **FFT-Based Correlation**: 3-5x speedup using scipy's FFT - **Hierarchical Search**: 5-10x speedup with multi-scale pyramid - **GPU Acceleration**: 10-50x speedup with optional CuPy (NVIDIA GPUs) - **Auto Method Selection**: Haz CUDA? CuPy? ### 2. I/O Caching - LRU caching for filesystem operations - Configurable cache sizes with performance monitoring - 2-3x reduction in I/O overhead ### 3. Optimized Crops Processing - Vectorized coordinate filtering with numpy - Pre-filtering and cached operations - 2-4x faster crops list generation ### 4. Performance Monitoring - Timing - Cache hit/miss statistics - Built-in benchmarking tool - Verbose logging options ## 🔧 New Command Line Options ```bash # Specify alignment method --alignment_method {auto,original,fft,hierarchical,gpu} # Enable verbose alignment logging --verbose_alignment # Run performance benchmark --benchmark ``` ## Uage ### Basic Usage (Auto Method Selection) ```bash python prep_image_dataset.py --input_dir /path/to/images --output_dir /path/to/output ``` ### Performance Benchmarking ```bash python prep_image_dataset.py --benchmark --input_dir /path/to/images ``` ### GPU Acceleration (if available) ```bash python prep_image_dataset.py --alignment_method gpu --input_dir /path/to/images ``` ### Dependencies - **Required**: numpy, scipy, torch, opencv-python (already in project) - **Optional**: cupy (for GPU acceleration) ## 📝 Files Modified - `src/rawnind/tools/prep_image_dataset.py`: Main script with CLI options and benchmarking - `src/rawnind/libs/rawproc.py` alignment functions and caching - `PERFORMANCE_OPTIMIZATIONS.md`: documentation ## more fast stuff not looked at yet: - Parallel I/O with async operations (I <3Trio) - Memory mapping for very large datasets - Custom CUDA kernels for specialized operations - Machine learning-based alignment prediction

hanatos · October 1, 2025, 11:54am

whoa awesome, thanks much! if i kill my old script now and start yours, will it start from scratch or recognise that it’s already at 40%?

hatsnp · October 1, 2025, 11:57am

No offense, but was all that vibe coded? Even left the AI summary of the actions preformed…

rengo · October 1, 2025, 4:51pm

Yeah IMO that type of the thing is the perfect application for AI coding. Limited in scope, very clear requirements. It’s not like its production code, anyway.

Edit: and yeah, I left the AI produced summary in place because I wanted to be transparent about it lol

Edit2: I guess I should clarify. Yes, The actual code was authored by an LLM, but depending on what you mean by “vibe-coded,” it may or may not fit that definition. If you mean did I just ask the AI to implement something for me without understanding what it was doing - no, that’s not the case. It was more like I did the T part of TDD and passed it off to the LLM to complete the rest.

rengo · October 1, 2025, 4:56pm

It should recognize what’s on disk:

def find_cached_result(ds_dpath, image_set, gt_file_endpath, f_endpath, cached_results):
    gt_fpath = os.path.join(ds_dpath, image_set, gt_file_endpath)
    f_fpath = os.path.join(ds_dpath, image_set, f_endpath)
    for result in cached_results:
        if result["gt_fpath"] == gt_fpath and result["f_fpath"] == f_fpath:
            return result

EDIT: It’s probably not totally clear from what I linked. Here’s some more context:

# Check if matching GT coordinates exist
              if coordinates in gt_file_coords:
                  fn_gt = gt_file_coords[coordinates]
                  
                  crop = {
                      "coordinates": list(coordinates),
                      "f_linrec2020_fpath": os.path.join(search_dir, fn_f),
                      "gt_linrec2020_fpath": os.path.join(prgb_gt_dir, fn_gt),
                  }
                  
                  if is_bayer:
                      f_bayer_path = os.path.join(
                          bayer_image_set_dpath,
                          "gt" if f_is_gt else "",
                          fn_f.replace("." + HDR_EXT, ".npy"),
                      )
                      gt_bayer_path = os.path.join(
                          bayer_image_set_dpath,
                          "gt",
                          fn_gt.replace("." + HDR_EXT, ".npy"),
                      )
                      
                      crop["f_bayer_fpath"] = f_bayer_path
                      crop["gt_bayer_fpath"] = gt_bayer_path
                      
                      # Use cached existence checks
                      if not cached_exists(f_bayer_path) or not cached_exists(gt_bayer_path):
                          logging.error(
                              f"Missing crop: {f_bayer_path} and/or {gt_bayer_path}"
                          )
                          continue  # Skip instead of breaking
                  
                  crops.append(crop)
  
  return crops

rengo · October 1, 2025, 11:53pm

I was wrong; it was probably about 2/3rds garbage, but had good ideas. The FFT approach works fine and is fast. The GPU implementation is kinda wild and needed to be rewritten. bunch of weird bugs like it had missed the whole “align within a scene” concept and was trying to align everything to everything…

I’m almost done fixing it properly

hanatos · October 2, 2025, 6:25am

haha nice, thanks for pointing this out. i couldn’t tell python from ai slop, all looks the same to me. i’ll let the slow original continue to run for a while in this case…

hanatos · October 2, 2025, 7:38am

or maybe not:

r_TEST_boardgames_top_GT_ISO50_sha1=8c121e3c1038766ee9f96565376eb264cf15b509.arw.png'
 79%|█████████████████████████████████████████████████▉             | 6571/8281 [41:33:00<10:48:46, 22.76s/it]
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/lib/python3.13/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
                    ~~~~^^^^^^^^^^^^^^^
  File "/home/jo/vc/ext/rawnind_jddc/src/rawnind/../rawnind/libs/rawproc.py", line 480, in get_best_alignment_compute_gain_and_make_loss_mask
    best_alignment, best_alignment_loss = find_best_alignment(
                                          ~~~~~~~~~~~~~~~~~~~^
        gt_rgb, f_rgb, return_loss_too=True
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )

trougnouf · October 3, 2025, 9:05am

Fixed it in [ds] fix organize_files.sh s.t. it processes X-Trans, TEST, UNK, UNK_TEST sets properly (3c88ef1c) · Commits · Benoit Brummer / rawnind_jddc · GitLab — prep_image_dataset.py troubles · Issue #2 · trougnouf/rawnind_jddc · GitHub

The script that reconstructs the directories structure after downloading the dataset did not properly handle “TEST”, “UNK” and “UNK_TEST” image sets, so they were all in the same directory and prep_image_dataset was trying to align completely unrelated test images (trying as far as MAX_SHIFT_SEARCH allows, 128 pixels around, and ultimately discarding the images).

Unfortunately you will have to reprocess the dataset. On the bright side it should go much faster now (Fewer image pairs to compare and the neighborhood search should be relatively quick). It’s been running on my (“humble” 6-cores Ryzen 5 5600G w/ 128 GB of RAM) computer since last night and so far 69% done.

(You will also need test_reserve: add TEST_ and UNK_TEST_ prefixes ( completes #2 and 3c8… · trougnouf/rawnind_jddc@b6c8159 · GitHub before launching the training s.t. the test images are not used for training. Also I would recommend including the UNK images in the training data, they are not in the default config, and if you are training a linear RGB model then you could also include the X-Trans images which are also not part of the training data by default since I did not use them in the paper.)

edit: the dataset preparation finished in less than 12 hours with no issue