Also if you can factorize your kernel and separate the 2D convolution into a 1D vertical and a 1D horizontal one, it’s better to avoid FFT.
But then, trying to limit the number of operations is not always good, it depends on how the memory is accessed vs. cache misses vs. I/O speed vs. computation speed. I have seen GPU benchmarks where FFT starts being worth it only for 64×64 kernels and up. It’s difficult to predict the performance without benchmarking.
Some Python libs actually run various convolutions for various kernels and images sizes, and cache the runtimes, so the code later switch to the fastest path depending on sizes.