Hi, G’MIC users and developers,
For the past month, I had the urge to improve rep_uniq_cols after seeing some inconsistency in execution time. Then, I managed to resolve that issue with a huge speedup time. In some cases, the default colormap 0,1,0 is better. Now, in the vast majority of time rep_uniq_cols is clearly the winner.
Here’s the execution time tests:
Test 1:
C:\Windows\System32>gmic sp cat,3000 blur 1 v - tic colormap 0,1,0 toc v +
[gmic]./ Start G'MIC interpreter (v.3.6.3).
[gmic]./ Input sample image 'cat' (1 image 3000x2750x1x3).
[gmic]./ Blur image [0] with standard deviation 1, neumann boundary conditions and gaussian kernel.
[gmic]./ Elapsed time: 21.555 s.
[gmic]./ Display image [0] = '[colormap of cat]'.
[0] = '[colormap of cat]':
size = (8081981,1,1,3) [92.5 Mio of float32].
data = (4.00037,4.00009,4.00002,4,4,4,4,4,4,4.00001,4.00002,4.00002,4,4,4.00001,4.00001,4.00001,4.00001,4.00001,4.00001,4.00001,4.00002,4.00002,4.00002,4.00002,4.00004,4.0001,4.00039,4.00026,4.00006,4.00001,4,4.00001,4.00002,4.00005,4.00006,4.00006,4.00001,4,4.00001,4,4,4.00001,4.00003,4.00004,4.00004,4.00004,4.00004,4.00004,4.00004,4.00004,4.00005,4.00006,4.00006,4.00007,4.00007,4.00008,4.00012,4.00033,64.7721,4.00023,4.00005,4.00001,4, ... ,206.35,87.0153,59.9605,29.1881,5,13.2261,16.9131,8.25004,181.361,105.441,63.0485,31.6775,16.5291,49.3207,30.9786,4.79665,26.0168,110.144,46.4924,26.5073,110.153,4.09372,40.3526,76.6279,25.4192,46.4429,51.3333,174.403,80.3667,167.548,150.046,125.914,9.32063,43.4595,5.01398,42.397,17.4149,85.105,4.00204,20.1766,185.107,28.2411,206.74,37.6532,10.3176,55.2464,150.526,44.3452,127.168,26.6568,17.3785,181.182,206.349,153.928,39.9804,151.3,193.331,105.212,4.98043,28.53,97.7067,17.0119,183.945,8.83532).
min = 4, max = 244, mean = 113.424, std = 66.0772, coords_min = (5,0,0,0), coords_max = (104322,0,0,0).
[gmic]./ End G'MIC interpreter.
C:\Windows\System32>gmic sp cat,3000 blur 1 v - tic rep_uniq_cols 0 toc v +
[gmic]./ Start G'MIC interpreter (v.3.6.3).
[gmic]./ Input sample image 'cat' (1 image 3000x2750x1x3).
[gmic]./ Blur image [0] with standard deviation 1, neumann boundary conditions and gaussian kernel.
[gmic]./ Elapsed time: 2.936 s.
[gmic]./ Display image [0] = '[empty]'.
[0] = '[empty]':
size = (8081981,1,1,3) [92.5 Mio of float32].
data = (173.999,79.8518,168.999,169.429,142.311,96.9179,97.0072,158.001,192.966,109.244,88.3766,61.7228,100.863,166.268,135.397,168.674,166.83,182.754,119.112,128.01,175.479,54.0732,82.1174,82.1174,86.9696,18.3247,164.261,72.0121,164.431,175.581,119.765,107.992,172.436,167.615,166.037,116.867,168.999,86.4267,170.087,167.865,89.0426,96.7374,183.837,210.78,169.004,125.822,77.185,86.9443,183.986,192.627,74.2208,126.684,84.9707,168.995,87.0002,165.632,173.215,169.032,97.7803,154.342,169.189,116.156,198.894,165.064, ... ,39.5352,38.4477,11.5743,81.1834,15.385,166.068,169.416,13.7728,4.02356,43.1095,38.5214,24.8612,45.0869,200.182,24.4272,63.8709,13.0029,38.9897,8.02893,57.6049,4.03924,37.6912,152.23,15.0495,15.2071,4.99976,12.6896,34.5346,60.477,24.4954,72.9876,13.2091,39.999,13.8128,61.8448,4.99922,38.656,15.7934,9.70077,92.5935,183.614,145.596,4.17547,35.81,95.9711,17.2533,13.4432,12.7759,184,4.08177,4.08177,21.3243,147.177,101.167,4.02374,4.71735,28.1805,15.1708,26.0187,106.174,140.016,21.4133,128.957,112.294).
min = 4, max = 244, mean = 113.424, std = 66.0772, coords_min = (2.68222e+06,0,0,0), coords_max = (390420,0,0,0).
[gmic]./ End G'MIC interpreter.
~7x speedup for test 1.
Test 2:
C:\Windows\System32>gmic sp cat v - tic colormap 0,1,0 toc v +
[gmic]./ Start G'MIC interpreter (v.3.6.3).
[gmic]./ Input sample image 'cat' (1 image 600x550x1x3).
[gmic]./ Elapsed time: 0.081 s.
[gmic]./ Display image [0] = '[colormap of cat]'.
[0] = '[colormap of cat]':
size = (256,1,1,3) [3072 b of float32].
data = (4,206,238,116,121,72,163,152,185,166,131,212,141,217,30,176,213,56,177,100,122,16,97,178,166,162,154,201,171,87,140,161,183,194,184,212,184,65,189,192,218,204,231,148,244,127,205,219,164,195,184,157,167,202,205,228,60,188,159,207,182,224,172,214, ... ,8,223,146,185,39,12,8,61,117,7,104,9,115,82,140,7,165,68,8,53,7,115,99,80,31,62,154,85,117,23,44,165,40,40,174,11,15,110,87,90,184,32,196,80,29,105,8,122,16,90,64,8,55,17,33,94,126,8,17,66,137,165,100,70).
min = 4, max = 244, mean = 118.258, std = 66.143, coords_min = (0,0,0,0), coords_max = (44,0,0,0).
[gmic]./ End G'MIC interpreter.
C:\Windows\System32>gmic sp cat v - tic rep_uniq_cols 0 toc v +
[gmic]./ Start G'MIC interpreter (v.3.6.3).
[gmic]./ Input sample image 'cat' (1 image 600x550x1x3).
[gmic]./ Elapsed time: 0.041 s.
[gmic]./ Display image [0] = '[empty]'.
[0] = '[empty]':
size = (256,1,1,3) [3072 b of float32].
data = (202,175,244,179,97,217,178,153,108,198,195,30,204,177,194,213,117,233,177,182,242,183,154,160,234,156,152,228,167,65,210,223,158,162,211,202,150,131,212,189,151,212,195,206,227,135,174,188,161,171,140,167,205,125,189,143,157,154,211,123,108,141,166,122, ... ,223,47,4,115,8,165,169,31,5,89,88,31,8,45,17,41,77,99,85,143,11,27,139,102,63,69,28,104,15,109,41,147,105,11,14,15,68,61,137,73,35,38,54,158,161,6,27,47,32,53,32,40,99,82,189,37,118,17,52,5,56,94,70,79).
min = 4, max = 244, mean = 118.258, std = 66.143, coords_min = (105,0,0,0), coords_max = (2,0,0,0).
[gmic]./ End G'MIC interpreter.
~2x speedup for Test 2
Test 3:
C:\Windows\System32>gmic sp eagle v - tic colormap 0,1,0 toc v +
[gmic]./ Start G'MIC interpreter (v.3.6.3).
[gmic]./ Input sample image 'eagle' (1 image 520x480x1x3).
[gmic]./ Elapsed time: 0.108 s.
[gmic]./ Display image [0] = '[colormap of eagle]'.
[0] = '[colormap of eagle]':
size = (68240,1,1,3) [799.7 Kio of float32].
data = (231,198,229,136,116,238,131,72,147,191,221,217,112,238,137,242,230,209,75,120,87,200,103,167,105,101,233,6,134,35,68,79,33,2,172,4,110,81,106,141,167,233,231,231,198,215,169,112,233,240,68,238,33,238,178,147,116,147,215,151,149,250,151,193, ... ,222,190,224,220,79,34,176,147,144,147,140,105,135,100,130,132,26,188,160,98,192,64,62,73,43,69,2,0,96,39,33,223,97,65,225,221,95,197,80,181,111,142,173,184,110,59,27,65,193,46,157,161,191,159,5,61,3,31,63,134,1,44,102,42).
min = 0, max = 255, mean = 121.897, std = 68.3201, coords_min = (4279,0,0,0), coords_max = (25141,0,0,0).
[gmic]./ End G'MIC interpreter.
C:\Windows\System32>gmic sp eagle v - tic rep_uniq_cols 0 toc v +
[gmic]./ Start G'MIC interpreter (v.3.6.3).
[gmic]./ Input sample image 'eagle' (1 image 520x480x1x3).
[gmic]./ Elapsed time: 0.144 s.
[gmic]./ Display image [0] = '[empty]'.
[0] = '[empty]':
size = (68240,1,1,3) [799.7 Kio of float32].
data = (27,20,19,17,11,14,26,10,18,26,29,221,52,42,233,218,192,240,249,151,33,174,33,241,212,161,37,234,143,190,215,186,12,203,119,22,227,229,86,196,37,222,208,144,251,193,187,248,253,207,251,163,139,208,124,249,61,74,248,232,85,244,211,97, ... ,89,108,43,20,40,62,141,192,24,69,63,74,59,153,195,56,43,61,118,64,54,183,29,43,174,51,91,45,71,22,27,31,18,64,53,25,74,34,73,49,116,76,62,47,75,0,73,39,101,63,34,67,95,43,42,72,90,116,118,67,68,66,92,47).
min = 0, max = 255, mean = 121.897, std = 68.3201, coords_min = (23289,0,0,0), coords_max = (812,0,0,0).
[gmic]./ End G'MIC interpreter.
Suddenly, G’MIC colormap 0,1,0 outperforms my algorithm.
How my own algorithm work? Create buckets per threads, and add colors to each buckets per threads. Then merge buckets per threads with a secondary set of bucket for amortized O(n) insert. During each runs, it executes with different set of seed. So, it’s a completely parallelized version of the stdlib version.
So, I been wondering how exactly can I improve my algorithm.
I look into the parameter, does this look strange?
cpus_to_use,mt_mode,pxs_to_process_per_threads,bitshift_factor,dictionary_buckets_per_threads,bit_mask:="
cpus_to_use=cut(int(whd#0/0x8000),1,$_cpus);
mt_mode=cpus_to_use>1;
pxs_to_process_per_threads=ceil(whd/cpus_to_use);
bitshift_factor=cut(int(log2(sqrt(pxs_to_process_per_threads)))+1,8,16);
dictionary_buckets_per_threads=1<<bitshift_factor;
bit_mask=dictionary_buckets_per_threads-1;
cpus_to_use,mt_mode,pxs_to_process_per_threads,bitshift_factor,dictionary_buckets_per_threads,bit_mask;
"
# snip
# The below line tells us how many threads to use when merging bucket
L={max(1,min(int(is#-1/0x4000),$_cpus,$dictionary_buckets_per_threads))}
0x8000 seems correct to my test, and 0x4000 seems correct too. But, are there better numbers here?
Another idea is that I can probably use defined bucket size when it comes to resizing buckets onto secondary dictionary, but this doesn’t do anything.
Any other idea on how can I improve the execution time for all case?
EDIT: Changing from 0x4000 to 0x8000 makes it execute the same time, but I’d suspect there’s more to this, so I dunno if it can be improved.