UTF-8 supporting tools in G'MIC

After some discussion within this thread and in this issue, it seems that the best solution that has been arrived is to develop UTF-8 specific commands tools to enable easier string processing within G’MIC.

Of course, this requires for you to set your interpreter to UTF-8, and to set your file to UTF-8 for it to work as expected.

Anyway, my first tool is going to be converting UTF-8 chars into indexes accordingly to this thread, but it’s possible, I may even take it further than indexing a little less than 256 chars, just so that you can make string processing commands work with many non-latin based characters.

Here’s my first work in progress (Note, I have not coded the conversion to indexes yet):

#@cli utf8_into_char_indexes: string,var_name
#@cli : Convert string representation into char indexes. Global variables are set for convenience. 
utf8_into_char_indexes:
skip "${2=}"
if size('$2') g=_$2 fi

eval "
 str='$1';
 const size_str=size(str);
 count_of_bit_set=vector(#size_str,0);
 pos=num_of_chars_analyzed=0;
 while(pos!=size_str,
  current_binary=str[pos]>>4;
  !current_binary?(
   ++pos;
   count_of_bit_set[num_of_chars_analyzed]=1;
  ):
  current_binary<=12?(
   pos+=2;
   count_of_bit_set[num_of_chars_analyzed]=2;
  ):
  current_binary==14?(
   pos+=3;
   count_of_bit_set[num_of_chars_analyzed]=3;
  ):
  current_binary==15?(
   pos+=4;
   count_of_bit_set[num_of_chars_analyzed]=4;
  );
  ++num_of_chars_analyzed;
 );
 set('count_of_bit_set',v2s(count_of_bit_set));
 num_of_chars_analyzed;
 "

num_of_chars=${}
bit_set_per_char={([$count_of_bit_set])[0,$num_of_chars]}
echo $bit_set_per_char

And a test demonstrate that the char count is indeed correct:

C:\WINDOWS\system32>gmic echo ${utf8_into_char_indexes\ €þ×}
[gmic]./ Start G'MIC interpreter (v.3.3.3).
3
[gmic]./ End G'MIC interpreter

Ok, I managed to successfully convert a UTF-8 string input into their respective decimal representation according to UTF-8 chart:

C:\WINDOWS\system32>gmic echo ${utf8str2int\ €ʃ}
[gmic]./ Start G'MIC interpreter (v.3.3.3).
8364,643
[gmic]./ End G'MIC interpreter.

Current code:

#@cli utf8str2ints: string
#@cli : Return integer representation of UTF-8 string.
utf8str2int:

eval "
 str='$1';
 const size_str=size(str);
 num_set=vector(#size_str,0);
 pos=num_of_chars_analyzed=0;
 while(pos!=size_str,
  current_binary=str[pos]>>4;
  !current_binary?(
   ++pos;
   num_set[num_of_chars_analyzed]=1;
  ):
  current_binary<=13?(
   pos+=2;
   num_set[num_of_chars_analyzed]=2;
  ):
  current_binary==14?(
   pos+=3;
   num_set[num_of_chars_analyzed]=3;
  ):
  current_binary==15?(
   pos+=4;
   num_set[num_of_chars_analyzed]=4;
  );
  ++num_of_chars_analyzed;
 );
 
 num_of_chars_analyzed;
 
 const N=0xff;
 const M=N>>2;
 
 pos=0;
 repeat(num_of_chars_analyzed,k,
  size_of_ints_per_char=num_set[k];
  
  size_of_ints_per_char==4?(
   num_set[k]=(str[pos]&(N>>5))<<18|(str[pos+1]&M)<<12|(str[pos+2]&M)<<6|(str[pos+3]&M);
  ):
  size_of_ints_per_char==3?(
   num_set[k]=(str[pos]&(N>>4))<<12|(str[pos+1]&M)<<6|(str[pos+2]&M);
  ):
  size_of_ints_per_char==2?(
   num_set[k]=(str[pos]&(N>>3))<<6|(str[pos+1]&M);
  ):(
   num_set[k]=str[pos];
  );
  
  pos+=size_of_ints_per_char;
 );

 set('num_of_chars',num_of_chars_analyzed);
 num_set;
 "

status {([${}])[0,$num_of_chars]}

Now all I need to do is the reverse method to convert back. Actually, it doesn’t work well as I’d like. :/ In some cases. I think it’s fixed.

Here are some more tests:

C:\WINDOWS\system32>gmic echo ${utf8str2int\ ߐ}
[gmic]./ Start G'MIC interpreter (v.3.3.3).
2000
[gmic]./ End G'MIC interpreter.

C:\WINDOWS\system32>gmic echo ${utf8str2int\ ߒ}
[gmic]./ Start G'MIC interpreter (v.3.3.3).
2002
[gmic]./ End G'MIC interpreter.

C:\WINDOWS\system32>gmic echo ${utf8str2int\ €}
[gmic]./ Start G'MIC interpreter (v.3.3.3).
8364
[gmic]./ End G'MIC interpreter.

C:\WINDOWS\system32>gmic echo ${utf8str2int\ 𐍈}
[gmic]./ Start G'MIC interpreter (v.3.3.3).
66376
[gmic]./ End G'MIC interpreter.

C:\WINDOWS\system32>gmic echo ${utf8str2int\ 𐍈€}
[gmic]./ Start G'MIC interpreter (v.3.3.3).
66376,8364
[gmic]./ End G'MIC interpreter.

C:\WINDOWS\system32>gmic echo ${utf8str2int\ ص}
[gmic]./ Start G'MIC interpreter (v.3.3.3).
1589
[gmic]./ End G'MIC interpreter.

All seems to check out. So, UTF-8 support for custom G’MIC command for string processing is really is feasible.

In the G’MIC math parser, strings are actually stored as vectors (of double, as it’s the only scalar type available), which means that it would be maybe better to have functions that convert from UTF-8 to UTF-32 and vice-versa.
After the conversion, the string can be manipulated considering each character is stored as a single integer (32bits integers all fit in a double).

I’d like that solution. If that doesn’t work out, then there’s always these commands. I have decided to push them into gmic-community.

Here they are: Add UTF-8 commands · GreycLab/gmic-community@7dadfa5 · GitHub

Test confirms it works:

C:\Windows\System32>gmic echo ${vint2utf8str\ 1589,8364}
[gmic]./ Start G'MIC interpreter (v.3.3.3).
ص€
[gmic]./ End G'MIC interpreter.

EDIT: Finally fixed gmic-community

More updates, although I finished up my UTF-8 tools, I restructured them and added explanations on my codes.

Here is the final code with commentaries

#@cli utf8str2vint: string
#@cli : Return vector of integer representation of UTF-8 string.
#@cli : Author : Reptorian.
utf8str2vint:
skip "${1=}"
if !$# status "" return fi

eval "
 str='$1';
 const size_str=size(str);
 num_set=vector(#size_str,0);

 pos=num_of_chars=0;
 while(pos<size_str,
  current_binary=str[pos]>>4; # Extract first 4 bits of first byte. pos is used to select and extract the relevant numbers.
  !current_binary?(
   ++pos;
   num_set[num_of_chars]=1;
  ):
  current_binary<=13?(
   pos+=2;
   num_set[num_of_chars]=2;
  ):
  current_binary==14?(
   pos+=3;
   num_set[num_of_chars]=3;
  ):
  current_binary==15?(
   pos+=4;
   num_set[num_of_chars]=4;
  );
  ++num_of_chars;
 );
 
 if(pos!=size_str,run('error not_a_UTF-8_string!'););

 const N=0xff; # Base for the mathematics involving encoding or decoding UTF-8 strings.
 const M=N>>2; # Used to select binary numbers within x section of 10xxxxxx.

 pos=0;
 repeat(num_of_chars,k,  # This loop is responsible for convert multiple integer representation into the index.
  size_of_ints_per_char=num_set[k];

  size_of_ints_per_char==4?(
   num_set[k]=(str[pos]&(N>>5))<<18|(str[pos+1]&M)<<12|(str[pos+2]&M)<<6|(str[pos+3]&M);
  ):
  size_of_ints_per_char==3?(
   num_set[k]=(str[pos]&(N>>4))<<12|(str[pos+1]&M)<<6|(str[pos+2]&M);
  ):
  size_of_ints_per_char==2?(
   num_set[k]=(str[pos]&(N>>3))<<6|(str[pos+1]&M);
  ):(
   num_set[k]=str[pos];
  );

  pos+=size_of_ints_per_char;
 );

 set('num_of_chars',num_of_chars);
 num_set;
 "

status {([${}])[0,$num_of_chars]}
#@cli vint2utf8str: int_a>=0,...
#@cli : Return corresponding integer(s) representation of UTF-8 as a string.
#@cli : Author : Reptorian.
vint2utf8str:
skip "${1=}"
if !$# status "" return fi

eval "
 list_of_ints=[$*];

 const list_size=size(list_of_ints);
 const init_size=list_size<<2;

 new_list=vector(#init_size,0);

 const M=1<<7; # 10xxxxxx . The starting number in binary is 10000000. x will be replaced by corresponding numbers associated with integer representation.
 const N=0xff; # Base for the mathematics involving encoding or decoding UTF-8 strings.
 const S=N>>2; # Used to select binary numbers to insert into x section of 10xxxxxx.
 const X1=xor(N,S);    #110xxxxx.
 const X2=xor(N,S>>1); #1110xxxx.
 const X3=xor(N,S>>2); #11110xxx.

 final_size=0;
 repeat(list_size,k,
  cv=list_of_ints[k];

  cv>0x10000?(LS=3;XC=X3;): # LS is shorthand for loops, which will be used to aid into inserting values into x series.
  cv>0x800?(  LS=2;XC=X2;):
  cv>0x80?(   LS=1;XC=X1;):
  (           LS=0;      );

  LS?(
   for(ind=final_size+LS,ind>final_size,--ind,
    new_list[ind]=M|(cv&S);
    cv>>=6;
   );
   new_list[final_size]=XC|cv;
  ):(
   new_list[final_size]=cv;
  );

  final_size+=LS+1;
 );

 set('final_size',final_size);
 new_list;
 "

status {`([${}])[0,$final_size]`}