After some discussion within this thread and in this issue, it seems that the best solution that has been arrived is to develop UTF-8 specific commands tools to enable easier string processing within G’MIC.
Of course, this requires for you to set your interpreter to UTF-8, and to set your file to UTF-8 for it to work as expected.
Anyway, my first tool is going to be converting UTF-8 chars into indexes accordingly to this thread, but it’s possible, I may even take it further than indexing a little less than 256 chars, just so that you can make string processing commands work with many non-latin based characters.
Here’s my first work in progress (Note, I have not coded the conversion to indexes yet):
#@cli utf8_into_char_indexes: string,var_name
#@cli : Convert string representation into char indexes. Global variables are set for convenience.
utf8_into_char_indexes:
skip "${2=}"
if size('$2') g=_$2 fi
eval "
str='$1';
const size_str=size(str);
count_of_bit_set=vector(#size_str,0);
pos=num_of_chars_analyzed=0;
while(pos!=size_str,
current_binary=str[pos]>>4;
!current_binary?(
++pos;
count_of_bit_set[num_of_chars_analyzed]=1;
):
current_binary<=12?(
pos+=2;
count_of_bit_set[num_of_chars_analyzed]=2;
):
current_binary==14?(
pos+=3;
count_of_bit_set[num_of_chars_analyzed]=3;
):
current_binary==15?(
pos+=4;
count_of_bit_set[num_of_chars_analyzed]=4;
);
++num_of_chars_analyzed;
);
set('count_of_bit_set',v2s(count_of_bit_set));
num_of_chars_analyzed;
"
num_of_chars=${}
bit_set_per_char={([$count_of_bit_set])[0,$num_of_chars]}
echo $bit_set_per_char
And a test demonstrate that the char count is indeed correct:
C:\WINDOWS\system32>gmic echo ${utf8_into_char_indexes\ €þ×}
[gmic]./ Start G'MIC interpreter (v.3.3.3).
3
[gmic]./ End G'MIC interpreter
Ok, I managed to successfully convert a UTF-8 string input into their respective decimal representation according to UTF-8 chart:
C:\WINDOWS\system32>gmic echo ${utf8str2int\ €ʃ}
[gmic]./ Start G'MIC interpreter (v.3.3.3).
8364,643
[gmic]./ End G'MIC interpreter.
Current code:
#@cli utf8str2ints: string
#@cli : Return integer representation of UTF-8 string.
utf8str2int:
eval "
str='$1';
const size_str=size(str);
num_set=vector(#size_str,0);
pos=num_of_chars_analyzed=0;
while(pos!=size_str,
current_binary=str[pos]>>4;
!current_binary?(
++pos;
num_set[num_of_chars_analyzed]=1;
):
current_binary<=13?(
pos+=2;
num_set[num_of_chars_analyzed]=2;
):
current_binary==14?(
pos+=3;
num_set[num_of_chars_analyzed]=3;
):
current_binary==15?(
pos+=4;
num_set[num_of_chars_analyzed]=4;
);
++num_of_chars_analyzed;
);
num_of_chars_analyzed;
const N=0xff;
const M=N>>2;
pos=0;
repeat(num_of_chars_analyzed,k,
size_of_ints_per_char=num_set[k];
size_of_ints_per_char==4?(
num_set[k]=(str[pos]&(N>>5))<<18|(str[pos+1]&M)<<12|(str[pos+2]&M)<<6|(str[pos+3]&M);
):
size_of_ints_per_char==3?(
num_set[k]=(str[pos]&(N>>4))<<12|(str[pos+1]&M)<<6|(str[pos+2]&M);
):
size_of_ints_per_char==2?(
num_set[k]=(str[pos]&(N>>3))<<6|(str[pos+1]&M);
):(
num_set[k]=str[pos];
);
pos+=size_of_ints_per_char;
);
set('num_of_chars',num_of_chars_analyzed);
num_set;
"
status {([${}])[0,$num_of_chars]}
Now all I need to do is the reverse method to convert back. Actually, it doesn’t work well as I’d like. :/
In some cases. I think it’s fixed.
Here are some more tests:
C:\WINDOWS\system32>gmic echo ${utf8str2int\ ߐ}
[gmic]./ Start G'MIC interpreter (v.3.3.3).
2000
[gmic]./ End G'MIC interpreter.
C:\WINDOWS\system32>gmic echo ${utf8str2int\ ߒ}
[gmic]./ Start G'MIC interpreter (v.3.3.3).
2002
[gmic]./ End G'MIC interpreter.
C:\WINDOWS\system32>gmic echo ${utf8str2int\ €}
[gmic]./ Start G'MIC interpreter (v.3.3.3).
8364
[gmic]./ End G'MIC interpreter.
C:\WINDOWS\system32>gmic echo ${utf8str2int\ 𐍈}
[gmic]./ Start G'MIC interpreter (v.3.3.3).
66376
[gmic]./ End G'MIC interpreter.
C:\WINDOWS\system32>gmic echo ${utf8str2int\ 𐍈€}
[gmic]./ Start G'MIC interpreter (v.3.3.3).
66376,8364
[gmic]./ End G'MIC interpreter.
C:\WINDOWS\system32>gmic echo ${utf8str2int\ ص}
[gmic]./ Start G'MIC interpreter (v.3.3.3).
1589
[gmic]./ End G'MIC interpreter.
All seems to check out. So, UTF-8 support for custom G’MIC command for string processing is really is feasible.
In the G’MIC math parser, strings are actually stored as vectors (of double
, as it’s the only scalar type available), which means that it would be maybe better to have functions that convert from UTF-8 to UTF-32 and vice-versa.
After the conversion, the string can be manipulated considering each character is stored as a single integer (32bits integers all fit in a double
).
I’d like that solution. If that doesn’t work out, then there’s always these commands. I have decided to push them into gmic-community.
Here they are: Add UTF-8 commands · GreycLab/gmic-community@7dadfa5 · GitHub
Test confirms it works:
C:\Windows\System32>gmic echo ${vint2utf8str\ 1589,8364}
[gmic]./ Start G'MIC interpreter (v.3.3.3).
ص€
[gmic]./ End G'MIC interpreter.
EDIT: Finally fixed gmic-community
More updates, although I finished up my UTF-8 tools, I restructured them and added explanations on my codes.
Here is the final code with commentaries
#@cli utf8str2vint: string
#@cli : Return vector of integer representation of UTF-8 string.
#@cli : Author : Reptorian.
utf8str2vint:
skip "${1=}"
if !$# status "" return fi
eval "
str='$1';
const size_str=size(str);
num_set=vector(#size_str,0);
pos=num_of_chars=0;
while(pos<size_str,
current_binary=str[pos]>>4; # Extract first 4 bits of first byte. pos is used to select and extract the relevant numbers.
!current_binary?(
++pos;
num_set[num_of_chars]=1;
):
current_binary<=13?(
pos+=2;
num_set[num_of_chars]=2;
):
current_binary==14?(
pos+=3;
num_set[num_of_chars]=3;
):
current_binary==15?(
pos+=4;
num_set[num_of_chars]=4;
);
++num_of_chars;
);
if(pos!=size_str,run('error not_a_UTF-8_string!'););
const N=0xff; # Base for the mathematics involving encoding or decoding UTF-8 strings.
const M=N>>2; # Used to select binary numbers within x section of 10xxxxxx.
pos=0;
repeat(num_of_chars,k, # This loop is responsible for convert multiple integer representation into the index.
size_of_ints_per_char=num_set[k];
size_of_ints_per_char==4?(
num_set[k]=(str[pos]&(N>>5))<<18|(str[pos+1]&M)<<12|(str[pos+2]&M)<<6|(str[pos+3]&M);
):
size_of_ints_per_char==3?(
num_set[k]=(str[pos]&(N>>4))<<12|(str[pos+1]&M)<<6|(str[pos+2]&M);
):
size_of_ints_per_char==2?(
num_set[k]=(str[pos]&(N>>3))<<6|(str[pos+1]&M);
):(
num_set[k]=str[pos];
);
pos+=size_of_ints_per_char;
);
set('num_of_chars',num_of_chars);
num_set;
"
status {([${}])[0,$num_of_chars]}
#@cli vint2utf8str: int_a>=0,...
#@cli : Return corresponding integer(s) representation of UTF-8 as a string.
#@cli : Author : Reptorian.
vint2utf8str:
skip "${1=}"
if !$# status "" return fi
eval "
list_of_ints=[$*];
const list_size=size(list_of_ints);
const init_size=list_size<<2;
new_list=vector(#init_size,0);
const M=1<<7; # 10xxxxxx . The starting number in binary is 10000000. x will be replaced by corresponding numbers associated with integer representation.
const N=0xff; # Base for the mathematics involving encoding or decoding UTF-8 strings.
const S=N>>2; # Used to select binary numbers to insert into x section of 10xxxxxx.
const X1=xor(N,S); #110xxxxx.
const X2=xor(N,S>>1); #1110xxxx.
const X3=xor(N,S>>2); #11110xxx.
final_size=0;
repeat(list_size,k,
cv=list_of_ints[k];
cv>0x10000?(LS=3;XC=X3;): # LS is shorthand for loops, which will be used to aid into inserting values into x series.
cv>0x800?( LS=2;XC=X2;):
cv>0x80?( LS=1;XC=X1;):
( LS=0; );
LS?(
for(ind=final_size+LS,ind>final_size,--ind,
new_list[ind]=M|(cv&S);
cv>>=6;
);
new_list[final_size]=XC|cv;
):(
new_list[final_size]=cv;
);
final_size+=LS+1;
);
set('final_size',final_size);
new_list;
"
status {`([${}])[0,$final_size]`}