gema for Unicode

By David N. Gray
Original version August 23, 2023
Updated for beta test January 12, 2024
Release version March 13, 2024

Status

A new 2.0 version of gema which supports Unicode characters is now available. A pre-built executable is available at https://sourceforge.net/projects/gema/files/gema/gema-2.0/. The new source code is in the git branch "Unicode-dev" (https://sourceforge.net/p/gema/code.git/ci/Unicode-dev/tree/) if you would like to try building it. (Make sure that you build with the macro _USE_WCHAR defined. See also file "src/ReadMe.txt".) Let me know of any bugs that you find, or any other suggestions. You can e-mail me at: DGray@acm.org

The functionality is unchanged from the beta test version, but a few minor bugs have been fixed, so if you have the beta test, you should replace it with the official release.

Remember that you can use
gema -version @end
to check which version you are using.

The documentation files have also been updated to match. Following is a summary of the new features. There are no incompatible changes from previous releases, so existing scripts and pattern files should still work the same.

Operational Overview

Input files and pattern files may use any of the following encodings: 8-bit, UTF-8, or UTF-16 (either big or little endian). By default, the input encoding is determined automatically, and the output file will use the same encoding as the corresponding input file. Standard output and standard error are assumed to always be UTF-8.

Internally, input data and patterns are seen as streams of 21-bit Unicode characters (or code points to be more precise). UTF-8 encoding and UTF-16 surrogate pairs are handled transparently, so won't be seen directly. In a pattern file, anywhere that an arbitrary character can appear, the full range of Unicode characters is supported.

One question which has hung over this project is how recognizers such as <L> (for letters) should be affected. Should it include letters in any alphabet, of just the locally selected language? Since there didn't seem to be any clear right answer, I chose to keep the built-in recognizers as applying only to ASCII characters, but if the user wants something else, they can define what they want themselves. The new function @defset (see details below) can be used to either extend the set of letters or to define new categories of characters from scratch.

New Features

Command line options

-outenc encoding: Output encoding. Specifies what encoding to use for output files opened subsequently. The valid values are (case-insensitive): auto, ASCII, 8bit, UTF-8, UTF-16LE, and UTF-16BE. The default is automatic, meaning the same as the input file. If ASCII is specified, any characters outside the ASCII range will be represented using \x or \u escape sequences.
-inenc encoding: Input encoding. Specifies what encoding to use for input files or pattern files opened subsequently. (It may be used more than once.) The values are the same as for -outenc. This option should seldom be necessary as the default is to determine the encoding automatically. You might need the option -inenc 8bit if eight-bit characters are being inappopriately interpreted as UTF-8 encoding.
-nobom: Suppresses writing of a Byte Order Mark at the beginning of the output file. (This is only relevant if the output encoding is UTF-8 or UTF-16.)

Pattern syntax

The notation \uxxxx can be used to designate a Unicode character by its hexadecimal code. This is like in C and C++ except that there may be anywhere from one to eight hexadecimal digits instead of exactly four. To avoid ambiguity, the hexadecimal digits may optionally be enclosed in braces: \u{xxxx}

Templates

The following new recognizers are supported:

<B>: user defined; see @defset below.
<E>: Emojis, emoticons, pictographs, arrows, and other geometric shapes. These are graphic characters which are pictures instead of language elements or conventional symbols.
<H>: Han (Chinese) characters.
<M>: Mathematical operator symbols.
<Q>: user defined; see @defset below.
<R>: user defined; see @defset below.
<V>: Valid characters; excludes control characters, local use codes, unpaired surrogates, out-of-range values, and explicit not-a-character codes.
<Z>: user defined; see @defset below.

Functions

The new function @defset{letter;characters} can be used to define a set of characters for template recognizers. It takes two parameters, a letter naming the character set, and the list of characters. Like in a regular expression character set, a hyphen can be used to indicate a range of characters. For example, 123456 can be abbreviated as 1-6. Any naming letter from A to Z is allowed; generally the defined set extends the meaning of the corresponding recognizer <A> to <Z>. Following are the special cases:

B: defines <B> (the default is an empty set)
E: redefines <E> (the default is @defset{E;\u2190-\u21FF\u2300-\u239A\u23B4-\u23FF\u2500-\u27BF\u27F0-\u27FF\u2900-\u297F\u2B00-\u2BFF\uFFE8-\uFFEE\u1CF00-\u1D24F\u1F000-\u1F0FF\u1F300-\u1FBFF})
J: additional lower-case letters to extend the meaning of <J>, <L>, <A>, and <W>
K: additional upper-case letters to extend the meaning of <K>, <L>, <A>, and <W>
H: redefines <H> (the default is @defset{H;\u2E80-\u2EFF\u3000-\u303F\u31C0-\u31EF\u3200-\u9FFF\u20000-\u323FF})
I: additional identifier constituents (besides ASCII letters and digits); this extends <I> and also affects \I. The default is: @defset{I;_} Note that @defset{I;characters} is equivalent to the older (now deprecated) notation @set-parm{idchars;characters} except that @defset allows using a hyphen for a range.
F: the set of characters which are accepted by <F> as being file name constituents, in addition to ASCII letters and digits. The default on Windows is: @defset{F;-.\/_~\#\@%+\=\:\\} (Note that here the hyphen is listed first in order to be taken literally instead of indicating a range.) This supersedes @set-parm{filechars;characters} (which for compatibility does not support ranges).
M: redefines <M>
Q: defines <Q> (the default is an empty set)
R: defines <R> (the default is an empty set)
S: specifies additional white space characters for <S>, -w, \S, \W, and @wrap. For example: @defset{S;\u2000-\u200B\u2028\u2029\u205F}
V: redefines <V>
Z: defines <Z> (the default is an empty set)

If both @defset{J;... and @defset{K;... are provided and they have the same number of characters, then the corresponding elements (taken in code value order) are assumed to be corresponding lower and upper case letters. This is used to extend case-insensitive comparison for -i, \C, and @cmpi. It is also used by the functions @upcase and @downcase.

The new function @getset{letter} returns the current value of the designated set of characters. This may be useful for extending a built-in set. For example, the following action adds ampersand to the set of mathematical symbols: @defset{M;@getset{M}\&}

The new function @as-ascii{characters} is like @quote in that it escapes syntactically significant characters, but it also represents characters beyond the ASCII range using \x or \u escape sequences. This may be particularly useful for a visual representation of a set of Unicode characters. For example: @as-ascii{@getset{E}} shows the ranges of code numbers which define <E>.
(I'm not happy with the name "as-ascii", so I'm open to any better suggestions.)

@set-parm{inenc;encoding} is equivalent to the command line option -inenc, and @set-parm{outenc;encoding} is equivalent to the command line option -outenc.

@set-switch{bom;0} is equivalent to the command line option -nobom. The default value is 1 for true.

@encoding{} returns the name string of the input file encoding. For example, the following shell command could be used to find out the encoding of a given file:
gema -match -p "\E=@encoding\n" file
Reporting at the end of the file (\E) enables finding out whether there are any 8-bit characters or UTF-8 sequences in a file of bytes.

Other considerations

Locales

The hard-coded recognizers, such as <L> for letters, are implemented by C95 library character type functions such as iswalpha. When using the default locale of "C", they consider only the ASCII character set. If the gema function @set-locale is used to change the locale for the C library, that may change the behavior of the character type functions, such as including some non-English letters. But the valid locale names are platform-dependent, and the available locales may depend on the local configuration of the operating system. Furthermore, the actual effect on the character type functions is not only platform dependent, but also tends to be undocumented. Thus the locale mechanism seems an unreliable way to handle other languages.

Consequently, I was tempted to just dispense with the C library character type functions and use the new set-of-characters mechanism for all of the recognizers, which would be a simpler and more consistent model. But for the sake of compatibility with previous releases, I didn't do that.

Performance

In theory, the Unicode version should be slightly less efficient than the original 8-bit configuration, but I'm not seeing any noticeable difference. So, while the source code preserves the ability to still build for 8-bit characters, I'm thinking now that there will be no need to support both configurations.