gema
which supports Unicode characters is now available.
A pre-built executable is available at
https://sourceforge.net/projects/gema/files/gema/gema-2.0/
.
The new source code is in
the git
branch "Unicode-dev"
(https://sourceforge.net/p/gema/code.git/ci/Unicode-dev/tree/
)
if you would like to try
building it. (Make sure that you build with the macro _USE_WCHAR
defined. See also file
"src/ReadMe.txt
".)
Let me know of any bugs that you find, or any other suggestions.
You can e-mail me at: DGray@acm.org
The functionality is unchanged from the beta test version, but a few minor bugs have been fixed, so if you have the beta test, you should replace it with the official release.
Remember that you can use
gema -version @end
to check which version you are using.
The documentation files have also been updated to match. Following is a summary of the new features. There are no incompatible changes from previous releases, so existing scripts and pattern files should still work the same.
Internally, input data and patterns are seen as streams of 21-bit Unicode characters (or code points to be more precise). UTF-8 encoding and UTF-16 surrogate pairs are handled transparently, so won't be seen directly. In a pattern file, anywhere that an arbitrary character can appear, the full range of Unicode characters is supported.
One question which has hung over this project is how recognizers such as
<L>
(for letters) should be affected. Should it include
letters in any alphabet, of just the locally selected language? Since
there didn't seem to be any clear right answer, I chose to keep the
built-in recognizers as applying only to ASCII characters, but if the user
wants something else, they can define what they want themselves. The new
function @defset
(see details
below) can be used to either
extend the set of letters or to define new categories of characters from
scratch.
-outenc
encodingauto, ASCII, 8bit,
UTF-8, UTF-16LE
, and UTF-16BE
. The default is
automatic, meaning the same as the input file.
If ASCII
is specified, any characters outside the ASCII
range will be represented using \x
or \u
escape sequences.
-inenc
encoding-outenc
. This option should seldom be necessary as the
default is to determine the encoding automatically.
You might need the option -inenc 8bit
if eight-bit
characters are being inappopriately interpreted as UTF-8 encoding.
-nobom
\u
xxxx can be used to designate a Unicode
character by its hexadecimal code. This is like in C and C++ except that
there may be anywhere from one to eight hexadecimal digits instead of
exactly four. To avoid ambiguity, the hexadecimal digits may optionally be
enclosed in braces: \u{
xxxx}
<B>
@defset
below.
<E>
<H>
<M>
<Q>
@defset
below.
<R>
@defset
below.
<V>
<Z>
@defset
below.
@defset{
letter;
characters}
can be used to define a set of characters
for template recognizers. It takes two parameters, a letter naming the
character set, and the list of characters. Like in a regular expression
character set, a hyphen can be used to indicate a range of characters.
For example, 123456
can be abbreviated as 1-6
.
Any naming letter from A
to Z
is allowed;
generally the defined set extends the meaning of the corresponding recognizer
<A>
to <Z>
.
Following are the special cases:
<B>
(the default is an empty set)
<E>
(the default is
@defset{E;\u2190-\u21FF\u2300-\u239A\u23B4-\u23FF\u2500-\u27BF\u27F0-\u27FF\u2900-\u297F\u2B00-\u2BFF\uFFE8-\uFFEE\u1CF00-\u1D24F\u1F000-\u1F0FF\u1F300-\u1FBFF}
)
<J>
, <L>
, <A>
, and <W>
<K>
, <L>
,
<A>
, and <W>
<H>
(the default is
@defset{H;\u2E80-\u2EFF\u3000-\u303F\u31C0-\u31EF\u3200-\u9FFF\u20000-\u323FF}
)
<I>
and
also affects \I
.
The default is: @defset{I;_}
Note that
@defset{I;
characters}
is equivalent to
the older (now deprecated) notation
@set-parm{idchars;
characters}
except
that @defset
allows using a hyphen for a range.
<F>
as being file name constituents, in
addition to ASCII letters and digits.
The default on Windows is: @defset{F;-.\/_~\#\@%+\=\:\\}
(Note that here the hyphen is listed first in order to be taken literally
instead of indicating a range.)
This supersedes
@set-parm{filechars;
characters}
(which for compatibility does not support ranges).
<M>
<Q>
(the default is an empty set)
<R>
(the default is an empty set)
<S>, -w, \S, \W
, and @wrap
.
For example: @defset{S;\u2000-\u200B\u2028\u2029\u205F}
<V>
<Z>
(the default is an empty set)
@defset{J;
... and @defset{K;
... are
provided and they have the same number of characters, then the
corresponding elements (taken in code value order) are assumed to be
corresponding lower and upper case letters. This is used to extend
case-insensitive comparison for -i
, \C
, and
@cmpi
. It is also used by the functions @upcase
and @downcase
.
The new function
@getset{
letter}
returns the current value
of the designated set of characters. This may be useful for extending a
built-in set. For example, the following action adds ampersand to the set
of mathematical symbols:
@defset{M;@getset{M}\&}
The new function
@as-ascii{
characters}
is like @quote
in that it escapes syntactically
significant characters, but it also represents characters beyond the
ASCII range using \x
or \u
escape sequences.
This may be particularly useful for a visual representation of a
set of Unicode characters. For example:
@as-ascii{@getset{E}}
shows the ranges of code numbers which
define <E>
.
(I'm not happy with the name "as-ascii", so I'm open to any better
suggestions.)
@set-parm{inenc;
encoding}
is equivalent to the command line option -inenc
, and
@set-parm{outenc;
encoding}
is equivalent to the command line option -outenc
.
@set-switch{bom;0}
is equivalent to the command line option
-nobom
. The default value is 1 for true.
@encoding{}
returns the name string of the input file encoding.
For example, the following shell command could be used to find out the
encoding of a given file:
gema -match -p "\E=@encoding\n"
file
Reporting at the end of the file (\E
) enables finding out
whether there are any 8-bit characters or UTF-8 sequences in a file of bytes.
<L>
for letters, are
implemented by C95 library character type functions such as
iswalpha
. When using the default locale of "C"
,
they consider only the ASCII character set. If the gema
function @set-locale
is used to change the locale for the C
library, that may change the behavior of the character type functions, such
as including some non-English letters.
But the valid locale names are platform-dependent, and the available
locales may depend on the local configuration of the operating system.
Furthermore, the actual effect on the character type functions is not only
platform dependent, but also tends to be undocumented. Thus the locale
mechanism seems an unreliable way to handle other languages.
Consequently, I was tempted to just dispense with the C library character type functions and use the new set-of-characters mechanism for all of the recognizers, which would be a simpler and more consistent model. But for the sake of compatibility with previous releases, I didn't do that.