Gema Built-in Functions

[Previous: Notation] [Table of Contents] [Next: Conclude]

4 Built-in Functions

There are a large number of built-in functions that can be used in actions. Function calls have the form “@name{args}”, with arguments separated by “;”. For functions without arguments, the argument delimiters “{}” may be omitted if they are not needed to separate the name from the following character.[Footnote 1] Each argument is itself an action which can use any of the special characters defined in actions, including nested function calls. The argument value is the result of evaluating the argument. In a few cases, arguments that are not used are skipped instead of being evaluated, but arguments are never used literally. All functions take a fixed number of arguments, although in a couple of cases the last argument is optional.

The descriptions of the functions use the terminology of a value being returned by the function, but it would be more accurate to speak of the result as being the series of characters that will be written to the current output stream, since in general the result value is not actually materialized as a separate string. Usually while evaluating a function argument, the current output stream is an internal buffer that collects the argument value for the function, and the function argument is actually an input stream that reads from that buffer. But in most cases these distinctions are not important for understanding how to use the functions.

The following sections document groups of related functions.

4.1 Numbers

Since gema is a text processor, it is not intended to be convenient or efficient for performing numeric operations, but it does have a set of arithmetic functions that should be sufficient to make it possible to do whatever calculations are necessary.

While all values are character strings, a string can be treated as a number if it consists of decimal digits optionally preceded by + or - and optionally preceded or followed by spaces. Where a numeric argument is required, such a string will be converted internally to a 32-bit signed integer. An error will be reported if the string is not a valid number. Functions that return a number will return a string of decimal digits possibly preceded by a minus sign.

Following are the arithmetic functions:

@add{number;number}: Addition - returns the sum of the two numbers.
@sub{number;number}: Subtraction - returns the first argument minus the second.
@mul{number;number}: Multiplication - returns the product of the two numbers.
@div{number;number}: Division - returns the quotient of dividing the first argument by the second.
@mod{number;number}: Modulus - returns the first argument modulo the second, as implemented by the C operator “%”.

Also, the following group of functions can be used to operate on numbers as bit strings:

@and{number;number}: Returns the bit-wise and of the two numbers.
@or{number;number}: Returns the bit-wise or of the two numbers.
@not{number}: Returns the bit-wise inverse of the argument.

Finally, some other assorted functions that deal with numbers:

@cmpn{number;number; less-value;equal-value;greater-value}: Compare numbers - returns the result of evaluating either the third, fourth, or fifth argument depending on whether the first argument is less than, equal to, or greater than the second, when compared as 32-bit signed numbers. The two arguments that are not used are not evaluated. For example, the following rule defines a function that will return the larger of two comma-separated numbers:
maxn:<N>,<N>=@cmpn{$1;$2;$2;$1;$1}
while the following rule sets a variable to the largest number seen:
notemax:<N>=@cmpn{$1;${max};;;@set{max;$1}}
@int-char{number}: Returns the character whose internal code is given by the argument.
@char-int{character}: Returns the decimal number representation of the internal character code of the argument, which should be a single-character string.
@radix{from;to;value}: Radix conversion. The first two arguments must be decimal integers. The third argument is interpreted as a number whose base is specified by the first argument. The result value is that number represented in the base specified by the second argument. As currently implemented, from may be any number from 2 to 32, but to can only be one of 8, 10, or 16. For example, octal constants in a C program could be converted to hexadecimal form by the following rule:
\I0<O>\I=0x@radix{8;16;$1}
For hexadecimal output, upper-case letters are used for the digits greater than 9. If lower-case letters are desired, the @downcase function can be used on the value returned by @radix.

4.2 String functions

The following built-in functions perform various manipulations on character strings.

4.2.1 Output formatting -- padding, filling, and wrapping

The following group of functions take two arguments; the first must be a number and the second is an arbitrary string. If the length of the string is greater than the number, then it is returned unchanged. Otherwise, the returned value will consist of the string padded with spaces to be of the designated length. The choice of function determines how the padding is done:

@left{length;string}: Left-justify the string, padding with spaces to the designated length. For example, “@left{8;ab}” returns “ab” followed by 6 spaces, while “@left{8;hippopotamus}” returns “hippopotamus” with no spaces. If you want long values to be truncated, you can use:
@left{length;@substring{0;length;string}}
or write something like “@left{8;@cut8{arg}}”, accompanied by the rule: “cut8:<U8>=$1@end”.
@right{length;string}: Right-justify the string, padding with spaces to the designated length.
@center{length;string}: Center the string within a field of the designated length.

Note that any of these functions can also be used with an empty second argument as a convenient way to generate a particular number of spaces.

The following group of functions serve a similar purpose, except that padding can be done using any arbitrary string instead of spaces. Here the first argument is the string representing an empty field, and the second argument will be justified within that field.

@fill-left{background;value}: Left-justify the value on top of the background string. For example, “@fill-left{......;foo}” returns “foo...”.
@fill-right{background;value}: Right-justify the value on top of the background string. For example, “@fill-right{00000;12}” returns “00012”.
@fill-center{background;value}: Center the value on top of the background string. For example, “@fill-center{(((())));xy}” returns “(((xy)))”.

The following functions perform formatting based in the current context in the output stream:

@tab{number}

The return value consists of however many space characters it takes to advance the output stream to the specified column number. If the output stream is already at or beyond the specified column, the return value is empty. Column 1 means the first character position following a newline character or the beginning of the data stream. Thus, for example, if the last character output was a newline, then @tab{10} will return 9 space characters so that the next character written will go in column 10.

@wrap{string}

Output with line wrapping. If there is room for the string on the current line of output, then it will be returned unchanged. Otherwise, when the string is longer than the remaining space on the current line, the return value consists of a newline character followed by an optional indentation string followed by the string argument with any leading whitespace removed. However, if the output stream is already at the beginning of a line, then the return value is the indentation string followed by the argument string with leading whitespace removed. By default, the lines are up to 80 characters long and the indentation string is empty. These parameters can be changed by the @set-wrap function below. Typically the argument string will be a word preceded by a space character, so that the space will separate it from the previous word if it fits on the current line, or will be discarded if a new line is started.

For example, you could reformat a text file with the shell command:

  gema -p '<G>=@wrap{ $1};\n\W\n=\n\n;\S=;' in.text out.text

where the first rule causes the groups of graphic (non-space) characters to be written separated by a single space in 80-character lines, the second rule preserves blank lines as paragraph separators, and the third rule discards other whitespace characters.

@set-wrap{number;string}

For subsequent calls to @wrap, the first argument specifies the maximum number of characters in a line, and the second argument is the indentation string. No value is returned. For example, for output with a four character left margin followed by a maximum of 70 characters of text, do: “@set-wrap{74;\s\s\s\s}”

4.2.2 String Comparison

The following functions compare two strings, and then returns the value of one of three arguments depending on the result of the comparison. The two arguments that are not used are not evaluated, so these functions can be used for conditional evaluation of side-effects as well as for returning a value.

@cmps{string;string;less-value;equal-value;greater-value}: Compare strings, case-sensitive. The characters are compared simply by their numeric codes, not by any language's collating sequence. The returned value is either the third, fourth, or fifth argument depending on whether the first argument is less than, equal to, or greater than the second.
@cmpi{string;string;less-value;equal-value;greater-value}: Compare strings, case-insensitive.

4.2.3 Case conversion

The following functions return a copy of their argument, converting the case of any letters:

@upcase{string}: Convert any lower-case letters to upper-case.
@downcase{string}: Convert any upper-case letters to lower-case.

For example, the following rule will capitalize each word in the input data:
<L1><w>=@upcase{$1}@downcase{$2}

4.2.4 Miscellaneous string functions

@length{string}

Returns the length of the argument as a decimal number. For example,“@length{abcdefghijkl}” returns the string “12”.

@reverse{string}

Returns the characters of the argument in reversed order.
For example, “@reverse{abcd}” returns “dcba”. This may be useful for performing processing that needs to be done from right to left. For example, the following set of rules will insert commas in the proper position in all numbers of four or more digits, grouping the digits by threes from the right-hand end:

  <D3><D>=@reverse{@comma{@reverse{$1$2}}}
  comma:<D3><D0>=$1,

@substring{skip;length;string}

Returns a substring of the third argument formed by skipping the number of characters indicated by the first argument and then taking the number of characters indicated by the second argument. For example, “@substring{3,4,elephant}” returns “phan”. If the first argument is negative, the effect is the same as zero. If the first argument is greater than the length of the string, then the result value is empty. The result may actually be shorter than length if there are not enough characters in the string: “@substring{3;99;tiger}” returns “er”.

Note that splitting input data into fields is usually more conveniently done by using a template such as “\L<U2><U3><u>\n”. The @substring function is more likely to be useful in cases where the numbers are computed instead of being constants.

@repeat{number;action}

The second argument is repeated the number of times specified by the first argument. For example, a string of eighty hyphens can be constructed by “@repeat{80;-}”.

While this is being listed under string functions because it doesn't seem to fit any other category, it is useful for much more than just repeating strings. Rather than just repeating the value, the second argument is an action which is evaluated the specified number of times, so it can have side-effects which are also performed repeatedly. If the number is less than or equal to zero, the second argument is not evaluated at all. For example, the following action will output the numbers from 1 to 100:

    @set{n;0}@repeat{100;@incr{n} $n}

Note that there is no operator or function needed for concatenation of strings since concatenation of elements is implied simply by juxtaposition.

4.3 Variables

A variable consists of a name and an associated value, both of which are character strings. Variable names are case-sensitive (regardless of the -i option). The value can contain any of the Unicode characters, and the name can contain any Unicode characters except for NUL. Names consisting of a period followed by upper-case letters are by convention reserved for internal use. Both strings may be of any length, limited only by the amount of memory available. Variables are manipulated by using the following action functions. Except for @var, they have no return value.

@set{name;value}: Set the named variable to the designated value. If the variable was already defined, the previous value is discarded. For example, “@set{count;0}” initializes variable “count” to 0.
@var{name}: Returns the current value of the named variable. If the variable is not defined, an error is reported and the return value is unspecified.[Footnote 2]
@var{name;default}: If the named variable is defined, then its current value is returned and the second argument is skipped without being evaluated. Otherwise, when the name is not defined, the return value is the result of evaluating the second argument.
@append{name;string}: The string is appended to the end of the value of the named variable. If the variable was not previously defined, then this acts the same as @set. For example, “@append{buf;$1}” has the same effect as “@set{buf;@var{buf;}$1}”, but using @append is considerably more efficient.
@incr{name}: The value of the named variable is incremented by one. You might think of “@incr{n}” as being an abbreviation for “ @set{n;@add{@var{n};1}}” but it is actually more general than that. The value may contain arbitrary characters before or after the number, and the number will be incremented while leaving the other characters unchanged. For example, if the value is “B9a”, it will be incremented to “B10a”. The value can also be just one or more letters, in which case the last letter will be incremented to the following letter; for example, “a” increments to “b”. and “z” increments to “aa”.
@decr{name}: The value of the named variable is decremented by one. This works like @incr except that the increment is -1 instead of +1, and decrementing a value of “a” is an error.
@bind{name;string}: Sets the value of the named variable to the string. If the variable was already defined, the previous value is remembered so that it can be restored by a subsequent call to @unbind. If @bind is called in the context of a recursive argument for a template match that subsequently fails, then the binding will be undone automatically.
@unbind{name}: The named variable is restored to the value it had before the most recent @bind. It it had not been defined before the @bind, then it becomes undefined again. An error is reported if the variable is undefined or if there is no pending binding.
@push{name;string}: A variable may be thought of as a stack of values, where @var accesses the top-of-stack value, @push pushes a new value onto the top of the stack, @set modifies the top value, and @pop pops the top value off the stack. @push is actually just another name for @bind.
@pop{name}: The variable is restored to the value it had before the most recent @push. This is actually just another name for @unbind. The top-of-stack value is simply discarded; there is no return value.

Also, @var (both the one and two argument forms) may be abbreviated as $, providing that the name does not begin with a digit. Thus, for example, “${foo}” has the same meaning as “@var{foo}”. Furthermore, if the variable name is a single constant letter (in any Unicode alphabet) and there is no default value argument, then the braces may be omitted. Thus, “@var{i}” can be abbreviated as “$i”. This last form (dollar letter) also has the special property that it can be used in a template to insert the current value of a variable into the template to be matched. All of the other variable operations can only be used in actions.

Lisp programmers may find it helpful to note that @set is like the Lisp set form, @var is like symbol-value, and the combination of @bind and @unbind is like what happens in a let for a variable with dynamic scope.

While there is no support for arrays as such, note that since the name of a variable can contain any characters, and the name is an evaluated argument, it is possible to do things like “@set{A[$i];$1}” which looks like an array and can be used like an array, even though the brackets and subscript are really just part of the variable name.

Variables can also be used as an associative look-up table, where the name is the key. However, the current implementation assumes that the number of variables will be small, so it may become slow if used as a table with a large number of entries.

4.4 Files

4.4.1 Pathname manipulation

This group of functions allow constructing pathnames in a manner that allows a pattern file to be independent of the pathname syntax for a particular operating system.

@makepath{directory;name;suffix}

Returns the file pathname formed by merging the file name in the second argument with the default directory in the first argument and replacing the suffix from the third argument, if not empty. If the second argument is an absolute pathname, then it retains the same directory and the first argument is not used. For example (assuming running on Unix):

@makepath{/home/dir;bar.c;.o} ⇒ /home/dir/bar.o
@makepath{/home/dir;/scr/bar.c;.o} ⇒ /scr/bar.o
@makepath{/home/dir;bar.c;} ⇒ /home/dir/bar.c

@mergepath{pathname;name;suffix}

Returns the file pathname formed by merging the second argument with a default directory extracted from the first argument and replacing the suffix from the third argument, if not empty. This differs from @makepath in that the first argument is a complete file pathname whose name portion is ignored. This would be used to create a new file in the same directory as another file. For example (assuming running on Unix):

@mergepath{/a/foo.i;bar.c;/a/baz.o} ⇒ /a/bar.o
@mergepath{/a/foo.i;/b/bar.c;.o} ⇒ /b/bar.o
@mergepath{/a/foo.i;bar.c;} ⇒ /a/bar.c

@relative-path{pathname;pathname}

If the two pathnames have the same directory portion, return the second argument with the common directory removed; else return the whole second argument. Note that if the two arguments are the same, this has the effect of separating the file name from the directory. For example:

@relative-path{/a/x/cat.x;/a/x/dog.c} ⇒ dog.c
@relative-path{/a/x/cat.x;/a/y/dog.c} ⇒ /a/y/dog.c

@expand-wild{pathname}

Usually this function just returns its argument followed by a newline. When running on MS-DOS or Windows and the pathname is a wild card (i.e. contains “*” or “?”), the return value consists of all files that match the pattern, with a newline following each one. If there are no matches, a warning is written to the error output and the return value is empty. This wild card expansion is done by a system call, so it is consistent with other MS-DOS or Windows command-line utilities, but the meaning of “*” is not completely the same as in gema patterns. On Unix, wild card arguments are presumed to have already been expanded by the shell, so expansion is not done here.

4.4.2 Using alternate input and output files

@err{string}

The argument is evaluated with its output being directed to the standard error output stream (stderr in C terminology). There is no return value. This can be used to write error messages or status messages. Don't forget that newlines must be explicitly provided, so the argument typically should end with “\n”.

@out{string}

The argument is evaluated with its output being sent directly to the current output file instead of to the current output stream. The distinction arises during translation of a recursive argument, where @out can be used to write directly to the output file instead of appending to the value of the argument being translated. Usually this is not what you want to do, but it may be useful in some circumstances.

For example, suppose some algebraic language is to be translated into an assembly-like language. A typical rule might look something like:

  expr:<term>+<term>=@incr{t}@out{\N  ADD $1,$2,R$t\n}R$t

where an expression “x+y” would be processed by outputting “ ADD x,y,R1” and returning “R1” as the result value to be used as an operand of the next instruction.

@write{pathname;string}

First the first argument is evaluated and the file that it names is opened for writing. If the pathname is “-”, then standard output will be used. If the same identical pathname has previously been used in a @write call, then it will continue writing to the end of the same file without re-opening or rewinding it.

Then the second argument is evaluated, with its output being directed to the designated file. Within that evaluation, the function @outpath will return the first argument of the @write.

The file remains open until either the program terminates or the same pathname is referenced in a call to @close or @read.

@close{pathname}

If the argument is identical to one previously appearing as the pathname argument in a call to @write, then that output file will be closed. Otherwise, nothing happens.

@read{pathname}

The file named by the argument is opened for reading. If the pathname is “-”, then standard input is used. If the same identical pathname was previously used in a @write, the output file will be closed before re-opening the file for reading. The result is an input stream that will read from the file as needed, and close it when the end is reached. This is commonly used in the context: “@domain{@read{pathname}}” which says to translate using the alternate file as input. Within this translation, the functions @file, @inpath, @line, @column, and @file-time will all refer to the file named in the argument of @read. However, if the @read function has its result concatenated with something else instead of appearing by itself as the argument to another function, then the effect will be to copy the entire contents of the file to the current output and close the file.

@probe{pathname}

This can be used to test a pathname to see whether it can be opened. The result value is “F” if the argument names an existing file, “D” if it names a directory, “V” if it names a device, “U” if it is undefined, or “X” if it is defined in some unexpected way.

4.4.3 File context queries

@outpath{}

Returns the pathname of the current output file, or as much of the pathname as is known. This would be the same as the output file argument on the command line or the pathname argument to the @write function if within that context.

@inpath{}

Returns the pathname of the current input file, or as much of the pathname as is known. This would be the same as the input file argument on the command line or the pathname argument to the @read function.

@file{}

Returns the name of the current input file, with any directories removed.

@line{}

Returns the current line number in the input file. More precisely, this is the line number of the last character matched by the template. If that character is a newline character, this is the number of the line preceding the newline.

@column{}

Returns the column number of the current position in the input stream, i.e., the column of the last character in the text matched by the template. (A UTF-16 surrogate pair counts as a single character, and a BOM at the beginning of the file is not counted.) For example, the following default rule could be used to write an error message for unexpected input:

  ?=@err{\NIllegal character "$1" in line @line, column @column.\n}

@out-column{}

Returns the column number of the current position in the current output file.

@file-time{}

Returns the date and time when the current input file was last modified. The information is presented as formatted by the C function ctime, except without any newline.

@encoding{}

Returns the name string of the input file encoding, which will be one of: ASCII, 8bit, UTF-8, UTF-16LE, or UTF-16BE. For example, the following shell command could be used to find out the encoding of a given file:
gema -match -p "\E=@encoding\n" file
Reporting at the end of the file (\E) enables finding out whether there are any 8-bit characters or UTF-8 sequences in a file of bytes.

4.5 Control flow functions

@end{}

Signals the successful completion of the current translation.

If this appears in the action for a pattern match at the top level of a file, the remainder of the input file will not be read, and if there are no more input files to be processed, the program will terminate with an exit status of 0 (assuming there were no errors before). For example, the following shell command will print the first line that matches and then stop:

  gema -match -p 'Title\:*\n=$0@end' foo

If this appears within the context of a recursive argument, then it ends the argument and returns control to the enclosing template to continue processing the input. For example, with the following rules:

    sign:+=+@end;-=-@end;=@end

the template argument “<sign>” will accept an optional plus or minus sign and nothing more.

@fail{}

Signals failure of the current translation. At top-level, this will terminate processing of the input file like @end, except that the program will have a non-zero exit status. For example, the following command will indicate by the exit status whether the file being tested contains a particular string:

  gema -match -p 'Success=@end;\E=@fail' foo.text

If the string is found, the program exits with 0; if the end of the file is reached, then a non-zero exit status is returned.

Within the context of a recursive argument, this causes the enclosing template to report a failed match.

@terminate{}

This ends the translation of a recursive argument. If the argument value is empty, then the template match fails, like for @fail. Otherwise, when some characters have been accepted, processing of the template continues like for @end. This is typically used instead of @end in a delimiter rule when an empty string is not to be considered a match. For example, with the following rules:

    vowel:a=a;e=e;i=i;o=o;u=u;=@terminate

the argument “<vowel>” will match one or more vowels.

@abort{}

Immediately terminates execution of the program with a non-zero exit status.

@exit-status{number}

This function can be used to specify that a particular exit status value will be returned when the program exits, providing that there is no error condition that specified a higher value first. This could be called before @fail or @abort to cause some particular non-zero value to be returned for the sake of a shell script that wants to test for what kind of failure occurred.

4.6 Other operating system interfaces

@date{}

Returns the current date, in the form: mm/dd/yyyy

@datime{}

Returns the current date and time, as formatted by the C function ctime, except without any newline.

@time{}

Returns the current time, in the form: hh:mm:ss

@getenv{name;default}

Returns the value of an environment variable, as from the C function getenv. The first argument is the name of the environment variable. The second argument is optional, and will be returned as the default value if the environment variable is not defined. For example, on a Unix system, the action “@getenv{USER}” will output the current user ID.

@shell{string}

The argument value is executed as a shell command by passing it to the C function system. Although it would be desirable for this to return the text written by execution of the command, that is not currently implemented. Instead, any output from the command goes directly to standard output, and there is no value returned to the current output stream. For example, the following action could be used to sort a temporary file:

    @shell{sort \< '${tmpfil1}' \> '${tmpfil2}'}

4.7 Definitions

The following functions can be used to add or remove definitions of rules at run time.

@define{patterns}

Define new rules. The evaluated argument value is read as a pattern file, defining rules and performing immediate actions as specified. There is no return value. For example, you can have one pattern file include another by using an immediate action like this:

    @define{@read{foo.pat}}

For another example, to emulate a C pre-processor, the #define directive could be implemented by the following rule (assuming, for simplicity of the example, no arguments, no continuation lines, and no comments):

    \N\#define <I> *\n=@define{\\I$1\\I\=@quote{$2}}

Note that the tricky part here is to get the right level of quoting so that things are evaluated at the proper time. The @quote function is explained below. Given the input line “#define NUM 3*4”, the argument of @define will evaluate to the string “\INUM\I=3\*4” which will then be defined as a new rule.

@quote{string}

Returns a copy of the argument value with backslashes inserted where necessary so that @define and @undefine will treat all of the characters as literals.[Footnote 3] For example, given an argument which evaluates to the string “a * 3”, the return value will be “a\ \*\ 3”.

@as-ascii{string}

This function is like @quote except that it also represents non-ASCII characters using \x or \u escape sequences to show the code value. This is typically useful for human reading on a terminal which doesn't have a Unicode font.

@undefine{patterns}

This can be used to undefine rules. The argument is processed like for @define, except that instead of defining rules, the effect is to cancel any existing rule that exactly matches. The argument may also be just a template, without any “=” or action, in which case any rule with the same template will be cancelled, without regard to its action. For example, the C #undef directive could be emulated (with the same simplifying assumptions as the #define example above) by the rule:

    \N\#undef <I>=@undefine{\\I$1\\I}

@subst{patterns;operand}

Substitution. Return the result of translating the operand according to the patterns specified by the first argument. The first argument is processed the same as by @define, except that the rules are implicitly defined in a temporary domain which is deleted after being used to translate the operand. An explicit domain name (i.e. before a colon) is not allowed. For example, “@subst{\\Iis\\I\=was;this is it}” will return “this was it”. Usually this sort of substitution is more conveniently and efficiently done by using a domain function (for example, “@frob{this is it}” with rule “frob:\Iis\I=was”), but the @subst function can be used in cases where the substitution needs to be computed at run time. For example, to emulate a C #define directive with one argument:

    \N\#define <I>\W(\W<I>\W) *\n=\
      @define{\\I$1\\W(\#)\=@subst{\\I$2\\I\=\\\$1;@quote{*}}}

Here @subst is used to replace references to the C argument name with the “$1” notation used by gema.

4.8 Setting Options

The following group of functions can be used to set various program options. These are typically used as immediate actions in a pattern file to set the options that the file needs, instead of requiring separate command line options. Execpt for @get-switch and @getset, these functions don't return any result value.

@set-switch{name;value}

Sets the value of any of several option switches that have numeric values. In most cases the value should be either 1 for true or 0 for false. The defined switch names are:

arglen: - maximum length for “*” operands. This is the only switch that takes a number rather than being just true or false.
b: - binary mode
bom: - write a Unicode Byte Order Mark at the beginning of the output file when using UTF-8 or UTF-16 encoding (default is 1 for true)
i: - case-insensitive mode
k: - keep going after errors
line: - line mode
match: - match only mode (input text that doesn't match any template is discarded instead of copied to the output)
t: - token mode (However, this is only part of the “-t” command line option, which is implemented by “@set-switch{t;1}@set-switch{w;1}”.)
trace: - write pattern match diagnostic messages to stderr. (Only recognized if the program was compiled with “-DTRACE”.)
w: - ignore whitespace in the input (This is only part of the “-w” command line option, which is implemented by “@set-switch{w;1}@set-syntax{S;\s\t}”.)

In each case, the switch corresponds to the command line option with the same name, and further explanation of the meaning can be found there.

@get-switch{name}

Returns the current value of the named switch.

@set-parm{name;value}

Sets the value of any of several options that have string values. The defined names are “idchars”, “filechars”, “backup”, “inenc”, and “outenc”. These are used to implement the command line options with the same names, and the meaning is documented there. But in version 2.0, “idchars” is superseded by “@defset{I;value}” and “filechars” is superseded by “@defset{F;value}”.

@set-syntax{type;charset}

This function can be used to change the meaning of characters in patterns. When used as an immediate action in a pattern file, it takes effect beginning with the next line read. The first argument designates one or more syntactic categories, and the second argument is a set of characters that will now have that meaning. For each character in the first argument, the corresponding character in the second argument acquires the designated syntactic class; when there is only one remaining character in the first argument, it applies to all of the remaining characters in the second argument. The syntactic class may be identified by either the special character which currently has that class (or which has that class by default if it is currently a literal), or by one of the following letters:

A: - argument separator. This is one of the two uses of the semicolon in the default syntax. For example, if you wanted to be able to use comma to separate arguments, do: “@set-syntax{A;,}”
C: - comment. Causes the rest of the line to be ignored as a comment.
D: - domain argument. Characters of this class can be used as an abbreviation for a domain argument with the same name. There are no characters that have this class by default. For example, if you say “@set-syntax{D;%}”, then the character “%” represents a recursive argument in the domain defined by rules prefixed by “%:”. In other words, “%” becomes an abbreviation for “<%>”, and it also can be used in an action to represent the value of the corresponding argument, like with “*”, “?”, and “#”.
E: - escape. Together with the following character, it specifies a control character or template operator. This is half of what the backslash does in the default syntax.
F: - function prefix. This introduces the name of a function to be called. This is one of two uses of “@” in the default syntax.
I: - ignore. Characters with this class will be completely ignored. There are no characters that have this class by default.
K: - character operator. Causes the following character to have its default meaning. This is one of the two things that “@” is used for in the default syntax.
L: - literal. For example, the command line option “-literal '/?^'” is implemented by: “@set-syntax{L;\/\?\^}”
M: - quote until match. Causes the following characters to be taken literally until a second occurrence of the character is found. For the sake of compatibility with earlier versions, there are currently no characters that have this class by default. For example, “@set-syntax{M;\'}” causes all characters between matching apostrophes to be taken literally (even backslash).
Q: - quote one character. Causes the following character to be taken literally. This is half of what the backslash does in the default syntax.
S: - ignored space. Characters with this class will be ignored unless they separate two identifiers, in which case they will be treated like “\S”. There are no characters that have this class by default, but part of what the -w option does is: “@set-syntax{S;\s\t}”
T: - terminator. A character with this class marks the end of a rule. By default, newline is used for this purpose.

For example, “@set-syntax{\*;\~}” would cause tilde to represent a wildcard argument. This doesn't change the meaning of the asterisk, it just means that now either of the characters can be used for that purpose. If you wanted to delimit recursive arguments with square brackets, and let angle brackets be literals, do: “@set-syntax{\<\>LL;\[\]\<\>}” See also the -ml option.

@reset-syntax{}

Re-initializes the syntax tables to their default state, thus undoing the effects of any calls to @set-syntax, including any use of the -literal or -ml option.

@set-locale{name}

Set the internationalization locale, using the C function setlocale. The valid argument values and the exact effect are dependent on the C library implementation on the specific platform. A typical usage would be:

    @set-locale{@getenv{LANG;C}}

This may affect which characters are considered to be letters by recognizer arguments such as “<L>”. This function is not supported on MS-DOS, but support on Windows was added in gema version 1.5.

@defset{letter;characters}

Defines a set of characters for template recognizers. (new in version 2.0) It takes two parameters, a letter naming the character set, and the list of characters. Like in a regular expression character set, a hyphen can be used to indicate a range of characters. For example, 123456 can be abbreviated as 1-6. Any naming letter from A to Z is allowed; generally the defined set extends the meaning of the corresponding recognizer <A> to <Z>. Following are the special cases:

B: defines <B> (the default is an empty set)
E: redefines <E>
J: additional lower-case letters to extend the meaning of <J>, <L>, <A>, and <W>
K: additional upper-case letters to extend the meaning of <K>, <L>, <A>, and <W>
H: redefines <H> (the default for Chinese is
@defset{H;\u2E80-\u2EFF\u3000-\u303F\u31C0-\u31EF\u3200-\u9FFF\u20000-\u323FF})
I: additional identifier constituents (besides ASCII letters and digits); this extends <I> and also affects \I and <Y>. The default is: @defset{I;_} For example, for parsing COBOL source code, you would want to do: @defset{I;-} Note that @defset{I;characters} is equivalent to the older (now deprecated) notation @set-parm{idchars;characters} except that @defset allows using a hyphen for a range.
F: the set of characters which are accepted by <F> as being file name constituents, in addition to ASCII letters and digits. The default on Windows is: @defset{F;-.\/_~\#\@%+\=\:\\} (Note that here the hyphen is listed first in order to be taken literally instead of indicating a range.) This supersedes @set-parm{filechars;characters} (which for compatibility does not support ranges).
M: redefines <M>
Q: defines <Q> (the default is an empty set)
R: defines <R> (the default is an empty set)
S: specifies additional white space characters for <S>, -w, \S, \W, and @wrap.
For example, to recognize the various Unicode space characters:
@defset{S;\u2000-\u200B\u2028\u2029\u205F} (This is not the default because there is a slight performance penalty.)
V: redefines <V>
Y: redefines the Unicode punctuation characters matched by <Y> in addition to the ASCII punctuation characters. The default is:
@defset{Y;\u2010-\u2027\u2030-\u205E\u27E6-\u27EF\u2E00-\u2E7F}
Z: defines <Z> (the default is an empty set)

If both @defset{J;characters} and @defset{K;characters} are provided and they have the same number of characters, then the corresponding elements (taken in code value order) are assumed to be corresponding lower and upper-case letters. This is used to extend case-insensitive comparison for -i, \C, and @cmpi. It is also used by the functions @upcase and @downcase. For example, for Spanish text:
@defset{J;áéíñóúü}@defset{K;ÁÉÍÑÓÚÜ}

More examples:
To redefine <H> to match Hebrew letters: @defset{H;\u0590-\u05FF}
To define <R> to match Russian (Cyrillic) letters: @defset{R;\u0400-\u052F}
To define <Z> to match Greek letters (including accented variants):
@defset{Z;\u0386-\u03CE\u1F00-\u1FFE}
To redefine <E> to match Egyptian hieroglyphs: @defset{E;\u13000-\u1345F}
To allow numbers to contain fractions: @defset{N;\xBC\xBD\xBE\u2150-\u215F}

@getset{letter}

Returns the current value of the set of characters named by the letter. (new in version 2.0) This may be useful for extending a built-in set. For example, the following action adds numbers and brackets to the set of mathematical symbols: @defset{M;@getset{M}0-9.\[\]} This function also allows seeing the default value of a set; for this purpose, you may want to use the @as-ascii function to see the code values of Unicode ranges. For example, the following shell command displays the default definition of <E>:
gema "@out{@as-ascii{@getset{E}}\n}@end"

4.9 Informational functions

@show-help{}: Displays on the standard error output a brief explanation of how to use the program. No value is returned. This is used internally to implement the -help option, and is not likely to be of use otherwise. The message is constructed based on the current syntax tables, so use of the @set-syntax function will be reflected here.
@version{}: Returns the program version identification string. This is used internally to implement the -version option, and is not likely to be of use otherwise.

[Previous: Notation] [Table of Contents] [Next: Conclude]