[Previous: Notation]
[Table of Contents]
[Next: Conclude]
4 Built-in Functions
There are a large number of built-in functions that can be used in
actions.
Function calls have the form
“@name{args}”, with arguments separated by
“;”.
For functions
without arguments, the argument delimiters
“{}”
may be omitted if they are not needed to separate the name from the following
character.[Footnote 1]
Each argument is itself an action which can use any of the special
characters defined in actions, including nested function calls.
The argument value is the result of evaluating the argument.
In a few cases, arguments that are not used are skipped instead of
being evaluated, but arguments are never used literally.
All functions take a fixed number of arguments, although in a couple of cases
the last argument is optional.
The descriptions of the functions use the terminology of a value being
returned by the function, but it would be more accurate to speak of the
result as being the series of characters that will be written to the
current output stream, since in general the result value is not actually
materialized as a separate string. Usually while evaluating a function
argument, the current output stream is an internal buffer that collects the
argument value for the function, and the function argument is actually an
input stream that reads from that buffer. But in most cases these
distinctions are not important for understanding how to use the functions.
The following sections document groups of related functions.
4.1 Numbers
Since gema is a text processor, it is not intended to be
convenient or efficient for performing numeric operations, but it does
have a set of arithmetic functions that should be sufficient to make it
possible to do whatever calculations are necessary.
While all values are character strings, a string can be treated as a
number if it consists of decimal digits optionally preceded by +
or - and optionally preceded or followed by spaces. Where a
numeric argument is required, such a string will be converted internally
to a 32-bit signed integer. An error will be reported if the string is
not a valid number.
Functions that return a number will return a
string of decimal digits possibly preceded by a minus sign.
Following are the arithmetic functions:
- @add{number;number}
-
Addition - returns the sum of the two numbers.
- @sub{number;number}
-
Subtraction - returns the first argument minus the second.
- @mul{number;number}
-
Multiplication - returns the product of the two numbers.
- @div{number;number}
-
Division - returns the quotient of dividing the first argument by the second.
- @mod{number;number}
-
Modulus - returns the first argument modulo the second, as implemented by
the C operator “%”.
Also, the following group of functions can be used to operate on numbers
as bit strings:
- @and{number;number}
-
Returns the bit-wise and of the two numbers.
- @or{number;number}
-
Returns the bit-wise or of the two numbers.
- @not{number}
-
Returns the bit-wise inverse of the argument.
Finally, some other assorted functions that deal with numbers:
- @cmpn{number;number;
less-value;equal-value;greater-value}
-
Compare numbers - returns the result of evaluating either the third, fourth, or fifth
argument depending on whether the
first argument is less than, equal to, or greater than the second, when
compared as 32-bit signed numbers.
The two arguments that are not used are not evaluated.
For example, the following rule defines a function that will return the
larger of two comma-separated numbers:
maxn:<N>,<N>=@cmpn{$1;$2;$2;$1;$1} while the following rule sets a variable to the largest number seen: notemax:<N>=@cmpn{$1;${max};;;@set{max;$1}}
- @int-char{number}
-
Returns the character whose internal code is given by the argument.
- @char-int{character}
-
Returns the decimal number representation of the internal character code
of the argument, which should be a single-character string.
- @radix{from;to;value}
-
Radix conversion.
The first two arguments must be decimal integers. The third argument is
interpreted as a number whose base is specified by the first argument.
The result value is that number represented in the base specified by the
second argument. As currently implemented, from may be any number
from 2 to 32, but to can only be one of 8, 10, or 16.
For example, octal constants in a C program could be
converted to hexadecimal form by the following rule:
\I0<O>\I=0x@radix{8;16;$1}
For hexadecimal output, upper-case letters are used for the digits
greater than 9. If lower-case letters are desired, the @downcase
function can be used on the value returned by @radix.
4.2 String functions
The following built-in functions perform various manipulations on
character strings.
4.2.1 Output formatting -- padding, filling, and wrapping
The following group of functions take two arguments; the first must be a
number and the second is an arbitrary string. If the length of the
string is greater than the number, then it is returned unchanged.
Otherwise, the returned value will consist of the string padded with
spaces to be of the designated length. The choice of function
determines how the padding is done:
- @left{length;string}
-
Left-justify the string, padding with spaces to the designated length.
For example, “@left{8;ab}” returns “ab” followed by 6
spaces, while “@left{8;hippopotamus}” returns “hippopotamus”
with no spaces. If you want long values to be truncated, you can
use:
@left{length;@substring{0;length;string}}
or write
something like “@left{8;@cut8{arg}}”, accompanied
by the rule: “cut8:<U8>=$1@end”.
- @right{length;string}
-
Right-justify the string, padding with spaces to the designated length.
- @center{length;string}
-
Center the string within a field of the designated length.
Note that any of these functions can also be used with an empty second
argument as a convenient way to generate a particular number of spaces.
The following group of functions serve a similar purpose, except that
padding can be done using any arbitrary string instead of spaces. Here
the first argument is the string representing an empty field, and the
second argument will be justified within that field.
- @fill-left{background;value}
-
Left-justify the value on top of the background string.
For example, “@fill-left{......;foo}” returns “foo...”.
- @fill-right{background;value}
-
Right-justify the value on top of the background string.
For example, “@fill-right{00000;12}” returns “00012”.
- @fill-center{background;value}
-
Center the value on top of the background string.
For example, “@fill-center{(((())));xy}” returns “(((xy)))”.
The following functions perform formatting based in the current context
in the output stream:
- @tab{number}
-
The return value consists of however many space characters it
takes to advance the output stream to the specified column number.
If the output stream is already at or beyond the specified column, the return
value is empty.
Column 1 means the first character position following a newline
character or the beginning of the data stream.
Thus, for example, if the last character output was a newline, then
@tab{10} will return 9 space characters so that the next
character written will go in column 10.
- @wrap{string}
-
Output with line wrapping.
If there is room for the string on the current line of output, then it
will be returned unchanged. Otherwise, when the string is longer than
the remaining space on the current line, the return value consists of a
newline character followed by an optional indentation string followed by
the string argument with any leading
whitespace removed. However, if the output stream is already at the
beginning of a line, then the return value is the indentation string
followed by the argument string with leading whitespace removed.
By default, the lines are up to 80 characters long
and the indentation string is empty. These parameters can be changed by the
@set-wrap function below.
Typically the argument string will be a word preceded by a space
character, so that the space will separate it from the previous word if
it fits on the current line, or will be discarded if a new line is started.
For example, you could reformat a text file with the shell command:
gema -p '<G>=@wrap{ $1};\n\W\n=\n\n;\S=;' in.text out.text
where the first rule causes the groups of graphic (non-space) characters to be
written separated by a single space in 80-character lines, the second
rule preserves blank lines as paragraph separators, and the third rule
discards other whitespace characters.
- @set-wrap{number;string}
-
For subsequent calls to @wrap,
the first argument specifies the maximum number of characters in a line, and
the second argument is the indentation string. No value is returned.
For example, for output with a four character left margin followed by a
maximum of 70 characters of text, do: “@set-wrap{74;\s\s\s\s}”
4.2.2 String Comparison
The following functions compare two strings, and then returns the value
of one of three arguments depending on the result of the comparison.
The two arguments that are not used are not evaluated, so these
functions can be used for conditional evaluation of side-effects as well
as for returning a value.
- @cmps{string;string;less-value;equal-value;greater-value}
-
Compare strings, case-sensitive.
The characters are compared simply by their numeric codes, not by any
language's collating sequence.
The returned value is either the third, fourth, or fifth argument
depending on whether the
first argument is less than, equal to, or greater than the second.
- @cmpi{string;string;less-value;equal-value;greater-value}
-
Compare strings, case-insensitive.
4.2.3 Case conversion
The following functions return a copy of their argument, converting the
case of any letters:
- @upcase{string}
-
Convert any lower-case letters to upper-case.
- @downcase{string}
-
Convert any upper-case letters to lower-case.
For example, the following rule will capitalize each word in the input data:
<L1><w>=@upcase{$1}@downcase{$2}
4.2.4 Miscellaneous string functions
- @length{string}
-
Returns the length of the argument as a decimal number.
For example,“@length{abcdefghijkl}” returns the string “12”.
- @reverse{string}
-
Returns the characters of the argument in reversed order.
For example, “@reverse{abcd}” returns “dcba”.
This may be useful for performing processing that needs to be done from
right to left. For example, the following set of rules will insert
commas in the proper position in all numbers of four or more digits,
grouping the digits by threes from the right-hand end:
<D3><D>=@reverse{@comma{@reverse{$1$2}}}
comma:<D3><D0>=$1,
- @substring{skip;length;string}
-
Returns a substring of the third argument formed by skipping the number of
characters indicated by the first argument and then taking the number of
characters indicated by the second argument.
For example, “@substring{3,4,elephant}” returns “phan”.
If the first argument is negative, the effect is the same as zero.
If the first argument is greater than the length of the string, then the
result value is empty. The result may actually be shorter than length if there are not enough characters in the string:
“@substring{3;99;tiger}” returns “er”.
Note that splitting input data into fields is usually more conveniently
done by using a template such as “\L<U2><U3><u>\n”. The
@substring function is more likely to be useful in cases where
the numbers are computed instead of being constants.
- @repeat{number;action}
-
The second argument is repeated the number of times specified by the
first argument. For example, a string of eighty hyphens can be
constructed by “@repeat{80;-}”.
While this is being listed under string functions because it doesn't seem
to fit any other category, it is useful for much more than just
repeating strings. Rather than just repeating the value, the second
argument is an action which is evaluated the specified number of times,
so it can have side-effects which are also performed repeatedly.
If the number is less than or equal to zero, the second
argument is not evaluated at all.
For example, the
following action will output the numbers from 1 to 100:
@set{n;0}@repeat{100;@incr{n} $n}
Note that there is no operator or function needed for concatenation of
strings since concatenation of elements is implied simply by
juxtaposition.
4.3 Variables
A variable consists of a name and an associated value,
both of which are character strings. Variable names are case-sensitive
(regardless of the -i option).
The value can contain any of the Unicode characters, and the name
can contain any Unicode characters except for NUL.
Names consisting of a period followed by upper-case letters are
by convention reserved for internal use.
Both strings may be of any length, limited only by the amount of memory
available.
Variables are manipulated by using the following action functions.
Except for @var, they have no return value.
- @set{name;value}
-
Set the named variable to the designated value.
If the variable was already defined, the previous value is discarded.
For example, “@set{count;0}” initializes variable “count” to 0.
- @var{name}
-
Returns the current value of the named variable. If the variable is not
defined, an error is reported and the return value is
unspecified.[Footnote 2]
- @var{name;default}
-
If the named variable is defined, then its current value is returned and
the second argument is skipped without being evaluated.
Otherwise, when the name is not defined, the return value is the result
of evaluating the second argument.
- @append{name;string}
-
The string is appended to the end of the value of the named variable.
If the variable was not previously defined, then this acts the same as
@set.
For example, “@append{buf;$1}” has the same effect as
“@set{buf;@var{buf;}$1}”, but using @append is
considerably more efficient.
- @incr{name}
-
The value of the named variable is incremented by one.
You might think of “@incr{n}” as being an abbreviation for
“ @set{n;@add{@var{n};1}}” but it is actually more general
than that. The value may contain arbitrary characters before or after
the number, and the number will be incremented while leaving the other
characters unchanged. For example, if the value is “B9a”, it
will be incremented to “B10a”. The value can also be just one
or more letters, in which case the last letter will be incremented to
the following letter; for example, “a” increments to “b”.
and “z” increments to “aa”.
- @decr{name}
-
The value of the named variable is decremented by one.
This works like @incr except that the increment is -1 instead of +1,
and decrementing a value of “a” is an error.
- @bind{name;string}
-
Sets the value of the named variable to the string. If the variable was
already defined, the previous value is remembered so that it can be
restored by a subsequent call to @unbind.
If @bind is called in the context of a recursive argument for a
template match that subsequently fails, then the binding will be undone
automatically.
- @unbind{name}
-
The named variable is restored to the value it had before the most
recent @bind. It it had not been defined before the @bind,
then it becomes undefined again.
An error is reported if the variable is undefined or if there is no
pending binding.
- @push{name;string}
-
A variable may be thought of as a stack of values, where @var
accesses the top-of-stack value, @push pushes a new value onto the
top of the stack, @set modifies the top value,
and @pop pops the top value off the stack.
@push is actually just another name for @bind.
- @pop{name}
-
The variable is restored to the value it had before the most recent
@push. This is actually just another name for @unbind.
The top-of-stack value is simply discarded; there is no return value.
Also, @var (both the one and two argument forms) may be abbreviated as
$, providing that the name does not begin with a digit.
Thus, for example, “${foo}” has the same meaning as
“@var{foo}”. Furthermore, if the variable name is a single
constant letter (in any Unicode alphabet) and there is no default value
argument, then the braces may be omitted. Thus,
“@var{i}” can be abbreviated as “$i”.
This last form (dollar letter) also has the special property that it can
be used in a
template to insert the current value of a variable into the template to
be matched. All of the other variable operations can only be used in
actions.
Lisp programmers may find it helpful to note that @set is
like the Lisp set form, @var is like symbol-value,
and the combination of @bind and @unbind is like what
happens in a let for a variable with dynamic scope.
While there is no support for arrays as such, note that since the name
of a variable can contain any characters, and the name is an evaluated
argument, it is possible to do things like
“@set{A[$i];$1}” which looks like an array and can be used like
an array, even though the brackets and subscript are really just part
of the variable name.
Variables can also be used as an associative look-up table, where the
name is the key. However, the current implementation assumes that the
number of variables will be small, so it may become slow if used as a
table with a large number of entries.
4.4.1 Pathname manipulation
This group of functions allow constructing pathnames in a manner that
allows a pattern file to be independent of the pathname syntax for a
particular operating system.
- @makepath{directory;name;suffix}
-
Returns the file pathname formed by merging the file name in the second
argument with the
default directory in the first argument and replacing the suffix from
the third argument, if not empty. If the second argument is an absolute
pathname, then it retains the same directory and the first argument is
not used.
For example (assuming running on Unix):
@makepath{/home/dir;bar.c;.o} ⇒ /home/dir/bar.o
@makepath{/home/dir;/scr/bar.c;.o} ⇒ /scr/bar.o
@makepath{/home/dir;bar.c;} ⇒ /home/dir/bar.c
- @mergepath{pathname;name;suffix}
-
Returns the file pathname formed by merging the second argument with a
default directory extracted from the first argument and replacing the
suffix from the third argument, if not empty. This differs from
@makepath in that the first argument is a complete file pathname
whose name portion is ignored. This would be used to create a new file
in the same directory as another file.
For example (assuming running on Unix):
@mergepath{/a/foo.i;bar.c;/a/baz.o} ⇒ /a/bar.o
@mergepath{/a/foo.i;/b/bar.c;.o} ⇒ /b/bar.o
@mergepath{/a/foo.i;bar.c;} ⇒ /a/bar.c
- @relative-path{pathname;pathname}
-
If the two pathnames have the same directory portion, return the second
argument with the common directory removed; else return the whole second
argument. Note that if the two arguments are the same, this has the
effect of separating the file name from the directory.
For example:
@relative-path{/a/x/cat.x;/a/x/dog.c} ⇒ dog.c
@relative-path{/a/x/cat.x;/a/y/dog.c} ⇒ /a/y/dog.c
- @expand-wild{pathname}
-
Usually this function just returns its argument followed by a newline.
When running on MS-DOS or Windows
and the pathname is a wild card (i.e. contains
“*” or “?”), the return value consists of all files
that match the pattern, with a newline following each one. If there are
no matches, a warning is written to the error output and the return value
is empty. This wild card expansion is done by a system call, so it
is consistent with other MS-DOS or Windows command-line utilities, but the
meaning of “*”
is not completely the same as in gema patterns.
On Unix, wild card arguments are presumed to have already been expanded by
the shell, so expansion is not done here.
4.4.2 Using alternate input and output files
- @err{string}
-
The argument is evaluated with its output being directed to the standard
error output stream (stderr in C terminology).
There is no return value.
This can be used to write error messages or status messages.
Don't forget that newlines must be explicitly provided, so the argument
typically should end with “\n”.
- @out{string}
-
The argument is evaluated with its output being sent directly to the
current output file instead of to the current output stream.
The distinction arises during translation of a recursive argument, where
@out can be used to write directly to the output file instead of
appending to the value of the argument being translated. Usually this
is not what you want to do, but it may be useful in some circumstances.
For example, suppose some algebraic language is to be translated into
an assembly-like language. A typical rule might look something like:
expr:<term>+<term>=@incr{t}@out{\N ADD $1,$2,R$t\n}R$t
where an expression “x+y” would be processed by outputting
“ ADD x,y,R1” and returning “R1” as the result value
to be used as an operand of the next instruction.
- @write{pathname;string}
-
First the first argument is evaluated and the file that it names is
opened for writing. If the pathname is “-”, then standard
output will be used. If the same identical pathname has previously been
used in a @write call, then it will continue writing to the end
of the same file without re-opening or rewinding it.
Then the second argument is evaluated, with its output being directed to
the designated file. Within that evaluation, the function
@outpath will return the first argument of the @write.
The file remains open until either the program terminates or the same
pathname is referenced in a call to @close or @read.
- @close{pathname}
-
If the argument is identical to one previously appearing as the pathname
argument in a call to
@write, then that output file will be closed.
Otherwise, nothing happens.
- @read{pathname}
-
The file named by the argument is opened for reading. If the pathname is “-”, then standard input is used.
If the same identical pathname was previously used in a @write,
the output file will be closed before re-opening the file for reading.
The result is an input stream that will read from the file as needed,
and close it when the end is reached.
This is commonly used in the context:
“@domain{@read{pathname}}”
which says to translate using the alternate file
as input. Within this translation, the functions @file,
@inpath, @line, @column, and @file-time will
all refer to the file named in the argument of @read.
However, if the @read function has its result concatenated with
something else instead of appearing by itself as the argument to another
function, then the effect will be to copy the entire contents of the
file to the current output and close the file.
- @probe{pathname}
-
This can be used to test a pathname to see whether it can be opened. The
result value is “F” if the argument names an existing file,
“D” if it names a directory, “V” if it names a device,
“U” if it is undefined, or “X” if it is defined in some
unexpected way.
4.4.3 File context queries
- @outpath{}
-
Returns the pathname of the current output file, or as much of the
pathname as is known. This would be the same as the output file
argument on the command line or the pathname argument to the
@write function if within that context.
- @inpath{}
-
Returns the pathname of the current input file, or as much of
the pathname as is known. This would be the same as the input file
argument on the command line or the pathname argument to the
@read function.
- @file{}
-
Returns the name of the current input file, with any directories removed.
- @line{}
-
Returns the current line number in the input file.
More precisely, this is the line number of the last character matched by
the template. If that character is a newline character, this is the
number of the line preceding the newline.
- @column{}
-
Returns the column number of the current position in the input stream,
i.e., the column of the last character in the text matched by the
template. (A UTF-16 surrogate pair counts as a single character, and a BOM
at the beginning of the file is not counted.)
For example, the following default rule could be used to write an error
message for unexpected input:
?=@err{\NIllegal character "$1" in line @line, column @column.\n}
- @out-column{}
-
Returns the column number of the current position in the current output
file.
- @file-time{}
-
Returns the date and time when the current input file was last modified.
The information is presented as formatted by the C function
ctime, except without any newline.
- @encoding{}
-
Returns the name string of the input file encoding, which will be one of:
ASCII, 8bit, UTF-8, UTF-16LE, or UTF-16BE.
For example, the following shell command could be used to find out the
encoding of a given file:
gema -match -p "\E=@encoding\n" file
Reporting at the end of the file (\E) enables finding out
whether there are any 8-bit characters or UTF-8 sequences in a file of bytes.
4.5 Control flow functions
- @end{}
-
Signals the successful completion of the current translation.
If this appears in the action for a pattern match at the top level of a
file, the remainder of the input file will not be read, and if there are
no more input files to be processed, the program will terminate with
an exit status of 0 (assuming there were no errors before).
For example, the following shell command will print the first line that
matches and then stop:
gema -match -p 'Title\:*\n=$0@end' foo
If this appears within the context of a recursive argument, then it ends
the argument and returns control to the enclosing template to continue
processing the input.
For example, with the following rules:
sign:+=+@end;-=-@end;=@end
the template argument “<sign>” will accept an optional plus or
minus sign and nothing more.
- @fail{}
-
Signals failure of the current translation.
At top-level, this will terminate processing of the input file like
@end, except that the program will have a non-zero exit status.
For example, the following command will indicate by the exit status
whether the file being tested contains a particular string:
gema -match -p 'Success=@end;\E=@fail' foo.text
If the string is found, the program exits with 0; if the end of the file
is reached, then a non-zero exit status is returned.
Within the context of a recursive argument, this causes the enclosing
template to report a failed match.
- @terminate{}
-
This ends the translation of a recursive argument. If the argument
value is empty, then the template match fails, like for @fail.
Otherwise, when some characters have been accepted, processing of the
template continues like for @end.
This is typically used instead of @end in a delimiter rule when
an empty string is not to be considered a match.
For example, with the following rules:
vowel:a=a;e=e;i=i;o=o;u=u;=@terminate
the argument “<vowel>” will match one or more vowels.
- @abort{}
-
Immediately terminates execution of the program with a non-zero exit status.
- @exit-status{number}
-
This function can be used to specify that a particular exit status value
will be returned when the program exits, providing that there is no
error condition that specified a higher value first.
This could be called before @fail or @abort to cause some
particular non-zero value to be returned for the sake of a shell script
that wants to test for what kind of failure occurred.
4.6 Other operating system interfaces
- @date{}
-
Returns the current date, in the form:
mm/dd/yyyy
- @datime{}
-
Returns the current date and time, as formatted by the C function
ctime, except without any newline.
- @time{}
-
Returns the current time, in the form:
hh:mm:ss
- @getenv{name;default}
-
Returns the value of an environment variable, as from the C function
getenv. The first argument is the name of the environment
variable. The second argument is optional, and will be returned as the
default value if the environment variable is not defined.
For example, on a Unix system, the action
“@getenv{USER}” will output the current user ID.
- @shell{string}
-
The argument value is executed as a shell command by passing it to the C
function system.
Although it would be desirable for this to return the text written by
execution of the command, that is not currently implemented. Instead,
any output from the command goes directly to standard output, and there
is no value returned to the current output stream.
For example, the following action could be used to sort a temporary file:
@shell{sort \< '${tmpfil1}' \> '${tmpfil2}'}
4.7 Definitions
The following functions can be used to add or remove definitions of
rules at run time.
- @define{patterns}
-
Define new rules.
The evaluated argument value is read as a pattern file, defining rules
and performing immediate actions as specified.
There is no return value.
For example, you can have one pattern file include another by using an
immediate action like this:
@define{@read{foo.pat}}
For another example,
to emulate a C pre-processor, the #define directive
could be implemented by the following rule (assuming, for simplicity of the
example, no arguments, no continuation lines, and no comments):
\N\#define <I> *\n=@define{\\I$1\\I\=@quote{$2}}
Note that the tricky part here is to get the right level of quoting so
that things are evaluated at the proper time. The @quote
function is explained below.
Given the input line
“#define NUM 3*4”, the argument of @define will evaluate
to the string “\INUM\I=3\*4” which will then be defined as a new
rule.
- @quote{string}
-
Returns a copy of the argument value with backslashes inserted where necessary
so that @define and @undefine
will treat all of the characters as literals.[Footnote 3]
For example, given an argument which evaluates to the string
“a * 3”, the return value will be “a\ \*\ 3”.
- @as-ascii{string}
-
This function is like @quote except that it also represents
non-ASCII characters using \x or \u escape sequences to
show the code value. This is typically useful for human reading on a
terminal which doesn't have a Unicode font.
- @undefine{patterns}
-
This can be used to undefine rules. The argument is processed like for
@define, except that instead of defining rules, the effect
is to cancel any existing rule that exactly matches. The argument may
also be just a template, without any “=” or action, in which
case any rule with the same template will be cancelled, without regard
to its action.
For example, the C #undef directive could be emulated (with the
same simplifying assumptions as the #define example above) by the
rule:
\N\#undef <I>=@undefine{\\I$1\\I}
- @subst{patterns;operand}
-
Substitution.
Return the result of translating the operand according to the patterns
specified by the first argument.
The first argument is processed the same as by @define, except
that the rules are implicitly defined in a temporary domain which is
deleted after being used to translate the operand. An explicit domain
name (i.e. before a colon) is not allowed.
For example, “@subst{\\Iis\\I\=was;this is it}” will return
“this was it”.
Usually this sort of substitution is more conveniently and efficiently
done by using a domain function (for example, “@frob{this is it}”
with rule “frob:\Iis\I=was”), but the @subst function
can be used in cases where the substitution needs to be computed at run
time.
For example, to emulate a C #define directive with one argument:
\N\#define <I>\W(\W<I>\W) *\n=\
@define{\\I$1\\W(\#)\=@subst{\\I$2\\I\=\\\$1;@quote{*}}}
Here @subst is used to replace references to the C argument name
with the “$1” notation used by gema.
4.8 Setting Options
The following group of functions can be used to set various program
options. These are typically used as immediate actions in a pattern
file to set the options that the file needs, instead of requiring
separate command line options. Execpt for @get-switch and
@getset, these functions don't return any result value.
- @set-switch{name;value}
-
Sets the value of any of several option switches that have numeric values.
In most cases the value should be either 1 for true or 0 for false.
The defined switch names are:
- arglen
- - maximum length for “*” operands. This is
the only switch that takes a number rather than being just true or false.
- b
- - binary mode
- bom
- - write a Unicode Byte Order Mark at the beginning of the
output file when using UTF-8 or UTF-16 encoding (default is 1 for true)
- i
- - case-insensitive mode
- k
- - keep going after errors
- line
- - line mode
- match
- - match only mode (input text that doesn't match any
template is discarded instead of copied to the output)
- t
- - token mode
(However, this is only part of the “-t” command line option,
which is implemented by
“@set-switch{t;1}@set-switch{w;1}”.)
- trace
- - write pattern match diagnostic messages to stderr.
(Only recognized if the program was compiled with “-DTRACE”.)
- w
- - ignore whitespace in the input
(This is only part of the “-w” command line option, which is
implemented by
“@set-switch{w;1}@set-syntax{S;\s\t}”.)
In each case, the switch corresponds to the command line option with the
same name, and further explanation of the meaning can be found there.
- @get-switch{name}
-
Returns the current value of the named switch.
- @set-parm{name;value}
-
Sets the value of any of several options that have string values.
The defined names are “idchars”, “filechars”,
“backup”, “inenc”, and “outenc”.
These are used to implement the command line options
with the same names, and the meaning is documented there.
But in version 2.0, “idchars” is superseded by
“@defset{I;value}” and
“filechars” is superseded by
“@defset{F;value}”.
- @set-syntax{type;charset}
-
This function can be used to change the meaning of characters in
patterns. When used as an immediate action in a pattern file, it takes
effect beginning with the next line read.
The first argument designates one or more syntactic
categories, and the second argument is a set of characters that will now
have that meaning. For each character in the first argument, the
corresponding character in the second argument acquires the designated
syntactic class; when there is only one remaining character in the first
argument, it applies to all of the remaining characters in the second
argument. The syntactic class may be identified by either the special
character which currently has that class (or which has that class by
default if it is currently a literal), or by one of the following
letters:
- A
- - argument separator. This is one of the two uses of
the semicolon in the default syntax. For example, if you wanted
to be able to use comma to separate arguments, do:
“@set-syntax{A;,}”
- C
- - comment. Causes the rest of the line to be ignored
as a comment.
- D
- - domain argument. Characters of this class can be
used as an abbreviation for a domain argument with the same name.
There are no characters that have this class by default.
For example, if you say “@set-syntax{D;%}”, then the
character “%” represents a recursive argument in the
domain defined by rules prefixed by “%:”. In other
words, “%” becomes an abbreviation for “<%>”,
and it also can be used in an action to represent the value of
the corresponding argument, like with “*”, “?”,
and “#”.
- E
- - escape. Together with the following character, it
specifies a control character or template operator. This is half of
what the backslash does in the default syntax.
- F
- - function prefix. This introduces the name of a
function to be called. This is one of two uses of “@”
in the default syntax.
- I
- - ignore. Characters with this class will be
completely ignored. There are no characters that have this
class by default.
- K
- - character operator. Causes the following character
to have its default meaning. This is one of the two things that
“@” is used for in the default syntax.
- L
- - literal. For example, the command line option
“-literal '/?^'” is implemented by:
“@set-syntax{L;\/\?\^}”
- M
- - quote until match.
Causes the following characters to be taken literally until a
second occurrence of the character is found. For the sake of
compatibility with earlier versions, there are currently no
characters that have this class by default.
For example, “@set-syntax{M;\'}” causes all characters
between matching apostrophes to be taken literally (even backslash).
- Q
- - quote one character.
Causes the following character to be taken
literally. This is half of what the backslash does in the
default syntax.
- S
- - ignored space. Characters with this class will be
ignored unless they separate two identifiers, in which case they
will be treated like “\S”. There are no characters
that have this class by default, but part of what the -w
option does is: “@set-syntax{S;\s\t}”
- T
- - terminator. A character with this class marks the
end of a rule. By default, newline is used for this purpose.
For example, “@set-syntax{\*;\~}” would cause tilde to
represent a wildcard argument. This doesn't change the meaning of the
asterisk, it just means that now either of the characters can be used
for that purpose. If you wanted to delimit recursive arguments with
square brackets, and let angle brackets be literals, do:
“@set-syntax{\<\>LL;\[\]\<\>}”
See also the -ml option.
- @reset-syntax{}
-
Re-initializes the syntax tables to their default state, thus undoing the
effects of any calls to @set-syntax, including any use of the
-literal or -ml option.
- @set-locale{name}
-
Set the internationalization locale, using the C function
setlocale.
The valid argument values and the exact effect are dependent on the C
library implementation on the specific platform.
A typical usage would be:
@set-locale{@getenv{LANG;C}}
This may affect which characters are considered to be letters
by recognizer arguments such as “<L>”.
This function is not supported on MS-DOS,
but support on Windows was added in gema version 1.5.
- @defset{letter;characters}
-
Defines a set of characters for template recognizers. (new in version 2.0)
It takes two parameters, a letter naming the
character set, and the list of characters. Like in a regular expression
character set, a hyphen can be used to indicate a range of characters.
For example, 123456 can be abbreviated as 1-6.
Any naming letter from A to Z is allowed;
generally the defined set extends the meaning of the corresponding recognizer
<A> to <Z>.
Following are the special cases:
- B
- defines <B> (the default is an empty set)
- E
- redefines <E>
- J
- additional lower-case letters to extend the meaning of <J>, <L>, <A>, and <W>
- K
- additional upper-case letters to extend the meaning of
<K>, <L>,
<A>, and <W>
- H
- redefines <H> (the default for Chinese is
@defset{H;\u2E80-\u2EFF\u3000-\u303F\u31C0-\u31EF\u3200-\u9FFF\u20000-\u323FF})
- I
- additional identifier constituents (besides ASCII letters and
digits); this extends
<I> and
also affects \I and <Y>.
The default is: @defset{I;_}
For example, for parsing COBOL source code, you would want to do:
@defset{I;-}
Note that
@defset{I;characters} is equivalent to
the older (now deprecated) notation
@set-parm{idchars;characters} except
that @defset allows using a hyphen for a range.
- F
- the set of characters which are accepted by
<F> as being file name constituents, in
addition to ASCII letters and digits.
The default on Windows is: @defset{F;-.\/_~\#\@%+\=\:\\}
(Note that here the hyphen is listed first in order to be taken literally
instead of indicating a range.)
This supersedes
@set-parm{filechars;characters}
(which for compatibility does not support ranges).
- M
- redefines <M>
- Q
- defines <Q> (the default is an empty set)
- R
- defines <R> (the default is an empty set)
- S
- specifies additional white space characters for
<S>, -w, \S, \W, and @wrap.
For example, to recognize the various Unicode space characters: @defset{S;\u2000-\u200B\u2028\u2029\u205F}
(This is not the default because there is a slight performance penalty.)
- V
- redefines <V>
- Y
- redefines the Unicode punctuation characters matched by <Y> in
addition to the ASCII punctuation characters.
The default is:
@defset{Y;\u2010-\u2027\u2030-\u205E\u27E6-\u27EF\u2E00-\u2E7F}
- Z
- defines <Z> (the default is an empty set)
If both @defset{J;characters}
and @defset{K;characters} are
provided and they have the same number of characters, then the
corresponding elements (taken in code value order) are assumed to be
corresponding lower and upper-case letters. This is used to extend
case-insensitive comparison for -i, \C, and
@cmpi. It is also used by the functions @upcase
and @downcase.
For example, for Spanish text: @defset{J;áéíñóúü}@defset{K;ÁÉÍÑÓÚÜ}
More examples: To redefine <H> to match Hebrew letters:
@defset{H;\u0590-\u05FF} To define <R> to match Russian (Cyrillic) letters:
@defset{R;\u0400-\u052F} To define <Z> to match Greek letters (including accented variants): @defset{Z;\u0386-\u03CE\u1F00-\u1FFE} To redefine <E> to match Egyptian hieroglyphs:
@defset{E;\u13000-\u1345F} To allow numbers to contain fractions:
@defset{N;\xBC\xBD\xBE\u2150-\u215F}
- @getset{letter}
-
Returns the current value of the set of characters named by the
letter. (new in version 2.0)
This may be useful for extending a built-in set. For example, the
following action adds numbers and brackets to the set of mathematical symbols:
@defset{M;@getset{M}0-9.\[\]}
This function also allows seeing the default value of a set; for this
purpose, you may
want to use the @as-ascii function to see the code values of Unicode
ranges. For example, the following shell command displays the default
definition of <E>:
gema "@out{@as-ascii{@getset{E}}\n}@end"
4.9 Informational functions
- @show-help{}
-
Displays on the standard error output a brief explanation of how to use
the program. No value is returned.
This is used internally to implement the
-help option, and is not likely to be of use otherwise.
The message is constructed based on the current syntax
tables, so use of the @set-syntax function will be reflected here.
- @version{}
-
Returns the program version identification string.
This is used internally to implement the -version option, and is
not likely to be of use otherwise.
[Previous: Notation]
[Table of Contents]
[Next: Conclude]
|
|
|