A general purpose macro processor
David N. Gray <dgray@acm.org>
March 17, 1995
Revised May 21 and August 20, 1995
Revised November 30, 2003
This document is the user manual for gema, a program whose name stands
for general purpose macro processor. This is a utility program
which is run as a shell command under Unix or MS-DOS and can be used for
performing conversions or translations of data files or extracting information
from files.
The man page for gema provides a tutorial introduction and a
brief reference summary, while this document provides a more detailed
specification. It would probably be best to read the tutorial section first
before reading the rest of this document. The command line options are fully
documented in the man page, so will not be repeated here.
This version of the document corresponds to version 1.2 of the program. (Note
that the -version option can be used to check which version you are
running.)
up
The general model of operation is that the program reads an input file and
writes an output file which consists of the input data transformed in accordance
with a set of transformation rules provided by the user. A rule consists of a
template and an action. The template is a pattern which the
program will attempt to match with the input data. Any input text that matches a
template pattern will be replaced by the result of evaluating the rule's action.
There may be multiple sets of rules, where each set of rules is called a
domain. At any given time, translation is controlled by the rules of
one particular domain, but both templates and actions are able to switch to a
different domain for processing particular portions of the data. A domain can
inherit from another domain, meaning that if no match is found for the current
input text in any of the rules for the current domain, then the rules of the
inherited domain will be tried.
Processing of a file begins using the default domain, whose name is the empty
string. First, if there is a rule with template ``\B'' (beginning of
file) or ``\A'' (beginning of data), then its action is performed. Then
the program begins reading the file. For each character position in the file,
the program attempts to find a rule in the current domain whose template pattern
matches the input text beginning at that point. If a match is found, then the
input stream is advanced to the end of the matched text and the rule's action is
executed. When no template matches the current position, the current character
is copied to the output file (unless the -match option is being used),
the input stream is advanced to the next character, and it tries to find a
template matching the text starting at that position. When the end of the input
file is reached, if there are any rules with template ``\E'' (end of
file) or ``\Z'' (end of data), their actions will be executed, and then
the files will be closed.
However, if a template matches without advancing the input stream (for
example, if it begins with ``\P''), then after executing its action,
the search continues as though it had not matched. This is necessary to avoid
hanging in a loop repeating the same match forever.
A rule may have an empty action, with the effect that the matching text is
simply discarded. In each domain there may be at most one rule with an empty
template, which signifies a default action to be taken when no other rule
matches. However, since an empty template does not cause the input stream to be
advanced and there are no more rules to try, this is only meaningful if the
corresponding action exits the current context by using one of @end,
@terminate, @fail, or @abort.
Generally speaking, while looking for a match, the rules within a domain will
conceptually be tried in the same order in which they were defined, so wherever
there might be ambiguities, the user should define the rules for preferred
special cases before the rules for default general cases. However, there are
some important exceptions:
- Rules beginning with a literal character (including space or
``\S'') will be tried before rules beginning with an argument.
(Operators that don't advance the input stream, such as ``\N'',
``\I'', or ``\L'', are ignored for the purpose of this
determination.) The way this actually works is that the current input
character is used as an array index to find the list of rules beginning with
that character. If none are found, or none match, then the rules beginning
with arguments are tried in sequence.
- If two rules begin with the same sequence of literal characters, the one
with the longer literal string will be tried first. This is because if this
were not done, a more specific rule appearing later would never match, which
would surely not be what the user intended.
- If two rules have identical templates, the second rule will replace the
first on the assumption that it is a redefinition.[Footnote 1]
Rules can be defined either as arguments of the -p command line
option or in pattern files loaded by the -f option. Each line in a
pattern file can be one of the following:
- A comment line, indicated by a ``!'' in the first column.
- Any blank lines are ignored.
- An immediate action - a line beginning with ``@'' can contain one
or more function calls which will be evaluated immediately before reading the
next line. Normally these should be functions that are used for their side
effects and do not return a value. The most common case is using
``@set'' to initialize a variable.
- Any other line is expected to be a rule, which should contain a
``='' separating the template and action. A line may contain multiple
rules, separated by semicolons. A rule may be continued on another line by
ending the line with a backslash, which causes the following newline to be
ignored and skips over any leading spaces on the following line.
Rules may be preceded by a domain name followed by a colon, which causes
all of the rules following on the same line to be defined in the designated
domain. The domain name may optionally be enclosed in angle brackets, and any
leading or trailing blanks will be ignored. If the domain is to be referenced
in an action following ``@'', then the name should be limited to
letters, digits, hyphen, and underscore.
- A line of the form ``name1::name2'' specifies
that the domain named on the left inherits from the domain named on the right.
Each domain can inherit from at most one other, but multiple levels of
inheritance is allowed.
A template may contain any of the following:
- Literal characters, which are to be exactly matched. Literal characters
include letters, digits, special characters quoted by a backslash, control
characters designated by backslash followed by a lower case letter or digit,
control characters designated by ``^'' followed by a letter, and any
other characters that don't have any special meaning. Any of the 256 possible
characters can be used.
- Arguments, which match some variable portion of text, and remember the
matched text so that it can be referenced later. There are several different
kinds of arguments, which are denoted by ``*'', ``?'',
``#'', ``<...>''. and
``/.../''. The current implementation allows a
template to have a maximum of twenty arguments.
- Template operators, denoted by a backslash followed by an upper case
letter, or by the space character. These may set local options, impose
additional requirements for a match, or allow skipping of redundant spaces in
the input.
- A dollar sign may be used followed by a single letter or digit to insert
the value of a variable or a previous argument into the text that is to be
matched. (This does not currently work for ``*'' arguments.)
An action may contain any of the following:
- Literal characters, which when evaluated simply copy themselves to the
output stream.
- The value of a template argument can be output by using a dollar sign
followed by the argument number (enclosed in braces if more than one digit),
or by using the characters ``*'', ``?'', or ``#''
to denote the corresponding argument.
- The notation ``${name}'' can be used to output
the value of a variable.
- The notation ``@name{args}''
can be used to call a built-in function or to translate the argument with a
user-defined domain.
- Some of the operators denoted by a space or backslash followed by an upper
case letter also apply in actions, typically for conditional output of spaces
or newlines.
up
Table of Contents
1 Introduction
2 Operational Overview
3 Notation
3.1 Special characters
3.2 Escape Sequences
3.3 Recognizer arguments
4 Built-in Functions
4.1 Numbers
4.2 String functions
4.2.1 Output formatting -- padding, filling, and wrapping
4.2.2 String Comparison
4.2.3 Case conversion
4.2.4 Miscellaneous string functions
4.3 Variables
4.4 Files
4.4.1 Pathname manipulation
4.4.2 Using alternate input and output files
4.4.3 File context queries
4.5 Control flow functions
4.6 Other operating system interfaces
4.7 Definitions
4.8 Setting Options
4.9 Informational functions
5
Customized command-line processing
6 Exit codes
7 Status and Future development
8 Acknowledgments
Footnotes
Footnote 1:
This policy was adopted before @undef had been invented; it might
be better to warn about duplicates and require explicit undefinition before
redefining. (back)
|
|
 |