[Previous: Overview]
[Table of Contents]
[Next: Functions]
3 Notation
In the documentation, quotation marks are often used to visually delimit
examples that are embedded in the text. The quotation marks are never part
of the example.
Letters in italics are descriptive place-holders, not literal characters.
3.1 Special characters
By default, the following characters have special meaning in patterns.
Note that all of these can be changed through the use of the
@set-syntax function. See also the -literal and
-ml options.
- *
-
In a template, this denotes a wild-card argument that matches any number
of characters, from zero up to a maximum of 4096, or as specified by the
-arglen option. (Some limit is needed for efficiency to avoid
reading all the way to the end of the file before concluding that the match
has failed.)
Characters are copied from the input stream into the argument value
until a match is found for the entire remainder of the template.
Thus, when a template has two or more wild card arguments, the input text is
divided among them as necessary for the complete template to be matched.
(By contrast, a “<u>” argument is similar except that it
terminates when a match is found for whatever sequence of literal
characters follows it, up until the next argument.)
If the -line option is in effect or if “\L” appeared
earlier in the template, then it will not accept a newline character.
In an action, it denotes the value of the corresponding
template argument.
- ?
-
Wild-card argument that matches any one character.
If the -line option is in effect or if “\L” appeared
earlier in the template, then it will not accept a newline character.
- #
-
Recursive argument.
In a template, this denotes an argument whose value is obtained by
translating the input text in the same domain as the current rule until
a match is found for whatever sequence of literal characters follows the
argument (up to the next argument, or the end of the template, or
“\G”).
In an action, it denotes the value of the corresponding template argument.
- <name>
-
Recursive argument, translated according the named domain,
or a pre-defined recognizer argument. The name may be empty to denote
the default domain.
The name does not have to have been defined before it is referenced.
This can be used only in a template.
If the -ml option (new in version 1.4) is in effect, the syntax is
instead: [name]
- /regexp/
-
In a template, this denotes an argument where
the characters between the slashes are used as a regular expression,
and the argument value is however much text it matches.
If the -ml option is in effect, a vertical bar is used instead of a
slash.
Regular expressions have been documented in many other places, so will not
be detailed here. Suffice it to say that the following characters and
combinations have special meaning:
. \ [ ] * + ^ $ \( \) \< \>
A slash that is to be part of the regular expression needs to be
preceded by a backslash.
The characters between the slashes are taken literally instead of being
evaluated according to the usual gema meaning except that
\x and \u escape sequences may be used.
Regular expression arguments never cross line boundaries.
Unlike other kinds of arguments, they will match as many characters as
they can, without regard to whatever follows in the template.
For example, the template “a/[a-z]*/x” will never match anything
because if there is an ending “x”, it will be swallowed by the
argument; however, in the template “a<l>x” the argument will match
on any letter except “x”.
- =
-
This designates the end of a template and the beginning of the
corresponding action.
- $0
-
This can be used in an action to copy the matched text to the output.
The template is evaluated as though it were an action, with each
argument designator being replaced by the actual argument value.
Note that this does not necessarily exactly duplicate the input text
since any ignored whitespace will be lost and recursive arguments are shown in
their translated form.
- $digit or ${digits}
-
In either a template or action, this represents the value of the
numbered argument. The argument number must be enclosed in braces if it
needs more than one digit. In a template, this obviously can only refer
to a preceding argument, and in the current implementation, the value of a
“*” argument cannot be accessed within the same template.
- $letter
-
In either a template or action, this inserts the value of a variable,
which is limited to having a name which is a single letter.
An error is reported if the variable is not defined.
In this context only, a letter can actually be any Unicode character
which doesn't have some other syntactic role.
- ${name}
-
In an action, this outputs the value of variable. The name is limited
to not begin with a digit. An error is reported if the variable is not
defined.
- ${name;default}
-
In an action, this outputs the value of the named variable, if it is
defined, or evaluates the default action if the variable is not defined.
- \
-
Escape character; see the section on “escape sequences” below.
- ^
-
Control key. Together with the following character, this represents the
control character formed by combining the Control key with the character.
For example, either “^J” or “^j” could be used to
denote the ASCII Line Feed character. This notation is not meaningful if
the character set being used is not based on ASCII.
- Space
-
In a template, a space character matches one or more whitespace
characters in the input, the same as “\S”.
(In the less likely event that you really want to
match exactly one space character, you can use “\ ” or
“\s”.)
In an action, a space character causes one space to be output if the
last character output was not a whitespace character, except that if
there are multiple adjacent spaces, all but the first are taken literally.
However, if the -w option is used, then spaces are ignored except
where they serve to separate two identifiers.
- NewLine
-
The end of a line denotes the end of a rule or immediate action.
- ;
-
The semicolon is used to separate multiple rules on the same line, and
to separate arguments of function calls.
- @name{args}
-
In an action, this notation is used to either call a built-in function
or to translate the argument using the rules of the named domain.
The name may be empty to denote the default domain.
It is permissible to reference a domain name that is defined later in the
file.
The braces may be optionally omitted for functions that take no arguments.
- @spchar
-
When followed by a special character (i.e. not a letter or digit),
the “@” indicates that the following character has its default
meaning, as documented in this list.
This can be used to access the original functionality of a
character that has been changed by the
-literal option or @set-syntax function.
For example, if you had done “-literal /” and then discovered that
you do need to use a regular expression, you could write it as
“@/regexp@/”.
- :
-
The characters to the left of the colon (with any leading and trailing
spaces and surrounding angle brackets removed) constitute the name of
the domain in which the rules that follow on the same line will be defined.
- ::
-
A double colon specifies that the domain whose name appears to the left,
inherits from the domain whose name appears to the right.
- !
-
Comment - the rest of the line is ignored. This can either appear at
the beginning of a line to cause the whole line to be ignored, or it can
be used at the end of a rule so that the remainder of the line is a comment.
3.2 Escape Sequences
The backslash character denotes special handling for the character that
follows it.
- When followed by a lower-case letter or a digit, it
represents a particular control character
or a character constructed from its code.
- When followed by an upper-case letter, it is a pattern match operator.
- A backslash at the end of
a line designates continuation by causing the newline to be ignored
along with any leading white space on the following line.
- Before any other character, the backslash quotes the character so that
it simply represents itself. In particular, a literal backslash is
represented by two backslashes.
Following are the defined escape sequences:
- \a
-
Alert (a.k.a. bell) character
- \b
-
Backspace character
- \cx
-
Control key combined with the following character.
For example, “\ci”, “\cI”, “^i”, “^I”,
and “\t” all have the same effect, namely to represent the
ASCII Tab character.
- \d
-
Delete character
- \e
-
Escape character (i.e. ESC, not backslash)
- \f
-
Form feed character
- \i
-
shift In control character
- \n
-
New line character
- \o
-
shift Out control character
- \r
-
carriage Return character
- \s
-
Space character
- \t
-
horizontal Tab character
- \uxxxx
-
Unicode character specified by its hexadecimal code. It takes anywhere
from one to eight hexadecimal digits, which may optionally be enclosed in
braces to avoid ambiguity: \u{xxxx}
- \v
-
Vertical tab character
- \xxx
-
character specified by its two-digit heXadecimal code.
Alternatively, \x{xxxx} means the same as
\u{xxxx}.
- \digits
-
character specified by its octal code
- \A
-
matches the beginning of the input data, either the beginning of a file
or the beginning of the argument for a domain used as a function.
- \B
-
matches the Beginning of file.
This can be used either by itself to specify actions to be taken before
beginning to read the file, or it can be used at the beginning of a
template that is to match only on the first line of the file.
- \C
-
this causes Case-insensitive comparison for letters in the rest of the
template. (See also the -i option which selects case-insensitive
mode globally.)
- \E
-
matches the End of file.
- \G
-
Goal point.
This can be used in a template to indicate the end of the literal string
that is used to recognize the end of the preceding argument.
For example, if the template “a(<T>) done” is applied to the input
data “a(x) b(y) done”, the argument “<T>” will match on
the text “x) b(y”, which is probably not what was desired.
If the template is written as “a(<T>)\G done” then the argument
will be terminated by the first right parenthesis, and then the match will
fail if the text following the parenthesis doesn't match “ done”.
This does not yet work for “*” arguments.
If “\G” immediately follows a recursive argument, then there is no
delimiter, and the argument will continue to accept characters until it
stops itself by executing @end or @terminate.
- \I
-
Identifier separator. In a template, this matches an empty string if it
is not within an identifier. In other words, it requires either of the
adjacent characters to not be an identifier constituent in order for the
template to match.
In an action, this outputs a space character if the last character
output is an identifier constituent.
By default, an identifier constituent is a letter, digit, or underscore,
but this can be modified by the -idchars option or
@defset{I;...} function.
- \J
-
Join - locally counteracts the -w and/or -t
option by saying that spaces in the input will not be ignored at this
position, and an identifier delimiter is not required here.
If neither of these options is being used, then it has no effect.
Not meaningful in an action.
- \L
-
Line mode - arguments that follow in the same template are not allowed
to cross line boundaries.
This also means that “\S” and “\W” will not accept
newline characters. However, a line boundary can still be crossed by an
explicit “\n” or “\N”.
- \N
-
New line boundary.
In a template, this matches an empty string if it is at either the
beginning of a line or the end of a line (either before or after a new
line character, or at the beginning or end of the file or data stream).
In an action, it outputs a new line character if the last character
output is not a new line.
- \P
-
Position - if the template matches, the input stream will be left at
this position. Thus everything following this is a look-ahead, and will
be re-read for subsequent pattern matches.
- \S
-
Space. In a template, this matches one or more whitespace characters.
(See also “<S>” which has the same effect except that the
spaces are remembered as an argument value.)
In an action, it outputs one space character if the last character
output is not a whitespace character.
- \W
-
optional Whitespace. In a template, this specifies that any whitespace
characters in the input stream at this point will be skipped over.
(See also “<s>” which has the same effect except that the
spaces are remembered as an argument value.)
However, if this is followed in the template by a literal whitespace
character, then that character will not be skipped. For example, in
“\W\n”, the “\W” will skip any whitespace other than a
newline.
This has no effect in an action.
See also the -w option which ignores spaces everywhere.
- \X
-
word separator. In a template, this matches an empty string if it is
not within a word. In this context, a word consists of letters and digits.
In an action, “\X” outputs a space character if the last character
output is a letter or digit.
- \Z
-
matches the end of the input data, either the end of a file
or the end of the argument for a domain used as a function,
or a look-ahead match of the terminating string for a recursive argument.
3.3 Recognizer arguments
The following argument designators, consisting of a single letter between
angle brackets, can be used in templates to match on
various kinds of characters. Preceding the letter with
“-”
inverts the test. The argument requires at least one matching character
if the letter is uppercase, or is optional if the letter is lowercase.
The letter may be followed by a number to match on that many
characters, or up to that maximum for an optional argument. If the
number is 0,
the argument matches if the next character is of the
indicated kind, but the input stream is not advanced past it; in other
words, this acts as a one-character look-ahead.
If the argument is followed in the template by literal characters, then
the argument will be terminated when that literal string is matched,
even if those characters would otherwise qualify for inclusion in the
argument.
While gema supports the full Unicode character set (beginning in
version 2.0),
character classifications (such as what is a letter) are implemented by C
standard library functions which by default use the “C” locale,
meaning that matches are limited to the ASCII character set.
This may be altered by use
of the @set-locale function, but that behavior is platform dependent.
Users may specify their own extensions to any of these recognizers by using
the @defset function (new in version 2.0).
- <A>
-
Alphanumeric (letters and digits)
(according to C function iswalnum)
- <B>
-
user defined; accepts the characters specified by @defset{B;characters}
- <C>
-
Control characters
(according to C function iswcntrl)
- <D>
-
Digits
- <E>
-
Emojis, emoticons, pictographs, arrows, and other geometric shapes.
These are Unicode graphic characters which are pictures
instead of language elements or conventional symbols.
(new in version 2.0)
- <F>
-
File pathname. See the -filechars option.
- <G>
-
Graphic characters, i.e. any non-space printable character
(according to C function iswgraph)
- <H>
-
Han characters, also known as Chinese, Kanji, or CJK (new in version 2.0)
- <I>
-
Identifier. By default, an identifier consists of letters, digits, and
underscores. See the -idchars option.
- <J>
-
lower case letters (in version 1.2 or later)
(according to C function iswlower)
- <K>
-
upper case letters (in version 1.2 or later)
(according to C function iswupper)
- <L>
-
Letters (either upper or lower case) (according to C function iswalpha)
- <M>
-
Mathematical operator symbols (new in version 2.0)
- <N>
-
Number, i.e. digits with optional sign and decimal point
- <O>
-
Octal digits
- <P>
-
Printing characters, including space
(according to C function iswprint)
- <Q>
-
user defined; accepts the characters specified by @defset{Q;characters}
- <R>
-
user defined; accepts the characters specified by @defset{R;characters}
- <S>
-
white Space characters (space, tab, newline, FF, VT)
(according to C function iswspace)
- <T>
-
Text characters, including all printing characters (iswprint) and
white space (iswspace)
- <U>
-
Universal (matches anything except end-of-file)
- <V>
-
Valid Unicode characters;
excludes control characters (other than white space),
local use codes, unpaired surrogates,
out-of-range values, unallocated blocks,
and explicit not-a-character codes.
(new in version 2.0)
- <W>
-
Word (letters, apostrophe, and hyphen)
- <X>
-
hexadecimal digits
- <Y>
-
punctuation (graphic characters which are conventionally used in text but
are not numbers or identifier constituents)
(according to C function iswpunct, supplemented by
@defset{Y;\u2010-\u2027\u2030-\u205E\u27E6-\u27EF\u2E00-\u2E7F} )
- <Z>
-
user defined; accepts the characters specified by @defset{Z;characters}
[Previous: Overview]
[Table of Contents]
[Next: Functions]
|
|
|