Gema Notation

[Previous: Overview] [Table of Contents] [Next: Functions]

3 Notation

In the documentation, quotation marks are often used to visually delimit examples that are embedded in the text. The quotation marks are never part of the example. Letters in italics are descriptive place-holders, not literal characters.

3.1 Special characters

By default, the following characters have special meaning in patterns. Note that all of these can be changed through the use of the @set-syntax function. See also the -literal and -ml options.

*: In a template, this denotes a wild-card argument that matches any number of characters, from zero up to a maximum of 4096, or as specified by the -arglen option. (Some limit is needed for efficiency to avoid reading all the way to the end of the file before concluding that the match has failed.) Characters are copied from the input stream into the argument value until a match is found for the entire remainder of the template. Thus, when a template has two or more wild card arguments, the input text is divided among them as necessary for the complete template to be matched. (By contrast, a “” argument is similar except that it terminates when a match is found for whatever sequence of literal characters follows it, up until the next argument.) If the -line option is in effect or if “\L” appeared earlier in the template, then it will not accept a newline character.
In an action, it denotes the value of the corresponding template argument.
?: Wild-card argument that matches any one character. If the -line option is in effect or if “\L” appeared earlier in the template, then it will not accept a newline character.
#: Recursive argument. In a template, this denotes an argument whose value is obtained by translating the input text in the same domain as the current rule until a match is found for whatever sequence of literal characters follows the argument (up to the next argument, or the end of the template, or “\G”).
In an action, it denotes the value of the corresponding template argument.
<name>: Recursive argument, translated according the named domain, or a pre-defined recognizer argument. The name may be empty to denote the default domain. The name does not have to have been defined before it is referenced. This can be used only in a template. If the -ml option (new in version 1.4) is in effect, the syntax is instead: [name]
/regexp/: In a template, this denotes an argument where the characters between the slashes are used as a regular expression, and the argument value is however much text it matches. If the -ml option is in effect, a vertical bar is used instead of a slash. Regular expressions have been documented in many other places, so will not be detailed here. Suffice it to say that the following characters and combinations have special meaning:
. \ [ ] * + ^ $ \< \>
A slash that is to be part of the regular expression needs to be preceded by a backslash. The characters between the slashes are taken literally instead of being evaluated according to the usual gema meaning except that \x and \u escape sequences may be used. Regular expression arguments never cross line boundaries. Unlike other kinds of arguments, they will match as many characters as they can, without regard to whatever follows in the template. For example, the template “a/[a-z]*/x” will never match anything because if there is an ending “x”, it will be swallowed by the argument; however, in the template “a<l>x” the argument will match on any letter except “x”.
=: This designates the end of a template and the beginning of the corresponding action.
$0: This can be used in an action to copy the matched text to the output. The template is evaluated as though it were an action, with each argument designator being replaced by the actual argument value. Note that this does not necessarily exactly duplicate the input text since any ignored whitespace will be lost and recursive arguments are shown in their translated form.
$digit or ${digits}: In either a template or action, this represents the value of the numbered argument. The argument number must be enclosed in braces if it needs more than one digit. In a template, this obviously can only refer to a preceding argument, and in the current implementation, the value of a “*” argument cannot be accessed within the same template.
$letter: In either a template or action, this inserts the value of a variable, which is limited to having a name which is a single letter. An error is reported if the variable is not defined. In this context only, a letter can actually be any Unicode character which doesn't have some other syntactic role.
${name}: In an action, this outputs the value of variable. The name is limited to not begin with a digit. An error is reported if the variable is not defined.
${name;default}: In an action, this outputs the value of the named variable, if it is defined, or evaluates the default action if the variable is not defined.
\: Escape character; see the section on “escape sequences” below.
^: Control key. Together with the following character, this represents the control character formed by combining the Control key with the character. For example, either “^J” or “^j” could be used to denote the ASCII Line Feed character. This notation is not meaningful if the character set being used is not based on ASCII.
Space: In a template, a space character matches one or more whitespace characters in the input, the same as “\S”. (In the less likely event that you really want to match exactly one space character, you can use “\ ” or “\s”.) In an action, a space character causes one space to be output if the last character output was not a whitespace character, except that if there are multiple adjacent spaces, all but the first are taken literally. However, if the -w option is used, then spaces are ignored except where they serve to separate two identifiers.
NewLine: The end of a line denotes the end of a rule or immediate action.
;: The semicolon is used to separate multiple rules on the same line, and to separate arguments of function calls.
@name{args}: In an action, this notation is used to either call a built-in function or to translate the argument using the rules of the named domain. The name may be empty to denote the default domain. It is permissible to reference a domain name that is defined later in the file. The braces may be optionally omitted for functions that take no arguments.
@spchar: When followed by a special character (i.e. not a letter or digit), the “@” indicates that the following character has its default meaning, as documented in this list. This can be used to access the original functionality of a character that has been changed by the -literal option or @set-syntax function. For example, if you had done “-literal /” and then discovered that you do need to use a regular expression, you could write it as “@/regexp@/”.
:: The characters to the left of the colon (with any leading and trailing spaces and surrounding angle brackets removed) constitute the name of the domain in which the rules that follow on the same line will be defined.
::: A double colon specifies that the domain whose name appears to the left, inherits from the domain whose name appears to the right.
!: Comment - the rest of the line is ignored. This can either appear at the beginning of a line to cause the whole line to be ignored, or it can be used at the end of a rule so that the remainder of the line is a comment.

3.2 Escape Sequences

The backslash character denotes special handling for the character that follows it.

When followed by a lower-case letter or a digit, it represents a particular control character or a character constructed from its code.
When followed by an upper-case letter, it is a pattern match operator.
A backslash at the end of a line designates continuation by causing the newline to be ignored along with any leading white space on the following line.
Before any other character, the backslash quotes the character so that it simply represents itself. In particular, a literal backslash is represented by two backslashes.

Following are the defined escape sequences:

\a: Alert (a.k.a. bell) character
\b: Backspace character
\cx: Control key combined with the following character. For example, “\ci”, “\cI”, “^i”, “^I”, and “\t” all have the same effect, namely to represent the ASCII Tab character.
\d: Delete character
\e: Escape character (i.e. ESC, not backslash)
\f: Form feed character
\i: shift In control character
\n: New line character
\o: shift Out control character
\r: carriage Return character
\s: Space character
\t: horizontal Tab character
\uxxxx: Unicode character specified by its hexadecimal code. It takes anywhere from one to eight hexadecimal digits, which may optionally be enclosed in braces to avoid ambiguity: \u{xxxx}
\v: Vertical tab character
\xxx: character specified by its two-digit heXadecimal code. Alternatively, \x{xxxx} means the same as \u{xxxx}.
\digits: character specified by its octal code
\A: matches the beginning of the input data, either the beginning of a file or the beginning of the argument for a domain used as a function.
\B: matches the Beginning of file. This can be used either by itself to specify actions to be taken before beginning to read the file, or it can be used at the beginning of a template that is to match only on the first line of the file.
\C: this causes Case-insensitive comparison for letters in the rest of the template. (See also the -i option which selects case-insensitive mode globally.)
\E: matches the End of file.
\G: Goal point. This can be used in a template to indicate the end of the literal string that is used to recognize the end of the preceding argument. For example, if the template “a(<T>) done” is applied to the input data “a(x) b(y) done”, the argument “<T>” will match on the text “x) b(y”, which is probably not what was desired. If the template is written as “a(<T>)\G done” then the argument will be terminated by the first right parenthesis, and then the match will fail if the text following the parenthesis doesn't match “ done”. This does not yet work for “*” arguments.
If “\G” immediately follows a recursive argument, then there is no delimiter, and the argument will continue to accept characters until it stops itself by executing @end or @terminate.
\I: Identifier separator. In a template, this matches an empty string if it is not within an identifier. In other words, it requires either of the adjacent characters to not be an identifier constituent in order for the template to match. In an action, this outputs a space character if the last character output is an identifier constituent. By default, an identifier constituent is a letter, digit, or underscore, but this can be modified by the -idchars option or @defset{I;...} function.
\J: Join - locally counteracts the -w and/or -t option by saying that spaces in the input will not be ignored at this position, and an identifier delimiter is not required here. If neither of these options is being used, then it has no effect. Not meaningful in an action.
\L: Line mode - arguments that follow in the same template are not allowed to cross line boundaries. This also means that “\S” and “\W” will not accept newline characters. However, a line boundary can still be crossed by an explicit “\n” or “\N”.
\N: New line boundary. In a template, this matches an empty string if it is at either the beginning of a line or the end of a line (either before or after a new line character, or at the beginning or end of the file or data stream). In an action, it outputs a new line character if the last character output is not a new line.
\P: Position - if the template matches, the input stream will be left at this position. Thus everything following this is a look-ahead, and will be re-read for subsequent pattern matches.
\S: Space. In a template, this matches one or more whitespace characters. (See also “<S>” which has the same effect except that the spaces are remembered as an argument value.) In an action, it outputs one space character if the last character output is not a whitespace character.
\W: optional Whitespace. In a template, this specifies that any whitespace characters in the input stream at this point will be skipped over. (See also “<s>” which has the same effect except that the spaces are remembered as an argument value.) However, if this is followed in the template by a literal whitespace character, then that character will not be skipped. For example, in “\W\n”, the “\W” will skip any whitespace other than a newline. This has no effect in an action. See also the -w option which ignores spaces everywhere.
\X: word separator. In a template, this matches an empty string if it is not within a word. In this context, a word consists of letters and digits. In an action, “\X” outputs a space character if the last character output is a letter or digit.
\Z: matches the end of the input data, either the end of a file or the end of the argument for a domain used as a function, or a look-ahead match of the terminating string for a recursive argument.

3.3 Recognizer arguments

The following argument designators, consisting of a single letter between angle brackets, can be used in templates to match on various kinds of characters. Preceding the letter with “-” inverts the test. The argument requires at least one matching character if the letter is uppercase, or is optional if the letter is lowercase. The letter may be followed by a number to match on that many characters, or up to that maximum for an optional argument. If the number is 0, the argument matches if the next character is of the indicated kind, but the input stream is not advanced past it; in other words, this acts as a one-character look-ahead.

If the argument is followed in the template by literal characters, then the argument will be terminated when that literal string is matched, even if those characters would otherwise qualify for inclusion in the argument.

While gema supports the full Unicode character set (beginning in version 2.0), character classifications (such as what is a letter) are implemented by C standard library functions which by default use the “C” locale, meaning that matches are limited to the ASCII character set. This may be altered by use of the @set-locale function, but that behavior is platform dependent. Users may specify their own extensions to any of these recognizers by using the @defset function (new in version 2.0).

<A>: Alphanumeric (letters and digits) (according to C function iswalnum)
: user defined; accepts the characters specified by @defset{B;characters}
<C>: Control characters (according to C function iswcntrl)
<D>: Digits
<E>: Emojis, emoticons, pictographs, arrows, and other geometric shapes. These are Unicode graphic characters which are pictures instead of language elements or conventional symbols. (new in version 2.0)
<F>: File pathname. See the -filechars option.
<G>: Graphic characters, i.e. any non-space printable character (according to C function iswgraph)
<H>: Han characters, also known as Chinese, Kanji, or CJK (new in version 2.0)
: Identifier. By default, an identifier consists of letters, digits, and underscores. See the -idchars option.
<J>: lower case letters (in version 1.2 or later) (according to C function iswlower)
<K>: upper case letters (in version 1.2 or later) (according to C function iswupper)
<L>: Letters (either upper or lower case) (according to C function iswalpha)
<M>: Mathematical operator symbols (new in version 2.0)
<N>: Number, i.e. digits with optional sign and decimal point
<O>: Octal digits
: Printing characters, including space (according to C function iswprint)
<Q>: user defined; accepts the characters specified by @defset{Q;characters}
<R>: user defined; accepts the characters specified by @defset{R;characters}
<S>: white Space characters (space, tab, newline, FF, VT) (according to C function iswspace)
<T>: Text characters, including all printing characters (iswprint) and white space (iswspace)
: Universal (matches anything except end-of-file)
<V>: Valid Unicode characters; excludes control characters (other than white space), local use codes, unpaired surrogates, out-of-range values, unallocated blocks, and explicit not-a-character codes. (new in version 2.0)
<W>: Word (letters, apostrophe, and hyphen)
<X>: hexadecimal digits
<Y>: punctuation (graphic characters which are conventionally used in text but are not numbers or identifier constituents) (according to C function iswpunct, supplemented by
@defset{Y;\u2010-\u2027\u2030-\u205E\u27E6-\u27EF\u2E00-\u2E7F} )
<Z>: user defined; accepts the characters specified by @defset{Z;characters}

[Previous: Overview] [Table of Contents] [Next: Functions]