| About | Downloads | Documents | Links | |
[Previous: Introduction] [Table of Contents] [Next: Notation] 2 Operational OverviewThe program operates on files which (as in the C programming language) are a stream of characters, with lines separated by a new line character, which is denoted as “\n”. Beginning with version 2.0, the full range of Unicode characters is supported, and files may be in 8-bit, UTF-8, or UTF-16 format. In this document, the word character should be understood to mean the same as what the Unicode standard calls a code point. (The distinction is that two or more code points might combine into a single image in a visual rendering.) The general model of operation is that the program reads an input file and writes an output file which consists of the input data transformed in accordance with a set of transformation rules provided by the user. A rule consists of a template and an action. The template is a pattern which the program will attempt to match with the input data. Any input text that matches a template pattern will be replaced by the result of evaluating the rule's action. There may be multiple sets of rules, where each set of rules is called a domain. At any given time, translation is controlled by the rules of one particular domain, but both templates and actions are able to switch to a different domain for processing particular portions of the data. A domain can inherit from another domain, meaning that if no match is found for the current input text in any of the rules for the current domain, then the rules of the inherited domain will be tried. Processing of a file begins using the default domain, whose name is the empty string. First, if there is a rule with template “\B” (beginning of file) or “\A” (beginning of data), then its action is performed. Then the program begins reading the file. For each character position in the file, the program attempts to find a rule in the current domain whose template pattern matches the input text beginning at that point. If a match is found, then the input stream is advanced to the end of the matched text and the rule's action is executed. When no template matches the current position, the current character is copied to the output file (unless the -match option is being used), the input stream is advanced to the next character, and it tries to find a template matching the text starting at that position. When the end of the input file is reached, if there are any rules with template “\E” (end of file) or “\Z” (end of data), their actions will be executed, and then the files will be closed. However, if a template matches without advancing the input stream (for example, if it begins with “\P”), then after executing its action, the search continues as though it had not matched. This is necessary to avoid hanging in a loop repeating the same match forever. A rule may have an empty action, with the effect that the matching text is simply discarded. In each domain there may be at most one rule with an empty template, which signifies a default action to be taken when no other rule matches. However, since an empty template does not cause the input stream to be advanced and there are no more rules to try, this is only meaningful if the corresponding action exits the current context by using one of @end, @terminate, @fail, or @abort. Generally speaking, while looking for a match, the rules within a domain will conceptually be tried in the same order in which they were defined, so wherever there might be ambiguities, the user should define the rules for preferred special cases before the rules for default general cases. However, there are some important exceptions:
Rules can be defined either as arguments of the -p command line option or in pattern files loaded by the -f option. Each line in a pattern file can be one of the following:
A template may contain any of the following:
An action may contain any of the following:
A notable difference from traditional macro processors is that the text resulting from an action is not automatically re-scanned to look for matches on the result. This has seemed to be more useful since many of the typical uses of gema involve translating from one language or representation to another, so the rules that apply to the input language are not relevant to the output language. Where desired, rescanning can be explicitly invoked as a part of the action by using the notation “@domain{text}” to re-process the constructed text with the rules of the specified domain. Also, the use of recursive arguments will often avoid a need for rescanning.
[Previous: Introduction] [Table of Contents] [Next: Notation] |