Gema Customized command-line processing

[Previous: Functions] [Table of Contents]

5 Customized command-line processing

Part of the intent of gema is that it can be used as a means of implementing more specialized tools. A utility program is defined by the command line arguments that it uses as well as by how it processes its input files. Therefore, gema provides a way to customize the handling of command line arguments.

The main program of gema just does some initialization, and then processes the command line arguments by translating them with a set of built-in patterns. These rules that define the command line arguments are defined in a domain named “ARGV”. The user is free to add additional rules to this domain, thereby implementing new command line options, or even to undefine existing rules. In the input stream that is translated by the ARGV domain, the command line arguments are separated by newline characters.[Footnote 1] The actions for the ARGV rules are expected to do all their work with side-effects and to not return any value. Any value that is returned by the translation (except for the delimiting newlines) will be reported by the main program as undefined arguments.

The complete set of built-in ARGV rules can be seen by looking at the source file “gema.c” in the variable argv_rules. Here are a few representative examples:

  ARGV:\N-idchars\n*\n=@set-parm{idchars;$1}
  ARGV:\N-literal\n*\n=@set-syntax{L;$1}
  ARGV:\N-p\n*\n=@define{*}
  ARGV:\N\L*\=*\n=@define{$0}
  ARGV:\N-odir\n*\n=@set{.ODIR;*}
  ARGV:\N-<L1>\n=@set-switch{$1;1}
  ARGV:\N-*\n=@err{Unrecognized option\:\ "-*"\n}@exit-status{3}

For an example of extending the command line options, suppose you wanted to emulate a C pre-processor by accepting “-D” options to define macros. That could be done by defining rules such as:

  ARGV:\N-D<I>\=*\n=@define{\\I$1\\I\=@quote{$2}}
  ARGV:\N-D<I>\n=@define{\\I$1\\I\=1}

Instead of adding to the built-in rules, it is also possible to suppress the built-in rules and define your own rules from scratch. To do this, start the program with a command line like:
gema -prim pattern-file ...
The -prim (“primitive mode”) option suppresses loading of the built-in rules and reads patterns from the specified file. Then the remainder of the command line is processed according to whatever ARGV rules were defined in that file. Note that even the default behavior of reading from standard input and writing to standard output is implemented by the ARGV rules. (The -prim option is the only one that is hard-coded instead of being implemented by patterns.)

6 Exit codes

When the program terminates, it will return one of the following status codes to the operating system (unless overridden by the use of function @exit-status):

0: nothing wrong
1: (reserved for user via @exit-status{1})
2: failed match signaled by @fail or @abort
3: undefined command line argument
4: syntax error in pattern definitions
5: use of undefined name during translation (domain, variable, switch, parameter, syntax type, or locale)
6: invalid numeric operand
7: can't execute shell command for @shell function
8: I/O error on input file
9: I/O error on output file
10: out of memory

7 Status and Future development

This program was essentially functionally complete by the end of 1995. There have been only minor enhancements and bug fixes since then, both because it had reached a point where it was sufficient for my own needs and because I had very little time to spend on further development in subsequent years. After retiring in 2023, I was able to fix the known bugs (version 1.5) and implement Unicode support (version 2.0).

I consider this to have been a successful experiment since this program continues to prove very useful for a wide variety of tasks that are not as well served by other tools.

In the original 1995 version of this document, I had written:

In an ideal world, the current program would be regarded as a completed prototype, and it would be appropriate to start designing the real program to replace it. However, as usually happens in the real world, we ship the prototype because there isn't time to do any more. There is room for improvement in the areas of consistency, ease of use, and performance at least.

But now, from the perspective of 2024, it seems that the program has proven to be sufficiently robust that there would be nothing substantial to be gained by a redesign. Similarly, while the notation might not be ideal, the benefit of a stable tool outweighs any potential improvement. While performance improvements might be possible, they would not be likely to make enough difference to be noticeable.

Since this was developed by one person as a spare time hobby, it has not had as extensive or systematic testing as could be done, but I continue to use it frequently and it has been used by a number of other people over the years, with only a small number of bugs being discovered.

I don't expect to be spending any more effort on further development, but I am interested in hearing about any bugs found or other suggestions.

Following, in no particular order, are some assorted ideas for potential future enhancements:

The pattern matching should build a multi-level decision tree instead of using just a two-level dispatch with linear search after that.
Should warn about a domain that is defined but not referenced, since it it easy to mistakenly neglect to quote a colon.
It might be useful to have a way to switch (or push and pop) the output file - e.g. to write each chapter of a document to a separate file even though the input might be a single file.
A function to construct a unique pathname for a temporary file.
A default notation for quoting a long section of literal text, in addition to using the backslash for quoting individual characters.
A function to return the pathname of the current directory.
Record the file and line that each rule came from, to be used in run-time error messages.
Improved trace mode as an aid for debugging pattern files. (Some improvement was done in version 1.5.)
A template operator for specifying an action to be taken after all input files have been processed.
A variation of the @shell function that returns the output of the command.
Make the code reentrant (eliminate static variables) to allow it to be used embedded in a multi-threaded application.
Use a more modern, more full-featured regular expression library. (However, regular expressions are rarely used, and with the introduction of @defset may no longer ever be necessary.)
Instead of using C library wide character functions, an alternate approach to Unicode support would be to use the ICU library. The source code allows for that alternate configuration, but it is incomplete.
Support for UTF-32 files could easily be added if anyone actually uses that as a file format.

8 Acknowledgments

This program was conceived as an extension of the concepts embodied in W. M. Waite's “STAGE2” processor [Footnote 2], as implemented by Roger Hall.[Footnote 3]

This program has some similarities to awk, but they are generally due more to similarity of purpose than to any deliberate copying. I did copy the $0 notation and adopt the term action.

This program was designed and coded by myself, David N. Gray, except for the regular expression processor, which utilizes public domain code written by Ozan S. Yigit and updated by Craig Durland and Harlan Sexton. David A. Mundie supplied modifications to enable use on the Macintosh, and offered helpful comments and encouragement. Tod Olson began the work of parameterizing the code for Unicode support.

Thanks to Remo Dentato for the integration with Lua, for setting up a SourceForge project [Footnote 4], and for helping with build scripts.

Thanks also to the several people who reported bugs along the way.

[Previous: Functions] [Table of Contents]