Looking back, I now see that without always realizing it, I have been searching for the perfect language for text manipulation ever since I became interested in computers. I suppose this is natural, since I came to computer science from a background in philosophy and literature. My first computer project was a program to analyze recipes so as to detect their underlying patterns and generalize them into metarecipes.
My first programming course was taught in CDC Fortran and strings were represented as 10-element arrays of characters; somehow I instinctively realized that computer science could do better than that. My first love in programming languages was Snobol; for one brief but glorious moment I felt that I had found the humanist's programming language, a text-oriented language that transcended the cumbersome, mechanical, left-brain thinking of more usual programming languages. Then I got the bill, and discovered that analyzing just a few recipes had eaten up nearly half of my graduate-student allotment of computer time.
There followed a long period of dabbling in programming languages. There was the Pascal phase, the Ada phase, the ML phase, the Prolog phase, the Awk phase, the Tcl-Tk phase. For a good while I was enamored of Perl's synergistic combination of basic pattern matching, operating system interface, and general-purpose programming features. Each had its merits, but I always knew that something was missing, something that kept me from feeling deeply satisfied.
Then I discovered Gema. From the moment I downloaded it, I realized this was the programming language for me, and since then I have used it extensively whenever the task at hand could be construed as a transformation problem - and of course, that covers a lot of territory. In many ways it is a strange language for me to have settled on, since it lacks the strong typing and other software engineering features that I believe essential to programming in the large, but I find its rule-based paradigm irresistable for the small- to medium-sized projects that I spend most of my time on.
In this article I attempt to explain what it is that draws me to Gema, and why I find it such a powerful paradigm for textual programming. I begin with what is by design its forte: its powerful rule-based transformation capability. In the second section I discuss some ancillary issues of program language design which make Gema convenient to use, but which would not in and of themselves make it my language of choice.
In writing the article I was torn between the need not to assume any previous knowledge of Gema and the desire not to have to write a tutorial. I have compromised. Throughout the text I am relying on the reader's intuitions to make sense of the examples; in an Appendix I provide a brief overview of Gema's syntax.
Gema | Sed | Awk | Perl | |
---|---|---|---|---|
Regular expressions | + | + | + | + |
Non-line-oriented | + | |||
Non-procedural and rule-based | + | + | ||
Recognizers | + | + | ||
User-defined recognizers | + | |||
Unification of matching and function definition | + | |||
Multiple rule sets | + | |||
Context-sensitive matching | + | |||
Recursive patterns | + | |||
Dynamic Patterns | + | |||
User-specified match position | + | |||
User-definable syntax | + | |||
General-purpose programming | + | + | + | |
OS Interface | + | + | ||
Command-line extension | + | + |
Gema is often compared to other languages such as Awk or Perl and at first glance it can be hard to see what is novel in the language. Table 1 represents my attempt at a feature chart showing that in truth Gema offers many uncommon capabilities.
Before beginning to examine those features, I'd like to say a few words about the conciseness of Gema. I freely admit that one of the things that appeals to me in Gema is how short programs written in it can be. I thrill to the pleasure of creating elegant one-liners that accomplish tasks that take many lines of code in other languages-for example, a two-line pattern that parses the input to a CGI application, replacing fifty lines of Perl. Or consider the following one-line keyword-in-context generator:
<G>\P <G> <G> <G> <G> = @fill-right{@repeat{40;\ };$1 $2} $3 $4 $5\n
Although such brevity has always always appealed to me in Gema, I used to feel guilty about it, because APL has given brevity in programming languages such a bad name. Eventually, however, I came to feel that Gema was not in fact a "write-once" language, but rather that its brevity most of the time comes from the fact that its solutions have such a good fit with the problem space. It seems to me that the one-line KWIC generator is a very natural expression of what we might have given the algorithm to be: "When you find five graphic character tokens separated by white space, print the first two of them right-justified in a field of forty blanks, print the rest of them, and end the line, then continue processing just after the first of the tokens." When the solution space matches the problem space, verbosity ceases to be a desideratum.
Like Awk, Gema is a rule-based, non-procedural language, and this is the essence of its charm for me. Rule-based rewriting systems have been extensively studied in computer science, but here I wish to stress not their theoretical advantages but rather their practical ones. They go far beyond the simple notational convenience of not having to be constantly manipulating file pointers or nesting control structures. It is rather that the language has built into it a powerful processing engine that is almost impossible to reproduce in a procedural language.
Consider the simple problem of replacing single quotes with double quotes, except when the quotation is within parentheses. In Gema this is expressed quite naturally as
(<u>)=$0 '="
In a regular-expression-based, procedural language, implementing this trivial task requires many lines of complex code, because you either have to preprocess the input to somehow seal off the parenthesized expressions before changing the quotation marks, or write a tokenizer that maintains state and skips ahead to the closing parenthesis. In practice, handling nested constructs with regular expressions is so complex that many programmers opt for an approximate solution that works most of the time. "We'll never have "" and "" on the same line-will we?" In Gema, all this maintaining of state is done by the transformation engine, and does not have to appear in the user's code.
A fine example of how Gema is tuned for its problem domain is the fact that it supports patterns which cross line boundaries. All the other languages in our survey are line-oriented, whereas most text transformation problems involve multi-line units. It may seem like a small point, but the truth is that such line orientation is a constant source of impedance mismatch which obfuscates the solution space. It means that Grep has no equivalent to Gema's
Gema -match Error\:\n<U>\nfor printing out error messages that are two lines long. It means that the first thing many Perl programs do is to read their input into a large array and delete the line ends. It means that in Awk you can't simply write
(<U>)=@process($0)to process all parenthesized expressions, or
\<table\><U>\<\/table\>=@table{$0}to process an HTML table. And so on and so forth. (I should note that Gema does support a line-oriented mode for those cases when it is appropriate.)
The dominant pattern-matching paradigm in computer technology today is the regular expression, which I think is a pity, because regular expressions are really a cumbersome and error-prone matching technology. Gema supports them, and they can occasionally be useful, but for most purposes what Gema calls recognizers and escape sequences prove to be more convenient.
Recognizers have two main advantages over regular expressions. The first is that where regular expressions are "greedy", swallowing up as much of their input as possible, recognizers are generous, and only nibble as much of their input as possible, which is almost always what is wanted. Given the input
The owner of the dog had an unquenchable hatred of dogmathe regular expression pattern "/[A-Za-z ]+dog/=($0)" returns
(The owner of the dog had an unquenchable hatred of dog)mawhereas the recognizer pattern "<U>dog=($1) returns
(The owner of the dog) had an unquenchable hatred of dogma
Of course, this greediness of regular expressions is especially undesirable in Gema because it is not by default line-oriented, so that the first pattern above would return everything up to the very last "dog" in the file, which could be thousands of lines. As I say, this is hardly ever what you want.
The second advantage of recognizers over regular expressions is that they provide a well-engineered repertoire of matching primitives that are very convenient for building up complex matches. Of course, many regular-expression-based languages beef up regular expressions with special notations for matching at the beginning of a line and so forth, but few of them have the wide range of recognizers built into Gema. Control characters, punctuation, the various flavors of letters and digits and identifier characters and filename characters, all are available in a compact notation. No more writing "[A-Za-z0-9_-]+" for an identifier or "[;.!@#$%^&*()]" for punctuation: instead we can write <I> and <Y>.
More importantly, Gema's recognizers can be parameterized for the length of the string being matched, made optional, and negated. For example, to recognize dates written in the international standard format YYYY-MM-DD, one can write
<D4>-<D2>-<D2>
Without this ability to conveniently bound the length of the match, many pattern-matching tasks become more error-prone. Compare the strings matched by the pattern above with theso matched by "[0-9]+-[0-9]+-[0-9]+".
The ability to negate these recognizers makes the system still more flexible. Need to find all punctuation marks that are not preceeded by a letter? The pattern
<-L><Y1>is what you need. Finally, recognizers can be made optional by writing them in lower case. For example,
<I><c>will match all identifiers along with the control characters that come after them, if there are any.
To my way of thinking the essential innovation in Gema is its support for user-defined recognizers. It is this that allows you to write grammars in Gema that are self-executing. In this article we will look at user-defined recognizers from two perspectives. In this section we will view them as an extension of the built-in recognizer mechanism, while in section 2.6 we will view them as providing multiple sets of transformation rules at the same time.
But for now, let us consider the (admittedly artificial) problem of recognizing a construct that consists of the word "dog", 4-digit numbers, identifiers whose third character is "e", and all strings consisting of two hexadecimal digits followed by a punctuation mark. Let us call such an ungodly construct "whacko". What we would like is to be able to write Gema rules with a "whacko" recognizer in just the same way we write rules with the built-in recognizers. For example, we might want to process all sequences where a whacko is preceded by the word the and followed by an identifier:
the <whacko> <I>=@process{$0}Gema allows us to do exactly that. How do we define what a whacko is? Gema has a very pure programming model: everything is done with pattern-matching whenever possible, so it should come as no surprise that it uses transformation rules to specify user-defined recognizers. Here is the definition of a whacko:
whacko:dog=$0@end whacko:<D4>=@end whacko:<I2>e<i>=@end whacko:<X2><Y1>=@end whacko:=@terminateNotice that there is a one-to-one correspondence between the components in the natural-language specification and its implementation in Gema.
The use of user-defined recognizers lets us write Gema rule sets that are to all intents and purposes self-executing grammars. For example, here is the BNF notation for an XML attribute list declaration taken straight out of the XML 1.0 specification:
AttlistDecl ::= '<!ATTLIST' S Name AttDef* S? '>'In Gema this can be written as:
AttlistDecl:'<!ATTLIST'<S><Name><AttDef><s>'>'Apart from the white space, this is clearly isomorphic to the BNF.
I am not necessarily advocating Gema as a replacement for parser generators; I sometimes think that I may go to far in using Gema as a quick way to implement solutions that in the long run might be better done with dedicated formal-language processing systems. I think Gema is most appropriate for situations where writing a BNF grammar is unfeasible. The point I am trying to make is simply that user-defined recognizers provide a very natural way of implementing formal language specifications.
I have often wondered what class of languages Gema can recognize. A friend of mine once pointed out that it permits context-sensitive parsing, because it permits pattern variables that can be dynamically set. For example, consider the problem of matching punctuation marks. This can be represented in Gema by the rule
<Y1><u>$1=@process{$0}This rule matches any single punctuation character, followed by zero or more other characters, followed by the punctuation character that was matched by the first recognizer. This is not limited to multiple matches of the same pattern, however, since user-defined recognizers can perform arbitrary computations to generate pattern variables. Suppose we want to parenthesize pairs of predators and their prey. We might write:
<animal><u>$p=($0) animal:cat=$0@set{p;mice} animal:heron=$0@set{p;fish}Then if the input is
The cat eats mice, the heron eats fish.The output will be
The (cat eats mice), the (heron eats fish).This is a powerful capability that although not used very frequently, is extremely useful when the need arises.
Up till now we have focused on the recognition portion of Gema; now it is time to consider the actions that can be applied to the input that is recognized. Gema provides the usual set of predefined functions, for arithmetic, string manipulation, file handling, text formatting, operating system calls, and the like.
In addition, Gema provides for user-defined functions. I have stressed that one of the things I like about the language is its consistent, unified computational model, so it should come as no surprise to learn that these user-defined functions are defined as rule-based transformations. In fact, user-defined functions are one and the same thing as user-defined recognizers: any function can be used as a recognizer and vice-versa.
Consider for example a reduplicative function that reduplicates its input, turning "ma" inta "ma-ma", "ba" into "ba-ba", etc.
redup:<I>=$1-$1On the left side of a transformation this can be used as a recognizer:
<redup>=$0while on the right-hand side it can be used as a function. If the task were to reduplicate names that started with J, for example, we might write
J<I>=@redup{$0}This ability to build up a repertoire of patterns that can be selected depending on the context encourages a divide-and-conquer approach and makes for more robust programs because it simplifies the amount of detail that needs to be considered; it provides a form of lexical scoping and information hiding. To take a real-world example, in processing HTML text it is convenient to be able to write e.g.
\<table\><U>\<\/table\>=@table{$1}In writing the rules for the "table" domain, I can focus on the table element because I know that everything else has been filtered out by the higher-level domain.
I can hear the sarcastic voice of the skeptics. "Ooh, ooh, isn't that cool-Gema's got -eek, eek-subroutines!!!'" Indeed, I find it difficult to convey the synergistic power of this subroutine-as-ruleset paradigm. Part of it, to be sure, is the generality of being able to define the syntax of the parameters. Want to use commas to separate parameters? It's easy:
myfunc:<U>,<U>,<U>=@print{My parameters are $1 $2 $3\n}Want to use vertical bars instead? OK:
myfunc:<U>|<U>|<U>=@print{My parameters are $1 $2 $3\n}Need a variable number of parameters? No problem-just define a domain that matches the right input patterns. For example, to set an arbitrary number of variables to the empty string:
@init{a, b, c, d, e, f, g...} init:<I>=@set{$1;}
I guess the appeal of Gema's subroutine mechanism comes in part from contrasting it to that of Awk and Perl, where the language is split in two conflicting parts-the non-procedural, pattern-matching part, and the procedural, algorithmic part, with only an awkward interface between the two.
Another powerful feature of Gema that is lacking in other text processing languages is the ability to define new rules at run-time. This lets the program adjust its behavior depending on the input data. To take a trivial example, consider the task of listing all the distinct words that appear in a text. This is easily accomplished by processing each word as it appears, then defining a rule to ignore the word from then on:
<W>=@process{$1}@define{$1=}Combined with Gema's treatment of files, which returns the file contents as the result of the "read" operator, the define operator can be used to provide an include mechanism:
@define{@read{newrules}}has the effect of including the file "newrules" as thought its content occurred textually at the point of the call.
I conclude this section on Gema's text transformation capabilities with two clean solutions to frequent pattern-matching problems. The first is the problem of balanced delimiters. Suppose one wants to enumerate all the parenthesized subexpressions in a given input. That is, if the input is
((a+b)*(c+d))we want to print out
(a+b) (c+d) ((a+b)*(c+d))For this, regular expressions and recognizers are inadequate, because they terminate prematurely - they don't keep track of their nesting level. If we write
(*)=$0@err{$0\n}we get
((a+b) (c+d) ((a+b)*(c+d))because the right parenthesis after "a+b" matches the first left parenthesis, not the second one. To get this right, we need Gema's recursive argument, represented by '#':
(#)=$0@err{$0\n}This argument performs recursive matches based on the literal characters surrounding it. The first recursive match finds '(a+b)', the second '(c+d)', etc. This capability is invaluable in dealing with the many situations involving nested constructs.
Finally, no discussion of Gema's power as a text transformation tool would be complete without mentioning that it has unlimited look-ahead. After any match, no matter how long, the scanning position can be reset to any arbitrary point.
We have already seen an example of this. Much of the conciseness of the KWIC program cited above comes from the ability to reset the scanning position (\P) to the end of the key word, rather than the default position at the end of the entire match:
<G>\P <G> <G> <G> <G> = @fill-right{@repeat{40;\ };$1 $2} $3 $4 $5\nIn general, look-ahead is a good thing, and the more the better.
I conclude by considering three general programming-language capabilities which add to Gema's usefulness as a text transformation language.
AttlistDecl:'<!ATTLIST'<S><Name><AttDef><s>'>'
A frequent use of syntax redefinition is in dealing with SGML texts, where the angle brackets and solidus conflict with their default use in Gema. In the redefined syntax, we write recognizers with square brackets instead of angle brackets: [I] instead of <I>, and so forth. The ability to redefine the comment character is also frequently useful.
To provide an extreme example, there is no reason the Gema fragment
! Cats eat mice animal:cat=$0@set{p;mice}could not be written
= Cats eat mice animal$cat&^0_set[p~mice]The total illegibility of this example may also serve as a reminder that such syntactic tinkering should be kept to a minimum.
Unlike Perl, Gema does not aim first and foremost to be a language for system programming tasks. Nevertheless, it does provide the basics: environment variables can be queried, pathnames manipulated, and operating system commands executed. The only serious restriction is that because a design goal was to implement Gema in pure ANSI C, executing a command does not return their value, thus requiring the use of temporary files. Despite this restriction, Gema's facilities in this area are adequate to allow it to be used for many system scripting tasks.
Finally, we should mention that like Perl, Gema supports extensions to its command-line processing. It should come as no surprise that this processing is driven by pattern-matching. Even the "built-in" command-line processing is done via pattern matching, and can be entirely replaced by the user if so desired.
For example, a "count" parameter can be passed to Gema by adding the following rule to the special ARGV domain:
ARGV:\N-count\n*\n=@set{count;$1}To my mind, this should be compared to the complex, never-quite-right packages for command-line processing that are endemic to languages like Perl.
Construct | Example | Meaning |
---|---|---|
Recognizer | <L> | Match any string of letters |
Escape Sequence | \W | Match optional white space |
Domain | table:tree=bush | When processing a table, replace "tree" with "bush" |
Variable Reference | ${count} | The value of the variable "count" |
Function Call | @fill-right{000;${count}} | Call the function "fill-right" with the parameters "000" and "${count}" |