(I set myself the goal of posting every week, but the latest Skydeck release has kept me busy, and it turned out I didn't understand the following as well as I thought.)
After seeing the examples of Camlp4 quotations in my last post, you may wonder:
- what are all the quotations (
str_item
, ctyp
, etc.)?
- what are all the antiquotations (
uid
, `str
, etc.)?
- which antiquotations are allowed where?
To answer these questions, we're going to look at how quotations are implemented in Camlp4. We'll need to learn a little about Camlp4's extensible parsers, and look at the OCaml parser in Camlp4.
Parsing OCaml
A small complication is that there is more than one concrete syntax for OCaml in Camlp4: the original (i.e. normal OCaml syntax) and revised syntaxes. The original syntax parser is given as an extension of the revised syntax one. So we'll begin in camlp4/Camlp4Parsers/Camlp4OCamlRevisedParser.ml (line 588 in the 3.10.2 source):
expr:
[ "top" RIGHTA
[
| "if"; e1 = SELF; "then"; e2 = SELF; "else"; e3 = SELF ->
<:expr< if $e1$ then $e2$ else $e3$ >>
You can read the parser more or less as a BNF grammar. This code defines a nonterminal
expr
by giving a bunch of cases. The cases are grouped together into
levels, which can be labeled and given an associativity (that's what
"top"
and
NONASSOC
are). Levels are used to indicate the precedence of operators, and also to provide hooks into the parser for extending it; for our purpose here you can skip over them.
You can read a case like a pattern match: match the stuff to the left of the arrow, return the stuff to the right. (What's being matched is a stream of tokens from the lexer.) A parser pattern can contain literal strings like "if"
, backquoted data constructors like `INT
(which can carry additional data), nonterminals, and some special keywords like SELF
. You can bind variables using ordinary pattern-matching syntax within token literals, and use x = y
syntax to bind the result of a call to a nonterminal.
The right side is a piece of AST representing what was parsed, and in most cases it is given as a quotation. This is pretty confusing, because often the left and right sides of a case look very similar, and you can't tell what AST node is produced. However, it gives us lots of examples of tricky quotations, and since we have already seen how to expand quotations we can deal with it. (If you're curious how Camlp4 is written using itself see camlp4/boot.)
Focusing on the if
case: the keywords if
, then
, and else
are parsed with an expression after each (at least we know that's the syntax of normal OCaml, and that gives a clue to what SELF
means: parse the current nonterminal); the expressions are bound to a variables; then the pieces are put together into an ExIfe
AST node.
(Some other special keywords you'll see are OPT
, which makes the next item optional, and LIST0
/LIST1
, which parse a list of items separated by the token after SEP
. LIST1
means there must be at least one item.)
OCaml allows you to leave off the else
part; where is the code for that? Turns out this is not allowed in revised syntax, and the original syntax overrides this part of the parser. Take a look at camlp4/Camlp4Parsers/Camlp4OCamlParser.ml (line 292):
expr: LEVEL "top"
[ [
| "if"; e1 = SELF; "then"; e2 = expr LEVEL "top";
"else"; e3 = expr LEVEL "top" ->
<:expr< if $e1$ then $e2$ else $e3$ >>
| "if"; e1 = SELF; "then"; e2 = expr LEVEL "top" ->
<:expr< if $e1$ then $e2$ else () >>
(Notice how the
expr
definition is qualified with the level in the revised grammar where it should slot in.)
Quotations and antiquotations
Hopefully that is enough about parsing to muddle through; let's move on to quotations. Here's another piece of the revised parser (line 670)--these are still cases of expr
:
[ `QUOTATION x -> Quotation.expand _loc x Quotation.DynAst.expr_tag
The
`QUOTATION
token contains a record including the body of the quotation and the tag. The record is passed off to the
Quotation
module to be expanded. The actual expansion happens in
camlp4/Camlp4Parsers/Camlp4QuotationCommon.ml. Looking to the bottom of that file, there are several lines like:
add_quotation "sig_item" sig_item_quot ME.meta_sig_item MP.meta_sig_item;
This installs a quotation expander for the
sig_item
tag. The expander parses the quotation starting at the
sig_item_quot
nonterminal in the parser, then runs the result through the antiquotation expander (see below). (The last two arguments to
add_quotation
have to do with the context where a quotation appears: inside a pattern you get
PaFoo
nodes while inside an expression you get
ExBar
nodes.) So we can answer one of the questions posed at the beginning: what are all the quotation tags? We can see here that there is a quotation for each type in
camlp4/Camlp4/Camlp4Ast.partial.ml.
Now let's look at antiquotations, which are more complicated (line 671):
| `ANTIQUOT ("exp"|""|"anti" as n) s ->
<:expr< $anti:mk_anti ~c:"expr" n s$ >>
The
`ANTIQUOT
token contains the tag and the body again (and the parser can choose a case based on the tag). The
anti
antiquotation creates a special AST node to hold the body of the antiquotation; each type in the AST has a constructor (
ExAnt
,
TyAnt
, etc.) for this purpose. The
mk_anti
function adds another tag, which is not always the same as the one we parsed; the
~c
argument adds a suffix giving the context where the antiquotation appeared.
There are two places where antiquotations are interpreted. First, in camlp4/Camlp4Parsers/Camlp4QuotationCommon.ml (line 89):
[ "`int" -> <:expr< string_of_int $e$ >>
This is one of a bunch of cases in a map over the syntax tree. It handles antiquotations like
<:expr< $`int:5$ >>
, which turns into an
ExInt
. You can also see cases here for the
anti
antiquotations, and some things to do with
list
antiquotations we haven't seen yet (more on this below).
Things that don't match these cases are handled when the AST is pretty-printed. Let's look at camlp4/Camlp4/Printers/OCaml.ml (line 510):
| <:expr< $int:s$ >> -> o#numeric f s ""
This case handles antiquotations like
<:expr< $int:"5"$ >>
. Again, this produces an
ExInt
, but you give it a string instead of an int.
What we have learned
Teaching a person to fish is fine, unless that person starves while trying to finish their PhD in theoretical pescatology. But I hope that you can see how we might go about answering the remaining questions--what are all the antiquotations, and where are they allowed--by examining all the `ANTIQUOT
cases in the parser and puzzling out where they get expanded.
Let's look at a particular example, by way of addressing the comment Nicolas Pouillard (aka Ertai) made on the last post. He points out that the final McOr
in of_string
can go outside the antiquotation. How could we learn this from the Camlp4 code? Let's find where the antiquotation is expanded, starting at the point where the function
keyword is parsed (Camlp4OCamlParser.ml line 299):
| "function"; a = match_case ->
<:expr< fun [ $a$ ] >>
(the right side is revised syntax) which uses
match_case
(line 350):
match_case:
[ [ OPT "|"; l = LIST1 match_case0 SEP "|" -> Ast.mcOr_of_list l ] ]
You might think that
match_case0
parses a single case, but let's check (
Camlp4OCamlRevisedParser.ml line 778):
match_case0:
[ [ `ANTIQUOT ("match_case"|"list" as n) s ->
<:match_case< $anti:mk_anti ~c:"match_case" n s$ >>
| `ANTIQUOT (""|"anti" as n) s ->
<:match_case< $anti:mk_anti ~c:"match_case" n s$ >>
We're interested in the second case for the moment: here's the antiquotation with no tag used in
of_string
. So the list of cases is returned by
match_case0
(as an
McAnt
with
match_case
as its tag) and more cases can be parsed following it.
(Now we can see a justification for a puzzling design decision in the AST: instead of collecting match cases in a list, it collects them with McOr
nodes. Many arrangements of McOr
nodes correspond to the same list of cases. As the above possibility shows, this is useful: an antiquotation can return zero, one, or several match cases, and we don't have to worry about splicing them into the list. On the other hand, it makes consuming the AST a little more complicated.)
We can go one step further: if we use the list
antiquotation, the first case in match_case0
returns an antiquotation with tag listmatch_case
, and we get the following expansion (Camlp4QuotationCommon.ml line 117):
| "listmatch_case" -> <:expr< Ast.mcOr_of_list $e$ >>
So our final
of_string
becomes:
let of_string = function
$list:
List.map
(fun c -> <:match_case< $`str:c$ -> $uid:c$ >>)
cons$
| _ -> invalid_arg "bad string"
Can we do something similar with the generation of the variant type? No, as it turns out. In the revised syntax, the arms of a variant are given inside square brackets, so we can say:
type t = [ $list:List.map (fun c -> <:ctyp< $uid:c$ >>) cons$ ]
But in the original syntax, without at least one constructor to make clear that we're defining a variant, there's no context to interpret a list, and this is reflected in the parser, which doesn't allow a
list
antiquotation there. This kind of problem is apparently why the revised syntax was introduced.
So far I've talked only about generating OCaml code; next time I'll cover how to use Camlp4 to consume OCaml, and build a simple code analysis tool.