Tuesday, January 27, 2009

Reading Camlp4, part 4: consuming OCaml ASTs

It's easy to think of Camlp4 as just "defmacro on steroids"; that is, just a tool for syntax extension, but it is really a box of independently-useful tools. As we've seen, Camlp4 can be used purely for code generation; in this post I'll describe a tool that uses it purely for code consumption: a (minimal, broken) version of otags:
open Camlp4.PreCast
module M = Camlp4OCamlRevisedParser.Make(Syntax)
module N = Camlp4OCamlParser.Make(Syntax)
We're going to call the OCaml parser directly. These functor applications are used only for their effect (which is to fill in an empty grammer with OCaml cases); ordinarily they would be called as part of Camlp4's dynamic loading process. Recall that the original syntax parser is an extension of the revised parser, so we need both, in this order.
let files = ref []

let rec do_fn fn =
  let st = Stream.of_channel (open_in fn) in
  let str_item = Syntax.parse_implem (Loc.mk fn) st in
  let str_items = Ast.list_of_str_item str_item [] in
  let tags = List.fold_right do_str_item str_items [] in
  files := (fn, tags)::!files
We'll call do_fn for each filename on the command line. The Syntax.parse_implem function takes a Loc.t and a stream, and parses the stream into a str_item. (The initial Loc.t just provides the filename so later locations can refer to it, for error messages etc.) Now, recall that even though we got back a single str_item, it can contain several definitions (collected with StSem). We use Ast.list_of_str_item to get an ordinary list, then accumulate tags into files.
and do_str_item si tags =
  match si with
 (* | <:str_item< let $rec:_$ $bindings$ >> -> *)
    | Ast.StVal (_, _, bindings) ->
        let bindings = Ast.list_of_binding bindings [] in
        List.fold_right do_binding bindings tags
    | _ -> tags
We'll only consider value bindings. The commented-out str_item quotation doesn't work (run it through Camlp4 to see why--I'm not sure where the extra StSem/StNil come from), so we fall back to an explicit constructor. (The rec antiquotation matches a flag controlling whether an StVal is a let rec or just a let; here we don't care.) Now we have an Ast.binding, which again can contain several bindings (collected with BiAnd) so we call Ast.list_of_bindings.
and do_binding bi tags =
  match bi with
    | <:binding@loc< $lid:lid$ = $_$ >> ->
      let line = Loc.start_line loc in
      let off = Loc.start_off loc in
      let pre = "let " ^ lid in
      (pre, lid, line, off)::tags
    | _ -> tags
We're going to generate an etags-format file, where each definition consists of a prefix of the line in the source, the tag itself, the line number, and the character offset. If you look in the parser you'll see that the left side of a binding can be any pattern (as you'd expect), but we only handle the case where it's a single identifier; the lid antiquotation extracts it as a string. The line number and character offset are easy to find from the location of the binding (see camlp4/Camlp4/Sig.ml for the Loc functions), which we get with @loc. The prefix is problematic: the location of the binding does not include the let or and part, and anyway what we really want is everything from the beginning of the line. Doable but not so instructive of Camlp4, so we just tack on a "let " prefix (so this doesn't work for and or if there is whitespace).
let print_tags files =
  let ch = open_out "TAGS" in
  ListLabels.iter files ~f:(fun (fn, tags) ->
    Printf.fprintf ch "\012\n%s,%d\n" fn 0;
    ListLabels.iter tags ~f:(fun (pre, tag, line, off) ->
      Printf.fprintf ch "%s\127%s\001%d,%d\n" pre tag line off))
Generating the tags file is straightforward, following the description at the bottom of the otags README. (The 0 is supposed to be the length of the tag data, but my Emacs doesn't seem to care.) We put the pieces together with Arg:
;;
Arg.parse [] do_fn "otags: fn1 [fn2 ...]";
print_tags !files
and finally, a Makefile:
otags: otags.ml
        ocamlc \
          -pp camlp4of \
          -o otags \
          -I +camlp4 -I +camlp4/Camlp4Parsers \
          dynlink.cma camlp4fulllib.cma otags.ml
We could improve this in many ways (error-handling, patterns, types, etc.); clearly we can't replicate otags in a few dozen lines. But Camlp4 takes care of a lot of the hard work. Next time, maybe, an actual syntax extension.

Thursday, January 22, 2009

Reading Camlp4, part 3: quotations in depth

(I set myself the goal of posting every week, but the latest Skydeck release has kept me busy, and it turned out I didn't understand the following as well as I thought.)

After seeing the examples of Camlp4 quotations in my last post, you may wonder:

  • what are all the quotations (str_item, ctyp, etc.)?
  • what are all the antiquotations (uid, `str, etc.)?
  • which antiquotations are allowed where?
To answer these questions, we're going to look at how quotations are implemented in Camlp4. We'll need to learn a little about Camlp4's extensible parsers, and look at the OCaml parser in Camlp4.

Parsing OCaml

A small complication is that there is more than one concrete syntax for OCaml in Camlp4: the original (i.e. normal OCaml syntax) and revised syntaxes. The original syntax parser is given as an extension of the revised syntax one. So we'll begin in camlp4/Camlp4Parsers/Camlp4OCamlRevisedParser.ml (line 588 in the 3.10.2 source):

    expr:
      [ "top" RIGHTA
        [ (* ... *)
        | "if"; e1 = SELF; "then"; e2 = SELF; "else"; e3 = SELF ->
            <:expr< if $e1$ then $e2$ else $e3$ >>
You can read the parser more or less as a BNF grammar. This code defines a nonterminal expr by giving a bunch of cases. The cases are grouped together into levels, which can be labeled and given an associativity (that's what "top" and NONASSOC are). Levels are used to indicate the precedence of operators, and also to provide hooks into the parser for extending it; for our purpose here you can skip over them.

You can read a case like a pattern match: match the stuff to the left of the arrow, return the stuff to the right. (What's being matched is a stream of tokens from the lexer.) A parser pattern can contain literal strings like "if", backquoted data constructors like `INT (which can carry additional data), nonterminals, and some special keywords like SELF. You can bind variables using ordinary pattern-matching syntax within token literals, and use x = y syntax to bind the result of a call to a nonterminal.

The right side is a piece of AST representing what was parsed, and in most cases it is given as a quotation. This is pretty confusing, because often the left and right sides of a case look very similar, and you can't tell what AST node is produced. However, it gives us lots of examples of tricky quotations, and since we have already seen how to expand quotations we can deal with it. (If you're curious how Camlp4 is written using itself see camlp4/boot.)

Focusing on the if case: the keywords if, then, and else are parsed with an expression after each (at least we know that's the syntax of normal OCaml, and that gives a clue to what SELF means: parse the current nonterminal); the expressions are bound to a variables; then the pieces are put together into an ExIfe AST node.

(Some other special keywords you'll see are OPT, which makes the next item optional, and LIST0/LIST1, which parse a list of items separated by the token after SEP. LIST1 means there must be at least one item.)

OCaml allows you to leave off the else part; where is the code for that? Turns out this is not allowed in revised syntax, and the original syntax overrides this part of the parser. Take a look at camlp4/Camlp4Parsers/Camlp4OCamlParser.ml (line 292):

    expr: LEVEL "top"
      [ [ (* ... *)
        | "if"; e1 = SELF; "then"; e2 = expr LEVEL "top";
          "else"; e3 = expr LEVEL "top" ->
            <:expr< if $e1$ then $e2$ else $e3$ >>
        | "if"; e1 = SELF; "then"; e2 = expr LEVEL "top" ->
            <:expr< if $e1$ then $e2$ else () >>
(Notice how the expr definition is qualified with the level in the revised grammar where it should slot in.)

Quotations and antiquotations

Hopefully that is enough about parsing to muddle through; let's move on to quotations. Here's another piece of the revised parser (line 670)--these are still cases of expr:

  [ `QUOTATION x -> Quotation.expand _loc x Quotation.DynAst.expr_tag
The `QUOTATION token contains a record including the body of the quotation and the tag. The record is passed off to the Quotation module to be expanded. The actual expansion happens in camlp4/Camlp4Parsers/Camlp4QuotationCommon.ml. Looking to the bottom of that file, there are several lines like:
  add_quotation "sig_item" sig_item_quot ME.meta_sig_item MP.meta_sig_item;
This installs a quotation expander for the sig_item tag. The expander parses the quotation starting at the sig_item_quot nonterminal in the parser, then runs the result through the antiquotation expander (see below). (The last two arguments to add_quotation have to do with the context where a quotation appears: inside a pattern you get PaFoo nodes while inside an expression you get ExBar nodes.) So we can answer one of the questions posed at the beginning: what are all the quotation tags? We can see here that there is a quotation for each type in camlp4/Camlp4/Camlp4Ast.partial.ml.

Now let's look at antiquotations, which are more complicated (line 671):

        | `ANTIQUOT ("exp"|""|"anti" as n) s ->
            <:expr< $anti:mk_anti ~c:"expr" n s$ >>
The `ANTIQUOT token contains the tag and the body again (and the parser can choose a case based on the tag). The anti antiquotation creates a special AST node to hold the body of the antiquotation; each type in the AST has a constructor (ExAnt, TyAnt, etc.) for this purpose. The mk_anti function adds another tag, which is not always the same as the one we parsed; the ~c argument adds a suffix giving the context where the antiquotation appeared.

There are two places where antiquotations are interpreted. First, in camlp4/Camlp4Parsers/Camlp4QuotationCommon.ml (line 89):

            [ "`int" -> <:expr< string_of_int $e$ >>
This is one of a bunch of cases in a map over the syntax tree. It handles antiquotations like <:expr< $`int:5$ >>, which turns into an ExInt. You can also see cases here for the anti antiquotations, and some things to do with list antiquotations we haven't seen yet (more on this below).

Things that don't match these cases are handled when the AST is pretty-printed. Let's look at camlp4/Camlp4/Printers/OCaml.ml (line 510):

    | <:expr< $int:s$ >> -> o#numeric f s ""
This case handles antiquotations like <:expr< $int:"5"$ >>. Again, this produces an ExInt, but you give it a string instead of an int.

What we have learned

Teaching a person to fish is fine, unless that person starves while trying to finish their PhD in theoretical pescatology. But I hope that you can see how we might go about answering the remaining questions--what are all the antiquotations, and where are they allowed--by examining all the `ANTIQUOT cases in the parser and puzzling out where they get expanded.

Let's look at a particular example, by way of addressing the comment Nicolas Pouillard (aka Ertai) made on the last post. He points out that the final McOr in of_string can go outside the antiquotation. How could we learn this from the Camlp4 code? Let's find where the antiquotation is expanded, starting at the point where the function keyword is parsed (Camlp4OCamlParser.ml line 299):

  | "function"; a = match_case ->
      <:expr< fun [ $a$ ] >>
(the right side is revised syntax) which uses match_case (line 350):
    match_case:
      [ [ OPT "|"; l = LIST1 match_case0 SEP "|" -> Ast.mcOr_of_list l ] ]
You might think that match_case0 parses a single case, but let's check (Camlp4OCamlRevisedParser.ml line 778):
    match_case0:
      [ [ `ANTIQUOT ("match_case"|"list" as n) s ->
            <:match_case< $anti:mk_anti ~c:"match_case" n s$ >>
        | `ANTIQUOT (""|"anti" as n) s ->
            <:match_case< $anti:mk_anti ~c:"match_case" n s$ >>
We're interested in the second case for the moment: here's the antiquotation with no tag used in of_string. So the list of cases is returned by match_case0 (as an McAnt with match_case as its tag) and more cases can be parsed following it.

(Now we can see a justification for a puzzling design decision in the AST: instead of collecting match cases in a list, it collects them with McOr nodes. Many arrangements of McOr nodes correspond to the same list of cases. As the above possibility shows, this is useful: an antiquotation can return zero, one, or several match cases, and we don't have to worry about splicing them into the list. On the other hand, it makes consuming the AST a little more complicated.)

We can go one step further: if we use the list antiquotation, the first case in match_case0 returns an antiquotation with tag listmatch_case, and we get the following expansion (Camlp4QuotationCommon.ml line 117):

            | "listmatch_case" -> <:expr< Ast.mcOr_of_list $e$ >>
So our final of_string becomes:
let of_string = function
    $list:
      List.map
        (fun c -> <:match_case< $`str:c$ -> $uid:c$ >>)
        cons$
  | _ -> invalid_arg "bad string"
Can we do something similar with the generation of the variant type? No, as it turns out. In the revised syntax, the arms of a variant are given inside square brackets, so we can say:
type t = [ $list:List.map (fun c -> <:ctyp< $uid:c$ >>) cons$ ]
But in the original syntax, without at least one constructor to make clear that we're defining a variant, there's no context to interpret a list, and this is reflected in the parser, which doesn't allow a list antiquotation there. This kind of problem is apparently why the revised syntax was introduced.

So far I've talked only about generating OCaml code; next time I'll cover how to use Camlp4 to consume OCaml, and build a simple code analysis tool.

Sunday, January 4, 2009

Reading Camlp4, part 2: quotations

Last time we saw how Camlp4 represents OCaml abstract syntax trees. We also saw some small examples of using Camlp4 quotations to get the AST corresponding to a piece of OCaml code. In this post we'll build a simple code-generating tool that uses quotations in more complicated ways.

The tool takes constructor names on stdin, one per line. It prints an OCaml module on stdout containing a type definition with the given constructors, a function to_string that converts a constructor of the type to the string of its name, and a function of_string that does the reverse. (Not the most useful program ever written.) We'll go through it line by line:

open Camlp4.PreCast

let _loc = Loc.ghost in
We start out with some boilerplate. Recall that Camlp4.PreCast gets us the Ast module containing the AST datatypes (along with some helper functions we'll see below); and that we need a location bound to _loc since the expanded quotations mention it.
let cons =
  let rec loop () =
    try
      match read_line () with
        | "" -> []
        | c -> c :: loop ()
    with End_of_file -> [] in
  loop () in
This part just reads lines from stdin until EOF and puts them in a list, nothing to do with Camlp4.
Printers.OCaml.print_implem
<:str_item<
Now we're into the meat. We've begun a str_item quotation (recall that this is a module body), and we're passing it to a function that will pretty-print it to stdout. The module Printers.OCaml comes from camlp4/Camlp4/Printers/OCaml.ml. (This module is called to pretty-print OCaml code when you run Camlp4; toward the bottom you can see some useful command-line options.)
type t =
    $Ast.TySum (_loc,
               Ast.tyOr_of_list
                 (List.map
                     (fun c -> <:ctyp< $uid:c$ >>)
                     cons))$
It's a bit confusing with the interjected explanations, but we are still in the quotation (up to the >> at the end of the program). This part generates the variant type definition. If the input is the lines "Foo", "Bar", "Baz", we want to generate type t = Foo | Bar | Baz.

This might be a good time to run the above type definition through Camlp4 to see what the AST looks like. You'll see that the body of the definition begins with Ast.TySum, then contains the branches of the variant collected by Ast.TyOr's. Each branch is an Ast.TyId with an Ast.IdUid inside containing the constructor name. We want to come up with the same thing for an arbitrary list of constructor names.

Reading from the inside out, we take the list of constructor names and map a ctyp quotation over it. In this quotation we see an antiquotation. Antiquotations let you fill in code so you can use a quotation as a template. Here the tag uid: turns the string c into an Ast.IdUid inside an Ast.TyId. So we have a list of Ast.TyId (Ast.IdUid "Foo")'s.

Now we call a helper function Ast.tyOr_of_list. There are a bunch of these in camlp4/Camlp4/Struct/Camlp4Ast.mlast, which is where the Ast module comes from. This one collects the elements of the list with Ast.TyOr. (As we saw with Ast.StSem/Ast.StNil, the AST doesn't use regular OCaml lists, but builds list-like things out of data constructors.)

Finally we wrap a Ast.TySum around the whole thing, and use an antiquotation to insert it as the body of the type definition. (Here no tag is needed on the antiquotation; this is usually the case if you're inserting an AST rather than a string.)

let to_string = function
    $Ast.mcOr_of_list
      (List.map
          (fun c -> <:match_case< $uid:c$ -> $`str:c$ >>)
          cons)$

let of_string = function
    $let ors =
       Ast.mcOr_of_list
         (List.map
             (fun c -> <:match_case< $`str:c$ -> $uid:c$ >>)
             cons) in
     Ast.McOr(_loc,
             ors,
             <:match_case< _ -> invalid_arg "bad string" >>)$

>>
Now that we've seen how things work, the functions are pretty easy to understand. There are a few new elements: A pattern match (in either a match or function expression) consists of an Ast.McArr for each arrow (which we make with the match_case quotation), collected by Ast.McOr's (which we make with the Ast.mcOr_of_list helper).

The `str antiquotation turns a string into a quoted OCaml string, escaping any special characters.

If you run a sample match through Camlp4 you'll see constructors like Ast.PaId and Ast.PaStr. These are for patterns on the left side of match cases. When we use the match_case quotation we don't have to worry about it; we can use the same antiquotations on the left and right, and the appropriate constructor is parsed out.

That's the whole program. Here's a Makefile for building it:

variant: variant.ml
        ocamlc \
          -pp camlp4of -I +camlp4 \
          -o variant camlp4lib.cma variant.ml
#       ocamlfind ocamlc -syntax camlp4o \
          -package camlp4.quotations.o -package camlp4.lib \
          -linkpkg -o variant variant.ml

clean:
        rm -f variant *.cm* *~
Quotations and antiquotations are a slick way to generate OCaml code, but it can be pretty hard to figure out how to use them. The first thing to realize is that you don't have to: anything you can do with Camlp4 you can do by working directly with the AST constructors (although you will find it tedious). As you learn your way around you can replace explicit AST code with quotations.

To access the full power of quotations we'll need to dive into the OCaml parser. But that's going to have to wait until the next post.