Thursday, August 26, 2010

ocamljs 0.3

I am happy to announce version 0.3 of ocamljs. Ocamljs is a system for compiling OCaml to Javascript. It includes a Javascript back-end for the OCaml compiler, as well as several support libraries, such as bindings to the browser DOM. Ocamljs also works with orpc for RPC over HTTP, and froc for functional reactive browser programming.

Changes since version 0.2 include:

  • support for OCaml 3.11.x and 3.12.0
  • jQuery binding (contributed by Dave Benjamin)
  • full support for OCaml objects (interoperable with Javascript objects)
  • Lwt 2.x support
  • ocamllex and ocamlyacc support
  • better interoperability with Javascript
  • many small fixes and improvements

Development of ocamljs has moved from Google Code to Github; see

Comparison to js_of_ocaml

Since I last did an ocamljs release, a new OCaml-to-Javascript system has arrived, js_of_ocaml. I want to say a little about how the two systems compare:

Ocamljs is a back-end to the existing OCaml compiler; it translates the “lambda” intermediate language to Javascript. (This is also where the bytecode and native code back-ends connect to the common front-end.) Js_of_ocaml post-processes ordinary OCaml bytecode (compiled and linked with the ordinary OCaml bytecode compiler) into Javascript. With ocamljs you need a special installation of the compiler (and special support for ocamlbuild and ocamlfind), you need to recompile libraries, and you need the OCaml source to build it. With js_of_ocaml you don’t need any of this.

Since ocamljs recompiles libraries, it’s possible to special-case code for the Javascript build to take advantage of Javascript facilities. For example, ocamljs implements the Buffer module on top of Javascript arrays instead of strings, for better performance. Similarly, it implements CamlinternalOO to use Javascript method dispatch directly instead of layering OCaml method dispatch on top. Js_of_ocaml can’t do this (or at least it would be necessary to recognize the compiled bytecode and replace it with the special case).

Because js_of_ocaml works from bytecode, it can’t always know the type of values (at the bytecode level, ints, bools, and chars all have the same representation, for example). This makes interoperating with native Javascript more difficult: you usually need conversion functions between the OCaml and Javascript representation of values when you call a Javascript function from OCaml. Ocamljs has more information to work with, and can represent OCaml bools as Javascript bools, for example, so you can usually call a Javascript function from OCaml without conversions.

Ocamljs has a mixed representation of strings: literal strings and the result of ^, Buffer.contents, and Printf.sprintf are all immutable Javascript strings; strings created with String.create are mutable strings implemented by Javascript arrays (with a toString method which returns the represented string). This is good for interoperability—you can usually pass a string directly to Javascript—but it doesn’t match regular OCaml’s semantics, and it can cause runtime failures (e.g. if you try to mutate an immutable string). Js_of_ocaml implements only mutable strings, so you need conversions when calling Javascript, but the semantics match regular OCaml.

With ocamljs, Javascript objects can be called from OCaml using the ordinary OCaml method-call syntax, and objects written in OCaml can be called using the ordinary Javascript syntax. With js_of_ocaml, a special syntax is needed to call Javascript objects, and OCaml objects can’t easily be called from Javascript. However, there is an advantage to having a special call syntax: with ocamljs it is not possible to partially apply calls to native Javascript methods, but this is not caught by the compiler, so there can be a runtime failure.

Ocamljs supports inline Javascript, while js_of_ocaml does not. I think it might be possible for js_of_ocaml to do so using the same approach that ocamljs takes: use Camlp4 quotations to embed a syntax tree, then convert the syntax tree from its OCaml representation (as lambda code or bytecode) into Javascript. However, you would still need conversion functions between OCaml and Javascript values.

I haven’t compared the performance of the two systems. It seems like there must be a speed penalty to translating from bytecode compared to translating from lambda code. On the other hand, while ocamljs is very naive in its translation, js_of_ocaml makes several optimization passes. With many programs it doesn’t matter, since most of the time is spent in browser code. (For example, the planet example seems to run at the same speed in ocamljs and js_of_ocaml.) It would be interesting to compare them on something computationally intensive like Andrej Bauer’s

Js_of_ocaml is more complete and careful in its implementation of OCaml (e.g. it supports int64s), and it generates much more compact code than ocamljs. I hope to close the gap in these areas, possibly by borrowing some code and good ideas from js_of_ocaml.

Friday, August 20, 2010

Mixing monadic and direct-style code with delimited continuations

The Lwt library is a really nice way to write concurrent programs. A big downside, however, is that you can’t use direct-style libraries with it. Suppose we’re writing an XMPP server, and we want to parse XML as it arrives over a network connection, using Daniel B├╝nzli’s nice xmlm library. Xmlm can read from a string, or from a Pervasives.in_channel, or you can give it a function of type (unit -> int) to return the next character of input. But there is no way to have it read from an Lwt thread; that is, we can’t give it a function of type (unit -> int Lwt.t), since it doesn’t know what do with an Lwt.t. To keep track of the parser state at the point the input blocks, the whole library would need to be rewritten in Lwt style (i.e. monadic style).

Now, Lwt does provide the Lwt_preemptive module, which gives you a way to spin off a preemptive thread (implemented as an ordinary OCaml thread) and wait for its result in the usual Lwt way with bind. This is useful, but has two drawbacks: preemptive threads are preemptive, so you’re back to traditional locking if you want to operate on shared data; and preemptive threads are threads, so they are much heavier than Lwt threads, and (continuing the XMPP hypothetical) it may not be feasible to use one per open connection.


What we would really like is to be able spin off a cooperative, direct-style thread. The thread needs a way to block on Lwt threads, but when it blocks we need to be able to schedule another Lwt thread. As a cooperative thread it of course has exclusive access to the process state while it is running. A cooperative, direct-style thread is sometimes called a coroutine (although to me that word connotes a particular style of inter-thread communication as well, where values are yielded between coroutines), or a fiber.

Here’s an API for mixing Lwt threads with fibers:

  val start : (unit -> 'a) -> 'a Lwt.t 
  val await : 'a Lwt.t -> 'a 

The start function spins off a fiber, returning an Lwt thread which is woken with the result of the fiber once it completes. The await function (which may be called only from within a fiber) blocks on the result of an Lwt thread, allowing another Lwt thread to be scheduled while it is waiting.

With this API we could implement our XMPP server by calling xmlm from within a fiber, and passing it a function that awaits the next character available on the network connection. But how do we implement it?

Delimited continuations

Oleg Kiselyov’s recent announcement of a native-code version of his Delimcc library for delimited continuations in OCaml reminded me of two things:

  1. I should find out what delimited continuations are.
  2. They sound useful for implementing fibers.

The paper describing the library, Delimited Control in OCaml, Abstractly and Concretely, has a pretty good overview of delimited continuations, and section 2 of A Monadic Framework for Delimited Continuations is helpful too.

The core API is small:

  type 'a prompt 
  type ('a,'b) subcont 
  val new_prompt   : unit -> 'a prompt 
  val push_prompt  : 'a prompt -> (unit -> 'a) -> 'a 
  val take_subcont : 
    'b prompt -> (('a,'b) subcont -> unit -> 'b) -> 'a 
  val push_subcont : ('a,'b) subcont -> (unit -> 'a) -> 'b 

I find it easiest to think about these functions as operations on the stack. A prompt is an identifier used to mark a point on the stack (the stack can be marked more than once with the same prompt). The function new_prompt makes a new prompt which is not equal to any other prompt.

The call push_prompt p f marks the stack with p then runs f, so the stack, growing to the right, looks like


where ABCD are stack frames in the continuation of the call to push_prompt, and EFGH are frames created while running f. If f returns normally (that is, without calling take_subcont) then its return value is returned by push_prompt, and we are back to the original stack ABCD.

If take_subcont p g is called while running f, the stack fragment EFGH is packaged up as an ('a,'b) subcont and passed to g. You can think of an ('a,'b) subcont as a function of type 'a -> 'b, where 'a is the return type of the call to take_subcont and 'b is the return type of the call to push_prompt. Take_subcont removes the fragment pEFGH from the stack, and there are some new frames IJKL from running g, so we have


Now g can make use of the passed-in subcont using push_subcont. (Thinking of a subcont as a function, push_subcont is just a weird function application operator, which takes the argument as a thunk). Then the stack becomes


Of course g can call the subcont as many times as you like.

A common pattern is to re-mark the stack with push_prompt before calling push_subcont (so take_subcont may be called again). There is an optimized version of this combination called push_delim_subcont, which produces the stack


The idea that a subcont is a kind of function is realized by shift0, which is like take_subcont except that instead of passing a subcont to g it passes an ordinary function. The passed function just wraps a call to push_delim_subcont. (It is push_delim_subcont rather than push_subcont for historical reasons I think—see the Monadic Framework paper for a comparison of various delimited continuation primitives.)

Implementing fibers

To implement fibers, we want start f to mark the stack, then run f; and await t to unwind the stack back to the mark, wait for t to complete, then restore the stack. Here is start:

  let active_prompt = ref None 
  let start f = 
    let t, u = Lwt.wait () in 
    let p = Delimcc.new_prompt () in 
    active_prompt := Some p; 
    Delimcc.push_prompt p begin fun () -> 
      let r = 
        try Lwt.Return (f ()) 
        with e -> Lwt.Fail e in 
      active_prompt := None; 
      match r with 
        | Lwt.Return v -> Lwt.wakeup u v 
        | Lwt.Fail e -> Lwt.wakeup_exn u e 
        | Lwt.Sleep -> assert false 

We make a sleeping Lwt thread, and store a new prompt in a global (this is OK because we won’t yield control to another Lwt thread before using it; of course this is not safe with OCaml threads). Then we mark the stack with push_prompt and run the fiber. (The let r = ... match r with ... is to avoid calling Lwt.wakeup{,_exn} in the scope of the try; we use Lwt.state as a handy type to store either a result or an exception.) If the fiber completes without calling await then all we do is wake up the Lwt thread with the returned value or exception.

Here is await:

  let await t = 
    let p = 
      match !active_prompt with 
        | None -> failwith "await called outside start" 
        | Some p -> p in 
    active_prompt := None; 
    match Lwt.poll t with 
      | Some v -> v 
      | None -> 
          Delimcc.shift0 p begin fun k -> 
            let ready _ = 
              active_prompt := Some p; 
              k (); 
              Lwt.return () in 
            ignore (Lwt.try_bind (fun () -> t) ready ready) 
          match Lwt.poll t with 
            | Some v -> v 
            | None -> assert false 

We first check to be sure that we are in the scope of start, and that t isn’t already completed (in which case we just return its result). If we actually need to wait for t, we call shift0, which capture the stack fragment back to the push_prompt call in start (this continuation includes the subsequent match Lwt.poll t and everything after the call to await), then try_bind so we can restore the stack fragment when t completes (whether by success or failure). When t completes, the ready function restores the global active_prompt, in case the fiber calls await again, then restores the stack by calling k (recall that this also re-marks the stack with p, which is needed if the fiber calls await again).

It’s pretty difficult to follow what’s going on here, so let’s try it with stacks. After calling start we have


where ABCD is the continuation of push_prompt in start (just the return of t) and EFGH are frames created by the thunk passed to start. Now, a call to await (on an uncompleted thread) calls shift0, which packs up EFGH as k and unwinds the stack to p. The function passed to shift0 stores k in ready but doesn’t call it, and control returns to start (since the stack has been unwound).

The program continues normally until t completes. Now control is in Lwt.run_waiters running threads that were waiting on t; one of them is our ready function. When it is called, the stack is re-marked and EFGH is restored, so we have


where QRST is wherever we happen to be in the main program, ending in Lwt.run_waiters. Now, EFGH ends with the second call to match Lwt.poll in await, which returns the value of t and continues the thunk passed to start. The stack is now marked with p inside Lwt.run_waiters, so when await is called again control returns there.

Events vs. threads

We have seen that we can use fibers to write Lwt threads in direct style. Should we abandon Lwt’s monadic style entirely, and use Lwt only for its event handling?

First, how does each style perform? Every time a fiber blocks and resumes, we have to copy, unwind, and restore its entire stack. With Lwt threads, the “stack” is a bunch of linked closures in the heap, so we don’t need to do anything to block or resume. On the other hand, building and garbage-collecting the closures is more expensive than pushing and popping the stack. We can imagine that which style performs better depends on the thread: if it blocks infrequently enough, the amortized cost of copying and restoring the stack might be lower than the cost of building and garbage-collecting the closures. (We can also imagine that a different implementation of delimited continuations might change this tradeoff.)

Second, how does the code look? The paper Cooperative Task Management without Manual Stack Management considers this question in the context of the “events vs. threads” debate. Many of its points lose their force when translated to OCaml and Lwt—closures, the >>= operator, and Lwt’s syntax extension go a long way toward making Lwt code look like direct style—but some are still germane. In favor of fibers is that existing direct-style code need not be rewritten to work with Lwt (what motivated us in the first place). In favor of monadic style is that the type of a function reflects the possibility that it might block, yield control to another thread, and disturb state invariants.

Direct-style FRP

We could apply this idea, of replacing monadic style with direct style using delimited continuations, to other monads—in particular to the froc library for functional reactive programming. (The Scala.React FRP library also uses delimited continuations to implement direct style; see Deprecating the Observer Pattern for details.)

Here’s the API:

  val direct : (unit -> 'a) -> 'a Froc.behavior 
  val read : 'a Froc.behavior -> 'a 

Not surprisingly, it’s just the same as for Lwt, but with a different monad and different names (I don’t know if direct is quite right but it is better than start). There is already a function Froc.sample with the same type as read, but it has a different meaning: sample takes a snapshot of a behavior but creates no dependency on it.

The implementation is very similar as well:

  let active_prompt = ref None 
  let direct f = 
    let t, u = Froc_ddg.make_changeable () in 
    let p = Delimcc.new_prompt () in 
    active_prompt := Some p; 
    Delimcc.push_prompt p begin fun () -> 
      let r = 
        try Froc_ddg.Value (f ()) 
        with e -> Froc_ddg.Fail e in 
      active_prompt := None; 
      Froc_ddg.write_result u r 
    (Obj.magic t : _ Froc.behavior) 

This is essentially the same code as start, modulo the change of monad. However, some of the functions we need aren’t exported from Froc, so we need to use the underlying Froc_ddg module and magic the result at the end. Froc_ddg.make_changeable is the equivalent of Lwt.wait: it returns an “uninitialized” monadic value along with a writer for that value. We use Froc_ddg.result instead of Lwt.state to store a value or exception, and Froc_ddg.write_result instead of the pattern match and Lwt.wakeup{,_exn}.

  let read t = 
    let p = 
      match !active_prompt with 
        | None -> failwith "read called outside direct" 
        | Some p -> p in 
    active_prompt := None; 
    Delimcc.shift0 p begin fun k -> 
      Froc.notify_result_b t begin fun _ -> 
        active_prompt := Some p; 
        k () 
    Froc.sample t 

And this is essentially the same code as await. A Froc.behavior always has a value, so we don’t poll it as we did with Lwt.t, but go straight to shift0. We have Froc.try_bind but it’s a little more compact to use use notify_result_b, which passes a result.

Monadic reflection

The similarity between these implementations suggests that we could use the same code to get a direct style version of any monad; we only need a way to create an uninitialized monadic value, then set it. The call to Lwt.poll in await is an optimization which we would have to forgo. (In both these examples we have a monad with failure, and try_bind, but we could do without it.)

A little googling turns up Andrzej Filinski’s paper Representing Monads, which reaches the same conclusion, with a lot more rigor. In that work start/direct are called reify, and await/read are called reflect. Reflect is close to the implementations above, but in reify the paper marks the stack inside a function passed to bind rather than creating an uninitialized monadic value and later setting it.

This makes sense—inside bind an uninitialized monadic value is created, then set from the result of the function passed to bind. So we are partially duplicating bind in the code above. If we mark the stack in the right place we should be able to use bind directly. It is hard to see how to make the details work out, however, since Lwt.bind and Froc.bind each have some cases where uninitialized values are not created.

(You can find the complete code for Lwt fibers here and direct-style froc here.)

(revised 10/22)

Friday, August 13, 2010

Reading Camlp4, part 10: custom lexers

As a final modification to our running JSON quotation example, I want to repair a problem noted in the first post—that the default lexer does not match the JSON spec—and in doing so demonstrate the use of custom lexers with Camlp4 grammars. We’ll parse UTF8-encoded Javascript using the ulex library.

To use a custom lexer, we need to pass a module matching the Lexer signature (in camlp4/Camlp4/ to Camlp4.PreCast.MakeGram. (Recall that we get back an empty grammar which we then extend with parser entries. ) Let’s look at the signature and its subsignatures, and our implementation of each:

  module type Error = sig 
    type t 
    exception E of t 
    val to_string : t -> string 
    val print : Format.formatter -> t -> unit 

First we have a module for packaging up an exception so it can be handled generically (in particular it may be registered with Camlp4.ErrorHandler for common printing and handling). We have simple exception needs so we give a simple implementation:

  module Error = 
    type t = string 
    exception E of string 
    let print = Format.pp_print_string 
    let to_string x = x 
  let _ = let module M = Camlp4.ErrorHandler.Register(Error) in () 

Next we have a module defining the tokens our lexer supports:

  module type Token = sig 
    module Loc : Loc 
    type t 
    val to_string : t -> string 
    val print : Format.formatter -> t -> unit 
    val match_keyword : string -> t -> bool 
    val extract_string : t -> string 
    module Filter : ... (* see below *) 
    module Error : Error 

The type t represents a token. This can be anything we like (in particular it does not need to be a variant with arms KEYWORD, EOI, etc. although that is the conventional representation), so long as we provide the specified functions to convert it to a string, print it to a formatter, determine if it matches a string keyword (recall that we can use literal strings in grammars; this function is called to see if the next token matches a literal string), and extract a string representation of it (called when you bind a variable to a token in a grammar—e.g. n = NUMBER). Here’s our implementation:

  type token = 
    | KEYWORD  of string 
    | NUMBER   of string 
    | STRING   of string 
    | ANTIQUOT of string * string 
    | EOI 
  module Token = 
    type t = token 
    let to_string t = 
      let sf = Printf.sprintf in 
      match t with 
        | KEYWORD s       -> sf "KEYWORD %S" s 
        | NUMBER s        -> sf "NUMBER %s" s 
        | STRING s        -> sf "STRING \"%s\"" s 
        | ANTIQUOT (n, s) -> sf "ANTIQUOT %s: %S" n s 
        | EOI             -> sf "EOI" 
    let print ppf x = Format.pp_print_string ppf (to_string x) 
    let match_keyword kwd = 
        | KEYWORD kwd' when kwd = kwd' -> true 
        | _ -> false 
    let extract_string = 
        | KEYWORD s | NUMBER s | STRING s -> s 
        | tok -> 
              ("Cannot extract a string from this token: " ^ 
                 to_string tok) 
    module Loc = Camlp4.PreCast.Loc 
    module Error = Error 
    module Filter = ... (* see below *) 

Not much to it. KEYWORD covers true, false, null, and punctuation; NUMBER and STRING are JSON numbers and strings; as we saw last time antiquotations are returned in ANTIQUOT; finally we signal the end of the input with EOI.

  module Filter : sig 
    type token_filter = 
      (t * Loc.t) Stream.t -> (t * Loc.t) Stream.t 
    type t 
    val mk : (string -> bool) -> t 
    val define_filter : t -> (token_filter -> token_filter) -> unit 
    val filter : t -> token_filter 
    val keyword_added : t -> string -> bool -> unit 
    val keyword_removed : t -> string -> unit 

The Filter module provides filters over token streams. We don’t have a need for it in the JSON example, but it’s interesting to see how it is implemented in the default lexer and used in the OCaml parser. The argument to mk is a function indicating whether a string should be treated as a keyword (i.e. the literal string is used in the grammar), and the default lexer uses it to filter the token stream to convert identifiers into keywords. If we wanted the JSON parser to be extensible, we would need to take this into account; instead we’ll just stub out the functions:

  module Filter = 
    type token_filter = 
      (t * Loc.t) Stream.t -> (t * Loc.t) Stream.t 
    type t = unit 
    let mk _ = () 
    let filter _ strm = strm 
    let define_filter _ _ = () 
    let keyword_added _ _ _ = () 
    let keyword_removed _ _ = () 

Finally we have Lexer, which packages up the other modules and provides the actual lexing function. The lexing function takes an initial location and a character stream, and returns a stream of token and location pairs:

module type Lexer = sig 
  module Loc : Loc 
  module Token : Token with module Loc = Loc 
  module Error : Error 
  val mk : 
    unit -> 
    (Loc.t -> char Stream.t -> (Token.t * Loc.t) Stream.t) 

I don’t want to go through the whole lexing function; it is not very interesting. But here is the main loop:

let rec token c = lexer 
  | eof -> EOI 
  | newline -> next_line c; token c c.lexbuf 
  | blank+ -> token c c.lexbuf 
  | '-'? ['0'-'9']+ ('.' ['0'-'9']* )? 
      (('e'|'E')('+'|'-')?(['0'-'9']+))? -> 
        NUMBER (L.utf8_lexeme c.lexbuf) 
  | [ "{}[]:," ] | "null" | "true" | "false" -> 
      KEYWORD (L.utf8_lexeme c.lexbuf) 
  | '"' -> 
      set_start_loc c; 
      string c c.lexbuf; 
      STRING (get_stored_string c) 
  | "$" -> 
      set_start_loc c; 
      c.enc := Ulexing.Latin1; 
      let aq = antiquot c lexbuf in 
      c.enc := Ulexing.Utf8; 
  | _ -> illegal c 

The lexer syntax is an extension provided by ulex; the effect is similar to ocamllex. The lexer needs to keep track of the current location and return it along with the token (next_line advances the current location; set_start_loc is for when a token spans multiple ulex lexemes). The lexer also needs to parse antiquotations, taking into account nested quotations within them.

(I think it is not actually necessary to lex JSON as UTF8. The only place that non-ASCII characters can appear is in a string. To lex a string we just accumulate characters until we see a double-quote, which cannot appear as part of a multibyte character. So it would work just as well to accumulate bytes. I am no Unicode expert though. This example was extracted from the Javascript parser in jslib, where I think UTF8 must be taken into account.)

Hooking up the lexer

There are a handful of changes we need to make to call the custom lexer:

In Jq_parser we make the grammar with the custom lexer module, and open it so the token constructors are available; we also replace the INT and FLOAT cases with just NUMBER; for the other cases we used the same token constructor names as the default lexer so we don’t need to change anything.

  open Jq_lexer 
  module Gram = Camlp4.PreCast.MakeGram(Jq_lexer) 
      | n = NUMBER -> Jq_number (float_of_string n) 

In Jq_quotations we have Camlp4.PreCast open (so references to Ast in the <:expr< >> quotations resolve), so EOI is Camlp4.PreCast.EOI; we want Jq_lexer.EOI, so we need to write it explicitly:

  json_eoi: [[ x = Jq_parser.json; `Jq_lexer.EOI -> x ]]; 

(Recall that the backtick lets us match a constructor directly; for some reason we can’t module-qualify EOI without it.)

That’s it.

I want to finish off this series next time by covering grammar extension, with an example OCaml syntax extension.

(You can find the complete code for this example here.)

Thursday, August 5, 2010

Reading Camlp4, part 9: implementing antiquotations

In this post I want to complicate the JSON quotation library from the previous post by adding antiquotations.

AST with antiquotations

In order to support antiquotations we will need to make some changes to the AST. Here is the new AST type:

  type t = 
      ... (* base types same as before *) 
    | Jq_array  of t 
    | Jq_object of t 
    | Jq_colon  of t * t 
    | Jq_comma  of t * t 
    | Jq_nil 
    | Jq_Ant    of Loc.t * string 

Let’s first consider Jq_Ant. Antiquotations $tag:body$ are returned from the lexer as an ANTIQUOT token containing the (possibly empty) tag and the entire body (including nested quotations/antiquotations) as a string. In the parser, we deal only with the JSON AST, so we can’t really do anything with an antiquotation but return it to the caller (wrapped in a Jq_Ant).

The lifting functions generated by Camlp4MetaGenerator treat Jq_Ant (and any other constructor ending in Ant) specially: instead of

  | Jq_Ant (loc, s) -> 
      <:expr< Jq_Ant ($meta_loc loc$, $meta_string s$) >> 

they have

  | Jq_Ant (loc, s) -> ExAnt (loc, s) 

Instead of lifting the constructor, they translate it directly to ExAnt (or PaAnt, depending on the context). We don’t otherwise have locations in our AST, but Jq_Ant must take a Loc.t argument because ExAnt does. Later, when we walk the OCaml AST expanding antiquotations, it will be convenient to have them as ExAnt nodes rather than lifted Jq_Ant nodes.

In addition to Jq_Ant, we have new Jq_nil, Jq_comma, and Jq_colon constructors, and we have replaced the lists in Jq_array and Jq_object with just t. The idea here is that in an antiquotation in an array, e.g.

  <:json< [ 1, true, $x$, "foo" ] >> 

we would like to be able to substitute any number of elements (including zero) into the array in place of x. If Jq_array took a list, we could substitute exactly one element only. So instead we build a tree out of Jq_comma and Jq_nil constructors; at any point in the tree we can substitute zero (Jq_nil), one (any other t constructor), or more than one (a Jq_comma subtree) elements. We recover a list by taking the fringe of the final tree. (In the Jq_ast module there are functions t_of_list and list_of_t which convert between these representations.) For objects, we use Jq_colon to associate a name with a value, then build a tree of name/value pairs the same way.

While this AST meets the need, it is now possible to have ill-formed ASTs, e.g. a bare Jq_nil, or a Jq_object where the elements are not Jq_colon pairs, or where the first argument of Jq_colon is not a Jq_string. This is annoying, but it is hard to see how to avoid it without complicating the AST and making it more difficult to use antiquotations.

Parsing antiquotations

Here is the updated parser:

  EXTEND Gram 
    json: [[ 
        ... (* base types same as before *) 
      | `ANTIQUOT 
          (""|"bool"|"int"|"flo"|"str"|"list"|"alist" as n, s) -> 
            Jq_Ant (_loc, n ^ ":" ^ s) 
      | "["; es = SELF; "]" -> Jq_array es 
      | "{"; kvs = SELF; "}" -> Jq_object kvs 
      | e1 = SELF; ","; e2 = SELF -> Jq_comma (e1, e2) 
      | -> Jq_nil 
      | e1 = SELF; ":"; e2 = SELF -> Jq_colon (e1, e2)  

We want to support several kinds of antiquotations: For individual elements, $x$ (where x is a t), or $bool:x$, $int:x$, $flo:x$, or $str:x$ (where x is an OCaml bool, int, float, or string); for these latter cases we need to wrap x in the appropriate t constructor. For lists of elements, $list:x$ where x is a t list, and $alist:x$ where x is a (string * t) list; for these we need to convert x to the Jq_comma / Jq_nil representation above. But in the parser all we do is return a Jq_Ant containing the tag and body of the ANTIQUOT token. (We return it in a single string separated by : because only one string argument is provided in ExAnt.)

It is the parser which controls where antiquotations are allowed, by providing a case for ANTIQUOT in a particular entry, and which tags are allowed in an entry. In this example we have only one entry, so we allow any supported antiquotation anywhere a JSON expression is allowed, but you can see in the OCaml parsers that the acceptable antiquotations can be context-sensitive, and the interpretation of the same antiquotation can vary according to the context (e.g. different conversions may be needed).

For arrays and objects, we parse SELF in place of the list. The cases for Jq_comma and Jq_nil produce the tree representation, and the case for Jq_colon allows name/value pairs. Recall that a token or keyword is preferred over the empty string, so the Jq_nil case matches only when none of the others do. In particular, the quotation <:json< >> parses to Jq_nil.

We can see that not only is the AST rather free, but so is the parser: it will parse strings which are not well-formed JSON, like <:json< 1, 2 >> or <json:< "foo" : true >>. We lose safety, since a mistake may produce an ill-formed AST, but gain convenience, since we may want to substitute these fragments in antiquotations. As an alternative, we could have a more restrictive parser (e.g. no commas allowed at the json entry), and provide different quotations for different contexts (e.g. <:json_list< >>, allowing commas) for use with antiquotations. For this small language I think it is not worth it.

Expanding antiquotations

To expand antiquotations, we take a pass over the OCaml AST we got from lifting the JSON AST; look for ExAst nodes; parse them as OCaml; then apply the appropriate conversion according to the antiquotation tag. To walk the AST we extend the object (generated with the Camlp4FoldGenerator filter) so we don’t need a bunch of boilerplate cases which return the AST unchanged. Here’s the code:

  module AQ = Syntax.AntiquotSyntax 
  let destruct_aq s = 
    let pos = String.index s ':' in 
    let len = String.length s in 
    let name = String.sub s 0 pos 
    and code = String.sub s (pos + 1) (len - pos - 1) in 
    name, code 
  let aq_expander = 
    inherit as super 
    method expr = 
        | Ast.ExAnt (_loc, s) -> 
            let n, c = destruct_aq s in 
            let e = AQ.parse_expr _loc c in 
            begin match n with 
              | "bool" -> <:expr< Jq_ast.Jq_bool $e$ >> 
              | "int" -> 
                  <:expr< Jq_ast.Jq_number (float_of_int $e$) >> 
              | "flo" -> <:expr< Jq_ast.Jq_number $e$ >> 
              | "str" -> <:expr< Jq_ast.Jq_string $e$ >> 
              | "list" -> <:expr< Jq_ast.t_of_list $e$ >> 
              | "alist" -> 
                        (fun (k, v) -> 
                          Jq_ast.Jq_colon (Jq_ast.Jq_string k, v)) 
              | _ -> e 
        | e -> super#expr e 
    method patt = 
        | Ast.PaAnt (_loc, s) -> 
            let _, c = destruct_aq s in 
            AQ.parse_patt _loc c 
        | p -> super#patt p 

When we find an antiquotation, we unpack the tag and contents (with destruct_aq), parse it using the host syntax (given by Syntax.AntiquotSyntax from Camlp4.PreCast, which might be either the original or revised syntax depending which modules are loaded), then insert conversions depending on the tag. Conversions don’t make sense in a pattern context, so for patterns we just return the parsed antiquotation.

Finally we hook into the quotation machinery, mostly as before:

let parse_quot_string loc s = 
  let q = !Camlp4_config.antiquotations in 
  Camlp4_config.antiquotations := true; 
  let res = Jq_parser.Gram.parse_string json_eoi loc s in 
  Camlp4_config.antiquotations := q; 
let expand_expr loc _ s = 
  let ast = parse_quot_string loc s in 
  let meta_ast = Jq_ast.MetaExpr.meta_t loc ast in 
  aq_expander#expr meta_ast 
Q.add "json" Q.DynAst.expr_tag expand_expr; 

Before parsing a quotation we set a flag, which is checked by the lexer, to allow antiquotations; the flag is initially false, so antiquotations appearing outside a quotation won’t be parsed. After lifting the JSON AST to an OCaml AST, we run the result through the antiquotation expander.

For concreteness, let’s follow the life of a quotation as it is parsed and expanded. Say we begin with

  <:json< [ 1, $int:x$ ] >> 

After parsing:

  Jq_array (Jq_comma (Jq_number 1., Jq_Ant (_loc, "int:x"))) 

After lifting:

    Jq_array (Jq_comma (Jq_number 1., $ExAnt (_loc, "int:x")$)) 

After expanding:

    Jq_array (Jq_comma (Jq_number 1., Jq_number (float_of_int x))) 
Nested quotations

Let’s see that again with a nested quotation:

  <:json< $<:json< 1 >>$ >> 

After parsing:

  Jq_Ant (_loc, "<:json< 1 >>") 

After lifting:

  ExAnt (_loc, "<:json< 1 >>") 

After expanding (during which we parse and expand "<:json< 1 >>" to <:expr< Jq_number 1. >>):

  <:expr< Jq_number 1. >> 

A wise man once said “The string is a stark data structure and everywhere it is passed there is much duplication of process.” So it is with Camlp4 quotations: each nested quotation is re-parsed; each quotation implementation must deal with parsing host-language antiquotation strings; and the lexer for each implementation must lex antiquotations and nested quotations. (Since we used the default lexer we didn’t have to worry about this, but see the next post.) It would be nice to have more support from Camlp4. On the other hand, while what happens at runtime seems baroque, the code above is relatively straightforward, and since we work with strings we can use any parser technology we like.

It has not been much (marginal) trouble to handle quotations in pattern contexts, but they are not tremendously useful. The problem is that we normally don’t care about the order of the fields in a JSON object, or if there are extra fields; we would like to write

  match x with 
    | <:json< { 
        "foo" : $foo$ 
      } >> -> ... (* do something with foo *) 

and have it work wherever the foo field is in the object. This is a more complicated job than just lifting the JSON AST. For an alternative approach to processing JSON using a list-comprehension syntax, see json_compr, an example I wrote for the upcoming metaprogramming tutorial at CUFP. For a fancier JSON DSL (including the ability to induct a type description from a bunch of examples!), see Julien Verlauget’s jsonpat. And for a framework to extend OCaml’s pattern-matching syntax, see Jeremy Yallop’s ocaml-patterns.

Next time we will see how to use a custom lexer with a Camlp4 grammar.

(You can find the complete code for this example here.)

Tuesday, August 3, 2010

Reading Camlp4, part 8: implementing quotations

The Camlp4 system of quotations and antiquotations is an awesome tool for producing and consuming OCaml ASTs. In this post (and the following one) we will see how to provide this facility for other syntaxes and ASTs. Here we consider just quotations; we’ll add antiquotations in the following post.


Our running example will be a quotation expander for JSON. Let’s begin with the JSON AST, in a module Jq_ast:

  type t = 
    | Jq_null 
    | Jq_bool   of bool 
    | Jq_number of float 
    | Jq_string of string 
    | Jq_array  of t list 
    | Jq_object of (string * t) list 

This is the same (modulo order and names) as json_type from the json-wheel library, but for various reasons we will not be able to use json_type. The Jq_ prefix is for json_quot, the name of this little library.

Parsing JSON

We’ll use a Camlp4 grammar to parse JSON trees. It is not necessary to use Camlp4’s parsing facilities in order to implement quotations—ultimately we will need to provide just a function from strings to ASTs, so we could use ocamlyacc or what-have-you instead—but it is convenient. Here is the parser:

  open Camlp4.PreCast 
  open Jq_ast 
  module Gram = MakeGram(Lexer) 
  let json = "json" 
  EXTEND Gram 
    json: [[ 
        "null" -> Jq_null 
      | "true" -> Jq_bool true 
      | "false" -> Jq_bool false 
      | i = INT -> Jq_number (float_of_string i) 
      | f = FLOAT -> Jq_number (float_of_string f) 
      | s = STRING -> Jq_string s 
      | "["; es = LIST0 json SEP ","; "]" -> Jq_array es 
      | "{"; 
          kvs = 
              [ s = STRING; ":"; j = json -> (s, j) ] 
              SEP ","; 
        "}" -> Jq_object kvs 

We use the default Camlp4 lexer (with MakeGram(Lexer)); as we have seen, keywords mentioned in a Camlp4 grammar are added to the lexer, so we don’t need to do anything special to lex null etc. However, while JSON/Javascript has a single number type, the default lexer returns different tokens for INT and FLOAT numbers, so we convert each to Jq_number. In fact, these tokens (along with STRING) represent OCaml integer, float and string literals, which do not exactly match the corresponding JSON ones, but they are fairly close so let’s not worry about it for now; we’ll revisit the lexer in a later post.

The parser itself is pleasingly compact; we can make good use of the LIST0 special symbol and an anonymous entry for parsing objects. Unfortunately things will get a little more complicated when we come to antiquotations.

Lifting the AST

Next we need to “lift” values of the JSON AST to values of the OCaml AST. What does “lift” mean, and why do we need to do it? The goal is to convert quotations in OCaml code, such as

  let x = <:json< [ 1, "foo", true ] >> 

into the equivalent

  let x = 
    Jq_ast.Jq_array [ 
      Jq_ast.Jq_number 1.; 
      Jq_ast.Jq_string "foo"; 
      Jq_ast.Jq_bool true 

This is to happen as part of Camlp4 preprocessing, which produces an OCaml AST, so what we produce in place of the <:json< ... >> expression must be a fragment of OCaml AST. We have a parser which takes a valid JSON string to the JSON AST; what remains is to take a JSON AST value to the corresponding OCaml AST. So we need a function with cases something like:

  | Jq_null -> <:expr< Jq_null >> 
  | Jq_number n -> <:expr< Jq_number $`flo:n$ >> 
  | ... 

It is not such a big deal to hand-write this lifting function for a small AST like JSON, but it is arduous and error-prone for full-size ASTs. Fortunately Camlp4 has a filter which does it for us. Let’s first look at the signature of the Jq_ast module:

  open Camlp4.PreCast 
  type t = ... (* as above *) 
  module MetaExpr : 
    val meta_t : Ast.loc -> t -> Ast.expr 
  module MetaPatt : 
    val meta_t : Ast.loc -> t -> Ast.patt 

The generated modules MetaExpr and MetaPatt provide functions to lift a JSON AST to either an OCaml expr (when the quotation appears as an expression) or patt (when it appears as a pattern). The loc arguments are inserted into the resulting OCaml AST so that compile errors have correct locations.

Now the implementation of Jq_ast:

  module Jq_ast = 
    type float' = float 
    type t = (* almost as above *) 
      | Jq_number of float' 
  include Jq_ast 
  open Camlp4.PreCast (* for Ast refs in generated code *) 
  module MetaExpr = 
    let meta_float' _loc f = <:expr< $`flo:f$ >> 
    include Camlp4Filters.MetaGeneratorExpr(Jq_ast) 
  module MetaPatt = 
    let meta_float' _loc f = <:patt< $`flo:f$ >> 
    include Camlp4Filters.MetaGeneratorPatt(Jq_ast) 

The file needs the Camlp4MetaGenerator filter (the camlp4.metagenerator package with findlib). The main idea is that the calls to Camlp4Filters.MetaGenerator{Expr,Patt} are expanded into the lifting functions. But there are a couple of fussy details:

First: The argument module Jq_ast which we pass to the generators is used both on the left and right of the generated function; if you look at the generated code there are cases like:

  | Jq_ast.Jq_null -> <:expr< Jq_ast.Jq_null >> 

(The <:expr< .. >> is already expanded in the actual generated code.) We need the AST to be available qualified by the module Jq_ast both in the current file and also in code that uses the quotation. So we have a nested Jq_ast module (for local uses, on the left-hand side) which we include (for external uses, on the right-hand side).

Second: The generators scan all the types defined in the current module, then generate code from the last-appearing recursive bundle. (In this case the recursive bundle contains just t, but in general there can be more than one; mutually recursive lifting functions are generated.) There are some special cases for predefined types, and in particular for float; however, it seems to be wrong:

  let meta_float _loc s = Ast.ExFlo (_loc, s) 

The ExFlo constructor takes a string representing the float, but calls to this function are generated when you use float in your type. To work around this, we define the type float' (on its own rather than as part of the last-appearing recursive bundle, or else Camlp4 would generate a meta_float' that calls meta_float), and provide correct meta_float' functions. There is a similar bug with meta_int, but meta_bool is correct, so our Jq_bool case does not need fixing.

(It is interesting to contrast this approach of lifting the AST with how it is handled in Template Haskell using the “scrap your boilerplate” pattern; see Geoffrey Mainland’s paper Why It’s Nice to be Quoted.)


Finally we can hook the parser and AST lifter into Camlp4’s quotation machinery, in the Jq_quotations module:

  open Camlp4.PreCast 
  module Q = Syntax.Quotation 
  let json_eoi = "json_eoi" 
  EXTEND Jq_parser.Gram 
    json_eoi: [[ x = Jq_parser.json; EOI -> x ]]; 
  let parse_quot_string loc s = 
    Jq_parser.Gram.parse_string json_eoi loc s 
  let expand_expr loc _ s = 
    Jq_ast.MetaExpr.meta_t loc (parse_quot_string loc s) 
  let expand_str_item loc _ s = 
    let exp_ast = expand_expr loc None s in 
    <:str_item@loc< $exp:exp_ast$ >> 
  let expand_patt loc _ s = 
    Jq_ast.MetaPatt.meta_t loc (parse_quot_string loc s) 
  Q.add "json" Q.DynAst.expr_tag expand_expr; 
  Q.add "json" Q.DynAst.patt_tag expand_patt; 
  Q.add "json" Q.DynAst.str_item_tag expand_str_item; 
  Q.default := "json" 

First, we make a new grammar entry json_eoi which parses a json expression followed by the end-of-input token EOI. Grammar entries ordinarily ignore the rest of the input after a successful parse. If we were to use the json entry directly, we would silently accept quotations with trailing garbage, and in particular incorrect quotations that happen to have a correct prefix, rather than alerting the user.

Then we register quotation expanders for the <:json< >> quotation in the expr, patt, and str_item contexts (str_item is useful because that is the context at the top level prompt), using Syntax.Quotation.add. All the expanders do is call the parser, then run the result through the appropriate lifting function.

Finally we set json as the default quotation, so we can just say << >> for JSON quotations. This is perhaps a bit cheeky, since the user may want something else as the default quotation; whichever module is loaded last wins.

It is worth reflecting on how the quotation mechanism works in the OCaml parser: There is a lexer token for quotations, but no node in the OCaml AST, so everything must happen in the parser. When a quotation is lexed, its entire contents is returned as a string. (Nested quotations are matched in the lexer—see quotation and antiquot in camlp4/Camlpl4/Struct/Lexer.mll—without considering the embedded syntax; this makes the << and >> tokens unusable in the embedded syntax.) The string is then expanded according to the table of registered expanders; expanders return a fragment of OCaml AST which is inserted into the parse tree.

You might have thought (as I did) that something fancy happens with quotations, e.g. Camlp4 switches to a different parser on the fly, then back to the original parser for antiquotations. But it is much simpler than that. At the same time, it is much more complicated than that, as we will see next time when we cover antiquotations (and in particular how nested antiquotations/quotations are handled).

(You can find the complete code here, including a pretty-printer and integration with the top level; after building and installing you can say e.g.

  # << [ 1, "foo", true ] >>;; 
  - : Jq_ast.t = [ 1, "foo", true ] 

although without antiquotations it is not very useful.)