Syntax

The most straightforward way to compose a Citrus grammar is to use Citrus' own custom grammar syntax. This syntax borrows heavily from Ruby, so it should already be familiar to Ruby programmers.

Terminals

Terminals may be represented by a string or a regular expression. Both follow the same rules as Ruby string and regular expression literals.

'abc'         # match "abc"
"abc\n"       # match "abc\n"
/abc/i        # match "abc" in any case
/\xFF/        # match "\xFF"

Character classes and the dot (match anything) symbol are supported as well for compatibility with other parsing expression implementations.

[a-z0-9]      # match any lowercase letter or digit
[\x00-\xFF]   # match any octet
.             # match any single character, including new lines

Also, strings may use backticks instead of quotes to indicate that they should match in a case-insensitive manner.

`abc`         # match "abc" in any case

Besides case sensitivity, case-insensitive strings have the same behavior as double quoted strings.

See Terminal and StringTerminal for more information.

Repetition

Quantifiers may be used after any expression to specify a number of times it must match. The universal form of a quantifier is N*M where N is the minimum and M is the maximum number of times the expression may match.

'abc'1*2      # match "abc" a minimum of one, maximum of two times
'abc'1*       # match "abc" at least once
'abc'*2       # match "abc" a maximum of twice

Additionally, the minimum and maximum may be omitted entirely to specify that an expression may match zero or more times.

'abc'*        # match "abc" zero or more times

The + and ? operators are supported as well for the common cases of 1* and *1 respectively.

'abc'+        # match "abc" one or more times
'abc'?        # match "abc" zero or one time

See Repeat for more information.

Lookahead

Both positive and negative lookahead are supported in Citrus. Use the & and ! operators to indicate that an expression either should or should not match. In neither case is any input consumed.

'a' &'b'      # match an "a" that is followed by a "b"
'a' !'b'      # match an "a" that is not followed by a "b"
!'a' .        # match any character except for "a"

A special form of lookahead is also supported which will match any character that does not match a given expression.

~'a'          # match all characters until an "a"
~/xyz/        # match all characters until /xyz/ matches

When using this operator (the tilde), at least one character must be consumed for the rule to succeed.

See AndPredicate, NotPredicate, and ButPredicate for more information.

Sequences

Sequences of expressions may be separated by a space to indicate that the rules should match in that order.

'a' 'b' 'c'   # match "a", then "b", then "c"
'a' [0-9]     # match "a", then a numeric digit

See Sequence for more information.

Choices

Ordered choice is indicated by a vertical bar that separates two expressions. When using choice, each expression is tried in order. When one matches, the rule returns the match immediately without trying the remaining rules.

'a' | 'b'       # match "a" or "b"
'a' 'b' | 'c'   # match "a" then "b" (in sequence), or "c"

It is important to note when using ordered choice that any operator binds more tightly than the vertical bar. A full chart of operators and their respective levels of precedence is below.

See Choice for more information.

Labels

Match objects may be referred to by a different name than the rule that originally generated them. Labels are added by placing the label and a colon immediately preceding any expression.

chars:/[a-z]+/  # the characters matched by the regular expression
                # may be referred to as "chars" in an extension
                # method

Extensions

Extensions may be specified using either “module” or “block” syntax. When using module syntax, specify the name of a module that is used to extend match objects in between less than and greater than symbols.

[a-z0-9]5*9 <CouponCode>  # match a string that consists of any lower
                          # cased letter or digit between 5 and 9
                          # times and extend the match with the
                          # CouponCode module

Additionally, extensions may be specified inline using curly braces. When using this method, the code inside the curly braces may be invoked by calling the value method on the match object.

[0-9] { to_i }        # match any digit and return its integer value when
                      # calling the #value method on the match object

Note that when using the inline block method you may also specify arguments in between vertical bars immediately following the opening curly brace, just like in Ruby blocks.

Super

When including a grammar inside another, all rules in the child that have the same name as a rule in the parent also have access to the super keyword to invoke the parent rule.

grammar Number
  rule number
    [0-9]+
  end
end

grammar FloatingPoint
  include Number

  rule number
    super ('.' super)?
  end
end

In the example above, the FloatingPoint grammar includes Number. Both have a rule named number, so FloatingPoint#number has access to Number#number by means of using super.

See Super for more information.

Precedence

The following table contains a list of all Citrus symbols and operators and their precedence. A higher precedence indicates tighter binding.

Operator Name Precedence
'' String (single quoted) 7
"" String (double quoted) 7
`` String (case insensitive) 7
[] Character class 7
. Dot (any character) 7
// Regular expression 7
() Grouping 7
* Repetition (arbitrary) 6
+ Repetition (one or more) 6
? Repetition (zero or one) 6
& And predicate 5
! Not predicate 5
~ But predicate 5
<> Extension (module name) 4
{} Extension (literal) 4
: Label 3
e1 e2 Sequence 2
e1 | e2 Ordered choice 1

Grouping

As is common in many programming languages, parentheses may be used to override the normal binding order of operators. In the following example parentheses are used to make the vertical bar between 'b' and 'c' bind tighter than the space between 'a' and 'b'.

'a' ('b' | 'c')   # match "a", then "b" or "c"