All Downloads are FREE. Search and download functionalities are using the official Maven repository.

javacc-7.0.4.www.doc.lexertips.html Maven / Gradle / Ivy





	 
  Tips for writing a good JavaCC lexical specification



Tips for writing a good JavaCC lexical specification

There are many ways to write the lexical specification for a grammar. But the performance of the generated token manager varies significantly depending on how you do this. Here are a few tips:

  • Try to specify as many String literals as possible. These are recognized by a Deterministic Finite Automata (DFA), which is much faster than the Nondeterministic Finite Automata (NFA) needed to recognize other kinds of complex regular expressions. For example, to skip blanks/tabs/newlines,
        SKIP : { " " | "\t" | "\n" }
    
    is more efficient than doing
        SKIP : { < ([" ", "\t", "\n"])+ > }
    
    because in the first case you only have string literals, it will generate a DFA whereas for the second case it will generate an NFA.
  • Try to use the pattern ~[] just by itself as much as possible. For example, doing a
        MORE : { < ~[] > }
    
    is better than doing
          TOKEN : { < (~[])+ > }
    
    of course, if your grammar dictates that one of these cannot be used, then you don't have a choice, but try to use < ~[] > as much as possible.
  • Specify all the String literals in the order of increasing length, i.e., all shorter string literals before longer ones. This will help optimizing the bit vectors needed for string literals.
  • Try to minimize the use of lexical states. When using these, try to move all your complex regular expressions into a single lexical state, leaving others to just recognize simple string literals.
  • Try to use IGNORE_CASE judiciously. Best thing to do is to set this option at the grammar level. If that is not possible, then try to have it set for *all* regular expressions in a lexical state. There is heavy performance penalty for setting IGNORE_CASE for some regular expressions and not for others in the same lexical state.
  • Try to SKIP as much possible, if you don't care about certain patterns. Here, you have to be a bit careful about EOF. seeing an EOF after SKIP is fine whereas, seeing an EOF after a MORE is a lexical error.
  • Try to avoid specifying lexical actions with MORE specifications. Generally every MORE should end up in a TOKEN (or SPECIAL_TOKEN) finally so you can do the action there at the TOKEN level, if it is possible.
  • Also try to avoid lexical actions and lexical state changes with SKIP specifications (especially for single character SKIP's like " ", "\t", "\n" etc.). For such cases, a simple loop is generated to eat up the SKIP'ed single characters. So obviously, if there is a lexical action or state change associated with this, it is not possible to it this way.
  • Try to avoid having a choice of String literals for the same token, e.g.
          < NONE : "\"none\"" | "\'none\'" >
    
    Instead, have two different token kinds for this and use a nonterminal which is a choice between those choices. The above example can be written as :
            < NONE1 : "\"none\"" >
          |
            < NONE2 : "\'none'\" >
    
    and define a nonterminal called None() as :
          void None() : {} { <NONE1> | <NONE2> }
    
    This will make recognition much faster. Note however, that if the choice is between two complex regular expressions, it is OK to have the choice.




© 2015 - 2024 Weber Informatics LLC | Privacy Policy