Tokenizes (with context dependence--could also parse grammars) returning Javascript data structure matching script to which syntax definition is applied.
Retokenizer is an ES6 class that turns text into an array of tokens according to a syntax definition you supply. It's essentially a lexical analyzer, but it comes close to being a grammar parser too — recursive sub-syntaxes can loop back on each other for context-sensitive parsing. Most typically you'd reach for it when developing a programming language.
How it differs from other tokenizers
- Keeps the splitters. The characters that split up tokens are retained, not discarded.
- Recursive sub-syntaxes. Enclosures can carry their own syntax — embed different sub-languages within each other, or loop a syntax back on itself for infinite recursive parsing.
- Context-sensitive parsing. The same character can mean
different things in different contexts — for instance
=as assignment in an open statement but as evaluation inside anif( .. )condition. - Retains unrecognized portions. Strings found between splitters can be kept, removed, or made to throw.
- Optional position info. Can provide line and character position numbers, and emit tokens as simple strings or richly detailed objects.
- No dependencies. Written for Node.js with no outside dependencies; adapts easily to any modern ES6 browser.
A taste
let tokenizer = Retokenizer( syntax, { rich:true, betweens:'keep', condense:true, caseful:false } )
A syntax is a record of splitters (in order of precedence),
removes (tokens to exclude from output), and enclosures
such as quoted strings, comments, or code blocks — each of which may carry its
own nested syntax.
Installation
Retokenizer is available as an npm package. Clone the source to explore the
included example.js, which demonstrates how to use it in your own
code:
git clone https://github.com/Solifugus/retokenizer.git
Downloads & installation
Builds are coming soon. Grab the source from GitHub for now.