retokenizer · tedderland.com

Tokenizes (with context dependence--could also parse grammars) returning Javascript data structure matching script to which syntax definition is applied.

Retokenizer is an ES6 class that turns text into an array of tokens according to a syntax definition you supply. It's essentially a lexical analyzer, but it comes close to being a grammar parser too — recursive sub-syntaxes can loop back on each other for context-sensitive parsing. Most typically you'd reach for it when developing a programming language.

How it differs from other tokenizers

Keeps the splitters. The characters that split up tokens are retained, not discarded.
Recursive sub-syntaxes. Enclosures can carry their own syntax — embed different sub-languages within each other, or loop a syntax back on itself for infinite recursive parsing.
Context-sensitive parsing. The same character can mean different things in different contexts — for instance = as assignment in an open statement but as evaluation inside an if( .. ) condition.
Retains unrecognized portions. Strings found between splitters can be kept, removed, or made to throw.
Optional position info. Can provide line and character position numbers, and emit tokens as simple strings or richly detailed objects.
No dependencies. Written for Node.js with no outside dependencies; adapts easily to any modern ES6 browser.

A taste

let tokenizer = Retokenizer( syntax, { rich:true, betweens:'keep', condense:true, caseful:false } )

A syntax is a record of splitters (in order of precedence), removes (tokens to exclude from output), and enclosures such as quoted strings, comments, or code blocks — each of which may carry its own nested syntax.

Installation

Retokenizer is available as an npm package. Clone the source to explore the included example.js, which demonstrates how to use it in your own code:

git clone https://github.com/Solifugus/retokenizer.git

Downloads & installation

Builds are coming soon. Grab the source from GitHub for now.

Source on GitHub ↗