Customize handling whitespace, comments when using generated tokenizer #14

nikomatsakis · 2015-09-14T10:05:55Z

The current tokenizer generation always uses two fixes precedence categories, so all regular expressions have equal weight. This is useful for giving a keyword like "class" precedence of an identifier regex, but there are times when we would like to give some regexs higher precedence over others. For example, if parsing a case-insensitive language like Pascal, you would like to use regexs like r"[vV][aA][rR]" rather than having to enumerate all combinations. But this won't work because of precedence conflicts. Another problem is that the tokenizer implicitly skips whitespace and there is no way to extend this set to skip other things, like comments.

Prioritization

I've been contemplating a match declaration to address prioritization, which might look like:

match {
    r"[vV][aA][rR]",
    r"[a-zA-Z][a-zA-Z0-9]+",
}

The idea here is that when you have tokens listed in a match declaraction, we can create custom precedence levels, so that e.g. here the "var" regex takes precedence over the identifier regex. Tokens not listed would be implicitly added to the end of the list, with literals first and regex second.

Comments and whitespace

I'm less clear on what to do here. I contemplated adding things to the match declaration with an "empty action", which would signal "do nothing":

match {
    r"{.*}" => { } // pascal style comment
}

or having something like an empty if declaration:

if r"{.*}";

I think I prefer the first alternative, but it doesn't seem great. Another thing that is unclear is if the implicit whitespace skipping should be disabled or retained. I think I prefer to retain it unless the user asks for it to be disabled, because it's always a bit surprising when you add something and find it implicitly removes another default. That is, adding comments to the list of things to skip implicitly removes whitespace. But not having the implicit whitespace at all feels like clearly the wrong default.

More complex tokenizers

Eventually I'd like to support lex-like specifications, where tokenizers can have a state stack -- and perhaps go further as described in #10. It'd be nice if we chose a syntax here that scaled gracefully to that case.

So some uncertainties here!

The text was updated successfully, but these errors were encountered:

nikomatsakis · 2016-03-04T14:26:16Z

I've been thinking about this more and I am feeling pretty good about match declarations as the means of writing tokenizer rules. To start with I'd probably just permit:

match {
    <pattern> => { }
}

Which would indicate that matches of <pattern> are to be ignored completely. The priority would be given the same priorities as today (0 for regular expressions, 1 for fixed strings).

In the future, I would expand this to permit:

Explicit #[priority=N] declarations.
Fallible actions (=>?).
Actions that produce a token value or multiple token values. See below.
States. Here I imagine match in foo or match in (foo, bar) where foo and bar are states, and some way to transition to a new state. Not quite sure how that should look; maybe even something like self.push_state(foo) in the action, though I'm generally not too keen on "injecting" identifiers into the action code.

The matter of actions that produce tokens still requires some thought. My initial idea was that they could return something that supports IntoIterator<T>, where T is the token type, but that would require specifying the types of tokens. Given that the current tokenizer type is just (usize, &'input str), that's not so good. We'd need some way to give names to the token types that get produced.

It might be that what you want then is a combination of an "extern token" and custom tokenizer -- basically a way to declare a token enum, and then using match statements to declare code that converts from text into those variants. We'd probably some nice shorthands then so that you can just list out the variants and not have to write explicit patterns for them.

davll · 2016-11-14T17:28:49Z

lex or flex offers start conditions to define custom states for special cases like comments.

flex doc
Reference Link

nikomatsakis · 2017-03-24T17:04:25Z

@davll yes, I am hoping to solve that another way (https://github.com/nikomatsakis/lalrpop/issues/195). In any case I purposefully left it out of this initial design.

nikomatsakis · 2017-03-24T17:04:42Z

er, wrong issue, but still true :P

divoxx · 2017-08-31T22:55:49Z

Any progress on this? Being able to match against newline specifically would be really really welcomed. Even if the solution was simply an option to toggle multiline matching on the lexer, allowing to choose whether newline is to be matched as whitespace.

Michael-F-Bryan · 2018-05-01T14:21:56Z

What are people's thoughts on adding a regex table-based lexer to the lalrpop_util crate so people have a solution to custom lexing (e.g. comments or significant whitespace) out of the box?

In one of my projects I created something where you'll register a bunch of regex patterns and "token constructor" functions and found it to be quite flexible and good for when you want to get up and running quickly.

The type ended up looking somewhat like this:

pub struct Lexer<'input> {
    src: &'input str,
    patterns: Vec<(Regex, Box<Fn(&str) -> Result<Token, LexError>>)>,
    skips: Regex,
    ix: usize,
}

Full Lexer implementation

impl<'input> Lexer<'input> {
    pub fn new(src: &'input str) -> Lexer<'input> {
        let mut lex = Lexer {
            src: src,
            patterns: Vec::new(),
            skips: Regex::new(r"^\s+").unwrap(),
            ix: 0,
        };

        lex
    }

    pub fn with_default_patterns(src: &'input str) -> Lexer<'input> {
        use Token::*;
        const KEYWORDS: &'static [(&'static str, Token<'static>)] = &[
            ("type", Type),
            ("end_type", EndType),
            ("struct", Struct),
            ("end_struct", EndStruct),
        ];
        const PUNCTUATION: &'static [(&'static str, Token<'static>)] = &[
            (r":", Colon),
            (r";", Semicolon),
            (r"-", Minus),
            (r"/", Slash),
            (r",", Comma),
            (r"\^", Carat),
            (r"\*", Asterisk),
            (r"\+", Plus),
            (r"\(", OpenParen),
            (r"\)", CloseParen),
        ];

        let mut this = Lexer::new(src);

        for &(punc, token) in PUNCTUATION {
            let pattern = format!("^{}", punc);
            this.register_pattern(&pattern, move |_| Ok(token));
        }

        // keywords
        for &(kw, token) in KEYWORDS {
            let pattern = format!("^(?i){}", kw);
            this.register_pattern(&pattern, move |_| Ok(token));
        }

        // literals
        this.register_pattern(r"^\d+\.\d+", |s| Ok(Token::Float(s.parse().unwrap())));
        this.register_pattern(r"^\d+", |s| Ok(Token::Integer(s.parse().unwrap())));

        // catch-alls
        this.register_pattern(r"^[\w_][\w\d_]*", |s| Ok(Token::Identifier(s)));
        
        this
    }

    fn register_pattern<F>(&mut self, pattern: &str, constructor: F)
    where
        F: Fn(&str) -> Result<Token, LexError> + 'static,
    {
        assert!(pattern.starts_with("^"));

        let re = Regex::new(pattern).expect("Invalid regex");
        let constructor = Box::new(constructor);

        self.patterns.push((re, constructor));
    }

    fn trim_whitespace(&mut self) {
        let tail = self.tail();
        if let Some(found) = self.skips.find(tail) {
            self.ix += found.as_str().len();
        }
    }

    fn tail(&self) -> &'input str {
        &self.src[self.ix..]
    }

    fn is_finished(&self) -> bool {
        self.src.len() <= self.ix
    }
}

impl<'input> Iterator for Lexer<'input> {
    type Item = Result<(usize, Token<'input>, usize), LexError>;

    fn next(&mut self) -> Option<Self::Item> {
        self.trim_whitespace();

        if self.is_finished() {
            return None;
        }

        let start = self.ix;

        for &(ref pattern, ref constructor) in &self.patterns {
            if let Some(found) = pattern.find(self.tail()) {
                self.ix += found.end();

                let ret = constructor(found.as_str()).map(|t| (start, t, self.ix));
                return Some(ret);
            }
        }

        Some(Err(LexError::Unknown))
    }
}

#[derive(Debug, Clone, Copy, PartialEq)]
pub enum Token<'input> {
    Integer(i64),
    Float(f64),
    Identifier(&'input str),
    Plus,
    Asterisk,
    Minus,
    Slash,
    Carat,
    Colon,
    Semicolon,
    Type,
    EndType,
    Struct,
    EndStruct,
    OpenParen,
    CloseParen,
    Comma,
}

benkay86 · 2020-01-16T03:05:48Z

Love the LALRPOP crate, but not having support for comments out of the box is a big letdown. All of the Example Uses in README.md have comments, including LALRPOP itself, so it almost seems hypocritical not to support them. I realize that it's possible to implement comments by writing a custom lexer, but that's not trivial to do!

A lot of ambitious ideas are put forward in this issue, and it looks like the match syntax has made it into the book. If there were a way to just get the comment syntax proposed by nikomatsakis supported that would be a tremendous help! Other more advanced features of a custom tokenizer could come later.

// Ignore <pattern>, a way to implement comments.
match {
    <pattern> => { }
}

rucoder · 2020-01-17T01:33:13Z

Love the LALRPOP crate, but not having support for comments out of the box is a big letdown. All of the Example Uses in README.md have comments, including LALRPOP itself, so it almost seems hypocritical not to support them. I realize that it's possible to implement comments by writing a custom lexer, but that's not trivial to do!

A lot of ambitious ideas are put forward in this issue, and it looks like the match syntax has made it into the book. If there were a way to just get the comment syntax proposed by nikomatsakis supported that would be a tremendous help! Other more advanced features of a custom tokenizer could come later.
// Ignore <pattern>, a way to implement comments.
match {
    <pattern> => { }
}

Agree, that would be perfect. I've finished a grammar for my language and the only thing left is comments. I really do not want to waste my time writing a new lexer JUST for comments

Marwes · 2020-01-17T08:58:57Z

This would be great, and it is probably the most requested feature from LALRPOP. Unfortunately there isn't anyone actively maintaining the crate. I believe I am the most active member and I don't even remember to answer all issues and questions. I'd be happy to review a PR (but it may require some prodding to make me remember) but ultimately I don't, nor do I believe any other maintainer has the bandwidth to implement this.

cc lalrpop#14

Marwes · 2020-03-02T14:48:48Z

Implemented the most basic, naive version I could in #509 .

cc lalrpop#14

aegooby · 2021-09-27T14:17:17Z

The regex/literal based version is great, but it has issues with nested C-style comments like

/* /* */ */

This syntax

// Ignore <pattern>, a way to implement comments.
match {
    <pattern> => { }
}

would be perfect for handling these, is there any implementation planned, or maybe a workaround using the current system besides writing an entire lexer?

airstrike · 2024-09-06T18:25:43Z

Curious if there's any progress on allowing for significant whitespace with LALRPOP. I tend to love the ergonomics of the crate overall and am trying not to move away from it when writing a whitespace-significant DSL

dburgener · 2024-09-06T18:36:21Z

The workaround described here should allow you to parse whitespace using the built in lexer in lalrpop.

nikomatsakis added the design-work-needed label Sep 14, 2015

nikomatsakis changed the title ~~Custom priorities, comment support when using an internal tokenizer~~ Custom priorities, comment support when using a generated tokenizer Sep 14, 2015

nikomatsakis mentioned this issue Sep 14, 2015

tokenizer support #10

Open

nikomatsakis mentioned this issue Sep 21, 2015

Add option to disable implicit whitespace skipping #21

Closed

nikomatsakis added this to the 1.0 milestone Nov 5, 2015

nikomatsakis mentioned this issue Jan 4, 2016

Cannot appear to match against newlines #49

Closed

jroesch mentioned this issue Jan 27, 2016

Properly implement comments. hubris-lang/hubris#15

Closed

nikomatsakis mentioned this issue Feb 22, 2016

How do whitespaces get handled? #71

Closed

nikomatsakis changed the title ~~Custom priorities, comment support when using a generated tokenizer~~ Customize handling whitespace, comments when using generated tokenizer Feb 23, 2016

nikomatsakis mentioned this issue Mar 4, 2016

Document custom lexers #39

Closed

nikomatsakis removed this from the 1.0 milestone Aug 7, 2016

nikomatsakis mentioned this issue Jan 26, 2018

How to implement comments? #265

Closed

varkor mentioned this issue Mar 19, 2018

Add a switch for disabling whitespace-skipping behavior #334

Open

Michael-F-Bryan mentioned this issue May 5, 2018

Add a custom lexer to the lalrpop-util crate #373

Closed

Marwes added a commit to Marwes/lalrpop that referenced this issue Mar 2, 2020

feat: Allow the tokenizer to contain custom skip regexes/literals

914ec1f

cc lalrpop#14

Marwes added a commit to Marwes/lalrpop that referenced this issue Mar 2, 2020

feat: Allow the tokenizer to contain custom skip regexes/literals

65112e6

cc lalrpop#14

Marwes mentioned this issue Mar 2, 2020

feat: Allow the tokenizer to contain custom skip regexes/literals #509

Merged

Marwes added a commit to Marwes/lalrpop that referenced this issue Mar 3, 2020

feat: Allow the tokenizer to contain custom skip regexes/literals

ee2f706

cc lalrpop#14

rvanasa mentioned this issue Aug 2, 2022

Whitespace/Comments dfinity/motoko.rs#6

Closed

JonathanxD mentioned this issue Aug 10, 2022

Cannot create a grammar without skip rules #678

Open

Pat-Lafon modified the milestone: 1.0 Apr 4, 2024

Pat-Lafon mentioned this issue Oct 3, 2024

Allow specifying conditions in external token patterns (contextual keywords) #966

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Customize handling whitespace, comments when using generated tokenizer #14

Customize handling whitespace, comments when using generated tokenizer #14

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!