8000 Customize handling whitespace, comments when using generated tokenizer · Issue #14 · lalrpop/lalrpop · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Customize handling whitespace, comments when using generated tokenizer #14

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
nikomatsakis opened this issue Sep 14, 2015 · 13 comments
Open

Comments

@nikomatsakis
Copy link
Collaborator

The current tokenizer generation always uses two fixes precedence categories, so all regular expressions have equal weight. This is useful for giving a keyword like "class" precedence of an identifier regex, but there are times when we would like to give some regexs higher precedence over others. For example, if parsing a case-insensitive language like Pascal, you would like to use regexs like r"[vV][aA][rR]" rather than having to enumerate all combinations. But this won't work because of precedence conflicts. Another problem is that the tokenizer implicitly skips whitespace and there is no way to extend this set to skip other things, like comments.

Prioritization

I've been contemplating a match declaration to address prioritization, which might look like:

match {
    r"[vV][aA][rR]",
    r"[a-zA-Z][a-zA-Z0-9]+",
}

The idea here is that when you have tokens listed in a match declaraction, we can create custom precedence levels, so that e.g. here the "var" regex takes precedence over the identifier regex. Tokens not listed would be implicitly added to the end of the list, with literals first and regex second.

Comments and whitespace

I'm less clear on what to do here. I contemplated adding things to the match declaration with an "empty action", which would signal "do nothing":

match {
    r"{.*}" => { } // pascal style comment
}

or having something like an empty if declaration:

if r"{.*}";

I think I prefer the first alternative, but it doesn't seem great. Another thing that is unclear is if the implicit whitespace skipping should be disabled or retained. I think I prefer to retain it unless the user asks for it to be disabled, because it's always a bit surprising when you add something and find it implicitly removes another default. That is, adding comments to the list of things to skip implicitly removes whitespace. But not having the implicit whitespace at all feels like clearly the wrong default.

More complex tokenizers

Eventually I'd like to support lex-like specifications, where tokenizers can have a state stack -- and perhaps go further as described in #10. It'd be nice if we chose a syntax here that scaled gracefully to that case.

So some uncertainties here!

@nikomatsakis nikomatsakis changed the title Custom priorities, comment support when using an internal tokenizer Custom priorities, comment support when using a generated tokenizer Sep 14, 2015
@nikomatsakis nikomatsakis added this to the 1.0 milestone Nov 5, 2015
@nikomatsakis nikomatsakis changed the title Custom priorities, comment support when using a generated tokenizer Customize handling whitespace, comments when using generated tokenizer Feb 23, 2016
@nikomatsakis
Copy link
Collaborator Author

I've been thinking about this more and I am feeling pretty good about match declarations as the means of writing tokenizer rules. To start with I'd probably just permit:

match {
    <pattern> => { }
}

Which would indicate that matches of <pattern> are to be ignored completely. The priority would be given the same priorities as today (0 for regular expressions, 1 for fixed strings).

In the future, I would expand this to permit:

  1. Explicit #[priority=N] declarations.
  2. Fallible actions (=>?).
  3. Actions that produce a token value or multiple token values. See below.
  4. States. Here I imagine match in foo or match in (foo, bar) where foo and bar are states, and some way to transition to a new state. Not quite sure how that should look; maybe even something like self.push_state(foo) in the action, though I'm generally not too keen on "injecting" identifiers into the action code.

The matter of actions that produce tokens still requires some thought. My initial idea was that they could return something that supports IntoIterator<T>, where T is the token type, but that would require specifying the types of tokens. Given that the current tokenizer type is just (usize, &'input str), that's not so good. We'd need some way to give names to the token types that get produced.

It might be that what you want then is a combination of an "extern token" and custom tokenizer -- basically a way to declare a token enum, and then using match statements to declare code that converts from text into those variants. We'd probably some nice shorthands then so that you can just list out the variants and not have to write explicit patterns for them.

@nikomatsakis nikomatsakis removed this from the 1.0 milestone Aug 7, 2016
@davll
Copy link
davll commented Nov 14, 2016

lex or flex offers start conditions to define custom states for special cases like comments.

flex doc
Reference Link

@nikomatsakis
Copy link
Collaborator Author

@davll yes, I am hoping to solve that another way (https://github.com/nikomatsakis/lalrpop/issues/195). In any case I purposefully left it out of this initial design.

@nikomatsakis
Copy link
Collaborator Author

er, wrong issue, but still true :P

@divoxx
Copy link
divoxx commented Aug 31, 2017

Any progress on this? Being able to match against newline specifically would be really really welcomed. Even if the solution was simply an option to toggle multiline matching on the lexer, allowing to choose whether newline is to be matched as whitespace.

@Michael-F-Bryan
Copy link
Contributor

What are people's thoughts on adding a regex table-based lexer to the lalrpop_util crate so people have a solution to custom lexing (e.g. comments or significant whitespace) out of the box?

In one of my projects I created something where you'll register a bunch of regex patterns and "token constructor" functions and found it to be quite flexible and good for when you want to get up and running quickly.

The type ended up looking somewhat like this:

pub struct Lexer<'input> {
    src: &'input str,
    patterns: Vec<(Regex, Box<Fn(&str) -> Result<Token, LexError>>)>,
    skips: Regex,
    ix: usize,
}
Full Lexer implementation
impl<'input> Lexer<'input> {
    pub fn new(src: &'input str) -> Lexer<'input> {
        let mut lex = Lexer {
            src: src,
            patterns: Vec::new(),
            skips: Regex::new(r"^\s+").unwrap(),
            ix: 0,
        };

        lex
    }

    pub fn with_default_patterns(src: &'input str) -> Lexer<'input> {
        use Token::*;
        const KEYWORDS: &'static [(&'static str, Token<'static>)] = &[
            ("type", Type),
            ("end_type", EndType),
            ("struct", Struct),
            ("end_struct", EndStruct),
        ];
        const PUNCTUATION: &'static [(&'static str, Token<'static>)] = &[
            (r":", Colon),
            (r";", Semicolon),
            (r"-", Minus),
            (r"/", Slash),
            (r",", Comma),
            (r"\^", Carat),
            (r"\*", Asterisk),
            (r"\+", Plus),
            (r"\(", OpenParen),
            (r"\)", CloseParen),
        ];

        let mut this = Lexer::new(src);

        for &(punc, token) in PUNCTUATION {
            let pattern = format!("^{}", punc);
            this.register_pattern(&pattern, move |_| Ok(token));
        }

        // keywords
        for &(kw, token) in KEYWORDS {
            let pattern = format!("^(?i){}", kw);
            this.register_pattern(&pattern, move |_| Ok(token));
        }

        // literals
        this.register_pattern(r"^\d+\.\d+", |s| Ok(Token::Float(s.parse().unwrap())));
        this.register_pattern(r"^\d+", |s| Ok(Token::Integer(s.parse().unwrap())));

        // catch-alls
        this.register_pattern(r"^[\w_][\w\d_]*", |s| Ok(Token::Identifier(s)));
        
        this
    }

    fn register_pattern<F>(&mut self, pattern: &str, constructor: F)
    where
        F: Fn(&str) -> Result<Token, LexError> + 'static,
    {
        assert!(pattern.starts_with("^"));

        let re = Regex::new(pattern).expect("Invalid regex");
        let constructor = Box::new(constructor);

        self.patterns.push((re, constructor));
    }

    fn trim_whitespace(&mut self) {
        let tail = self.tail();
        if let Some(found) = self.skips.find(tail) {
            self.ix += found.as_str().len();
        }
    }

    fn tail(&self) -> &'input str {
        &self.src[self.ix..]
    }

    fn is_finished(&self) -> bool {
        self.src.len() <= self.ix
    }
}

impl<'input> Iterator for Lexer<'input> {
    type Item = Result<(usize, Token<'input>, usize), LexError>;

    fn next(&mut self) -> Option<Self::Item> {
        self.trim_whitespace();

        if self.is_finished() {
            return None;
        }

        let start = self.ix;

        for &(ref pattern, ref constructor) in &self.patterns {
            if let Some(found) = pattern.find(self.tail()) {
                self.ix += found.end();

                let ret = constructor(found.as_str()).map(|t| (start, t, self.ix));
                return Some(ret);
            }
        }

        Some(Err(LexError::Unknown))
    }
}

#[derive(Debug, Clone, Copy, PartialEq)]
pub enum Token<'input> {
    Integer(i64),
    Float(f64),
    Identifier(&'input str),
    Plus,
    Asterisk,
    Minus,
    Slash,
    Carat,
    Colon,
    Semicolon,
    Type,
    EndType,
    Struct,
    EndStruct,
    OpenParen,
    CloseParen,
    Comma,
}

@benkay86
Copy link

Love the LALRPOP crate, but not having support for comments out of the box is a big letdown. All of the Example Uses in README.md have comments, including LALRPOP itself, so it almost seems hypocritical not to support them. I realize that it's possible to implement comments by writing a custom lexer, but that's not trivial to do!

A lot of ambitious ideas are put forward in this issue, and it looks like the match syntax has made it into the book. If there were a way to just get the comment syntax proposed by nikomatsakis supported that would be a tremendous help! Other more advanced features of a custom tokenizer could come later.

// Ignore <pattern>, a way to implement comments.
match {
    <pattern> => { }
}

@rucoder
Copy link
rucoder commented Jan 17, 2020

Love the LALRPOP crate, but not having support for comments out of the box is a big letdown. All of the Example Uses in README.md have comments, including LALRPOP itself, so it almost seems hypocritical not to support them. I realize that it's possible to implement comments by writing a custom lexer, but that's not trivial to do!

A lot of ambitious ideas are put forward in this issue, and it looks like the match syntax has made it into the book. If there were a way to just get the comment syntax proposed by nikomatsakis supported that would be a tremendous help! Other more advanced features of a custom tokenizer could come later.

// Ignore <pattern>, a way to implement comments.
match {
    <pattern> => { }
}

Agree, that would be perfect. I've finished a grammar for my language and the only thing left is comments. I really do not want to waste my time writing a new lexer JUST for comments

@Marwes
Copy link
Contributor
Marwes commented Jan 17, 2020

This would be great, and it is probably the most requested feature from LALRPOP. Unfortunately there isn't anyone actively maintaining the crate. I believe I am the most active member and I don't even remember to answer all issues and questions. I'd be happy to review a PR (but it may require some prodding to make me remember) but ultimately I don't, nor do I believe any other maintainer has the bandwidth to implement this.

@Marwes
Copy link
Contributor
Marwes commented Mar 2, 2020

Implemented the most basic, naive version I could in #509 .

Marwes added a commit to Marwes/lalrpop that referenced this issue Mar 3, 2020
@aegooby
Copy link
aegooby commented Sep 27, 2021

The regex/literal based version is great, but it has issues with nested C-style comments like

/* /* */ */

This syntax

// Ignore <pattern>, a way to implement comments.
match {
    <pattern> => { }
}

would be perfect for handling these, is there any implementation planned, or maybe a workaround using the current system besides writing an entire lexer?

@airstrike
Copy link

Curious if there's any progress on allowing for significant whitespace with LALRPOP. I tend to love the ergonomics of the crate overall and am trying not to move away from it when writing a whitespace-significant DSL

@dburgener
Copy link
Contributor

The workaround described here should allow you to parse whitespace using the built in lexer in lalrpop.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

0