-
Notifications
You must be signed in to change notification settings - Fork 299
Customize handling whitespace, comments when using generated tokenizer #14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I've been thinking about this more and I am feeling pretty good about
Which would indicate that matches of In the future, I would expand this to permit:
The matter of actions that produce tokens still requires some thought. My initial idea was that they could return something that supports It might be that what you want then is a combination of an "extern token" and custom tokenizer -- basically a way to declare a token enum, and then using |
|
@davll yes, I am hoping to solve that another way (https://github.com/nikomatsakis/lalrpop/issues/195). In any case I purposefully left it out of this initial design. |
er, wrong issue, but still true :P |
Any progress on this? Being able to match against newline specifically would be really really welcomed. Even if the solution was simply an option to toggle multiline matching on the lexer, allowing to choose whether newline is to be matched as whitespace. |
What are people's thoughts on adding a regex table-based lexer to the In one of my projects I created something where you'll register a bunch of regex patterns and "token constructor" functions and found it to be quite flexible and good for when you want to get up and running quickly. The type ended up looking somewhat like this: pub struct Lexer<'input> {
src: &'input str,
patterns: Vec<(Regex, Box<Fn(&str) -> Result<Token, LexError>>)>,
skips: Regex,
ix: usize,
} Full Lexer implementationimpl<'input> Lexer<'input> {
pub fn new(src: &'input str) -> Lexer<'input> {
let mut lex = Lexer {
src: src,
patterns: Vec::new(),
skips: Regex::new(r"^\s+").unwrap(),
ix: 0,
};
lex
}
pub fn with_default_patterns(src: &'input str) -> Lexer<'input> {
use Token::*;
const KEYWORDS: &'static [(&'static str, Token<'static>)] = &[
("type", Type),
("end_type", EndType),
("struct", Struct),
("end_struct", EndStruct),
];
const PUNCTUATION: &'static [(&'static str, Token<'static>)] = &[
(r":", Colon),
(r";", Semicolon),
(r"-", Minus),
(r"/", Slash),
(r",", Comma),
(r"\^", Carat),
(r"\*", Asterisk),
(r"\+", Plus),
(r"\(", OpenParen),
(r"\)", CloseParen),
];
let mut this = Lexer::new(src);
for &(punc, token) in PUNCTUATION {
let pattern = format!("^{}", punc);
this.register_pattern(&pattern, move |_| Ok(token));
}
// keywords
for &(kw, token) in KEYWORDS {
let pattern = format!("^(?i){}", kw);
this.register_pattern(&pattern, move |_| Ok(token));
}
// literals
this.register_pattern(r"^\d+\.\d+", |s| Ok(Token::Float(s.parse().unwrap())));
this.register_pattern(r"^\d+", |s| Ok(Token::Integer(s.parse().unwrap())));
// catch-alls
this.register_pattern(r"^[\w_][\w\d_]*", |s| Ok(Token::Identifier(s)));
this
}
fn register_pattern<F>(&mut self, pattern: &str, constructor: F)
where
F: Fn(&str) -> Result<Token, LexError> + 'static,
{
assert!(pattern.starts_with("^"));
let re = Regex::new(pattern).expect("Invalid regex");
let constructor = Box::new(constructor);
self.patterns.push((re, constructor));
}
fn trim_whitespace(&mut self) {
let tail = self.tail();
if let Some(found) = self.skips.find(tail) {
self.ix += found.as_str().len();
}
}
fn tail(&self) -> &'input str {
&self.src[self.ix..]
}
fn is_finished(&self) -> bool {
self.src.len() <= self.ix
}
}
impl<'input> Iterator for Lexer<'input> {
type Item = Result<(usize, Token<'input>, usize), LexError>;
fn next(&mut self) -> Option<Self::Item> {
self.trim_whitespace();
if self.is_finished() {
return None;
}
let start = self.ix;
for &(ref pattern, ref constructor) in &self.patterns {
if let Some(found) = pattern.find(self.tail()) {
self.ix += found.end();
let ret = constructor(found.as_str()).map(|t| (start, t, self.ix));
return Some(ret);
}
}
Some(Err(LexError::Unknown))
}
}
#[derive(Debug, Clone, Copy, PartialEq)]
pub enum Token<'input> {
Integer(i64),
Float(f64),
Identifier(&'input str),
Plus,
Asterisk,
Minus,
Slash,
Carat,
Colon,
Semicolon,
Type,
EndType,
Struct,
EndStruct,
OpenParen,
CloseParen,
Comma,
} |
Love the LALRPOP crate, but not having support for comments out of the box is a big letdown. All of the Example Uses in README.md have comments, including LALRPOP itself, so it almost seems hypocritical not to support them. I realize that it's possible to implement comments by writing a custom lexer, but that's not trivial to do! A lot of ambitious ideas are put forward in this issue, and it looks like the
|
Agree, that would be perfect. I've finished a grammar for my language and the only thing left is comments. I really do not want to waste my time writing a new lexer JUST for comments |
This would be great, and it is probably the most requested feature from LALRPOP. Unfortunately there isn't anyone actively maintaining the crate. I believe I am the most active member and I don't even remember to answer all issues and questions. I'd be happy to review a PR (but it may require some prodding to make me remember) but ultimately I don't, nor do I believe any other maintainer has the bandwidth to implement this. |
Implemented the most basic, naive version I could in #509 . |
The regex/literal based version is great, but it has issues with nested C-style comments like
This syntax
would be perfect for handling these, is there any implementation planned, or maybe a workaround using the current system besides writing an entire lexer? |
Curious if there's any progress on allowing for significant whitespace with LALRPOP. I tend to love the ergonomics of the crate overall and am trying not to move away from it when writing a whitespace-significant DSL |
The workaround described here should allow you to parse whitespace using the built in lexer in lalrpop. |
The current tokenizer generation always uses two fixes precedence categories, so all regular expressions have equal weight. This is useful for giving a keyword like
"class"
precedence of an identifier regex, but there are times when we would like to give some regexs higher precedence over others. For example, if parsing a case-insensitive language like Pascal, you would like to use regexs liker"[vV][aA][rR]"
rather than having to enumerate all combinations. But this won't work because of precedence conflicts. Another problem is that the tokenizer implicitly skips whitespace and there is no way to extend this set to skip other things, like comments.Prioritization
I've been contemplating a
match
declaration to address prioritization, which might look like:The idea here is that when you have tokens listed in a match declaraction, we can create custom precedence levels, so that e.g. here the "var" regex takes precedence over the identifier regex. Tokens not listed would be implicitly added to the end of the list, with literals first and regex second.
Comments and whitespace
I'm less clear on what to do here. I contemplated adding things to the match declaration with an "empty action", which would signal "do nothing":
or having something like an empty if declaration:
I think I prefer the first alternative, but it doesn't seem great. Another thing that is unclear is if the implicit whitespace skipping should be disabled or retained. I think I prefer to retain it unless the user asks for it to be disabled, because it's always a bit surprising when you add something and find it implicitly removes another default. That is, adding comments to the list of things to skip implicitly removes whitespace. But not having the implicit whitespace at all feels like clearly the wrong default.
More complex tokenizers
Eventually I'd like to support lex-like specifications, where tokenizers can have a state stack -- and perhaps go further as described in #10. It'd be nice if we chose a syntax here that scaled gracefully to that case.
So some uncertainties here!
The text was updated successfully, but these errors were encountered: