8000 Allow specifying conditions in external token patterns (contextual keywords) · Issue #966 · lalrpop/lalrpop · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Allow specifying conditions in external token patterns (contextual keywords) #966

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
osa1 opened this issue Oct 3, 2024 · 8 comments
Open

Comments

@osa1
Copy link
Contributor
osa1 commented Oct 3, 2024

Currently an external token declaration like

enum Token {
    "(" => Token { kind: TokenKind::LParen, .. },
    ...
}

Generates this pattern when matching the token:

match __token {
    Token { kind: TokenKind::LParen, .. } if true => Some(2),
    ...
}

I think we should be able to allow the user to specify the if true part, something like:

enum Token {
    "(" => Token { kind: TokenKind::LParen, .. },
    "blah" => Token { kind: TokenKind::Id, text } if text.as_str() == "blah",
    ...
}

Which would be compiled to

match __token {
    Token { kind: TokenKind::LParen, .. } if true => Some(2),
    Token { kind: TokenKind::Id, text } if text.as_str() == "blah" => Some(3),
    ...
}

This can be used to handle contextual keywords. Currently we need to handle contextual keywords in the lexer, and then in the parser, when we expect a variable, handle contextual keywords and convert them to variables.

I think this should be fairly straightforward to implement. I can give it a try if maintainers think this can be added.

@dburgener
Copy link
Contributor

I agree that this doesn't sound too hard to implement. I'm a little fuzzy on the motivating use case though. Why not just have the lexer return different token variants for these cases?

I think @Pat-Lafon has more experience with custom lexers than I do, so hopefully he can chime in as well.

@Pat-Lafon
Copy link
Contributor

I tried to track down why if true is generated since it seems like a no-op. As far as I can tell, it's "the way it always has been" going back to #122. It seems likely to me that this very well could have been an intended feature that never got implemented? I would be interested in seeing this as a pr, it would also make lalrpop more consistent given that guards are available in other parts of the grammar(according to

<lo:@L> <s:Symbol+> <c:("if" <Cond>)?> <a:Action?> <hi:@R> => {
).

I'm curious what other power this might add. Since this condition is probably arbitrary rust code, could the condition be used to access some arbitrary mutable global state?

enum Token {
    "(" => Token { kind: TokenKind::LParen, .. },
    "blah" => Token { kind: TokenKind::Id, text } if External::IncrementandCheckNumberofBlah(&text),
    ...
}

For some minor bikeshedding, I'm a little sad that the guard/condition goes after the => instead of before(which is more consistent with Rust's guard syntax and LALRPOP's limited guards) but it makes sense give that the variables being guarded on are introduced after the =>. It almost looks out of order to me. I would propose "when" here... but this is very low priority.

(Some relevant stuff I found for my own reference: #112, #14, #10)

@yannham
Copy link
Contributor
yannham commented Oct 7, 2024

I agree that this doesn't sound too hard to implement. I'm a little fuzzy on the motivating use case though. Why not just have the lexer return different token variants for these cases?

I suspect the problem of contextual keywords at least is the following: for example, in the language I'm working on, or isn't a special keyword. So it is a valid identifier and all is good. However, in a pattern, it is used as a combinator: you can write (match { ('Foo x) or ('Bar x) => x }). There is not conflict because an identifier can't appear at the same position. Still, it's annoying to express: you need a dedicated token for that in the lexer (so very much like a normal keyword), and then define a rule like Identifier : Ident => { BasicIdentifier, "or" => Ident("or"), "other_contextual_keyword" => etc. }. At least that's what I understand from the problem space.

However, I'm not sure to understand how the proposed feature could do anything about it - I feel like what you'd need is a guard on rules when matching said token, and not a guard at the token definition site?

@osa1
Copy link
Contributor Author
osa1 commented Oct 7, 2024

However, I'm not sure to understand how the proposed feature could do anything about it - I feel like what you'd need is a guard on rules when matching said token, and not a guard at the token definition site?

With the proposed feature, if you have a contextual keyword "or" like in your example, you can do

enum Token {
    ...
    "or" => Token { kind: TokenKind::Var, text } if text.as_str() == "or",
    ...
}

So the lexer keeps recognizing "or" as a variable instead of a keyword, but you define a token in the parser that checks the variable's text when matching an "or".

Depending on your token definition you may be able to do this just with patterns, e.g.

enum Token {
    ...
    "or" => Token { kind: TokenKind::Var, text: "or" }, // <-- works when `text` is a `&str`
    ...
}

But this doesn't work when the text field's type is something other than &str. In my case, it's SmolStr, which I can't pattern match.

@dburgener
Copy link
Contributor

If I've understood the problem statement correctly, it sounds awfully similar to this recent PR adding a documentation example. In that situation, lexer behavior needed to be controlled by parser level information, and the example shows how parser context can be passed to the lexer. It seems like the "or" problem discussed here could be solved with that same technique. The lexer defines a mode type, and a Rc<RefCell<LexerMode>> is passed to the parser, so the parser can toggle the state and the lexer can know about the context while lexing. In the "or" example, that means that the lexer can lex "or" as TokenKind::Var normally, but the parser toggles into the mode where "or" becomes a keyword and flips the bool for the lexer, which then starts lexing "or" as TokenKind::Or until the mode bit is flipped back.

If I've understood the problem correctly, I'm still confused as to how the guard approach proposed here addresses it. Aren't you just moving some lexer duties into the parser? It seems to me like the enum with the guard to lex to "or" when the string is "or", and variable otherwise is unconditional, right? So you've just moved the lexer duties into the parser, but not actually taken advantage of the parser's contextual awareness in making the lexing decision. Maybe there's a way to take advantage of the parsers contextual awareness using these guards, but I don't see it spelled out above, and it's not currently clear to me how that would work.

If I've understood the problem correctly, and it can be solved using the example I mentioned above, I'm not necessarily opposed to also adding this lexer guard feature to provide another option. I'm just still not really understanding how the lexer guard feature adds value.

@yannham
Copy link
Contributor
yannham commented Oct 8, 2024

@osa1 I'm sorry but I have trouble understanding your examples, because your token is already named "or" and thus the condition if text.as_str() == "or" appears to be void (always true).

Do you actually mean something more like

    identifier => Token { kind: TokenKind::Var, text } if text.as_str() == "or",

Where the identifier token actually matches more things than just or ?

@osa1
Copy link
Contributor Author
osa1 commented Oct 8, 2024

@yannham I'm using an external lexer, "or" is just the name I give to the pattern. So in this example: (from my previous comment)

    "or" => Token { kind: TokenKind::Var, text } if text.as_str() == "or",

I use "or" terminal to match tokens with pattern Token { kind: TokenKind::Var, text } with the guard text.as_str() == "or". Without the guard it matches any variable.

@yannham
Copy link
Contributor
yannham commented Oct 8, 2024

Ah, I see, thanks - I think I was reading that backward, as I haven't used such token definition in ages. Thanks for the clarification, it makes sense now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants
0