8000 Custom lexer with custom Token type · Issue #803 · lalrpop/lalrpop · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Custom lexer with custom Token type #803

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ethindp opened this issue Jun 12, 2023 · 8 comments
Closed

Custom lexer with custom Token type #803

ethindp opened this issue Jun 12, 2023 · 8 comments

Comments

@ethindp
Copy link
ethindp commented Jun 12, 2023

I was looking through the custom Lexer tutorial, but I'm not really sure how to use the lexer I've written with Lalrpop. In particular, the regular expressions I use are ones that the Regex crate does not (and most likely will not) support; I instead use fancy_regex for those features. However, my Lexer returns types like this:

#[derive(Clone, Debug, Serialize, Deserialize)]
pub enum Token {
    Identifier {
        line: usize,
        column: usize,
        value: String,
    },
// ...

(It also returns an anyhow::Result<Vec<Token>>.) What would be the (correct) way of incorporating this lexer into a Lalrpop grammar? (I just have a standalone tokenize function, though I can change that to a full Lexer type if that's required.)

@arnaudgolfouse
Copy link
Contributor

You can read https://lalrpop.github.io/lalrpop/lexer_tutorial/003_writing_custom_lexer.html for a detailed overview, but in short I think what you want to do is:

  • Create a Lexer type, that implements Iterator<Item = anyhow::Result<Token>>

    You may want to separate the line/column info from the token type. In this case, you will want something like

    struct Info { line: usize, column: usize }
    impl Iterator for Lexer { 
        type Item = anyhow::Result<(Info, Token, Info)>;
        fn iter(&mut self) -> Option<Self::Item> { /* ... */ }
    }
  • In the lalrpop file, add at the end an extern block:

    extern {
        type Location = Info; // can be changed to any type you'd like. Access it with <l:@R> and <r:@R> fragments.
        type Error = anyhow::Error;
    
        enum Token {
            "IDENT" => Token::Identifier { value: <String> },
            // add every token
        }
    }
    
  • When creating the parser, it will now require the Lexer as input ( as opposed to a string slice).

You can now write your rules using the defined tokens:

AnIdent = {
    <id:"IDENT"> => { /* */ }
    <l:@L> "other token" <r:@R> => { /* use 'l' and 'r' to do whatever you want, generate an error, etc... */ }
}
// etc

@ethindp
Copy link
Author
ethindp commented Jun 15, 2023

@arnaudgolfouse I've made some significant changes to my lexer; notably, I've moved the line and column data to a struct, so that tokens look like this:

#[derive(Clone, Debug, Serialize, Deserialize)]
pub struct Token {
    pub line: usize,
    pub column: usize,
    pub data: TokenData,
}

And then I just have a discriminated enumeration for each token data type:

#[derive(Clone, Debug, Serialize, Deserialize)]
pub enum TokenData {
    /// Whitespace denotes a set of characters utilized to create separation between textual elements, such as words and symbols.
    Whitespace(char),
    /// Identifiers are used as names.
    Identifier(String),
    /// An integer literal is a numeric literal in the conventional notation (that is, it solely consists of an integer in any base, but it is not a significand).
    IntegerLiteral(IBig),
    /// A real literal is a floating-point literal: it can be in any valid base and it consists of a significand and exponent.
    RealLiteral(f64),
    /// A character literal is formed by enclosing a graphic character between two apostrophe characters.
    CharacterLiteral(char),
    /// A string literal is formed by a sequence of graphic characters (possibly none) enclosed between two quotation marks used as string brackets. They are used to represent operator symbols, values of a string type, and array subaggregates.
    StringLiteral(String),
    /// A comment starts with two adjacent hyphens and extends up to the end of the line.
    Comment(String),
    // Eof marks the end of file
    Eof,
    // Reserved words
    // ...
    // Delimiters
    // ...

I think that this might make writing the parser with Lalrpop a lot easier. I do have a couple questions about what you wrote:

  1. Why do I need to redefine the Token enum? Presumably my Lexer already defines this, so it seems redundant.
  2. Why do something like "IDENT" => Token::...? What does this accomplish?

@arnaudgolfouse
Copy link
Contributor

I did not design this library, so someone more knowledgable than me might correct me !

  1. If you are referring to the enum Token { ... } in the extern block, it doesn't really redefine the Token enum. Instead, it maps each variant to a separate integer (fn __token_to_integer in the generated file). This is used to generate the parsing tables, and so it needs to know how many variants there are/where they appear in the grammar, which it cannot do if Token is just an arbitrary type you pull from somewhere else.
  2. From what I understand the design of LALRPOP is that productions are written as follow:
    Assign = {
        "IDENT" "=" Expr => /* */,
    }
    
    But if you supply your own lexer, lalrpop does not know what "IDENT" or "=" refer to, and so we need to tell it that this refers to Token::Identifier(<String>) and Token::EqualSign (or similar). And so we have to write the "IDENT" => Token::... part.

@yannham
Copy link
Contributor
yannham commented Jun 15, 2023

I didn't take part in the original design either, but I believe what @arnaudgolfouse just said is correct. The enum Token isn't a definition but a mapping, from symbols that LALRPOP understands in the grammar as terminals, to a Rust enum variants representing your token.

We could probably imagine a different mechanism, maybe something based on traits with a derive macro and custom attributes acting directly on the definition of the Token enum, keeping everything in one place?

@ethindp
Copy link
Author
ethindp commented Jun 15, 2023

@yannham and @arnaudgolfouse I'm still kind of confused but maybe seeing the generated code would help me understand things. Just to clarify, if I want to make the mapping using the current way I do things, I should create a Lexer type, implement Iterator, and then define an extern block like so:

extern {
    type Location = (usize, usize); // line, column
    type error = Anyhow::Error;

    enum Token {
        "ident" => TokenData::Identifier(id),
        // ...
    }
}

And that's it? Then I just pass in my lexer as well as an input?

@yannham
Copy link
Contributor
yannham commented Jun 16, 2023

@ethindp yes, that should be it! Note that your iterator mustn't produce just tokens but "spanned tokens", that is Result<(Location, Token, Location), Error>.

@yannham
Copy link
Contributor
yannham commented Jul 3, 2023

@ethindp is your initial problem solved? Could we consider closing this issue?

@ethindp
Copy link
Author
ethindp commented Jul 3, 2023

Yes, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants
0