8000 FR: Documentation Clarification re: logos · Issue #905 · lalrpop/lalrpop · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

FR: Documentation Clarification re: logos #905

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
abhillman opened this issue Jun 3, 2024 · 7 comments
Open

FR: Documentation Clarification re: logos #905

abhillman opened this issue Jun 3, 2024 · 7 comments

Comments

@abhillman
Copy link
abhillman commented Jun 3, 2024

tl;dr: Is it possible to define a tokenizer that does not require a callback using logos with lalrpop?


In this tutorial http://lalrpop.github.io/lalrpop/lexer_tutorial/005_external_lib.html, a token for lexing identifiers is declared:

#[derive(Logos, Clone, Debug, PartialEq)]
#[logos(skip r"[ \t\n\f]+", skip r"#.*\n?", error = LexicalError)]
struct Token {
  // ...
  #[regex("[_a-zA-Z][_0-9a-zA-Z]*", |lex| lex.slice().to_string())]
  Identifier(String),
  // ...
}

in addition to a parser that identifies lexed identifiers:

pub Term: Box<ast::Expression> = {
  // ...
  <name:"identifier"> => {
    Box::new(ast::Expression::Variable(name))
  },
  // ...
}

What I am noticing is that if a callback is not offered to logos' regex macro, name in the parser binds the token itself, as opposed to its value. But offering a callback is not required – in theory – because a token returned by lgoos' lexer includes its lexed text the lexer can return the slice a given token matches. For example:

use logos::Logos;

#[derive(Logos, Debug, PartialEq)]
#[logos(skip r"[ \t\n\f]+")]
enum Token {
    // Note that there is no callback passed to the `regex` macro
    #[regex("[a-zA-Z]+")]
    Text,
}

#[cfg(test)]
mod tests {
  fn t0() {
     let mut lex = Token::lexer("sometext");
     assert_eq!(lex.next(), Some(Ok(Token::Text)));
     assert_eq!(lex.slice(), "sometext");
  }
}
``

That said, I haven't yet found a way to use lalrpop with logos without providing a callback. In the example from the tutorial, this fails to compile if the callback is removed:

```rust
pub Term: Box<ast::Expression> = {
  // ...
  <name:"identifier"> => {
    Box::new(ast::Expression::Variable(name.slice()))
  },
  // ...
}

Is it possible to define a tokenizer that does not require a callback using logos with lalrpop?

@Pat-Lafon
Copy link
Contributor

Hmm, I may need some more clarification on what the issue is here. For example, what is the compilation error you are hitting? What is the diff you made to the lexer example(doc/lexer)?

I tried to play around with the lexer example by removing the callback for Token::Identifier. I had to introduce some lifetimes which linked the Identifier token to the underlying source but got something that compiled. Here is my diff. How does this differ from your use case?

diff --git a/doc/lexer/src/grammar.lalrpop b/doc/lexer/src/grammar.lalrpop
index fa3d6d7..3852407 100644
--- a/doc/lexer/src/grammar.lalrpop
+++ b/doc/lexer/src/grammar.lalrpop
@@ -1,7 +1,7 @@
 use crate::tokens::{Token, LexicalError};
 use crate::ast;
 
-grammar;
+grammar<'input>;
 
 // ...
 
@@ -9,10 +9,10 @@ extern {
   type Location = usize;
   type Error = LexicalError;
 
-  enum Token {
+  enum Token<'input> {
     "var" => Token::KeywordVar,
     "print" => Token::KeywordPrint,
-    "identifier" => Token::Identifier(<String>),
+    "identifier" => Token::Identifier(<&'input str>),
     "int" => Token::Integer(<i64>),
     "(" => Token::LParen,
     ")" => Token::RParen,
@@ -32,7 +32,7 @@ pub Script: Vec<ast::Statement> = {
 
 pub Statement: ast::Statement = {
   "var" <name:"identifier"> "=" <value:Expression> ";" => {
-    ast::Statement::Variable { name, value }
+    ast::Statement::Variable {name: name.to_string(), value }
   },
   "print" <value:Expression> ";" => {
     ast::Statement::Print { value }
@@ -81,7 +81,7 @@ pub Term: Box<ast::Expression> = {
     Box::new(ast::Expression::Integer(val))
   },
   <name:"identifier"> => {
-    Box::new(ast::Expression::Variable(name))
+    Box::new(ast::Expression::Variable(name.to_string()))
   },
   "(" <e:Expression> ")" => e
 }
\ No newline at end of file
diff --git a/doc/lexer/src/lexer.rs b/doc/lexer/src/lexer.rs
index d77b015..722fefa 100644
--- a/doc/lexer/src/lexer.rs
+++ b/doc/lexer/src/lexer.rs
@@ -6,7 +6,7 @@ pub type Spanned<Tok, Loc, Error> = Result<(Loc, Tok, Loc), Error>;
 
 pub struct Lexer<'input> {
     // instead of an iterator over characters, we have a token iterator
-    token_stream: SpannedIter<'input, Token>,
+    token_stream: SpannedIter<'input, Token<'input>>,
 }
 
 impl<'input> Lexer<'input> {
@@ -19,7 +19,7 @@ impl<'input> Lexer<'input> {
 }
 
 impl<'input> Iterator for Lexer<'input> {
-    type Item = Spanned<Token, usize, LexicalError>;
+    type Item = Spanned<Token<'input>, usize, LexicalError>;
 
     fn next(&mut self) -> Option<Self::Item> {
         self.token_stream
diff --git a/doc/lexer/src/tokens.rs b/doc/lexer/src/tokens.rs
index a11b127..7c2e024 100644
--- a/doc/lexer/src/tokens.rs
+++ b/doc/lexer/src/tokens.rs
@@ -17,14 +17,14 @@ impl From<ParseIntError> for LexicalError {
 
 #[derive(Logos, Clone, Debug, PartialEq)]
 #[logos(skip r"[ \t\n\f]+", skip r"#.*\n?", error = LexicalError)]
-pub enum Token {
+pub enum Token<'a> {
     #[token("var")]
     KeywordVar,
     #[token("print")]
     KeywordPrint,
 
-    #[regex("[_a-zA-Z][_0-9a-zA-Z]*", |lex| lex.slice().to_string())]
-    Identifier(String),
+    #[regex("[_a-zA-Z][_0-9a-zA-Z]*")]
+    Identifier(&'a str),
     #[regex("[1-9][0-9]*", |lex| lex.slice().parse())]
     Integer(i64),
 
@@ -47,7 +47,7 @@ pub enum Token {
     OperatorDiv,
 }
 
-impl fmt::Display for Token {
+impl<'a> fmt::Display for Token<'a> {
     fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
         write!(f, "{:?}", self)
     }

@Pat-Lafon
Copy link
Contributor

I'll close this issue. Let me know if there is more here and we can reopen

@abhillman
Copy link
Author

Thanks for the reply. The thing I am looking for here is the ability to access the slice for a given token, as opposed to storing the slice in the token itself. This allows for nice tests with logos, like the following:

#[derive(Logos, Debug, PartialEq, Clone)]
pub enum Token {
    #[regex(r"\(\*[a-zA-Z0-9 ]*\*\)")]
    BlockComment,

    #[regex("[0-9]+")]
    Int,

    #[regex(r"[0-9]+\.[0-9]*")]
    Float,

    #[regex(r#""[^"]*""#)]
    String_,

    Extra,
}
#[cfg(test)]
mod tests {
    use logos::Logos;
    use crate::tokens::Token;
    use crate::tokens::Token::{BlockComment, Float, Int};

    fn assert_tokens(s: &str, tokens: Vec<Token>) {
        let mut lex = Token::lexer(s);

        <
8000
span class="pl-k">let tokens_ = lex.into_iter().map(|v| {
            if v.is_err() {
                eprintln!("{:#?}", v);
            }

            v.unwrap_or(Token::Extra)
        }).collect::<Vec<Token>>();
        assert_eq!(tokens_, tokens)
    }

    #[test]
    fn int() {
        assert_tokens("123", vec![Int])
    }

    #[test]
    fn float() {
        assert_tokens("123.0", vec![Float]);
        assert_tokens("123.", vec![Float])
    }

    #[test]
    fn block_comment() {
        assert_tokens("(* hello *)", vec![BlockComment]);
        assert_tokens("(* hello world *)", vec![BlockComment]);
        assert_tokens("(* hello world 123 *)", vec![BlockComment])
    }

    #[test]
    fn string() {
        assert_tokens(r#""hello""#, vec![Token::String_])
    }
}

Back in my lalrpop grammar, I can't seem to find a way to do this atm. One thing I can think of (which feels like a hack) is to provide the input a second time to the parser by doing something like grammar: String; and then using location tracking.

Do we know if there might be another way to approach this?

@abhillman
Copy link
Author

Related #803

@abhillman
Copy link
Author
abhillman commented Aug 12, 2024

I got this working with the following:

use crate::ast::Ast;
use crate::ast::f64;
use crate::tokens::Token;
use crate::tokens::Parse;
use crate::tokens;

grammar<'input>;

pub Exprs: Vec<Ast> = {
    <v:(<Expr>)*> => v,
}

pub Expr: Ast = {
    <info: @L> <val:"int"> => Ast::Int(<tokens::Token as Parse<i64>>::parse(&val, info.1)),
    <info: @L> <val:"float"> => Ast::Float(<tokens::Token as Parse<f64>>::parse(&val, info.1)),
    <info: @L> <val:"string"> => Ast::String_(<tokens::Token as Parse<String>>::parse(&val, info.1)),
}

extern {
    type Location = (usize, &'input str);
    type Error = ();

    enum Token {
        "int" => Token::Int,
        "float" => Token::Float,
        "string" => Token::String_,
    }
}

Besides the gymnastics of Ast::Int(<tokens::Token as Parse<i64>>::parse(&val, info.1)) -- which I can probably improve upon and is a bit ugly -- I wonder if there is any way around having to insert <info: @L>.

@abhillman
Copy link
Author

One of the things that comes to mind, which is an FR, and probably a major FR:

// not currently a feature
extern {
  enum Token {
    "int" => (Token::Int, String)
  }
}

grammar<'input>;

pub Exprs: Vec<Ast> = {
    <v:(<Expr>)*> => v,
}

pub Expr: Ast = {
    <val:"int"> => Ast::Int(<tokens::Token as Parse<i64>>::parse(&val.0, val.1)),
    ...
}

This could be cool, but I'm not sure of use-cases beyond this. I could also certainly refactor my test code for logos to get something like I have, instead of expecting this from logos (though it would be cool, for this case, anyways).

@abhillman
Copy link
Author
abhillman commented Aug 12, 2024

Another thing I could think of that could help here is being able to access the lexer inline in the parser -- e.g. in the case of a logos parser, being able to call lexer.slice().

Again, a FR, it might look something like this:

    // not a feature right now
    <val:"int"> => Ast::Int(<tokens::Token as Parse<i64>>::parse(&val, $lexer.slice())),

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants
0