-
Notifications
You must be signed in to change notification settings - Fork 299
Custom lexer with custom Token type #803
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
You can read https://lalrpop.github.io/lalrpop/lexer_tutorial/003_writing_custom_lexer.html for a detailed overview, but in short I think what you want to do is:
You can now write your rules using the defined tokens:
|
@arnaudgolfouse I've made some significant changes to my lexer; notably, I've moved the line and column data to a struct, so that tokens look like this: #[derive(Clone, Debug, Serialize, Deserialize)]
pub struct Token {
pub line: usize,
pub column: usize,
pub data: TokenData,
} And then I just have a discriminated enumeration for each token data type: #[derive(Clone, Debug, Serialize, Deserialize)]
pub enum TokenData {
/// Whitespace denotes a set of characters utilized to create separation between textual elements, such as words and symbols.
Whitespace(char),
/// Identifiers are used as names.
Identifier(String),
/// An integer literal is a numeric literal in the conventional notation (that is, it solely consists of an integer in any base, but it is not a significand).
IntegerLiteral(IBig),
/// A real literal is a floating-point literal: it can be in any valid base and it consists of a significand and exponent.
RealLiteral(f64),
/// A character literal is formed by enclosing a graphic character between two apostrophe characters.
CharacterLiteral(char),
/// A string literal is formed by a sequence of graphic characters (possibly none) enclosed between two quotation marks used as string brackets. They are used to represent operator symbols, values of a string type, and array subaggregates.
StringLiteral(String),
/// A comment starts with two adjacent hyphens and extends up to the end of the line.
Comment(String),
// Eof marks the end of file
Eof,
// Reserved words
// ...
// Delimiters
// ... I think that this might make writing the parser with Lalrpop a lot easier. I do have a couple questions about what you wrote:
|
I did not design this library, so someone more knowledgable than me might correct me !
|
I didn't take part in the original design either, but I believe what @arnaudgolfouse just said is correct. The We could probably imagine a different mechanism, maybe something based on traits with a |
@yannham and @arnaudgolfouse I'm still kind of confused but maybe seeing the generated code would help me understand things. Just to clarify, if I want to make the mapping using the current way I do things, I should create a extern {
type Location = (usize, usize); // line, column
type error = Anyhow::Error;
enum Token {
"ident" => TokenData::Identifier(id),
// ...
}
} And that's it? Then I just pass in my lexer as well as an input? |
@ethindp yes, that should be it! Note that your iterator mustn't produce just tokens but "spanned tokens", that is |
@ethindp is your initial problem solved? Could we consider closing this issue? |
Yes, thank you! |
I was looking through the custom Lexer tutorial, but I'm not really sure how to use the lexer I've written with Lalrpop. In particular, the regular expressions I use are ones that the Regex crate does not (and most likely will not) support; I instead use fancy_regex for those features. However, my Lexer returns types like this:
(It also returns an
anyhow::Result<Vec<Token>>
.) What would be the (correct) way of incorporating this lexer into a Lalrpop grammar? (I just have a standalonetokenize
function, though I can change that to a fullLexer
type if that's required.)The text was updated successfully, but these errors were encountered: