-
Notifications
You must be signed in to change notification settings - Fork 299
Document custom lexers #39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I am trying to build a parser for the BNF described here. If you scroll down enough you notice that the BNF also contains a custom lexer. However, without documentation it is pretty hard to implement it. It would be greatly appreciated if you could finish up the documentation, or alternatively help me in the implementation work. |
@mAarnos I should really write up the docs, yes. But in the meantime: do you have a link to what you have attempted? I can show you some examples -- it's actually fairly straightforward. Basically, you just need to define an enum for your tokens and write up something that implements the This is the LALRPOP tokenizer, which makes maximal use of all features. This enum represents each token. Here is the implementation of the iterator trait. The item type is Finally, you need a corresponding section in your grammar that tells LALRPOP what your location and error types are, and which variant corresponds to which terminal. Here is the one LALRPOP uses. |
Anyway, let me know if that is helpful, I didn't go into much depth, so please do ask questions. |
It would be a appreciated if you could briefly explain what are the requirements of a lexer in the library, and how to tell it to use one. (in other words how to tell the library you are using a custom lexer) |
I built a custom lexer (heavily based on the LALRPOP lexer) and it seems like most use cases for hand writing lexers could be covered by expanding the LALRPOP language. At first thought I came up with something like this. grammar;
// Same as it is today.
tokens; // Optional 3rd section.
LBRACK = "[";
IDENT = "[a-zA-Z][a-zA-Z0-9-_]*";
// Something to make building comments and escapes easier. As far as the more complex parts of lexing Yacc allows for lexer states, so you can transition in and out of states like |
Would be nice also to make them ignorable like for example ignorable COMMENT= "\#.*\n"; |
The matter of extending LALRPOP's own lexer generation is #14, not this issue. |
I'd love a simple complete example showing how to strip shell-style comments (it's what I'm actually trying to do right now and don't quite know where to start) |
@gnosek Currently, the only way to do that is to write your own lexer. It should implement the Iterator interface such that
...and EOF is represented by Now, assuming the hard part is out of the way, here comes the weird part. Intuitively, you might think that if you have a Token enum, you can write So, for example, here's a line from the grammar I wrote today:
And here are the mappings it uses:
Here, my tokenizer produces Hope that helps. I was pretty bewildered for a couple hours there. |
That's mostly right -- except that you don't have to use string literals, you can use other names too. e.g., I used to have a kind of short-hand, where you could just use string literals and we assumed that (e.g.) "KwTrace" would be mapped to Another option to writing the lexer by hand, btw, is to use a more automated lexer generator such as https://github.com/LeoTestard/rustlex. Sadly LALRPOP's doesn't yet support a more flexible input format, mostly because I haven't had time to devote to it (I'd love to work on this with someone :) |
I am hoping to push for 1.0 soon, so I do intend to spend a fair amount of time just writing docs at least -- I think this issue should be first on my list. |
In the meantime, here is an example of a test that uses a custom lexer: Notice in particular the |
Ah, here is a simpler test that eschews locations and errors: And the tokenizer, a very simple one, is here: LALRPOP's own tokenizer is probably a better model though: |
@nikomatsakis Thanks for the clarification! I couldn't catch you on IRC the other day so I was going mainly by guesswork and experimentation. Maybe a tutorial is in order? I'll write something up and send a pull request. As for my own project, I'd prefer to use a lexer generator to help manage the explosion in complexity once I add interpolated strings. Even if the token stream input format accepted by LALRPOP won't soon be changed, perhaps a convention for writing iterator adapters will suffice. This part will have to go on the back burner, but I'll try some things out later. |
I'd like to see documentation on parsing string literals. I'm having trouble parsing them using a grammar. I've searched a bit and it seems like it's a job for the tokenizer, so perhaps it could be a good subject for the tutorial. |
I'm trying to write a However, since further classifying |
It seems like the whitespace tutorial covers this. Should this be closed? Perhaps with the opening of more specific issues for things that may be missing? |
Closing. Please reopen (or open a new issue) if there's still something in the docs that is lacking. |
When the lexer yields an iterator of struct Location {
pub span: core::ops::Range<usize>,
pub file_id: usize,
} And this type already encapsulates the idea of a text range. It is a bit weird to have this value duplicated twice, just because of the typing where it was assumed that |
That may be better, not aware of any problems with changing it. |
The current tutorial and documentation does not explain how one writes and integrates a custom lexer.
The text was updated successfully, but these errors were encountered: