8000 Document custom lexers · Issue #39 · lalrpop/lalrpop · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Document custom lexers #39

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
nikomatsakis opened this issue Nov 5, 2015 · 20 comments
Closed

Document custom lexers #39

nikomatsakis opened this issue Nov 5, 2015 · 20 comments
Milestone

Comments

@nikomatsakis
Copy link
Collaborator

The current tutorial and documentation does not explain how one writes and integrates a custom lexer.

@nikomatsakis nikomatsakis added this to the 1.0 milestone Nov 5, 2015
@mAarnos
Copy link
mAarnos commented Jan 3, 2016

I am trying to build a parser for the BNF described here. If you scroll down enough you notice that the BNF also contains a custom lexer. However, without documentation it is pretty hard to implement it. It would be greatly appreciated if you could finish up the documentation, or alternatively help me in the implementation work.

@nikomatsakis
Copy link
Collaborator Author

@mAarnos I should really write up the docs, yes. But in the meantime: do you have a link to what you have attempted? I can show you some examples -- it's actually fairly straightforward.

Basically, you just need to define an enum for your tokens and write up something that implements the Iterator trait for those tokens. It depends a bit if you want accurate span (line/column number) information and whether your iterator is fallible (i.e., if there are inputs that cannot be lexed, which is generally true).

This is the LALRPOP tokenizer, which makes maximal use of all features. This enum represents each token. Here is the implementation of the iterator trait. The item type is Result<Spanned<Tok<'input>>, Error>: basically, the Err case means a lexer error, and the Ok case means a successful token. Spanned is just an alias for (Location, Tok<'input>, Location), where the first Location is the start of the token, and the second is the end of the token.

Finally, you need a corresponding section in your grammar that tells LALRPOP what your location and error types are, and which variant corresponds to which terminal. Here is the one LALRPOP uses.

@nikomatsakis
Copy link
Collaborator Author

Anyway, let me know if that is helpful, I didn't go into much depth, so please do ask questions.

@Binero
Copy link
Binero commented Feb 9, 2016

It would be a appreciated if you could briefly explain what are the requirements of a lexer in the library, and how to tell it to use one. (in other words how to tell the library you are using a custom lexer)

@nixpulvis
Copy link
Contributor

I built a custom lexer (heavily based on the LALRPOP lexer) and it seems like most use cases for hand writing lexers could be covered by expanding the LALRPOP language. At first thought I came up with something like this.

grammar;

// Same as it is today.

tokens;  // Optional 3rd section.

LBRACK = "[";
IDENT = "[a-zA-Z][a-zA-Z0-9-_]*";

// Something to make building comments and escapes easier.

As far as the more complex parts of lexing Yacc allows for lexer states, so you can transition in and out of states like COMMENT or ESCAPE. This works really nicely in my opinion.

@Binero
Copy link
Binero commented Mar 3, 2016

Would be nice also to make them ignorable like for example

ignorable COMMENT= "\#.*\n";

@nikomatsakis
Copy link
Collaborator Author

The matter of extending LALRPOP's own lexer generation is #14, not this issue.

@gnosek
Copy link
gnosek commented Aug 5, 2016

I'd love a simple complete example showing how to strip shell-style comments (it's what I'm actually trying to do right now and don't quite know where to start)

@malleusinferni
Copy link
Contributor

@gnosek Currently, the only way to do that is to write your own lexer. It should implement the Iterator interface such that Item is Result<(Loc, Token, Loc), Err> for arbitrary types Token, Loc, and Err, where Loc holds offsets into the input string. (I say "arbitrary" but Token should be an enum, and Loc should just be usize.) So, the return value of your next() method will look like this:

Some(Ok((token_start, Token::SomeToken, token_end)))

...and EOF is represented by None. You're probably gonna end up writing this out by hand. Hopefully it will stay under a thousand lines of code. Good luck.

Now, assuming the hard part is out of the way, here comes the weird part. Intuitively, you might think that if you have a Token enum, you can write Token::Foo directly in the grammar rules. Not so fast. What you're supposed to do (as far as I can tell) is use string literals as symbolic names to represent your enum values. You define all these mappings yourself (reposting @nikomatsakis's link). Thus, if you have some keyword foo, you'll write "foo" in your grammar rules, and in the bottom section you'll also write "foo" => Token::Foo.

So, for example, here's a line from the grammar I wrote today:

"trace" <Expr> ";" => ast::Stmt::Trace(<>),

And here are the mappings it uses:

";" => Tok::EndLn,
"trace" => Tok::KwTrace,
// others omitted

Here, my tokenizer produces Tok::EndLn in response to a newline as well as an actual semicolon... But the parser doesn't know that, because it doesn't handle the character stream directly. Likewise, all my comment-skipping logic is in the tokenizer, and the parser knows nothing about my comment syntax.

Hope that helps. I was pretty bewildered for a couple hours there.

@nikomatsakis
Copy link
Collaborator Author

That's mostly right -- except that you don't have to use string literals, you can use other names too. e.g., trace => Tok::KwTrace. But you do have to define a mapping. Moreover, it's quite flexible: for example, you could do trace => Tok::KwTrace(<u32>, _), which would mean that the value of a trace token is the u32 found in the first argument of the enum, and that the second argument should be ignored.

I used to have a kind of short-hand, where you could just use string literals and we assumed that (e.g.) "KwTrace" would be mapped to Tok::KwTrace, but I removed it I think. I could add it back, might be handy.

Another option to writing the lexer by hand, btw, is to use a more automated lexer generator such as https://github.com/LeoTestard/rustlex. Sadly LALRPOP's doesn't yet support a more flexible input format, mostly because I haven't had time to devote to it (I'd love to work on this with someone :)

@nikomatsakis
Copy link
Collaborator Author

I am hoping to push for 1.0 soon, so I do intend to spend a fair amount of time just writing docs at least -- I think this issue should be first on my list.

@nikomatsakis
Copy link
Collaborator Author

In the meantime, here is an example of a test that uses a custom lexer:

Notice in particular the extern section, which defines the various bits of that lexer. For example, if using the most capable option, the lexer is expected to yield an iterator of Result<(L, T, L), E> values, where L is a "location" type (usually a usize) used to indicate where the token started and ended, T is the token type, and E is the "error" type. Those various bits are defined in the extern section. I think you can just leave out the type Location and type Error if your lexer either doesn't track locations or doesn't yield errors.

@nikomatsakis
Copy link
Collaborator Author

Ah, here is a simpler test that eschews locations and errors:

And the tokenizer, a very simple one, is here:

LALRPOP's own tokenizer is probably a better model though:

@malleusinferni
Copy link
Contributor

@nikomatsakis Thanks for the clarification! I couldn't catch you on IRC the other day so I was going mainly by guesswork and experimentation. Maybe a tutorial is in order? I'll write something up and send a pull request.

As for my own project, I'd prefer to use a lexer generator to help manage the explosion in complexity once I add interpolated strings. Even if the token stream input format accepted by LALRPOP won't soon be changed, perhaps a convention for writing iterator adapters will suffice. This part will have to go on the back burner, but I'll try some things out later.

@sbstp
Copy link
sbstp commented Aug 16, 2016

I'd like to see documentation on parsing string literals. I'm having trouble parsing them using a grammar. I've searched a bit and it seems like it's a job for the tokenizer, so perhaps it could be a good subject for the tutorial.

@euclio
Copy link
euclio commented Oct 18, 2016

I'm trying to write a sh parser (probably way over my head, but I'm patient) in Rust. I'm wondering if lalrpop is the tool for the job. For my custom lexer, how do I represent context-sensitive tokens as seen here? My lexer can currently convert tokens into operators, io_numbers and tokens.

However, since further classifying token into word, name, assignment_word requires knowing the context of the token (and the grammar itself), I'm not quite sure how to proceed.

Perhaps this is related to #156 and #112?

@ahmedcharles
Copy link
Contributor

It seems like the whitespace tutorial covers this. Should this be closed? Perhaps with the opening of more specific issues for things that may be missing?

@ahmedcharles
Copy link
Contributor

Closing. Please reopen (or open a new issue) if there's still something in the docs that is lacking.

@eduhenke
Copy link

In the meantime, here is an example of a test that uses a custom lexer:

Notice in particular the extern section, which defines the various bits of that lexer. For example, if using the most capable option, the lexer is expected to yield an iterator of Result<(L, T, L), E> values, where L is a "location" type (usually a usize) used to indicate where the token started and ended, T is the token type, and E is the "error" type. Those various bits are defined in the extern section. I think you can just leave out the type Location and type Error if your lexer either doesn't track locations or doesn't yield errors.

When the lexer yields an iterator of Result<(L, T, L), E>. Is there any specific reason why the L type parameter, is being used twice? I understand that in the simple case where L=usize, the first L is the start location, and the second L is the end location, however couldn't we model that making L=(usize, usize)? I am questioning that because I am currently using L as:

struct Location {
    pub span: core::ops::Range<usize>,
    pub file_id: usize,
}

And this type already encapsulates the idea of a text range. It is a bit weird to have this value duplicated twice, just because of the typing where it was assumed that L was a more simpler type like usize.

@Marwes
Copy link
Contributor
Marwes commented Aug 23, 2021

I understand that in the simple case where L=usize, the first L is the start location, and the second L is the end location, however couldn't we model that making L=(usize, usize)

That may be better, not aware of any problems with changing it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

0