-
Notifications
You must be signed in to change notification settings - Fork 299
tokenizer support #10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
We currently have some simple tokenizer generation support -- it has no states, and you can't write any custom action code at all. There are some immediate needs for extension described in #14, but it'd be nice to tie this all together in a grand, unified design. |
My current thought is something like this. You can add
The action will be to produce something that supports I had imagined that the internal token form would "desugar" to this. One catch is that if you are hand-writing code, you would presumably want to specify your own token type. The current internal tokenizer just generates tuples like The main use case for producing multiple tokens is something like making While resolving these annoying questions, I could JUST support this form, to allow skipping comments:
(However, we also want to consider states, so we can support nested |
It's time to make progress here. I have a concrete proposal I plan to experiment with. It does not yet handle multiple lexer states, but I'm interested in inferring those automatically based on the grammar instead. My idea is roughly this. One can define a match {
r"[a-z]\w+", // anything beginning with lowercase letter
"Foo" // Foo specifically
} else {
r"[A-Z]\w+", // anything beginning with uppercase letter
_, // include anything else mentioned in the grammar that doesn't already appear here
}; In addition, these entries can have actions. Right now, these actions can only take two forms:
The default action we saw above is then equivalent to an identity rename. e.g. So, to accommodate Pascal, where the reserved words are case insensitive, I might do: match {
r"(?i)begin" => "BEGIN",
r"(?i)end" => "END",
} else {
r"[a-zA-Z_][a-zA-Z0-9_]*" => IDENTIFIER,
}; This would declare that:
Note that you can map to any symbol name, with or without quotes etc. So then I can write my grammar using those "cleaned-up" names:
I feel good about this. |
I've been looking at the code a bit. I'm going to take a few notes on a possible implementation plan, though I've run out of "side project" time for this morning:
To make this system I am describing here work, we have to make changes in a few places. First and foremost, we have to parse the Next, we can now have "bare" terminals even with an internal tokenizer. So, name resolution needs to find the internal tokenizer "match" declaration and check the right-hand side of each In token-check, when we validate a literal like Finally, when we build the DFA, we would build it up using the LHS of each |
I like this API, I'll try to see if I can make enough sense of the lalrpop code to see if I can make any progress. |
So @wagenet made awesome progress here. There is still a bit more work to go before I'm ready to close this issue. In particular, I want to add support for tokenizers that produce zero or multiple tokens in response to an input. I'm envisioning something like this: match {
r"&&" => ("&[&]", "&[]"), // produce two tokens in response to 1 regex
} else {
r"\s+" => (), // skip whitespace
r"//.*" => (), // skip EOL comments
r"/\*.*\*/" => (), // skip (non-nested) `/* ... */` comments
} This would basically make us able to synthesize any regular (as in regular expressions) lexer. To go beyond regular languages we'd have to add states -- I'm not opposed to it, but want to think on it a bit more. One thing that I don't know: right now, we implicitly skip whitespace for you. I'd like to give users control of that, but I'm not sure when to disable the current behavior. For example, should we disable the whitespace default behavior if there are any tokens that map to Somehow the "no skip whitespace" thing feels related to Thoughts? |
In the former case, how would one write a lexer that skips nothing? |
@8573 yeah, good question. I was thinking that the better "magic" answer would be to maybe look to see if any regex patterns may match whitespace, but it feels...kind of magic. |
Opting for the non-magic option seems to be the more predictable choice and wouldn't preclude implementing magic in the case it is not set as a fallback. |
Yes. I think there are two viable options:
Right now, I lean towards the former. The annotation just feels awkward, and using
The next question is: what notation should we use for "skip" rules? I am thinking a bit forward here. I eventually expect to support:
Considering all these things I am leaning towards either:
I sort of lean towards the second one right now, but really either is fine. |
Has there been any progress with "skip" rules? I want to use them for line comments and inline comments without nesting. What needs to be changed to support "skip" rules? |
How does this relate to #14? They both seem related to tokenizers and the |
@ahmedcharles yeah I think they may be basically the same thing. =) |
Should one of them be closed? |
What is the status for parsing indentation sensitive languages? This (usually) requires a non-regular lexer to create functionality such as Python ignoring indentation within lists. Is there any documentation on how one can create their own lexer? |
@anchpop http://lalrpop.github.io/lalrpop/lexer_tutorial/index.html explains how to write and integrate a simple lexer. |
Oh, I didn't notice that, thanks |
I would be happy to work on an implementation of something like what's described in this whitepaper, as it would dramatically simplify some of my code. But I would need some guidance from someone experienced with the library on what they think the best way to implement it would be, In addition, being able to configure the lexer to emit ENDLINE or STARTLINE tokens would allow some languages to be described without much more complexity on LALRPOP's side, I think. |
We want the ability to add your own tokenizer. This should permit very lightweight specifications but scale to really complex things. I envision the first step as generating a tokenizer based on the terminals that people use (#4) but it'd be nice to actually just permit tokenizer specifications as well, where people can write custom action code based on the strings that have been recognized.
Some things I think we'll want:
Tok
for just one token.()
, we expect you to return zero tokens.(Tok, Tok)
, you always return two tokens.Vec<Tok>
, we expect you to return a dynamic number of tokens.The text was updated successfully, but these errors were encountered: