8000 tokenizer support · Issue #10 · lalrpop/lalrpop · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

tokenizer support #10

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
nikomatsakis opened this issue Jun 24, 2015 · 18 comments
Open

tokenizer support #10

nikomatsakis opened this issue Jun 24, 2015 · 18 comments

Comments

@nikomatsakis
Copy link
Collaborator

We want the ability to add your own tokenizer. This should permit very lightweight specifications but scale to really complex things. I envision the first step as generating a tokenizer based on the terminals that people use (#4) but it'd be nice to actually just permit tokenizer specifications as well, where people can write custom action code based on the strings that have been recognized.

Some things I think we'll want:

  • A stack of states for the tokenizer to use
  • The ability for action code to yield any number of tokens. I envision that we'll define a trait and require you to return some value that conforms to that trait. This trait will permit:
    • return Tok for just one token.
    • if you write a return type of (), we expect you to return zero tokens.
    • if you write a return type of (Tok, Tok), you always return two tokens.
    • if you write a return type of Vec<Tok>, we expect you to return a dynamic number of tokens.
    • internally, the generated code will keep a queue of generated tokens, and as tokens are requested they are removed from the front, and we only go back to the bytes when that queue is exhausted.
@nikomatsakis
Copy link
Collaborator Author

We currently have some simple tokenizer generation support -- it has no states, and you can't write any custom action code at all. There are some immediate needs for extension described in #14, but it'd be nice to tie this all together in a grand, unified design.

@nikomatsakis
Copy link
Collaborator Author

My current thought is something like this. You can add match sections into the grammar, with an optional priority annotation. These can contain either literal strings or regular expresisons.

#[priority=2]
match {
    "foo" => action, // literal text
    r"bar" => action, // regular expression
    "foo" =>? action, // fallible action
}

The action will be to produce something that supports IntoIterator<Item=TheToken>. Thus you can return None to just skip the token, Some(foo) to produce one token, or vec![...] to produce many tokens (or other such things). If the rule is fallible, you should produce Result<T,E> where E is a ParseError and T is something supporting IntoIterator, as above.

I had imagined that the internal token form would "desugar" to this. One catch is that if you are hand-writing code, you would presumably want to specify your own token type. The current internal tokenizer just generates tuples like (u32, &'input str), where the integer identifies which token it is. This is pretty opaque and not something you could feed by hand (OTOH, the main use for this form is to specify the whitespace to skip, in which case you'd just return None).

The main use case for producing multiple tokens is something like making >>> a single match that then generates 3 tokens, one for each >. (Here, the first two would be a special sort of > that indicates that another > afterwards.) This could also be accommodated by incorporating lookahead into the regular expressions, I guess. I'd have to think about how best to do that (or read some papers).

While resolving these annoying questions, I could JUST support this form, to allow skipping comments:

match {
    "//[^\n]*" => None, 
}

(However, we also want to consider states, so we can support nested /* comments. For that I imagined match in state, and returning something like Push(state, tokens), but that also requires some thought about how the states are reflected into rust code, etc. Maybe it should be special syntax. Bah.)

@nikomatsakis
Copy link
Collaborator Author
nikomatsakis commented Mar 24, 2017

It's time to make progress here. I have a concrete proposal I plan to experiment with. It does not yet handle multiple lexer states, but I'm interested in inferring those automatically based on the grammar instead.

My idea is roughly this. One can define a match block that defines your tokenizer. This consists of multiple sections corresponding to priorities. The regular expressions or strings within a section all have equal priority and hence must not match the same things:

match {
    r"[a-z]\w+", // anything beginning with lowercase letter
    "Foo" // Foo specifically
} else {
    r"[A-Z]\w+", // anything beginning with uppercase letter
    _, // include anything else mentioned in the grammar that doesn't already appear here
};

In addition, these entries can have actions. Right now, these actions can only take two forms:

  • skip (=> ()) -- when we find a match, ignore it entirely
  • rename (=> S where S is some symbol) -- produce this terminal that is used elsewhere in the grammar

The default action we saw above is then equivalent to an identity rename. e.g. "Foo" is equivalent to "Foo" => "Foo". The _ just means "scrape the grammar for things literals etc and insert them here" (e.g., maybe the grammar has r"[0-9]+" for numbers; that would get inserted there).

So, to accommodate Pascal, where the reserved words are case insensitive, I might do:

match {
    r"(?i)begin" => "BEGIN",
    r"(?i)end" => "END",
} else {
    r"[a-zA-Z_][a-zA-Z0-9_]*" => IDENTIFIER,
};

This would declare that:

  1. begin/end have higher-priority and
  2. I will refer to them in my grammar as "BEGIN" and "END";
  3. identifiers have lower priority;
  4. and I will refer to them as IDENTIFIER.

Note that you can map to any symbol name, with or without quotes etc.

So then I can write my grammar using those "cleaned-up" names:

BLOCK = "BEGIN" ... "END";
EXPRESSION = IDENTIFIER ...;

I feel good about this.

@nikomatsakis
Copy link
Collaborator Author

I've been looking at the code a bit. I'm going to take a few notes on a possible implementation plan, though I've run out of "side project" time for this morning:

  • right now, name resolution takes place in two places:
    • for non-terminals, that is resolve/mod.rs. It cares about terminals only in so far as it needs to ignore them. This works by first identifying the set of non-quoted identifiers (e.g., IDENTIFIER) that are actually terminals. Currently, that can only occur if you have an external tokenizer, so it finds an extern section for that and extracts out the LHS of those pattern declarations.
    • later, in token_check, we walk the terminals. If an external tokenizer is defined, we check that the terminal is in that set. Otherwise we collect the literals ('foo" or r"foo").
  • token-check then constructs an InternTok structure, presuming no external tokenizer has been defined. This basically means we construct the DFA.

To make this system I am describing here work, we have to make changes in a few places.

First and foremost, we have to parse the match declaration above (which I will call an "tokenizer declaration"). When we lower from the parse-tree to the repr, I would then make the invariant that we have either an external tokenizer or an internal tokenizer. If the user did not add a extern or a match section, that is equivalent to an internal tokenizer like match { _ }.

Next, we can now have "bare" terminals even with an internal tokenizer. So, name resolution needs to find the internal tokenizer "match" declaration and check the right-hand side of each =>. So e.g. r"[a-zA-Z_][a-zA-Z0-9_]*" => IDENTIFIER would indicate that IDENTIFIER is actually a terminal. That's easy enough.

In token-check, when we validate a literal like "bar" that appears in the grammar, we will want to check against the tokenizer declaration: if "bar" appears on the RHS of some entry, we're done. But otherwise, if there is a _, we want to collect in a vector. If there is no _, we want to report an error, this is an unrecognized symbol.

Finally, when we build the DFA, we would build it up using the LHS of each A => B mapping. I think as long as we are careful, no further changes are needed to other parts of LALRPOP.

@nikomatsakis nikomatsakis added this to the 1.0 milestone Mar 24, 2017
@wagenet
Copy link
Contributor
wagenet commented Mar 24, 2017

I like this API, I'll try to see if I can make enough sense of the lalrpop code to see if I can make any progress.

@nikomatsakis
Copy link
Collaborator Author

So @wagenet made awesome progress here. There is still a bit more work to go before I'm ready to close this issue. In particular, I want to add support for tokenizers that produce zero or multiple tokens in response to an input. I'm envisioning something like this:

match {
    r"&&" => ("&[&]", "&[]"), // produce two tokens in response to 1 regex
} else {
    r"\s+" => (), // skip whitespace
    r"//.*" => (), // skip EOL comments
    r"/\*.*\*/" => (), // skip (non-nested) `/* ... */` comments
}

This would basically make us able to synthesize any regular (as in regular expressions) lexer. To go beyond regular languages we'd have to add states -- I'm not opposed to it, but want to think on it a bit more.

One thing that I don't know: right now, we implicitly skip whitespace for you. I'd like to give users control of that, but I'm not sure when to disable the current behavior. For example, should we disable the whitespace default behavior if there are any tokens that map to ()? Or should we have people add a #[no_skip_whitespace] annotation?

Somehow the "no skip whitespace" thing feels related to _ to me -- i.e., the former opts out of default behavior, the latter opts back in -- perhaps they could be combined? i.e., we'd have _ by default unless you wrote #[no_default] or something on the match?

Thoughts?

@8573
Copy link
8573 commented Mar 30, 2017

For example, should we disable the whitespace default behavior if
there are any tokens that map to ()? Or should we have people add
a #[no_skip_whitespace] annotation?

In the former case, how would one write a lexer that skips nothing?

@nikomatsakis
Copy link
Collaborator Author

@8573 yeah, good question. I was thinking that the better "magic" answer would be to maybe look to see if any regex patterns may match whitespace, but it feels...kind of magic.

@hollinwilkins
Copy link
hollinwilkins commented Apr 4, 2017

Opting for the non-magic option seems to be the more predictable choice and wouldn't preclude implementing magic in the case it is not set as a fallback.

@nikomatsakis
Copy link
Collaborator Author

Opting for the non-magic option seems to be the more predictable choice and wouldn't preclude implementing magic in the case it is not set as a fallback.

Yes. I think there are two viable options:

  • If there is no _, then disable skipping whitespace and require an explicit line.
  • Some form of annotation like #[no_skip_whitespace]

Right now, I lean towards the former. The annotation just feels awkward, and using _ as a stand-in for "insert default rules" feels ok. Basically _ would be short for:

  • Insert a "foo" => "foo" rule for every terminal that appears in the grammar which is not explicitly listed.
  • Insert r"\s+" => ()

The next question is: what notation should we use for "skip" rules? I am thinking a bit forward here. I eventually expect to support:

  • custom code rules where you can write things like => { /* this will execute */ }
  • probably the ability to transition between states ike "/*" => goto COMMENT
    • I had eventually hoped to avoid this but I'm changing my opinion
  • the ability to generate multiple tokens from one regex match (as I already mentioned)
    • but this may want custom code, since I think the number of tokens to generate may sometimes depend on the input
    • and maybe this should just be a custom tokenizer

Considering all these things I am leaning towards either:

  • => (), as originally proposed, or
  • => continue, just use a keyword.

I sort of lean towards the second one right now, but really either is fine.

@pyfisch
Copy link
Contributor
pyfisch commented Oct 29, 2017

Has there been any progress with "skip" rules? I want to use them for line comments and inline comments without nesting.

What needs to be changed to support "skip" rules?

@ahmedcharles
Copy link
Contributor

How does this relate to #14? They both seem related to tokenizers and the match functionality. Should one be a duplicate of the other? It's also not obvious what functionality would be sufficient to result in either issue being closed?

@nikomatsakis
Copy link
Collaborator Author

@ahmedcharles yeah I think they may be basically the same thing. =)

@ahmedcharles
Copy link
Contributor

Should one of them be closed?

@anchpop
Copy link
anchpop commented Mar 25, 2019

What is the status for parsing indentation sensitive languages? This (usually) requires a non-regular lexer to create functionality such as Python ignoring indentation within lists. Is there any documentation on how one can create their own lexer?

@Marwes
Co 8000 py link
Contributor
Marwes commented Mar 25, 2019

@anchpop http://lalrpop.github.io/lalrpop/lexer_tutorial/index.html explains how to write and integrate a simple lexer.

@anchpop
Copy link
anchpop commented Mar 25, 2019

Oh, I didn't notice that, thanks

@anchpop
Copy link
anchpop commented Mar 30, 2019

I would be happy to work on an implementation of something like what's described in this whitepaper, as it would dramatically simplify some of my code. But I would need some guidance from someone experienced with the library on what they think the best way to implement it would be, In addition, being able to configure the lexer to emit ENDLINE or STARTLINE tokens would allow some languages to be described without much more complexity on LALRPOP's side, I think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants
0