-
Notifications
You must be signed in to change notification settings - Fork 1.3k
[v2] Ditch custom tokenizers, make grammars more flexible instead #3911
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Sorry, but I don't understand your question. The terms "parent" and "child" are not used anywhere in
You are just reintroducing modify dependencies. I believe I talked at lengths about modify dependencies and why they make everything more complex in old issues, but to summarize: the main problem is that languages that You are also misunderstanding what |
They are not — I've defined them in the OP because I didn't know how else to refer to these. E.g. for the templating use case, parent would be the templating language and child the language being generated (e.g. markup).
Neither, it was about how highlighting templating languages works on a high level.
Interesting. I wonder if there's a solution that fixes this without throwing the baby out with the bathwater. I posted #3916 today (before I saw your reply) brainstorming some ideas, but I hadn't taken this into consideration.
Likewise, My bad, I should have documented all this somewhere. |
Uh oh!
There was an error while loading. Please reload this page.
Background
Currently, a grammar can define a custom tokenizer, which is handled like this in
tokenize()
:However, searching the codebase for uses of this feature, they are all invariably templating languages embedded in

markup
:When the vast majority of uses for such a low-level escape hatch are all of the same type, it's a strong signal to implement a higher level feature for this use case.
I can't fully parse what the code in
src/shared/languages/templating.ts
is doing. Assuming in the templating use case "parent" is the templating language and "child" ismarkup
:From a quick scan it looks like it's closer to 3, but it would take a much deeper look to be sure. @RunDevelopment perhaps you can help short-circuit this?
A good solution to these could also help with the issues identified in #3901 (
diff-highlight
anddata-uri-highlight
are actually language definitions).It seems that the core use case here in all cases is languages embedded in other languages. Currently this is something we support where they are completely separate and known in advance. E.g. CSS in
<style>
tags or astyle
attribute is no problem because it can be handled completely independently. However, in the templating ordiff
case, you need to remove the tokens of the parent language to be able to highlight the child.Additionally, you don't necessarily know what the child language is.
diff
it is provided explicitly and defaults tonone
if not providedmarkup
(which is true in the vast majority of the times, but not always — I've used templating languages to generate CSS or JSON for example).data-uri
, it depends on the MIME type.What we don't support well is:
<style>
), and thus its tokens need to be removed or processed to highlight the child language.data-uri-highlight
, but also languages like CSS extras, JS extras, PHP extras.http
,markdown
,data-uri-highlight
, but also plugins likehighlight-keywords
.Proposal
Explicit parent & child language
For 1, it seems that we need a concept of an explicit "child language" — the language being generated, which should be able to have separate defaults per grammar, but also be settable. The
diff
syntax actually seems good for this:language-parent
would just use the default,language-parent-child
would use that. Note that we should not need both —language-parent-child
should suffice. Examples:diff
language would have no default child, solanguage-diff-css
would highlight a diff as CSS, whereaslanguage-diff
would just highlight the diff.liquid
language would usemarkup
as the default, solanguage-liquid
would highlight the liquid markup and assume the child ismarkup
(the grammar default), whereaslanguage-markup-none
would only highlight the Liquid tags.TBD:
none
viachild: null
orchild: grammar
and that also declares them as parental (😁).Dynamic content grammars
One idea to implement dynamic content tokenization is to make aliases dynamic. We already have a concept of an
alias
to add additional classes to tokens. What if we supported functions as aliases that would take what we've tokenized so far as input?From a quick look at the code, that doesn't seem easy, as aliases are currently applied at the very end.
Another idea would be to make
inside
andrest
dynamic, and allow them to take a function whose parameter contains matched tokens for both that token and any parents. That seems a lot more feasible, asinside
is used throughout tokenization.As an example, then
data-uri
could extend theuri
token of other languages and add:This could even be used to make the cases already handled more elegant:
Explicit extends
Currently, languages extending other languages do so imperatively. Ideally, this should be encoded in their metadata, e.g.
css-extras
would haveextends: css
.css
should not know aboutcss-extras
and should not have it as an optional dependency. In general, the thing being extended should never know about the thing extending it — that is dependency going the wrong way and violates DIP.Perhaps which position it is inserted before/after should be part of the token metadata. E.g. currently
css
does this:This is dependency going the wrong way. Instead,
css-extras
should do this:I think this will help us get rid of most, if not all optional dependencies, which I think @RunDevelopment had mentioned were problematic in some ways that I don't remember.
The text was updated successfully, but these errors were encountered: