8000 [v2] Ditch custom tokenizers, make grammars more flexible instead · Issue #3911 · PrismJS/prism · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

[v2] Ditch custom tokenizers, make grammars more flexible instead #3911

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
LeaVerou opened this issue Apr 27, 2025 · 2 comments
Closed

[v2] Ditch custom tokenizers, make grammars more flexible instead #3911

LeaVerou opened this issue Apr 27, 2025 · 2 comments

Comments

@LeaVerou
Copy link
Member
LeaVerou commented Apr 27, 2025

Background

Currently, a grammar can define a custom tokenizer, which is handled like this in tokenize():

const customTokenize = grammar[tokenizer];
if (customTokenize) {
	return customTokenize(text, grammar, prism);
}

However, searching the codebase for uses of this feature, they are all invariably templating languages embedded in markup:
Image

When the vast majority of uses for such a low-level escape hatch are all of the same type, it's a strong signal to implement a higher level feature for this use case.

I can't fully parse what the code in src/shared/languages/templating.ts is doing. Assuming in the templating use case "parent" is the templating language and "child" is markup:

  1. Is it tokenizing using two language definitions and then somehow merging the result?
  2. Is it using the parent and then tokenizing some of the tokens left using the child?
  3. Is it tokenizing using the parent, then removing these tokens and tokenized what's left using the child?

From a quick scan it looks like it's closer to 3, but it would take a much deeper look to be sure. @RunDevelopment perhaps you can help short-circuit this?

A good solution to these could also help with the issues identified in #3901 (diff-highlight and data-uri-highlight are actually language definitions).

It seems that the core use case here in all cases is languages embedded in other languages. Currently this is something we support where they are completely separate and known in advance. E.g. CSS in <style> tags or a style attribute is no problem because it can be handled completely independently. However, in the templating or diff case, you need to remove the tokens of the parent language to be able to highlight the child.

Additionally, you don't necessarily know what the child language is.

  • For diff it is provided explicitly and defaults to none if not provided
  • For templating we just assume it's markup (which is true in the vast majority of the times, but not always — I've used templating languages to generate CSS or JSON for example).
  • For data-uri, it depends on the MIME type.

What we don't support well is:

  1. Meta-languages / preprocessors, such as templating languages, where the parent language produces the child language (instead of the island-like embedding of things like <style>), and thus its tokens need to be removed or processed to highlight the child language.
  2. Languages doing more fine-grained highlighting of some of the tokens of other languages, that we may not want to include by default. That's relevant to data-uri-highlight, but also languages like CSS extras, JS extras, PHP extras.
  3. Languages where the token names (and thus, in some cases, the language being highlighted) dynamically depends on the code. This is the case for http, markdown, data-uri-highlight, but also plugins like highlight-keywords.

Proposal

Explicit parent & child language

For 1, it seems that we need a concept of an explicit "child language" — the language being generated, which should be able to have separate defaults per grammar, but also be settable. The diff syntax actually seems good for this: language-parent would just use the default, language-parent-child would use that. Note that we should not need both — language-parent-child should suffice. Examples:

  • The diff language would have no default child, so language-diff-css would highlight a diff as CSS, whereas language-diff would just highlight the diff.
  • The liquid language would use markup as the default, so language-liquid would highlight the liquid markup and assume the child is markup (the grammar default), whereas language-markup-none would only highlight the Liquid tags.
  • HTML can embed CSS and JS, but these are not its children. The child concept is for languages that produce other languages.

TBD:

  • Do languages opt-in to this, by somehow declaring themselves as being able to support a child or can all languages have a child and some just don't do anything meaningful with it? The former is easier to implement, though the latter is more flexible. We can start from the former and expand it later: languages can just declare the default child, even if it's none via child: null or child: grammar and that also declares them as parental (😁).

Dynamic content grammars

One idea to implement dynamic content tokenization is to make aliases dynamic. We already have a concept of an alias to add additional classes to tokens. What if we supported functions as aliases that would take what we've tokenized so far as input?
From a quick look at the code, that doesn't seem easy, as aliases are currently applied at the very end.

Another idea would be to make inside and rest dynamic, and allow them to take a function whose parameter contains matched tokens for both that token and any parents. That seems a lot more feasible, as inside is used throughout tokenization.
As an example, then data-uri could extend the uri token of other languages and add:

inside (tokens) {
	let {mimeType} = ...
	if (!mimeType) return;
	let language = getLanguageByMimeType(mimeType);
	if (!language) return;
	return {
		'mime-type': ...,
		inside: language
	}
}

This could even be used to make the cases already handled more elegant:

'tag': {
	pattern: MARKUP_TAG,
	greedy: true,
	inside: {
		'tag': {
			pattern: /^(<\/?)[^\s>\/]+/,
			lookbehind: true,
			inside: {
				'namespace': /^[^\s>\/:]+:/,
			},
		},
		'attr-value': {
			pattern: /=\s*(?:"[^"]*"|'[^']*'|[^\s'">=]+)/,
			inside: {
				'punctuation': [
					{
						pattern: /^=/,
						alias: 'attr-equals',
					},
					{
						pattern: /^(\s*)["']|["']$/,
						lookbehind: true,
					},
				],
				'entity': entity,
+				rest (tokens) {
+					let name = tokens['attr-name'];
+					if (name === 'style') return 'language-css';
+					else if(name.startsWith('on')) return 'language-js';
+				}
			},
		},
		'punctuation': /^<\/?|\/?>$/,
		'attr-name': {
			pattern: /[^\s>\/]+/,
			inside: {
				'namespace': /^[^\s>\/:]+:/,
			},
		},
	},
},

Explicit extends

Currently, languages extending other languages do so imperatively. Ideally, this should be encoded in their metadata, e.g. css-extras would have extends: css. css should not know about css-extras and should not have it as an optional dependency. In general, the thing being extended should never know about the thing extending it — that is dependency going the wrong way and violates DIP.

Perhaps which position it is inserted before/after should be part of the token metadata. E.g. currently css does this:

const extras = getOptionalLanguage('css-extras');
if (extras) {
	insertBefore(css, 'function', extras);
}

This is dependency going the wrong way. Instead, css-extras should do this:

export default {
	id: 'css-extras',
	extends: 'css',
	insertBefore: 'function', // this should also work for children
/* ... */

I think this will help us get rid of most, if not all optional dependencies, which I think @RunDevelopment had mentioned were problematic in some ways that I don't remember.

@LeaVerou LeaVerou added the v2 label Apr 27, 2025
@RunDevelopment
Copy link
Member

From a quick scan it looks like it's closer to 3, but it would take a much deeper look to be sure.

Sorry, but I don't understand your question. The terms "parent" and "child" are not used anywhere in templating.ts, so I don't know what those are supposed to be. I'm also not sure if your question is about the embedIn or templating function.

Explicit extends

You are just reintroducing modify dependencies. I believe I talked at lengths about modify dependencies and why they make everything more complex in old issues, but to summarize: the main problem is that languages that extend or otherwise copy css will be different depending on when css is modified. Load order simply isn't defined for modify deps. This made it necessary for the languages that extend or copy css to manually control the load order via optional deps. The end effect was that virtually unrelated languages had to know about each other. I spend years working with this mess, so please heed my advice when I say that they are a bad idea. Just make grammar immutable.

You are also misunderstanding what css-extras is, or rather, what I made it in v2. If Prism wasn't constraint by bundle size, css-extras wouldn't exist. All tokens in css-extras would just be part of css proper. As such, css-extras is not a language or an amendment to css, it is a collection of tokens of css that are made optional for size reasons. So css depends on css-extras, because it contains some tokens of css. It's just like c depends on c-like, because it gets some of its tokens from there.

@LeaVerou
Copy link
Member Author
LeaVerou commented Apr 29, 2025

From a quick scan it looks like it's closer to 3, but it would take a much deeper look to be sure.

Sorry, but I don't understand your question. The terms "parent" and "child" are not used anywhere in templating.ts, so I don't know what those are supposed to be.

They are not — I've defined them in the OP because I didn't know how else to refer to these. E.g. for the templating use case, parent would be the templating language and child the language being generated (e.g. markup).

I'm also not sure if your question is about the embedIn or templating function.

Neither, it was about how highlighting templating languages works on a high level.

Explicit extends

You are just reintroducing modify dependencies. I believe I talked at lengths about modify dependencies and why they make everything more complex in old issues, but to summarize: the main problem is that languages that extend or otherwise copy css will be different depending on when css is modified. Load order simply isn't defined for modify deps. This made it necessary for the languages that extend or copy css to manually control the load order via optional deps. The end effect was that virtually unrelated languages had to know about each other. I spend years working with this mess, so please heed my advice when I say that they are a bad idea. Just make grammar immutable.

Interesting. I wonder if there's a solution that fixes this without throwing the baby out with the bathwater. I posted #3916 today (before I saw your reply) brainstorming some ideas, but I hadn't taken this into consideration.
Can you recall cases where the order made such a difference and where virtually unrelated languages had to know about each other? It would really help.

You are also misunderstanding what css-extras is, or rather, what I made it in v2. If Prism wasn't constraint by bundle size, css-extras wouldn't exist. All tokens in css-extras would just be part of css proper. As such, css-extras is not a language or an amendment to css, it is a collection of tokens of css that are made optional for size reasons. So css depends on css-extras, because it contains some tokens of css. It's just like c depends on c-like, because it gets some of its tokens from there.

css-extras is not separate solely for bundle size reasons. The filesizes of these (CSS extras is ~4KB) are completely inconsequential in 2025 — they were far more significant when Prism was originally written, in 2012. It is mainly that this level of granularity is not always desirable, and the perf hit from all the granularity is far more important than the tiny increase in bundle size.

Likewise, c-like was not made separate entirely for bundle size reasons. I'd say that was the least important reason. It was separate so that folks could highlight C-like languages in a basic way even if Prism didn't have a language definition for them yet, and to make it easier for folks to define missing language definitions. Admittedly less of an issue now with its 300 languages, but it was a very significant benefit in 2012 when it launched with 5.

My bad, I should have documented all this somewhere.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants
0