[v2] Language combinations / extensions / embeddings / ... #3927

LeaVerou · 2025-05-05T05:43:21Z

I've been thinking a lot about what's the best way to handle language definitions that depend or make use of other languages. Some earlier thoughts in:

I have a strong hunch that these are all facets of the same problem and a good API design will minimize the number of separate solutions for each of them, so I'm going to close all three so we can discuss them holistically here.

Note

This is a work in progress and I will update it as I think more about this problem.

But before I go into the weeds, an illustration (real screenshot of our code, taken from VS Code):

This is 4 languages nested in each other!

The outer language is JS (well, TS)
JSDoc in JS doc comments
Markdown in JSDoc
JS in Markdown code blocks

Use cases

There are currently two types of dependencies:

Required (actual ESM import)
Optional (use them if something else imported them, np otherwise)

And several types of use cases described below.

_Note: Any mention of "now" refers to the simplify branch (draft PR) and not the current v2 branch.

1. Languages using another language as a base (e.g. JavaScript using C-like)

This is the most straightforward case: just simple inheritance.

The base language is now declared via the base key instead of an imperative extend() call (I wonder if parent or extends may be better names) and is considered a required dependency.
It is imported as a regular ESM import
Its grammar is passed to the language's grammar() function via a base key that resolves synchronously.
By default any tokens specified by the child grammar overwrite tokens in the base grammar. If something different is desired, there are the following escape hatches:
- $merge does a deep merge of certain tokens instead
- $insertBefore inserts certain tokens before another
- $insertAfter inserts certain tokens after another
- $insert is a shorthand version of $insertBefore/$insertAfter that is better suited to one-off inserts as the position is specified inside its value via $before/$after
- All of the above are combined into a single patch and are applied as late as possible (but I'm still debating whether that's a good idea).

Usually, the base language is useful on its own. E.g. clike was not just created to make its child languages more DRY, but to have something to fall back on when one wanted to highlight a C-like language that did not have a dedicated language definition (admittedly far more important when Prism first launched with like 5 languages compared to now).

These days, there are also cases where language definitions exist for the sole reason of making other language definitions more DRY (such as javadoclike, which is a perversion of the concept. No language should be registered and become available as a language-xxx class if it's not useful on its own, otherwise it's not actually a language, it's a shared utility.

2. Languages embedding/embedded in other languages

This can be broken down into two major categories:

Where the language is known in advance (e.g. JS inside <script> elements)
Where the language is not known in advance (e.g. code inside a tagged markdown code block or when highlighting http requests)

1 was already handled by special casing strings as values of $rest/$inside, but 2 was severely problematic and required a fair bit of custom code. #3923 proposes a $language descriptor that can handle both, by taking a function as well which takes named capturing groups as a parameter. I'm still unsure if this is a good solution.

There is also the question of what types of dependencies these are: are they optional or required? It seems like it could go either way, depending on the user's goal, but I'm leaning towards required. But then, for 2, does that mean that now your required dependencies depend on the code being matched?

Perhaps these are actually the only true "optional" dependencies, and there should be a way for Prism users to autoload these as well. In that case, perhaps grammars should support async nodes for these that resolve when they are loaded. The way code is tokenized could easily support parts of it being deferred for later.

3. Languages that are preprocessors for other languages

Example: PHP or Liquid are HTML preprocessors.

This is what #3911 was about. It is further complicated by the fact that these preprocessors could often generate anything, but definitely do have defaults (usually markup). This is the prime reason custom tokenizers exist, which I would love to get rid of.

I no longer think most of #3911 was a good idea, but there is one part that I still think was: languages being able to declare what language they produce, and have that be overridable via two-id language classes (e.g. language-diff-css to highlight a CSS diff or language-liquid-css for a Liquid template that produces CSS.

I'm still unsure how exactly these work today, but perhaps a good solution to 2 could also address these (by essentially emulating $rest: childLanguage).

4. Languages that can make other language definitions "richer" but are not strictly necessary

This one is the hairiest category as it encompasses so many diverse use cases.

Examples:

javastacktrace extending log
Tags inside VB.NET/F# doc comments being highlighted as tags if markup is loaded. This one is basically highlighting the need for a shared utility for tag.
JSON in http being highlighted as JSON if that is loaded, or as JS otherwise. That seems to be a bonafide optional dependency.
markdown in graphql comments being highlighted if it is loaded. That seems to be a bonafide optional dependency.
jsdoc in JS doc-comment tokens is highlighted if jsdoc is loaded. That seems to be a bonafide optional dependency.
js-templates extending JS with the ability to highlight template literals tagged with a certain language. Not everyone highlighting JS wants to highlight tagged template literals, but since JS is the host language, it cannot be language-js-templates that activates this functionality.
opencl-extensions extending C and C++. Not everyone highlighting C/C++ wants this.
css-extras extending css with specialized tokens for selectors etc. Not everyone highlighting CSS wants the granularity of css-extras.

Languages should not modify other languages

Previously, there were more of these, which existed for the sole reason of reducing bundle sizes to the extreme (like saving 1KB). These are now eliminated. The ones that remain are those that fundamentally should involve user choice, as described above.

The toughest of all are those like the last three, which are also currently the only uses for extends (#3911). Languages extending other languages are deeply problematic:

It means their resolved grammars cannot be cached — every Prism instance needs to spawn its own (though that might be unavoidable since plugins could also modify them)
@RunDevelopment warned very strongly against them in [v2] Ditch custom tokenizers, make grammars more flexible instead #3911 (comment) mentioning ordering effects that created chaos.

Optional dependencies beyond actual embedding are a smell

Even in use cases that are "proper" optional dependencies, it feels that this logic should really live with the child language. But …if it does, that would mean the child language modifies the parent language, which, as described above, is evil!

Not necessarily. Languages modifying other languages was one way to do it. What if there are others?

Essentially, in all of these, we have one language adding granularity to another. In most of them, we don't want users to have to opt-in separately for every use, so languages modifying other languages was invented as a solution to that. E.g. you may want all your CSS examples to be highlighted with the granularity of css-extras, and it would be annoying (and incompatible) to have to specify language-css-extras each time. In many an opt-in doesn't make much sense at all, and it's really about not bloating the bundle size. E.g. of course you want to highlight JS in HTML if you have a JS language definition loaded.

Ideas

These are currently mainly for 4. I have some ideas for the rest, but it's 4 that is the hairiest.

1. Language aliases

We could extend the concept of language aliases to existing languages. Then e.g. css-extras could be defined as just regular inheritance over css and one could alias css to css-extras.

Pros: Predictable, not affected by ordering effects, re-uses an existing mechanism (inheritance)
Cons: Lacks composability. How do I use two different types of "extras"?

2. Language extensions layered on top of existing languages without modifying them

Instead of language extensions actually mutating the host language, what if languages could declare that they are automatically applied within certain tokens of other languages?

Pros: Composability
Cons: Unclear if this would actually not cause the same issues as languages being mutated, since in theory there could still be ordering effects here

3. Language modifications with defined ordering

Languages like css-extras are never autoloaded, right? They need to be explicitly loaded …somehow. So perhaps the ordering effects go away on their own in v2, simply because loading order is much more well defined.

Additionally, we could soften the blow by making it configurable with a Prism config option for how to handle extends languages:

Default: Modify the parent language
Create a new language definition. E.g. you'd need to use language-css-extras explicitly to use css-extras

In fact, we could create the new language definition anyway.

The text was updated successfully, but these errors were encountered:

LeaVerou · 2025-05-08T14:34:52Z

Thinking about this some more:

'm considering moving back from $language to just a string key with the language name. I'm not sure $language adds something over it, and there is no ambiguity. For the cases of dynamic languages, we can just allow $rest and $inside to take (groups) => string functions in the same way.
We need a new key that works like $rest but doesn't operate on the unmatched tokens separately but removes them, highlights the result as a whole, and then re-inserts them. This is how templating languages work currently, and a couple more. No idea what that might be called, suggestions welcome.
I've been wondering about introducing a language-xxx:yyy convention for parent-child languages. Then languages can be fetched independently, we don't have to know that language xxx supports child languages (but we would need to fetch it to see what its default child is). E.g. language-liquid would be the same as language-liquid:markup and language-diff would imply language-diff:none, so one would use e.g. language-diff-css to highlight CSS diffs. Alternatively, we could formalize that language ids cannot have hyphens in them, and then we can just use a hyphen to separate the parent from the child. From a quick look we have very few languages with hyphens in them (though I haven't checked aliases yet):

avro-idl
css-extras
css-selector
dns-zone-file
excel-formula
firestore-security-rules
go-module
icu-message-format
js-templates
linker-script
nand2tetris-hdl
opencl-extensions
plant-uml
shell-session
solution-file
splunk-spl
t4-cs
t4-vb.
visual-basic
web-idl
One consideration here is that we want to be compatible with GitHub, but none of the language ids it uses have hyphens.

Still no good idea about how to handle languages that make others more granular but should not be enabled by default (css-extras, opencl-extensions, js-templates`). We may want to go back to having them as optional languages of their base language and essentially including them is the opt-in, but it would be nice to be able to turn them on or off for parts of the page without having to create separate Prism instances. Perhaps the parent-child syntax could work here too 🤔
Still no good idea about how to handle dynamic required dependencies short of making the core highlighting async.

DmitrySharabin · 2025-05-08T16:22:10Z

... none of the language ids it uses have hyphens

There are almost none. There is json-doc. 🙃 They also use underscores, like in literate_coffeescript, common_lisp, and a few more.

LeaVerou · 2025-05-08T20:20:49Z

... none of the language ids it uses have hyphens

There are almost none. There is json-doc. 🙃 They also use underscores, like in literate_coffeescript, common_lisp, and a few more.

Nice catch! Now I'm wondering WTF is json-doc 😛 The description doesn't explain much.

LeaVerou marked this as a duplicate of #3916 May 5, 2025

LeaVerou mentioned this issue May 5, 2025

[v2] Introduce language extensions as an explicit concept #3916

Closed

LeaVerou marked this as a duplicate of #3911 May 5, 2025

LeaVerou marked this as a duplicate of #3923 May 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[v2] Language combinations / extensions / embeddings / ... #3927

[v2] Language combinations / extensions / embeddings / ... #3927

Uh oh!

Uh oh!

Uh oh!

[v2] Language combinations / extensions / embeddings / ... #3927

[v2] Language combinations / extensions / embeddings / ... #3927

Comments

Use cases

1. Languages using another language as a base (e.g. JavaScript using C-like)

2. Languages embedding/embedded in other languages

3. Languages that are preprocessors for other languages

4. Languages that can make other language definitions "richer" but are not strictly necessary

Languages should not modify other languages

Optional dependencies beyond actual embedding are a smell

Ideas

1. Language aliases

2. Language extensions layered on top of existing languages without modifying them

3. Language modifications with defined ordering

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!