[localize] Fix "&", "<" and ">" getting replaced with html escape sequences. #5058

IIIMADDINIII · 2025-08-22T20:50:34Z

"&", "<" and ">" do not need to be escaped in Template Literal Strings.
This causes invalid translations when these symbols are used in non HTML Context.
I think it is the code author/translator responsibility to properly escape special characters depending on the context.
lit/localize can not know in which context the strings are used.

google-cla · 2025-08-22T20:50:39Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

justinfagnani · 2025-08-22T21:26:16Z

@aomarks any thoughts?

@IIIMADDINIII is there some kind of test you can add so this won't regress if it's a good fix?

changeset-bot · 2025-08-24T09:56:44Z

🦋 Changeset detected

Latest commit: a366b9b

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 3 packages

Name	Type
@lit/localize-tools	Minor
@lit-labs/cli-localize	Patch
@lit-labs/cli	Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

IIIMADDINIII · 2025-08-24T13:28:22Z

After digging a little deeper i figured out the following:

@justinfagnani There is already a test for this, but the expected value included the escape sequences:
https://github.com/lit/lit/pull/5058/files#diff-f69680f4d4214898d3d6247b7ad6a30a08aec9f7ff0f83fb699c59668cfe07efL347-L351

There also already was a test case for keeping escape sequences in html translations:
https://github.com/lit/lit/blob/main/packages/localize-tools/testdata/build-runtime-xliff/input/foo.ts#L59-L60
The Previous logic worked as follows:

Extract would parse html Templates with html5 -> Removing escape sequences only from html Templates
Write the Translations to XLIFF file -> Adding escape sequences as needed for it to be valid xml (all Types)
Build would read XLIFF file form disk -> Removing the previously added Escape Sequences (all Types)
escapeTextContentToEmbedInTemplateLiteral -> Always adding escape Sequences for some Characters (all Types)

This Pull request is changing this, by not removing the escape sequences during extract (copy source string instead of using the parser result). So the escape sequences do not need to be added back by escapeTextContentToEmbedInTemplateLiteral which is flawed.

Advantages of this solution:

Not removing escape sequences which are in the source
Not adding escape sequences which are not in the source
I think it is expected behavior that translations do not mess with escape sequences
All Types ("", str`` , html`` ) have the same Behavior.

Disadvantages of this solution:

HTML escape sequences in source are double escaped in XLIFF files (The & sign needs to be escaped in XLIFF)
Changing XLIFF Source Translations for existing projects might be a breaking change

I would consider this a breaking change, because if some project has existing translations, these would need to be fixed to add the escape sequences which where previously removed.
If a source string currently contains a < and was already translated then the target is not double escaped.
So even if the sources are extracted again, the translations do not contain the double escapes and during build the escaping would be removed, creating invalid html.

I think this pull request implements the correct behavior, but if the braking change is not desired, It might be a good idea to instead only apply the escaping in escapeTextContentToEmbedInTemplateLiteral for html Templates.

rictic · 2025-08-27T21:18:44Z

packages/localize-tools/src/tests/transform.unit.test.mts

  await checkTransform(
    'msg(str`Hello <b>${msg("World", {id: "bar"})}</b>!`, {id: "foo"});',
-    '`Hola &lt;b&gt;Mundo&lt;/b&gt;!`;',
+    '`Hola <b>Mundo</b>!`;',


This test does look more correct to me. We should not be doing HTML escaping of expressions that only contain msg() calls, string literals, and str templates.

We do need to be careful about HTML that we emit into html templates, both because that is a security boundary for Lit (a maliciously crafted html template can execute arbitrary code. this is fine becaus 8000 e html templates are themselves source code).

rictic · 2025-08-27T21:31:18Z

packages/localize-tools/testdata/build-runtime-xliff-ph/goldens/xliff/es-419.xlf

 <trans-unit id="h02c268d9b1fcb031">
-  <source>&lt;Hello<ph id="0">&lt;b></ph>&lt;World &amp; Friends><ph id="1">&lt;/b></ph>!></source>
-  <target>&lt;Hola<ph id="0">&lt;b></ph>&lt;Mundo &amp; Amigos><ph id="1">&lt;/b></ph>!></target>
+  <source>&amp;lt;Hello<ph id="0">&lt;b></ph>&amp;lt;World &amp;amp; Friends&amp;gt;<ph id="1">&lt;/b></ph>!&amp;gt;</source>


This looks less correct. Why is it double-escaped?

TL;DR
The original template for this translation is html`<Hello<b><World & Friends></b>!>` which already includes escape sequences. To write this in an XML file, the & symbols need to be escaped.

Long Version:
I have the opinion, that localize should not change how I write my HTML in templates. Lets say I want to put the cent Symbol in a temlate (¢). But for some reason i need to escape it. I would write: html`¢` .

The old Version would convert all escape sequences back, so the ¢ would become a ¢ Symbol. This is then written to the translation file as a ¢ Symbol.
During build it would read the ¢ Symbol and output html`¢` to the source. So it is not the same as how I wrote the template.

To fix this, we need to preserve the original escape sequences. So instead off writing the ¢ symbol to the translation files we need to write ¢ to the translation file. Doing so will cause the XML serializer to escape the & Sign to make it valid XML. This results in &cent; to be written to the translation file. If you open this with an XML viewer you will not See the double escape.

The XML parser will convert the &cent; back to ¢ while reading the file during build. So localize can See the original content and will emit html`¢` to the source. This is the behaivor i would expect.

That explanation sounds reasonable to me

That makes sense to me. I'm surprised that no other tests needed changing here, just these .xlf files. Do we not use them in any other parts of the flow?

Because the behavior is mostly the same and most of the test do not include escape sequences, not a lot needed to be changed. If it helps, i could add some tests to check preserving escape sequences.

Regarding the Question: I do not know exactly. My understanding is, that the translation files are updated when extract is run. And when build is run it will do an extract first (without updating translations) to check for missing translations and then read the translations from the files.

IIIMADDINIII · 2025-09-10T14:39:17Z

Is there anything i need to to to get this merged??
I am just asking, because i am not so familiar with open source development.

Fix "&", "<" and ">" getting replaced with html escape sequences.

c3c1f59

IIIMADDINIII requested a review from kevinpschaaf as a code owner August 22, 2025 20:50

justinfagnani changed the title ~~Fix "&", "<" and ">" getting replaced with html escape sequences.~~ [localize] Fix "&", "<" and ">" getting replaced with html escape sequences. Aug 22, 2025

IIIMADDINIII force-pushed the patch-2 branch from c5c1ac4 to c3c1f59 Compare August 24, 2025 10:01

IIIMADDINIII added 3 commits August 24, 2025 14:40

Fix Parser removing html escapes sequences

1c41fba

Fix Tests

51f5f56

Change Set

a366b9b

IIIMADDINIII force-pushed the patch-2 branch from 548ec13 to a366b9b Compare August 24, 2025 12:42

rictic reviewed Aug 27, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[localize] Fix "&", "<" and ">" getting replaced with html escape sequences. #5058

[localize] Fix "&", "<" and ">" getting replaced with html escape sequences. #5058

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[localize] Fix "&", "<" and ">" getting replaced with html escape sequences. #5058

Are you sure you want to change the base?

[localize] Fix "&", "<" and ">" getting replaced with html escape sequences. #5058

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

🦋 Changeset detected

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants