Skip to content
Open
Show file tree
Hide file tree
Changes from 20 commits
Commits
Show all changes
52 commits
Select commit Hold shift + click to select a range
4a8ca98
Auto-escape JavaScript and JSON script tags when necessary
sirreal Dec 15, 2025
4a28ef2
Update ticket number
sirreal Dec 15, 2025
0bef687
Fix those lints
sirreal Dec 15, 2025
cdba027
No trailing function commas
sirreal Dec 15, 2025
ba54ae4
Remove JSON_THROW_ON_ERROR constant
sirreal Dec 15, 2025
a697e9e
fixup! Remove JSON_THROW_ON_ERROR constant
sirreal Dec 15, 2025
2b3d0d0
Merge branch 'trunk' into html-api/auto-escape-javascript-json
sirreal Dec 16, 2025
8246439
Remove JSON +json subtype handling
sirreal Dec 16, 2025
3f7f227
Merge branch 'trunk' into html-api/auto-escape-javascript-json
sirreal Dec 19, 2025
1362451
Add JS and JSON script tag tests
Copilot Dec 19, 2025
8d30680
Fix typo in comment
sirreal Dec 19, 2025
3b5ef4e
Improve comment
sirreal Dec 19, 2025
f8cfdf9
Clean up and fix tests
sirreal Dec 19, 2025
501d201
Clean up and improve type/language logic
sirreal Dec 19, 2025
bcc02ae
Fix svg SCRIPT tag tests
sirreal Dec 19, 2025
ea03441
Add todo on MIME type parsing
sirreal Dec 19, 2025
c7d1827
Update since tags 🤞
sirreal Dec 19, 2025
a134e82
Make is_{javascript,json}_script_tag methods private
sirreal Dec 19, 2025
d4bd4b3
Add language whitespace test
sirreal Dec 19, 2025
edae8d5
How'd that extra space get there, remove it!
sirreal Dec 19, 2025
02ca3c0
Name search parts
sirreal Dec 19, 2025
d058a78
Clean up escaping tests
sirreal Dec 19, 2025
dfb63af
Add ignore and todo tags to new private is_*_script_tag functions
sirreal Dec 19, 2025
11f51c9
Improve example comment
sirreal Dec 19, 2025
a3e0e27
Update regex comment with named groups
sirreal Dec 19, 2025
288b952
Add details to "other" failure to escape comment
sirreal Dec 19, 2025
a7495dd
Improve consistency of comment quoting and differentiate types of dou…
sirreal Dec 19, 2025
d8c320c
Describe JavaScript escaping strategy in detail
sirreal Dec 19, 2025
369eefc
Use the updated_html value in round-trip test
sirreal Dec 19, 2025
55d4e47
Merge branch 'trunk' into html-api/auto-escape-javascript-json
sirreal Dec 22, 2025
2ef0bf0
Fix demonstration comment
sirreal Dec 22, 2025
cb6b990
Extract and document escaping functions
sirreal Dec 22, 2025
e399bf6
Tweak workaround documentation
sirreal Dec 22, 2025
83c1fab
Add ASCII chart about HTML script tags with original source
sirreal Dec 22, 2025
1872681
Improve linking between escapes
sirreal Dec 22, 2025
83ff62f
Fix comments, typos, lints
sirreal Dec 22, 2025
5979782
fixup! Fix comments, typos, lints
sirreal Dec 22, 2025
d469ae4
fixup! fixup! Fix comments, typos, lints
sirreal Dec 22, 2025
402ae9f
Fix \c -> \r (carriage return) typo
sirreal Dec 22, 2025
6b2c9ba
Add note about not parsing MIME types for JS script tags
sirreal Dec 22, 2025
eef0ccb
Add todo comment to is_json_script_tag
sirreal Dec 22, 2025
71d2686
Re-order tag name termination chars to match elsewhere
sirreal Dec 22, 2025
d4693a2
Fix typo
sirreal Dec 22, 2025
eb4e091
Merge branch 'trunk' into html-api/auto-escape-javascript-json
sirreal Dec 29, 2025
4c3b0b2
Update comments on tag prefixes matching search pattern
sirreal Dec 29, 2025
bfa60cf
Use content-type identification and escape without PCRE
dmsnell Dec 30, 2025
3252160
fixup! Use content-type identification and escape without PCRE
dmsnell Dec 30, 2025
0b9b2bd
fixup! Use content-type identification and escape without PCRE
dmsnell Dec 30, 2025
b2533e1
fixup! Use content-type identification and escape without PCRE
dmsnell Dec 31, 2025
ddeb301
fixup! Use content-type identification and escape without PCRE
dmsnell Dec 31, 2025
b36fc32
fixup! Use content-type identification and escape without PCRE
dmsnell Dec 31, 2025
1249af0
fixup! Use content-type identification and escape without PCRE
dmsnell Dec 31, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
238 changes: 223 additions & 15 deletions src/wp-includes/html-api/class-wp-html-tag-processor.php
Original file line number Diff line number Diff line change
Expand Up @@ -3811,29 +3811,81 @@ public function set_modifiable_text( string $plaintext_content ): bool {

switch ( $this->get_tag() ) {
case 'SCRIPT':
/**
* This is over-protective, but ensures the update doesn't break
* the HTML structure of the SCRIPT element.
/*
* SCRIPT tag contents can be dangerous.
*
* The text `</script>` could close the SCRIPT element prematurely.
*
* More thorough analysis could track the HTML tokenizer states
* and to ensure that the SCRIPT element closes at the expected
* SCRIPT close tag as is done in {@see ::skip_script_data()}.
* The text `<script>` could enter the "script data double escaped state", preventing the
* SCRIPT element from closing as expected, for example:
*
* A SCRIPT element could be closed prematurely by contents
* like `</script>`. A SCRIPT element could be prevented from
* closing by contents like `<!--<script>`.
* <script>
* // If this "<!--" then "<script>" the closing tag will not be recognized.
* </script>
* <h1>This appears inside the preceding SCRIPT element.</h1>
*
* The following strings are essential for dangerous content,
* although they are insufficient on their own. This trade-off
* prevents dangerous scripts from being sent to the browser.
* It is also unlikely to produce HTML that may confuse more
* basic HTML tooling.
* The relevant state transitions happen on text like:
* 1. <
* 2. / (optional)
* 3. script (case-insensitive)
* 4. One of the following characters:
* - \t
* - \n
* - \r (\r and \r\n newlines are normalized to \n in HTML pre-processing)
* - \f
* - " " (U+0020 SPACE)
* - /
* - >
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we’re not checking for these state transitions. in the code following, we’re checking for the simplified transition, akin to the simplified diagram

I think it would be helpful to match the discussion of the approach with the comment. that is, I don’t know if it’s particularly helpful to spell out all of the characters when terser descriptions are warranted. e.g.

 * If the plaintext contains any sequences which would could be interpreted as
 * SCRIPT opening or closing tags, then it is sufficient to escape these. This
 * prevents getting into the dangerous double-escape state. Technically, what
 * matters is not the presence of a full or actual SCRIPT tag, but the start of
 * a tag containing the "SCRIPT" tag name.
 *
 * @see URL
 */
if ( false !== stripos( ... ) || false !== stripos( ... ) ) {

}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those lines capture all of the relevant state transitions, it's really about these minimal transitions that either close the script or move to double-escaped:

---
config:
---
stateDiagram-v2
  state "script data" as ScriptData
  state "escaped" as Escaped
  state "double escaped" as DoubleEscaped
  state "Close script" as CloseScript

  Escaped --> DoubleEscaped : #60;script[ \t\f\r\n/>]
  ScriptData --> CloseScript : #60;/script[ \t\f\r\n/>]
  Escaped --> CloseScript : #60;/script[ \t\f\r\n/>]
Loading

The idea being that if those transitions are prevented then the script contents cannot break the HTML structure.


I agree the documentation here needs to be reviewed broadly. It's important to document the escaping well for posterity to understand why this is implemented as it is.

*
* @see https://html.spec.whatwg.org/multipage/parsing.html#script-data-double-escaped-state
*/
if (
false !== stripos( $plaintext_content, '</script' ) ||
false !== stripos( $plaintext_content, '<script' )
) {
return false;
/*
* JavaScript can be safely escaped.
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you happy with this comment? I think some things are very elaborate and technical, and worded in ways which could use some refinement.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment was one of the last things I added and could certainly use revision.

if ( $this->is_javascript_script_tag() ) {
$plaintext_content = preg_replace_callback(
/*
* This case-insensitive pattern consists of three groups:
*
* 1: "<" or "</"
* 2: "s"
* 3: "cript" + a trailing character that terminates a tag name.
*/
'~(</?)(s)(cript[\\t\\r\\n\\f />])~i',
static function ( $matches ) {
$escaped_s_char = 's' === $matches[2]
? '\\u0073'
: '\\u0053';
return "{$matches[1]}{$escaped_s_char}{$matches[3]}";
},
$plaintext_content
);
} elseif ( $this->is_json_script_tag() ) {
/**
* JSON can be safely escaped.
*
* The following replacement may appear insufficient, "<" is replaced
* with its JSON escape sequence "\u003C" without considering whether
* the "<" is preceded by an escaping backslash. JSON does not support
* arbitrary character escaping in strings (unlike JavaScript) so "\<"
* is invalid JSON and does not need to be considered.
*
* @see https://www.json.org/json-en.html
*/
$plaintext_content = strtr(
$plaintext_content,
array( '<' => '\\u003C' )
);
} else {
/*
* Other types of script tags cannot be escaped safely.
*/
return false;
}
}

$this->lexical_updates['modifiable text'] = new WP_HTML_Text_Replacement(
Expand Down Expand Up @@ -3891,6 +3943,162 @@ static function ( $tag_match ) {
return false;
}

/**
* Indicates if the currently matched tag is a JavaScript script tag.
*
* @see https://html.spec.whatwg.org/multipage/scripting.html#prepare-the-script-element
*
* @since 7.0.0
*
* @return bool True if the script tag will be evaluated as JavaScript.
*/
private function is_javascript_script_tag(): bool {
if ( 'SCRIPT' !== $this->get_tag() || $this->get_namespace() !== 'html' ) {
return false;
}

/*
* > If any of the following are true:
* > - el has a type attribute whose value is the empty string;
* > - el has no type attribute but it has a language attribute and that attribute's
* > value is the empty string; or
* > - el has neither a type attribute nor a language attribute,
* > then let the script block's type string for this script element be "text/javascript".
*/
$type_attr = $this->get_attribute( 'type' );
$language_attr = $this->get_attribute( 'language' );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems like this has the potential to be clearer if we did one of two things:

  • early-abort when both the type and language attributes are missing.
  • null-coalesce to some value like '' which would be semantically the same in these checks as null but allow us to treat the values as strings.

or even something like this…

$type = $this->get_attribute( 'type' );
$type = is_string( $type ) ?: '';        // combine `true` and `null` into ''.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, this can be simpler. The different cases seem clear so we can also add some unit tests.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reviewed this and simplified or clarified it slightly, but I think it matches the specified behavior well and I'm not sure that further changes will improve things.

if ( true === $type_attr || '' === $type_attr ) {
return true;
}
if (
null === $type_attr
&& ( null === $language_attr || true === $language_attr || '' === $language_attr )
) {
return true;
}

/*
* > Otherwise, if el has a type attribute, then let the script block's type string be
* > the value of that attribute with leading and trailing ASCII whitespace stripped.
* > Otherwise, el has a non-empty language attribute; let the script block's type string
* > be the concatenation of "text/" and the value of el's language attribute.
*/
$type_string = null !== $type_attr ? trim( $type_attr, " \t\f\r\n" ) : "text/{$language_attr}";

/*
* > If the script block's type string is a JavaScript MIME type essence match, then
* > set el's type to "classic".
*
* > A string is a JavaScript MIME type essence match if it is an ASCII case-insensitive
* > match for one of the JavaScript MIME type essence strings.
*
* > A JavaScript MIME type is any MIME type whose essence is one of the following:
* >
* > - application/ecmascript
* > - application/javascript
* > - application/x-ecmascript
* > - application/x-javascript
* > - text/ecmascript
* > - text/javascript
* > - text/javascript1.0
* > - text/javascript1.1
* > - text/javascript1.2
* > - text/javascript1.3
* > - text/javascript1.4
* > - text/javascript1.5
* > - text/jscript
* > - text/livescript
* > - text/x-ecmascript
* > - text/x-javascript
*
* @see https://mimesniff.spec.whatwg.org/#javascript-mime-type-essence-match
* @see https://mimesniff.spec.whatwg.org/#javascript-mime-type
*/
switch ( strtolower( $type_string ) ) {
case 'application/ecmascript':
case 'application/javascript':
case 'application/x-ecmascript':
case 'application/x-javascript':
case 'text/ecmascript':
case 'text/javascript':
case 'text/javascript1.0':
case 'text/javascript1.1':
case 'text/javascript1.2':
case 'text/javascript1.3':
case 'text/javascript1.4':
case 'text/javascript1.5':
case 'text/jscript':
case 'text/livescript':
case 'text/x-ecmascript':
case 'text/x-javascript':
return true;

/*
* > Otherwise, if the script block's type string is an ASCII case-insensitive match for
* > the string "module", then set el's type to "module".
*
* A module is evaluated as JavaScript.
*/
case 'module':
return true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PhpStorm complains about having a separate case here:

Image

It suggests moving the module case up above with the others.

This is purely stylistic, I acknowledge.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer this as-is, it allows quotes referencing different specifications to remain separate.

}

/*
* > Otherwise, if the script block's type string is an ASCII case-insensitive match for the string "importmap", then set el's type to "importmap".
* > Otherwise, if the script block's type string is an ASCII case-insensitive match for the string "speculationrules", then set el's type to "speculationrules".
*
* These conditions indicate JSON content.
*/

/*
* > Otherwise, return. (No script is executed, and el's type is left as null.)
*/
return false;
}

/**
* Indicates if the currently matched tag is a JSON script tag.
*
* @since 7.0.0
*
* @return bool True if the script tag should be treated as JSON.
*/
private function is_json_script_tag(): bool {
if ( 'SCRIPT' !== $this->get_tag() || $this->get_namespace() !== 'html' ) {
return false;
}

$type = $this->get_attribute( 'type' );
if ( null === $type || true === $type || '' === $type ) {
return false;
}
$type = strtolower( trim( $type, " \t\f\r\n" ) );

/*
* > …
* > Otherwise, if the script block's type string is an ASCII case-insensitive match for the string "importmap", then set el's type to "importmap".
* > Otherwise, if the script block's type string is an ASCII case-insensitive match for the string "speculationrules", then set el's type to "speculationrules".
* @see https://html.spec.whatwg.org/#script-processing-model
*
* > A JSON MIME type is any MIME type whose subtype ends in "+json" or whose essence
* > is "application/json" or "text/json".
*
* @todo The JSON MIME type handling handles some common cases but when MIME type parsing is available it should be leveraged here.
*
* @see https://mimesniff.spec.whatwg.org/#json-mime-type
*/
if (
'importmap' === $type ||
'speculationrules' === $type ||
'application/json' === $type ||
'text/json' === $type
) {
return true;
}

return false;
}

/**
* Updates or creates a new attribute on the currently matched tag with the passed value.
*
Expand Down
Loading
Loading