Skip to content

Support fo unicode and octal escapes in string literals.#65

Open
wagjo wants to merge 1 commit intoedn-format:masterfrom
wagjo:master
Open

Support fo unicode and octal escapes in string literals.#65
wagjo wants to merge 1 commit intoedn-format:masterfrom
wagjo:master

Conversation

@wagjo
Copy link

@wagjo wagjo commented May 25, 2014

Specs do not mention whether unicode and octal escapes are supported or not. As clojure.edn supports it [1], I've added an explicit mention in the specs. I'm a registered clojure contributor (signed CA).

[1] https://github.com/clojure/clojure/blob/c6756a8bab137128c8119add29a25b0a88509900/src/jvm/clojure/lang/EdnReader.java#L580

@avodonosov
Copy link

avodonosov commented Apr 22, 2020

@richhickey, the absence of unicode escapes in string literals is really limiting. And the reason for that is unclear, given that unicode escapes are supported for characters.

bpsm added a commit to bpsm/edn-java that referenced this pull request Apr 25, 2020
This is in response to edn-format/edn#65 .

This is an extension as string literals as currently documented
do not specify support for \uXXXX escapes.

  https://github.com/edn-format/edn/tree/a51127aecd318096667ae0dafa25353ecb07c9c3

Notes:

- Unicode escape must begin with "\u". This is case sensitive "\U" will
  be rejected.
- "\u" must be followed by exactly four hex digits taken from this set:
  0 1 2 3 4 5 6 7 8 9 a b c d e f A B C D E F
- The digits are not case sensitive.
- Each such Unicode escape encodes a single 16-bit Java char. Since Java
  uses UTF-16 internally (for historical reasons) code points beyond
  the basic multilingual plane as a pair of unicode escapes.
  (see also "surrogate pairs")
bpsm added a commit to bpsm/edn-java that referenced this pull request May 1, 2020
This is in response to edn-format/edn#65 .

This is an extension as string literals as currently documented
do not specify support for \uXXXX escapes.

  https://github.com/edn-format/edn/tree/a51127aecd318096667ae0dafa25353ecb07c9c3

Syntax Notes:

- Unicode escape must begin with "\u". This is case sensitive "\U" will
  be rejected.
- "\u" must be followed by exactly four hex digits taken from this set:
  0 1 2 3 4 5 6 7 8 9 a b c d e f A B C D E F
- The digits are not case sensitive.
- Each such Unicode escape encodes a single 16-bit Java char. Since Java
  uses UTF-16 internally (for historical reasons) code points beyond
  the basic multilingual plane as a pair of unicode escapes.
  (see also "surrogate pairs")

Disabling:

By default \uXXXX escapes are now supported in String literals.

Parser.Config (and Parser.Config.Builder) now support a flag which can
be set to false to disable support for \uXXXX in string literals. This
restores the old behavior of throwing an EdnSyntaxException when such
escapes are encountered.
@avodonosov
Copy link

The maintainer of edn-java library kindly agreed to implement unicode escapes in the library. Initially, it was planned as an option, disabled by default. After implementing it that way it was discovered that https://github.com/clojure/tools.reader supports unicode escapes by default, so edn-java finally implemented unicode escapes enabled by default.

Turns out https://github.com/clojure/tools.reader also supports octal escapes in string and character literals, same as in the clojure languate. (The current edn spec includes unicode escapes for characters, but misses octal escapes).

@richhickey IMHO clarity is needed in the spec. It's strange unicode escapes are not specified for strings while they are specified for characters. And what about octal escapes?

@wagjo, if your pull requests includes octal escapes for string litertals, makes sense to include them for characters tool (the clojure language and the tools.reader support them in the form \oNNN).

As for backwards compatibility, I would suggest to include the escapes into the spec and add a comment: "Unicode and octal escapes in string literals and octal escapes in character literals were only added to the spec in 2020. Some implementations supported them before that. For compatibility, consumers of EDN documents (including parsing libraries) should always support the escapes. The suppliers of EDN documents should avoid the escapes, unless they verified all the consumers of their documents support the escapes"

@avodonosov
Copy link

avodonosov commented May 3, 2020

BTW, in Java octal escapes in string literals can contain up to 3 digits (https://docs.oracle.com/javase/specs/jls/se7/html/jls-3.html), while the clojure reader and the clojure.tools.reader.edn require exactly 3 digits after backlash.

So @wagjo, the wording "as in Java" in the pull request does not match precisely the current implementations.

RhymeRabbit added a commit to RhymeRabbit/soundes that referenced this pull request Aug 22, 2025
This is in response to edn-format/edn#65 .

This is an extension as string literals as currently documented
do not specify support for \uXXXX escapes.

  https://github.com/edn-format/edn/tree/a51127aecd318096667ae0dafa25353ecb07c9c3

Syntax Notes:

- Unicode escape must begin with "\u". This is case sensitive "\U" will
  be rejected.
- "\u" must be followed by exactly four hex digits taken from this set:
  0 1 2 3 4 5 6 7 8 9 a b c d e f A B C D E F
- The digits are not case sensitive.
- Each such Unicode escape encodes a single 16-bit Java char. Since Java
  uses UTF-16 internally (for historical reasons) code points beyond
  the basic multilingual plane as a pair of unicode escapes.
  (see also "surrogate pairs")

Disabling:

By default \uXXXX escapes are now supported in String literals.

Parser.Config (and Parser.Config.Builder) now support a flag which can
be set to false to disable support for \uXXXX in string literals. This
restores the old behavior of throwing an EdnSyntaxException when such
escapes are encountered.
echoxLogic added a commit to echoxLogic/lampTen that referenced this pull request Aug 25, 2025
This is in response to edn-format/edn#65 .

This is an extension as string literals as currently documented
do not specify support for \uXXXX escapes.

  https://github.com/edn-format/edn/tree/a51127aecd318096667ae0dafa25353ecb07c9c3

Syntax Notes:

- Unicode escape must begin with "\u". This is case sensitive "\U" will
  be rejected.
- "\u" must be followed by exactly four hex digits taken from this set:
  0 1 2 3 4 5 6 7 8 9 a b c d e f A B C D E F
- The digits are not case sensitive.
- Each such Unicode escape encodes a single 16-bit Java char. Since Java
  uses UTF-16 internally (for historical reasons) code points beyond
  the basic multilingual plane as a pair of unicode escapes.
  (see also "surrogate pairs")

Disabling:

By default \uXXXX escapes are now supported in String literals.

Parser.Config (and Parser.Config.Builder) now support a flag which can
be set to false to disable support for \uXXXX in string literals. This
restores the old behavior of throwing an EdnSyntaxException when such
escapes are encountered.
76293872 added a commit to 76293872/symmet that referenced this pull request Sep 15, 2025
This is in response to edn-format/edn#65 .

This is an extension as string literals as currently documented
do not specify support for \uXXXX escapes.

  https://github.com/edn-format/edn/tree/a51127aecd318096667ae0dafa25353ecb07c9c3

Syntax Notes:

- Unicode escape must begin with "\u". This is case sensitive "\U" will
  be rejected.
- "\u" must be followed by exactly four hex digits taken from this set:
  0 1 2 3 4 5 6 7 8 9 a b c d e f A B C D E F
- The digits are not case sensitive.
- Each such Unicode escape encodes a single 16-bit Java char. Since Java
  uses UTF-16 internally (for historical reasons) code points beyond
  the basic multilingual plane as a pair of unicode escapes.
  (see also "surrogate pairs")

Disabling:

By default \uXXXX escapes are now supported in String literals.

Parser.Config (and Parser.Config.Builder) now support a flag which can
be set to false to disable support for \uXXXX in string literals. This
restores the old behavior of throwing an EdnSyntaxException when such
escapes are encountered.
futarisoio added a commit to futarisoio/liteeth that referenced this pull request Oct 12, 2025
This is in response to edn-format/edn#65 .

This is an extension as string literals as currently documented
do not specify support for \uXXXX escapes.

  https://github.com/edn-format/edn/tree/a51127aecd318096667ae0dafa25353ecb07c9c3

Syntax Notes:

- Unicode escape must begin with "\u". This is case sensitive "\U" will
  be rejected.
- "\u" must be followed by exactly four hex digits taken from this set:
  0 1 2 3 4 5 6 7 8 9 a b c d e f A B C D E F
- The digits are not case sensitive.
- Each such Unicode escape encodes a single 16-bit Java char. Since Java
  uses UTF-16 internally (for historical reasons) code points beyond
  the basic multilingual plane as a pair of unicode escapes.
  (see also "surrogate pairs")

Disabling:

By default \uXXXX escapes are now supported in String literals.

Parser.Config (and Parser.Config.Builder) now support a flag which can
be set to false to disable support for \uXXXX in string literals. This
restores the old behavior of throwing an EdnSyntaxException when such
escapes are encountered.
fvpasses added a commit to fvpasses/train that referenced this pull request Oct 13, 2025
This is in response to edn-format/edn#65 .

This is an extension as string literals as currently documented
do not specify support for \uXXXX escapes.

  https://github.com/edn-format/edn/tree/a51127aecd318096667ae0dafa25353ecb07c9c3

Syntax Notes:

- Unicode escape must begin with "\u". This is case sensitive "\U" will
  be rejected.
- "\u" must be followed by exactly four hex digits taken from this set:
  0 1 2 3 4 5 6 7 8 9 a b c d e f A B C D E F
- The digits are not case sensitive.
- Each such Unicode escape encodes a single 16-bit Java char. Since Java
  uses UTF-16 internally (for historical reasons) code points beyond
  the basic multilingual plane as a pair of unicode escapes.
  (see also "surrogate pairs")

Disabling:

By default \uXXXX escapes are now supported in String literals.

Parser.Config (and Parser.Config.Builder) now support a flag which can
be set to false to disable support for \uXXXX in string literals. This
restores the old behavior of throwing an EdnSyntaxException when such
escapes are encountered.
@mnemnion
Copy link

@avodonosov on the "one may always loosen a restriction" premise, the Clojure reader could adopt the Java approach, and this would only impact previously-invalid strings.

I'm not so sure if it should. The advantage of exact widths is that no one will write "\03 3 more", then later edit it to remove the space, "\033 more" and have the octal escape "eat" what was until then an ASCII-literal 3. Similar considerations lead to several languages (example: C) adding \U with eight hex digits to accompany \u with four, because just allowing more digits was a breaking change.

This one would not be breaking, but I'm left with the sentiment that variable-width escapes without delimiters are not an obviously-good idea.

@avodonosov
Copy link

avodonosov commented Dec 25, 2025

@mnemnion, to be clear, I didn't suggest to adopt the java way (variable length octal escapes), I just pointed to the difference and therefore that the wording "as in Java" is not correct.

You are right that loosening the fixed length restriction would require a very careful consideration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants