ongoing by Tim Bray · RFC 9839 and Bad Unicode


Unicode is good. If you’re designing a data structure or protocol that has text fields, they should contain
Unicode characters encoded in UTF-8. There’s another question, though:
Which Unicode characters?” The
answer is “Not all of them, please exclude some.”

This issue keeps coming up, so Paul Hoffman and I put together an individual-submission draft
to the IETF and now (where by “now” I mean “two years later”) it’s been published as
RFC 9839. It explains which characters are bad, and why, then offers
three plausible less-bad subsets that you might want to use.
Herewith a bit of background, but…

Please ·
If you’re actually working on something new that will have text fields, please read the RFC. It’s only ten pages long, and that’s
with all the IETF boilerplate. It’s written specifically for software and networking people.

The smoking gun ·
The badness that 9839 focuses on is “problematic characters”, so let’s start with a painful example of what that means.
Suppose you’re designing a protocol that uses JSON and one of your constructs has a username field.
Suppose you get this message (I omit all the non-username fields). It’s
a perfectly legal JSON text:

{
    "username": "\u0000\u0089\uDEAD\uD9BF\uDFFF"
}    

Unpacking all the JSON escaping gibberish reveals that the value of the username field contains four
numeric “code points” identifying Unicode characters:

  1. The first code point is zero, in Unicode jargon U+0000. In human-readable text it
    has no meaning, but it will interfere with the operation of certain programming languages.

  2. Next is Unicode U+0089, official name “CHARACTER TABULATION WITH JUSTIFICATION”. It’s what Unicode calls a
    C1
    control code
    , inherited from ISO/IEC 6429:1992, adopted from
    ECMA 48 (1991), which calls it
    “HTJ” and says: HTJ causes the contents of the active field (the field in the presentation component that contains the
    active presentation position) to be shifted forward so that it ends at the character position preceding the
    following character tabulation stop. The active presentation position is moved to that following character
    tabulation stop. The character positions which precede the beginning of the shifted string are put into the
    erased state.

    Good luck with that.

  3. The third code point, U+DEAD, in Unicode lingo, is an “unpaired surrogate”. To understand,
    you’d have to learn how Unicode’s much-detested
    UTF-16 encoding works.
    I recommend not bothering.

    All you need to know is that surrogates are only meaningful when they come in pairs in UTF-16 encoded text. There is
    effectively no such text on the wire and thus no excuse for tolerating surrogates in your data. In fact, the UTF-8 specification
    says that you mustn’t use UTF-8 to encode surrogates. But the real problem is that different libraries in different
    programming languages don’t always do the same things when they encounter this sort of fœtid interloper.

  4. Finally, \uD9BF\uDFFF is JSON for the code point U+7FFFF.
    Unicode has a category called “noncharacter”, containing a few dozen code points that, for a variety of
    reasons, some good,
    don’t represent anything and must not be interchanged on the wire. U+7FFFF is one of those.

The four code points in the example are all clearly problematic.
The just-arrived RFC 9839 formalizes the notion of “problematic” and
offers easy-to-cite language saying which of these problematic types you want to
exclude from your text fields. Which, if you’re going to use JSON, you should probably do.

Don’t blame Doug ·
Doug Crockford I mean, the inventor of JSON. If he (or I or really anyone careful) were inventing JSON now that Unicode is
mature, he’d have been fussier about its character repertoire. Having said that, we’re stuck with JSON-as-it-is forever, so we
need a good way to say which of the problematic characters we’re going to exclude even if JSON allows them.

PRECISion ·
You may find yourself wondering why the IETF waited until 2025 to provide help with Bad Unicode.
It didn’t; here’s
RFC 8264: PRECIS Framework: Preparation, Enforcement, and
Comparison of Internationalized Strings in Application Protocols
; the first PRECIS predecessor was published in 2002.
8264 is 43 pages long, containing a very
thorough discussion of many more potential Bad Unicode issues than 9839 does.

Like 9839, PRECIS specifies subsets of the Unicode character repertoire and goes further, providing a mechanism for defining
more.

Having said that, PRECIS doesn’t seem to be very widely used by people who are defining new data structures and protocols. My
personal opinion is that there are two problems which make it hard to adopt. First, it’s large and
complex, with many moving parts, and requires careful study to understand. Developers are (for good reason) lazy.

Second, using PRECIS ties you to a specific version of Unicode. In particular, it forbids the use of the (nearly a million)
unassigned code points. Since each release of Unicode includes new code point assignments, that means that a sender and receiver
need to agree on exactly which version of Unicode they’re both going to use if they want reliably interoperable behavior. This
makes life difficult for anyone writing a general-purpose code designed to be used in lots of different applications.

I personally think that the only version of Unicode anybody wants to use is “as recent as possible”, so they can be confident
of having all the latest emojis.

Anyhow, 9839 is simpler and dumber than PRECIS. But I think some people will find it useful and now the IETF agrees.

Source code ·
I’ve written a little Go-language library to validate incoming text fields against each of the three subsets that 9839
specifies,
here. I don’t claim it’s optimal, but it is well-tested.

It doesn’t have a version number or release just yet, I’ll wait till a few folk have had a chance to spot any dumb mistakes I
probably made.

Details ·
Here’s a compact summary of the world of problematic Unicode code points and data formats and standards.

Problematic classes excluded?
Surrogates Legacy controls Noncharacters
CBOR yes no no
I-JSON yes no yes
JSON no no no
Protobufs no no no
TOML yes no no
XML yes partial [1] partial [2]
YAML yes mostly [3] partial [2]
RFC 9839 Subsets
Scalars yes no no
XML yes partial partial
Assignables yes yes yes

Notes:

[1] XML allows C1 controls.

[2] XML and YAML don’t exclude the noncharacters outside the Basic Multilingual Pane.

[3] YAML excludes all the legacy controls except for the mostly-harmless U+0085, another version of
\n used in IBM mainframe documents.

Thanks! ·
9839 is not a solo production. It received an extraordinary amount of discussion and improvement from a lot of smart and
well-informed people
and the published version, 15 draft revisions later, is immensely better than my initial draft. My sincere thanks go to my
co-editor Paul Hoffman and to all those mentioned in the RFC’s “Acknowledgements” section.

On individual submissions ·
9839 is the second “individual submission” RFC I’ve pushed through the IETF (the other is
RFC 7725, which registers the HTTP 451 status code). While it’s nice
to decide something is worth standardizing and eventually have that happen, it’s really a lot of work. Some of that work is
annoying.

I’ve been involved in
other efforts as Working-Group member, WG chair, and WG specification editor, and I can report authoritatively that creating an
RFC the traditional way, through a Working Group, is easier and better.

I feel discomfort advising others not to follow in my footsteps, but in this case I think it’s the right advice.




Source link