ongoing by Tim Bray · RFC 9839 and Bad Unicode
 
Unicode is good. If you’re designing a data structure or protocol that has text fields, they should contain
    Unicode characters encoded in UTF-8. There’s another question, though:
    “Which Unicode characters?” The
    answer is “Not all of them, please exclude some.”
This issue keeps coming up, so Paul Hoffman and I put together an individual-submission draft
    to the IETF and now (where by “now” I mean “two years later”) it’s been published as
    RFC 9839. It explains which characters are bad, and why, then offers
    three plausible less-bad subsets that you might want to use.
    Herewith a bit of background, but…
Please ·
If you’re actually working on something new that will have text fields, please read the RFC. It’s only ten pages long, and that’s
    with all the IETF boilerplate. It’s written specifically for software and networking people.
The smoking gun ·
The badness that 9839 focuses on is “problematic characters”, so let’s start with a painful example of what that means.
    Suppose you’re designing a protocol that uses JSON and one of your constructs has a username field.
    Suppose you get this message (I omit all the non-username fields). It’s
    a perfectly legal JSON text:
{
    "username": "\u0000\u0089\uDEAD\uD9BF\uDFFF"
}    
Unpacking all the JSON escaping gibberish reveals that the value of the username field contains four
    numeric “code points” identifying Unicode characters:
- 
The first code point is zero, in Unicode jargon U+0000. In human-readable text it
 has no meaning, but it will interfere with the operation of certain programming languages.
- 
Next is Unicode U+0089, official name “CHARACTER TABULATION WITH JUSTIFICATION”. It’s what Unicode calls a
 C1
 control code, inherited from ISO/IEC 6429:1992, adopted from
 ECMA 48 (1991), which calls it
 “HTJ” and says: HTJ causes the contents of the active field (the field in the presentation component that contains the
 active presentation position) to be shifted forward so that it ends at the character position preceding the
 following character tabulation stop. The active presentation position is moved to that following character
 tabulation stop. The character positions which precede the beginning of the shifted string are put into the
 erased state.Good luck with that. 
- 
The third code point, U+DEAD, in Unicode lingo, is an “unpaired surrogate”. To understand,
 you’d have to learn how Unicode’s much-detested
 UTF-16 encoding works.
 I recommend not bothering.All you need to know is that surrogates are only meaningful when they come in pairs in UTF-16 encoded text. There is 
 effectively no such text on the wire and thus no excuse for tolerating surrogates in your data. In fact, the UTF-8 specification
 says that you mustn’t use UTF-8 to encode surrogates. But the real problem is that different libraries in different
 programming languages don’t always do the same things when they encounter this sort of fœtid interloper.
- 
Finally, \uD9BF\uDFFFis JSON for the code pointU+7FFFF.
 Unicode has a category called “noncharacter”, containing a few dozen code points that, for a variety of
 reasons, some good,
 don’t represent anything and must not be interchanged on the wire.U+7FFFFis one of those.
The four code points in the example are all clearly problematic.
    The just-arrived RFC 9839 formalizes the notion of “problematic” and
    offers easy-to-cite language saying which of these problematic types you want to
    exclude from your text fields. Which, if you’re going to use JSON, you should probably do.
Don’t blame Doug ·
    Doug Crockford I mean, the inventor of JSON.  If he (or I or really anyone careful) were inventing JSON now that Unicode is
    mature, he’d have been fussier about its character repertoire. Having said that, we’re stuck with JSON-as-it-is forever, so we
    need a good way to say which of the problematic characters we’re going to exclude even if JSON allows them.
PRECISion ·
    You may find yourself wondering why the IETF waited until 2025 to provide help with Bad Unicode.
    It didn’t; here’s
    RFC 8264: PRECIS Framework: Preparation, Enforcement, and
    Comparison of Internationalized Strings in Application Protocols; the first PRECIS predecessor was published in 2002.
    8264 is 43 pages long, containing a very
    thorough discussion of many more potential Bad Unicode issues than 9839 does.
Like 9839, PRECIS specifies subsets of the Unicode character repertoire and goes further, providing a mechanism for defining
    more.
Having said that, PRECIS doesn’t seem to be very widely used by people who are defining new data structures and protocols. My
    personal opinion is that there are two problems which make it hard to adopt. First, it’s large and
    complex, with many moving parts, and requires careful study to understand. Developers are (for good reason) lazy.
Second, using PRECIS ties you to a specific version of Unicode. In particular, it forbids the use of the (nearly a million)
    unassigned code points. Since each release of Unicode includes new code point assignments, that means that a sender and receiver
    need to agree on exactly which version of Unicode they’re both going to use if they want reliably interoperable behavior. This
    makes life difficult for anyone writing a general-purpose code designed to be used in lots of different applications.
I personally think that the only version of Unicode anybody wants to use is “as recent as possible”, so they can be confident
    of having all the latest emojis.
Anyhow, 9839 is simpler and dumber than PRECIS. But I think some people will find it useful and now the IETF agrees.
Source code ·
    I’ve written a little Go-language library to validate incoming text fields against each of the three subsets that 9839
    specifies,
    here.  I don’t claim it’s optimal, but it is well-tested.
It doesn’t have a version number or release just yet, I’ll wait till a few folk have had a chance to spot any dumb mistakes I
    probably made.
Details ·
    Here’s a compact summary of the world of problematic Unicode code points and data formats and standards.
| Problematic classes excluded? | |||
|---|---|---|---|
| Surrogates | Legacy controls | Noncharacters | |
| CBOR | yes | no | no | 
| I-JSON | yes | no | yes | 
| JSON | no | no | no | 
| Protobufs | no | no | no | 
| TOML | yes | no | no | 
| XML | yes | partial [1] | partial [2] | 
| YAML | yes | mostly [3] | partial [2] | 
| RFC 9839 Subsets | |||
| Scalars | yes | no | no | 
| XML | yes | partial | partial | 
| Assignables | yes | yes | yes | 
Notes:
[1] XML allows C1 controls.
[2] XML and YAML don’t exclude the noncharacters outside the Basic Multilingual Pane.
[3] YAML excludes all the legacy controls except for the mostly-harmless U+0085, another version of
    \n used in IBM mainframe documents.
Thanks! ·
    9839 is not a solo production. It received an extraordinary amount of discussion and improvement from a lot of smart and
    well-informed people
    and the published version, 15 draft revisions later, is immensely better than my initial draft. My sincere thanks go to my
    co-editor Paul Hoffman and to all those mentioned in the RFC’s “Acknowledgements” section.
On individual submissions ·
    9839 is the second “individual submission” RFC I’ve pushed through the IETF (the other is
    RFC 7725, which registers the HTTP 451 status code).  While it’s nice
    to decide something is worth standardizing and eventually have that happen, it’s really a lot of work. Some of that work is
    annoying.
I’ve been involved in
    other efforts as Working-Group member, WG chair, and WG specification editor, and I can report authoritatively that creating an
    RFC the traditional way, through a Working Group, is easier and better.
I feel discomfort advising others not to follow in my footsteps, but in this case I think it’s the right advice.
