ongoing by Tim Bray · RFC 9839 and Bad Unicode
Unicode is good. If you’re designing a data structure or protocol that has text fields, they should contain
Unicode characters encoded in UTF-8. There’s another question, though:
“Which Unicode characters?” The
answer is “Not all of them, please exclude some.”
This issue keeps coming up, so Paul Hoffman and I put together an individual-submission draft
to the IETF and now (where by “now” I mean “two years later”) it’s been published as
RFC 9839. It explains which characters are bad, and why, then offers
three plausible less-bad subsets that you might want to use.
Herewith a bit of background, but…
Please ·
If you’re actually working on something new that will have text fields, please read the RFC. It’s only ten pages long, and that’s
with all the IETF boilerplate. It’s written specifically for software and networking people.
The smoking gun ·
The badness that 9839 focuses on is “problematic characters”, so let’s start with a painful example of what that means.
Suppose you’re designing a protocol that uses JSON and one of your constructs has a username
field.
Suppose you get this message (I omit all the non-username
fields). It’s
a perfectly legal JSON text:
{
"username": "\u0000\u0089\uDEAD\uD9BF\uDFFF"
}
Unpacking all the JSON escaping gibberish reveals that the value of the username
field contains four
numeric “code points” identifying Unicode characters:
-
The first code point is zero, in Unicode jargon
U+0000
. In human-readable text it
has no meaning, but it will interfere with the operation of certain programming languages. -
Next is Unicode
U+0089
, official name “CHARACTER TABULATION WITH JUSTIFICATION”. It’s what Unicode calls a
C1
control code, inherited from ISO/IEC 6429:1992, adopted from
ECMA 48 (1991), which calls it
“HTJ” and says: HTJ causes the contents of the active field (the field in the presentation component that contains the
active presentation position) to be shifted forward so that it ends at the character position preceding the
following character tabulation stop. The active presentation position is moved to that following character
tabulation stop. The character positions which precede the beginning of the shifted string are put into the
erased state.Good luck with that.
-
The third code point,
U+DEAD
, in Unicode lingo, is an “unpaired surrogate”. To understand,
you’d have to learn how Unicode’s much-detested
UTF-16 encoding works.
I recommend not bothering.All you need to know is that surrogates are only meaningful when they come in pairs in UTF-16 encoded text. There is
effectively no such text on the wire and thus no excuse for tolerating surrogates in your data. In fact, the UTF-8 specification
says that you mustn’t use UTF-8 to encode surrogates. But the real problem is that different libraries in different
programming languages don’t always do the same things when they encounter this sort of fœtid interloper. -
Finally,
\uD9BF\uDFFF
is JSON for the code pointU+7FFFF
.
Unicode has a category called “noncharacter”, containing a few dozen code points that, for a variety of
reasons, some good,
don’t represent anything and must not be interchanged on the wire.U+7FFFF
is one of those.
The four code points in the example are all clearly problematic.
The just-arrived RFC 9839 formalizes the notion of “problematic” and
offers easy-to-cite language saying which of these problematic types you want to
exclude from your text fields. Which, if you’re going to use JSON, you should probably do.
Don’t blame Doug ·
Doug Crockford I mean, the inventor of JSON. If he (or I or really anyone careful) were inventing JSON now that Unicode is
mature, he’d have been fussier about its character repertoire. Having said that, we’re stuck with JSON-as-it-is forever, so we
need a good way to say which of the problematic characters we’re going to exclude even if JSON allows them.
PRECISion ·
You may find yourself wondering why the IETF waited until 2025 to provide help with Bad Unicode.
It didn’t; here’s
RFC 8264: PRECIS Framework: Preparation, Enforcement, and
Comparison of Internationalized Strings in Application Protocols; the first PRECIS predecessor was published in 2002.
8264 is 43 pages long, containing a very
thorough discussion of many more potential Bad Unicode issues than 9839 does.
Like 9839, PRECIS specifies subsets of the Unicode character repertoire and goes further, providing a mechanism for defining
more.
Having said that, PRECIS doesn’t seem to be very widely used by people who are defining new data structures and protocols. My
personal opinion is that there are two problems which make it hard to adopt. First, it’s large and
complex, with many moving parts, and requires careful study to understand. Developers are (for good reason) lazy.
Second, using PRECIS ties you to a specific version of Unicode. In particular, it forbids the use of the (nearly a million)
unassigned code points. Since each release of Unicode includes new code point assignments, that means that a sender and receiver
need to agree on exactly which version of Unicode they’re both going to use if they want reliably interoperable behavior. This
makes life difficult for anyone writing a general-purpose code designed to be used in lots of different applications.
I personally think that the only version of Unicode anybody wants to use is “as recent as possible”, so they can be confident
of having all the latest emojis.
Anyhow, 9839 is simpler and dumber than PRECIS. But I think some people will find it useful and now the IETF agrees.
Source code ·
I’ve written a little Go-language library to validate incoming text fields against each of the three subsets that 9839
specifies,
here. I don’t claim it’s optimal, but it is well-tested.
It doesn’t have a version number or release just yet, I’ll wait till a few folk have had a chance to spot any dumb mistakes I
probably made.
Details ·
Here’s a compact summary of the world of problematic Unicode code points and data formats and standards.
Problematic classes excluded? | |||
---|---|---|---|
Surrogates | Legacy controls | Noncharacters | |
CBOR | yes | no | no |
I-JSON | yes | no | yes |
JSON | no | no | no |
Protobufs | no | no | no |
TOML | yes | no | no |
XML | yes | partial [1] | partial [2] |
YAML | yes | mostly [3] | partial [2] |
RFC 9839 Subsets | |||
Scalars | yes | no | no |
XML | yes | partial | partial |
Assignables | yes | yes | yes |
Notes:
[1] XML allows C1 controls.
[2] XML and YAML don’t exclude the noncharacters outside the Basic Multilingual Pane.
[3] YAML excludes all the legacy controls except for the mostly-harmless U+0085
, another version of
\n
used in IBM mainframe documents.
Thanks! ·
9839 is not a solo production. It received an extraordinary amount of discussion and improvement from a lot of smart and
well-informed people
and the published version, 15 draft revisions later, is immensely better than my initial draft. My sincere thanks go to my
co-editor Paul Hoffman and to all those mentioned in the RFC’s “Acknowledgements” section.
On individual submissions ·
9839 is the second “individual submission” RFC I’ve pushed through the IETF (the other is
RFC 7725, which registers the HTTP 451 status code). While it’s nice
to decide something is worth standardizing and eventually have that happen, it’s really a lot of work. Some of that work is
annoying.
I’ve been involved in
other efforts as Working-Group member, WG chair, and WG specification editor, and I can report authoritatively that creating an
RFC the traditional way, through a Working Group, is easier and better.
I feel discomfort advising others not to follow in my footsteps, but in this case I think it’s the right advice.