[rescue] Corrupted list messages

Fri Oct 1 02:05:25 CDT 2021

On Wed, Sep 29, 2021 at 12:22:05PM -0400, Mouse wrote:
[...]
> Of course it will. You've got UTF-8 there, and Linux has _heavily_ drunk
> the "UTF-8 is the One True Way to represent characters" koolaid.

So has everybody else, pretty much. UCS-2 was a thing for a while in the
1990s because C/C++ demands fixed-width characters, which is why Java and
Windows APIs use it, but once it became clear there needed to be more than
65,536 code points, it was extended into the variable-width UTF-16 which is
a pain because of endianness issues and US-ASCII text transforming into NUL
bytes. UTF-8 is by far the sanest Unicode encoding which is why it can now
be pretty much assumed.

For 8 bit encodings, Windows-1252 -- the source of the "smartquotes" which
mess up messages -- is the most common.

Both UTF-8 and Windows-1252 encodings are no-ops when encoding US-ASCII
text. This is not an accident.

> Try feeding it to a tool which actually shows you the underlying octet
> sequences, if you can find such a tool in Linux (hexdump -C, maybe?) or if
> you can get the octet sequence onto an OS that isn't quite so manic for
> UTF-8.

"hexdump -C" does the job, but is hard to read if you're analysing text
rather than binary data. For this purpose, "LC_ALL=C less" (assuming a
typical modern Unix) will show ASCII text normally and give the hex codes of
high-bit-set characters. That's because less(1) will show hex codes for
invalid characters, and the C locale switches it to use 7 bit characters
instead of the typical default of UTF-8 on modern Unix systems.

[...]
> even if it contains octets in the 0x00-0x1f range (control charaters),
> which, except for a very few such as 0x09, are not printable US-ASCII. My
> SMTP server rejected the result because of the mismatch between the
> labeling and the content. (It appears to be one of the few that actually
> rejects this sort of insanity. Most appear to believe standards are a
> quaint relic of the past, or some such, rather than being the only basis
> the net has for interoperability.)

Most things about this high-bit stripping are merely cosmetic, but this is
actually a problem. Most octets can safely pass through "text" protocols
such as SMTP, but NUL, CR, and LF are special. In Windows-1252 the euro sign
has codepoint 0x80 and S-caron is 0x8A, and stripping their high bits turn
them into NUL and LF. (0x8D is unused; perhaps somebody at Microsoft
anticipated this exact problem.) Other Windows-1252 extensions also generate
ASCII control codes which may screw up the terminal when the mail is read.

I note that ohno.mrbill.net explicitly advertises 8BITMIME support, so the
excuse of "the server's just using an older version of the SMTP standard and
is not obliged to be 8 bit clean" does not wash.