[rescue] Corrupted list messages
Mouse
mouse at Rodents-Montreal.ORG
Wed Sep 29 11:22:05 CDT 2021
I was looking at blocked mail for unrelated reasons and saw the traffic
about how messages are coming through sprinkled with `b's, as in
>> Liam Proven b Profile: https://about.me/liamproven
>> Email: lproven at cix.co.uk b gMail/gTalk/gHangouts: lproven at gmail.com
>> Twitter/Facebook/LinkedIn/Flickr: lproven b Skype: liamproven
>> UK: +44 7939-087884 b D R (+ WhatsApp/Telegram/Signal): +420 702 829 053
What is going on here, I believe, is that people are sending UTF-8
text, but the mailing list is stripping all high bits. This leads to,
to pick a hypothetical example, U+0416, CYRILLIC CAPITAL LETTER ZHE,
which in UTF-8 is 0xd0 0x96, turning into 0x50 0x16, or a P and a ^V.
The repeated b arises because many of these characters begin with 0xe2,
which turns into 0x62, or lowercase b. (The characters that will do
this are the ones in the range U+2000 through U+2fff, which includes
many ranges, but in particular includes the "General Punctuation"
range, which is probably responsible for most of them.)
The recommendation to avoid the misnamed "smart quotes" feature is
good (see also demoroniser), but it is not enough. You need to stick
to ASCII for the list, or your text _will_ get mangled. The list
strips high bits _without_ looking at the charset= marking, so it's not
a question of what character you try to send but how it's encoded. For
example, if you try to send U+00d7, MULTIPLICATION SIGN, as UTF-8,
you'll send 0xc3 0x97, which the list will convert to 0x43 0x17 - but
if you send that same character as 8859-1, you'll send just 0xd7, which
the list will convert to 0x57.
> I tested it in a Linux console, no X.11 or anything, and it still
> looks fine at my end.
Of course it will. You've got UTF-8 there, and Linux has _heavily_
drunk the "UTF-8 is the One True Way to represent characters" koolaid.
Try feeding it to a tool which actually shows you the underlying octet
sequences, if you can find such a tool in Linux (hexdump -C, maybe?) or
if you can get the octet sequence onto an OS that isn't quite so manic
for UTF-8.
> I've swapped the en-dashes to bullets; any better now?
I saw b" instead of b<control-char>, so perhaps marginally better. But
your mail will continue to get corrupted until you stop sending
anything non-ASCII to the list. (Whether it gets corrupted into
anything specific depends on the details of what you're sending.)
This is also related to why I didn't see this traffic until I looked at
blocked mail. The list is partially to blame: it strips high bits, but
then it (incorrectly) doesn't check the result, sending it out
mislabeled
Content-Type: text/plain; charset="us-ascii"
even if it contains octets in the 0x00-0x1f range (control charaters),
which, except for a very few such as 0x09, are not printable US-ASCII.
My SMTP server rejected the result because of the mismatch between the
labeling and the content. (It appears to be one of the few that
actually rejects this sort of insanity. Most appear to believe
standards are a quaint relic of the past, or some such, rather than
being the only basis the net has for interoperability.)
/~\ The ASCII Mouse
\ / Ribbon Campaign
X Against HTML mouse at rodents-montreal.org
/ \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
More information about the rescue
mailing list