[rescue] UTF-8 [was T5220 update]
Jonathan Patschke
jp at celestrion.net
Thu Nov 2 09:14:11 CDT 2017
On Wed, 1 Nov 2017, Mouse wrote:
> Storage compactness is a completely spurious claim except for those
> using mostly-ASCII characters. UTF-8, as compared to a stream of
> 16-bit codepoints, does not save storage for anything except ASCII,
This was a design goal, and I don't think it's as bad as all that. Many
of the scripts using those wider characters have a greater information
density per glyph than the western scripts do. Cyrillic gets shafted, and
that's probably somewhere between coincidence and politics.
> In my opinion - and that's all it is, my opinion, and it's probably
> worth about what you paid for it - UTF-8 is an abomination. The
> benefits of each character being the same size in memory far outweighs,
> to me, the storage compaction UTF-8 provides for ASCII text (or, if you
> use 24- instead of 16-bit codepoints, the handful of writing systems
> outlined above).
My instinct is to agree, but for applications where that matters (nearly
any in-memory processing), there's UTF-32/UCS-4. You do the code-point
processing in bulk once instead of iteratively, and your in-core view of
the file has characters of the same size. UTF-8's compactness is intended
for transfer and storage[0] primarily.
Further, a system that defaults to single-byte storage ends arguments
about byte order; expand the bytes into core however you see fit, but
serialization with other systems won't depend on byte-order marking.
>> For its faults, UTF-8 and Unicode are _FAR_ better than their
>> predecessors.
>
> Maybe they would have been if there were no installed base - though I
> still consider variable-sized (in storage) characters an abomination.
The Big Win for the notion of variable-width characters, if we're talking
about installed bases, is that UTF-8 software can correctly process all
7-bit ASCII text--including control codes. This is, by far, the single
largest set of legacy electronic textual data.
That facilitates support for wide characters being introduced into
software without a Flag Day when all characters need to be 24 or 32 bits
wide.
>> Thompson and Pike were presenting talks on UTF-8 in the early-to-mid
>> 1990s.
>
> So? I can't see that as relevant, unless your stance is something
> like, UTF-8 is the best encoding of the best character set for all
> users and purposes, so it is reasonable to expect everyone/everything
> to support it as soon as it was introduced (modulo implementation
> delay).
At 24 years on, that delay could involve conceiving the programmer who
would later implement UTF-8 support and sending him/her through
university. "My system is more than a year or two old," would be a
perfectly valid excuse if Unicode were a passing fad with niche
applicability and a majority of the planet well-serviced by ASCII.
> Perhaps that is your stance, in which case, I have the painful duty to
> break it to you that it's not so. There are lots of users and purposes
> for which Unicode, never mind UTF-8, is a wrong answer, even today.
> Many of them involve the sort of hardware and software this list
> focuses on, hence my remark.
The thing about the network is that something doesn't have to be the best
to be nigh-universal, which is how we got Unix to begin with. There will
probably never be a best-in-all-cases-ever incidence of any technology,
but there will usually be one that's pretty reasonable to support by
default.
That used to be ASCII. These days, it really looks[1] to be Unicode, for
better or worse. Looping all the way back to the start of this
divergence, if software needs ASCII, iconv is a much better input filter
than &= 127.
[0] Although filesystem support for lz4 and similar compression schemes
makes even this a hard claim to defend, but in 1993 the relative
processing overhead was much higher.
[1] I very likely have a bias in my perception as to how valuable a
universal character set is due to most of my coworkers speaking
English (or any Western language) as a second or third language.
--
Jonathan Patschke
Austin, TX
USA
More information about the rescue
mailing list