Uppercase Sharp S Issues
by Dr Asmus Freytag
The following was written as a contribution to the deliberations of the Unicode Technical Committee, which was then in final review of a proposal to encode a CAPITAL SHARP S. Many of the arguments and counter arguments summarized here are taken from a rather lengthy internal and external discussion and are presented here without the evidentiary support with which they may have been presented when originally introduced into the discussion.
The text was initially presented to a non-public Unicode discussion list on 8 March 2007 and has been edited slightly for this publication.
In German orthography (especially after the recent reform), there is a clear distinction between ?? and ?ss? in lower case. There are some word pairs where it is the only distinction. (The same is true for some personal names).
For purposes of searching personal names, and for sorting words, it is expedient to suppress that distinction. Part of that (probably) has to do with the fact that spellings of personal names cannot be predicted by sound, and that sorting similar sounding names together is generally useful. Pre-reform, the ?? and ?ss? were also used in distinction, but in ways that was not as clearly related to pronunciation of the word. Ergo, sorting words had the same issues as sorting personal names. However, sorting and searching are special in that they often create fairly wide equivalence sets, compared to the distinctions needed in accurately writing text or names.
The origin of the ?SS? case mapping for ?? is not actually known with certainty. However, it was decreed in a time where the use of Fraktur and typewriters were common. Typewriters had extreme limitation in the number of signs they could support, and ALL UPPERCASE text in Fraktur is an absurdity. Since the ?? does not (ordinarily) occur in TitleCase, which is very common in German (nouns), the impact of the standard orthographic rule is limited.
Nevertheless, the post office (on forms), sign writers, certain name registries, and many other users that use ALL UPPERCASE text (in modern style, not Fraktur), feel that suppressing the distinction between words and names that contain ? and those that contain ?ss? is not appropriate.
There are three ways this distinction can be maintained in ALL UPPERCASE text. Use of ?SZ?, retaining lower case ?? as-is and using an uppercase form of ??. All three forms can be found. And all three ways have their adherents. Yes, that means that Germany is not united after all. ;-).
For the following argument, it is important to not conflate any of
these three forms with the standard orthography, which does equate ??
with ?SS? in ALL UPPERCASE text. The standard orthography is the only
one, that (outside sorting and searching) allows the equivalence
between ?SS? as uppercase of ?? and ?SS? as uppercase of ?ss?, (while
simultaneously distinguishing carefully between their lower case forms).
If one were to desire a distinction between ?ss? and ?? in lower as
well as in uppercase, for semantic reasons, then choosing an encoding
that is based on a glyphic variation of ?SS? would give the desired
presentation but would hide the distinction at the character level.
Of course, it is in principle possible to arrange layout engines as
well as all text processing to magically do the right thing, no matter
how a text-element is encoded, and no matter what the cost, but,
putting it briefly, the Unicode philosophy is to model things close to
the common understanding of the text element?unless the script model
consistently supports a non-intuitive approach. I see no recent
precedent, incidentally, that, by itself, would make deciding the
current question a slam-dunk, but I tend to dislike piling
complex-script like approaches onto Latin.
If you desire to carry the distinction between ?SS? and ?? in ALL
UPPERCASE TEXT, for semantic reasons, there are currently these three
ways:
- Using ?SZ?. This is unattractive because converting the string to
lower case results in nonsense, and few if any text processes consider
any equivalence between ?sz? and ??. It feels unnatural to many
readers. Nevertheless it is used in certain cases.
- Using ?? as is. This does not suffer from the aforementioned
problem, but is visually not appealing. Nevertheless, of the three, it
is currently the most widespread solution.
- Using an uppercase form of ??. This is currently only possible
with ad-hoc support. Nevertheless its use can be documented, and given
the technical challenges, is surprisingly frequent.
By itself, the proposal to encode a CAPITAL SHARP S does not change
the current orthography. The proposal (as such) does not even try to
standardize on the third form, but merely proposes that the uppercase
form of ?? be considered a character, and implemented as such.
(Individuals among the proposers or elsewhere may have an interest in
promoting a change in writing practices, but it is not Unicode's role
to take sides on such larger issues, and there's little objective
reason to fear radical and imminent change in the majority usage.
Raising the threat of such change as if it was imminent and inevitable
would seem to border on fear-mongering, so let us agree that it is
neither).
Given that the use of an uppercase form of ?? is clearly a variation
of a (currently more common) practice of using the lowercase form for
the same purpose, a search for a solution should start from the ?? and
not from the equivalence to the ?SS?. The reason is that while that
equivalence is present in the standard orthography, it is explicitly
rejected
by users of all three alternative ways. Starting from the ?? would
follow the principle of least surprise to the users and implementers.
Given that ALL UPPERCASE contexts are relatively uncommon, that
retaining the distinction between ?? and ?SS?, is less common than
giving up that distinction as per dictate of the standard orthography,
and that out of three possible ways, only one uses an uppercase form of
??, the expectation of the
average German user would first and foremost be that existing texts and implementation behave as before.
Adding a new character would therefore not change the default case
mapping of ?? to ?SS?. Users of the third way would need to enter
their new character by hand, or use special purpose software. The
former is appropriate for signage, book covers, and similar uses. The
latter is what the post office might use in a data processing center
entering hand-filled forms using ??.? Institutions maintaining lists of
names in ALL UPPERCASE might utilize similar special purpose software.
For users of the third way, what would change as result of adding a
character is that current ad-hoc solutions could be replaced by
conformant
solutions with initially equal functionality. To the degree that
certain very common font suites were to add a glyph for this character,
reasonable transmission on the web and in e-mail would work in the
medium term. If the default lowercase mapping of the character is to
the existing ??,? name and form data can be converted to standard
orthography by title casing (nouns/names) or lowercasing, which would
be useful (and retain the desired distinction).
Extending the weak equivalence to ?SS? for sorting and searching (by
default) would make data using the new character equally accessible.
Obviously, however, the whole reason for using the ?? is so that some
search modes would
not
make that equivalence. Such search modes are already required to
support users of the second way, which is currently the most common way
of supporting the distinction between ?? and ?SS? in ALL UPPERCASE
contexts.
The existence of this 'second way' (retaining lower case ?? 'as-is')
and the fact that it is, for now, the most common non-standard way of
retaining the distinction between ?? and ?SS? in ALL UPPERCASE
context, means that the third way cannot be considered in isolation.
For example, a lot less would be gained by basing the third way on an
encoding that is based on ?SS,? because that makes it different from
the second way. On the contrary, many of the potential complications
of, as well as solutions for, addressing the third way with a new
character are already present because of the second way.
The primary exception on the text processing level would be the lack of
a (default) uppercase mapping from ?? to the new character. I concur
with the proposers' judgment that this is not an issue for the
average
user, and that the adherents of the third way either can live with that
restriction or that they will (be able to) use tailored software. (It
is possible to disagree with that judgment, but that comes down to a
matter of opinion.)
The primary exception on the display level would be the lack (for a transition time) of a glyph in many or most fonts.
It is sometimes claimed that <S, ZWJ, S> would gracefully fall
back to ?SS? and that would make it more attractive than the '?missing
glyph? that would ensue if there was a new character, but no glyph in
the font. While the fallback does work wherever the system enforces the
default-ignorable property of ZWJ, it violates the rule of ?no
surprises? since anyone who intends to communicate a distinction
between ?? and ?SS? will no longer be able to predict what the other
side will see, and there will be no obvious indication of error. (Users
of the third way that anticipate transmission problems would presumably
rather fall back, manually, to the second way.)
Incidentally, it is equally unclear whether such a ligature could/would
be enabled without affecting the use of all other ligatures in the
document. Ligatures across compound-word internal boundaries are not
desirable in German, and might have to be suppressed individually with
ZWNJ before ligatures could be enabled globally for German text.
Positive ligature support may be absent or may not be controllable in
forms. Such complications can easily mean that using an SS ligature is
equally limiting in practice as using a new character with initially
limited font support.
Lowercasing such data opens a new issue, i.e. that of displaying <s,
ZWJ, s>. If fonts were to utilize a ?? glyph for that sequence,
which might only be tempting, then it could encourage a dual
representation of the lower case ??. If they were not, then
lowercasing a text that intends to make a distinction that is
unequivocally correct and required in lower case text, would result in
its being removed?unless a special mapping <S, ZWJ, S> → ? were
to be widely implemented. (Not to mention that such a mapping would go
against the principle of not having ZWJ affect casing).
While the facts about actual usage can be established and putative
consequences for both proposed solution and counterproposal can be
mapped, the weighting of this information is and remains a matter of
judgment, and true precedents for such a complicated situation are
lacking.
Finally, what of the non-technical factors that UTC should consider when making encoding decisions?
There seems to be agreement that Unicode does not restrict itself to
standard orthography, that it is descriptive rather than prescriptive,
and that it takes no sides in settling orthographies?but retains the
right to determine how best to reflect a given orthography in an
encoding. All three ways discussed here would qualify for being
encodable, based on their degree of documented usage (two of which, of
course, are already encodable).
There is considerably less agreement on how to account for historical
development, including the origin (putative or documented) of a form,
trends in the development of an orthography (observable or speculative)
and predictions of future (or far future) outcomes. In the case at
hand, I tend to believe in the existence of overarching trends, while
simultaneously disbelieving a concrete possibility of real and
widespread change in actual practices on the ground in the near to
medium term.
In terms of stability of properties, it is claimed that proponents of
the third way would ask (eventually) for a change of the mapping from
?? to ?SS? to a mapping from ?? to uppercase ??. Well, they might,
but my firm assumption is that UTC will do the research to base its
decisions on the needs of the
average
user. As long as the standard orthography remains the standard, those
needs are unchanged. Not encoding a new character, by the way is no
safeguard, because proponents of the second way (and there are more of
them) could ask for a similar incompatible change in mapping (to always
leave the ?? as-is.)
Under the assumption that UTC continues to be able to do due diligence
in this case, neither scenario represents a true risk?up until that
potential far-in-the-future time that the
average
user wants a different behavior, at which time the UTC has worse
problems than whether the uppercase ?? should be a character or
<S, ZWJ, S>. (In fact, in precisely such a case, that elegant
fall-back would likely be a true liability).
For these reasons I continue to support, on balance, the proposal as
submitted and continue to discount many of the scare scenarios. Even
with the addition of a new character, none of the three ways discussed
here are ideal, and neither is the standard orthography as it stands.
However, the existence of these multiple ways is itself a mirror of the
(near glacial) change in interpretation and usage of the ??.? This is a
historical process, and if Unicode has a role, it is to remain neutral,
but supportive.
? ? ?
top of page
Letzte Änderung: 01.07.2008 01:36