?berlegungen zum Gro?buchstaben Scharf-S
von Dr. Asmus Freytag
Der folgende Text wurde als Beitrag zu den Beratungen des Unicode-Technik-Komitees geschrieben, als dort ein Antrag zur Kodierung des VERSALEN SCHARF-S abschlie?end diskutiert wurde. Viele der hier zusammengefa?ten Argumente und Gegenargumente stammen aus l?ngeren Diskussionen ? internen und externen ?, sie werden hier ohne Einzelnachweise vorgestellt (die in den Diskussionen urspr?nglich vielfach angef?hrt wurden).
Der Text erschien erstmals am 8. Mai 2007 in einem nicht?ffentlichen Unicode-Diskussionsforum und wurde f?r die hier vorliegende Ver?ffentlichung leicht ?berarbeitet.? Eine deutsche ?bersetzung ist in Vorbereitung ?I
n German orthography (especially after the recent reform), there is a clear distinction between ?? and ?ss? in lower case. There are some word pairs where it is the only distinction. (The same is true for some personal names).
For purposes of searching personal names, and for sorting words, it is expedient to suppress that distinction. Part of that (probably) has to do with the fact that spellings of personal names cannot be predicted by sound, and that sorting similar sounding names together is generally useful. Pre-reform, the ?? and ?ss? were also used in distinction, but in ways that was not as clearly related to pronunciation of the word. Ergo, sorting words had the same issues as sorting personal names. However, sorting and searching are special in that they often create fairly wide equivalence sets, compared to the distinctions needed in accurately writing text or names.
The origin of the ?SS? case mapping for ?? is not actually known with certainty. However, it was decreed in a time where the use of Fraktur and typewriters were common. Typewriters had extreme limitation in the number of signs they could support, and ALL UPPERCASE text in Fraktur is an absurdity. Since the ?? does not (ordinarily) occur in TitleCase, which is very common in German (nouns), the impact of the standard orthographic rule is limited.
Nevertheless, the post office (on forms), sign writers, certain name registries, and many other users that use ALL UPPERCASE text (in modern style, not Fraktur), feel that suppressing the distinction between words and names that contain ? and those that contain ?ss? is not appropriate.
There are three ways this distinction can be maintained in ALL UPPERCASE text. Use of ?SZ?, retaining lower case ?? as-is and using an uppercase form of ??. All three forms can be found. And all three ways have their adherents. Yes, that means that Germany is not united after all. ;-).
For the following argument, it is important to not conflate any of these three forms with the standard orthography, which does equate ?? with ?SS? in ALL UPPERCASE text. The standard orthography is the only
one, that (outside sorting and searching) allows the equivalence between ?SS? as uppercase of ?? and ?SS? as uppercase of ?ss?, (while simultaneously distinguishing carefully between their lower case forms).
If one were to desire a distinction between ?ss? and ?? in lower as well as in uppercase, for semantic reasons, then choosing an encoding that is based on a glyphic variation of ?SS? would give the desired presentation but would hide the distinction at the character level.
Of course, it is in principle possible to arrange layout engines as well as all text processing to magically do the right thing, no matter how a text-element is encoded, and no matter what the cost, but, putting it briefly, the Unicode philosophy is to model things close to the common understanding of the text element?unless the script model consistently supports a non-intuitive approach. I see no recent precedent, incidentally, that, by itself, would make deciding the current question a slam-dunk, but I tend to dislike piling complex-script like approaches onto Latin.
If you desire to carry the distinction between ?SS? and ?? in ALL UPPERCASE TEXT, for semantic reasons, there are currently these three ways:
- Using ?SZ?. This is unattractive because converting the string to lower case results in nonsense, and few if any text processes consider any equivalence between ?sz? and ??. It feels unnatural to many readers. Nevertheless it is used in certain cases.
- Using ?? as is. This does not suffer from the aforementioned problem, but is visually not appealing. Nevertheless, of the three, it is currently the most widespread solution.
- Using an uppercase form of ??. This is currently only possible with ad-hoc support. Nevertheless its use can be documented, and given the technical challenges, is surprisingly frequent.
By itself, the proposal to encode a CAPITAL SHARP S does not change the current orthography. The proposal (as such) does not even try to standardize on the third form, but merely proposes that the uppercase form of ?? be considered a character, and implemented as such. (Individuals among the proposers or elsewhere may have an interest in promoting a change in writing practices, but it is not Unicode's role to take sides on such larger issues, and there's little objective reason to fear radical and imminent change in the majority usage. Raising the threat of such change as if it was imminent and inevitable would seem to border on fear-mongering, so let us agree that it is neither).
Given that the use of an uppercase form of ?? is clearly a variation of a (currently more common) practice of using the lowercase form for the same purpose, a search for a solution should start from the ?? and not from the equivalence to the ?SS?. The reason is that while that equivalence is present in the standard orthography, it is explicitly rejected
by users of all three alternative ways. Starting from the ?? would follow the principle of least surprise to the users and implementers.
Given that ALL UPPERCASE contexts are relatively uncommon, that retaining the distinction between ?? and ?SS?, is less common than giving up that distinction as per dictate of the standard orthography, and that out of three possible ways, only one uses an uppercase form of ??, the expectation of the average
German user would first and foremost be that existing texts and implementation behave as before.
Adding a new character would therefore not change the default case mapping of ?? to ?SS?. Users of the third way would need to enter their new character by hand, or use special purpose software. The former is appropriate for signage, book covers, and similar uses. The latter is what the post office might use in a data processing center entering hand-filled forms using ??.? Institutions maintaining lists of names in ALL UPPERCASE might utilize similar special purpose software.
For users of the third way, what would change as result of adding a character is that current ad-hoc solutions could be replaced by conformant
solutions with initially equal functionality. To the degree that certain very common font suites were to add a glyph for this character, reasonable transmission on the web and in e-mail would work in the medium term. If the default lowercase mapping of the character is to the existing ??,? name and form data can be converted to standard orthography by title casing (nouns/names) or lowercasing, which would be useful (and retain the desired distinction).
Extending the weak equivalence to ?SS? for sorting and searching (by default) would make data using the new character equally accessible. Obviously, however, the whole reason for using the ?? is so that some search modes would not
make that equivalence. Such search modes are already required to support users of the second way, which is currently the most common way of supporting the distinction between ?? and ?SS? in ALL UPPERCASE contexts.
The existence of this 'second way' (retaining lower case ?? 'as-is') and the fact that it is, for now, the most common non-standard way of retaining the distinction between ?? and ?SS? in ALL UPPERCASE context, means that the third way cannot be considered in isolation. For example, a lot less would be gained by basing the third way on an encoding that is based on ?SS,? because that makes it different from the second way. On the contrary, many of the potential complications of, as well as solutions for, addressing the third way with a new character are already present because of the second way.
The primary exception on the text processing level would be the lack of a (default) uppercase mapping from ?? to the new character. I concur with the proposers' judgment that this is not an issue for the average
user, and that the adherents of the third way either can live with that restriction or that they will (be able to) use tailored software. (It is possible to disagree with that judgment, but that comes down to a matter of opinion.)
The primary exception on the display level would be the lack (for a transition time) of a glyph in many or most fonts.
It is sometimes claimed that <S, ZWJ, S> would gracefully fall back to ?SS? and that would make it more attractive than the '?missing glyph? that would ensue if there was a new character, but no glyph in the font. While the fallback does work wherever the system enforces the default-ignorable property of ZWJ, it violates the rule of ?no surprises? since anyone who intends to communicate a distinction between ?? and ?SS? will no longer be able to predict what the other side will see, and there will be no obvious indication of error. (Users of the third way that anticipate transmission problems would presumably rather fall back, manually, to the second way.)
Incidentally, it is equally unclear whether such a ligature could/would be enabled without affecting the use of all other ligatures in the document. Ligatures across compound-word internal boundaries are not desirable in German, and might have to be suppressed individually with ZWNJ before ligatures could be enabled globally for German text. Positive ligature support may be absent or may not be controllable in forms. Such complications can easily mean that using an SS ligature is equally limiting in practice as using a new character with initially limited font support.
Lowercasing such data opens a new issue, i.e. that of displaying <s, ZWJ, s>. If fonts were to utilize a ?? glyph for that sequence, which might only be tempting, then it could encourage a dual representation of the lower case ??. If they were not, then lowercasing a text that intends to make a distinction that is unequivocally correct and required in lower case text, would result in its being removed?unless a special mapping <S, ZWJ, S> → ? were to be widely implemented. (Not to mention that such a mapping would go against the principle of not having ZWJ affect casing).
While the facts about actual usage can be established and putative consequences for both proposed solution and counterproposal can be mapped, the weighting of this information is and remains a matter of judgment, and true precedents for such a complicated situation are lacking.
Finally, what of the non-technical factors that UTC should consider when making encoding decisions?
There seems to be agreement that Unicode does not restrict itself to standard orthography, that it is descriptive rather than prescriptive, and that it takes no sides in settling orthographies?but retains the right to determine how best to reflect a given orthography in an encoding. All three ways discussed here would qualify for being encodable, based on their degree of documented usage (two of which, of course, are already encodable).
There is considerably less agreement on how to account for historical development, including the origin (putative or documented) of a form, trends in the development of an orthography (observable or speculative) and predictions of future (or far future) outcomes. In the case at hand, I tend to believe in the existence of overarching trends, while simultaneously disbelieving a concrete possibility of real and widespread change in actual practices on the ground in the near to medium term.
In terms of stability of properties, it is claimed that proponents of the third way would ask (eventually) for a change of the mapping from ?? to ?SS? to a mapping from ?? to uppercase ??. Well, they might, but my firm assumption is that UTC will do the research to base its decisions on the needs of the average
user. As long as the standard orthography remains the standard, those needs are unchanged. Not encoding a new character, by the way is no safeguard, because proponents of the second way (and there are more of them) could ask for a similar incompatible change in mapping (to always leave the ?? as-is.)
Under the assumption that UTC continues to be able to do due diligence in this case, neither scenario represents a true risk?up until that potential far-in-the-future time that the average
user wants a different behavior, at which time the UTC has worse problems than whether the uppercase ?? should be a character or <S, ZWJ, S>. (In fact, in precisely such a case, that elegant fall-back would likely be a true liability).
For these reasons I continue to support, on balance, the proposal as submitted and continue to discount many of the scare scenarios. Even with the addition of a new character, none of the three ways discussed here are ideal, and neither is the standard orthography as it stands. However, the existence of these multiple ways is itself a mirror of the (near glacial) change in interpretation and usage of the ??.? This is a historical process, and if Unicode has a role, it is to remain neutral, but supportive.
? ? ?
Letzte Änderung: 01.07.2008 01:36