Soft Hyphens

A soft hyphen (U+00AD, ­) is a type of hyphen used to specify a place in text where a hyphenated break is allowed without forcing a line break in an inconvenient place if the text is re-flowed. Recently I had to deal with the questions of how to handle soft hyphens in a web-based application and quickly discovered that soft hyphens were not as simple a topic as I initially thought.

Why should a web developer care one iota about soft hyphens? It is because the use of soft hyphens, while not yet universal, is becoming more common in languages other than English, e.g. German. The reason for this is that longer words that cannot be automatically hyphenated by a browser as necessary lead to ugly layout, especially when there’s not a lot of horizontal space.

HTML recognizes two types of hyphens: the plain hyphen and the soft hyphen. The plain hyphen should be interpreted by a browser as just another character. The soft hyphen tells the user agent where a line break may occur. Browsers that support soft hyphens must observe the following semantics: If a line is broken at a soft hyphen, a hyphen character must be displayed at the end of the first line. If a line is not broken at a soft hyphen, the user agent must not display a hyphen character. For operations such as searching and sorting, the soft hyphen should always be ignored. Soft hyphens are represented by the HTML equivalent character “­” and rendered by a graphic symbol that’s identical to a standard hyphen (-).

Unfortunately not all standards agree on what a soft hyphen is. According to ISO Latin 1 (ISO 8859-1), it is a visible hyphen. According to the Unicode standard, it is a hidden hyphenation hint. The HTML 4 specification also defines a soft hyphen to be a hyphenation hint. This conflict of standards has lead to various interpretations of the soft hyphen by application developers.

The Unicode FAQ includes the following:

Q: Unicode now treats the SOFT HYPHEN as format control (Cf) character when formerly it was a punctuation character (Pd). Doesn’t this break ISO 8859-1 compatibility?

A: No. The ISO 8859-1 standard defines the SOFT HYPHEN as “[a] graphic character that is imaged by a graphic symbol identical with, or similar to, that representing hyphen” (section 6.3.3), but does not specify details of how or when it is to be displayed, nor other details of its semantics. The soft hyphen has had a long history of legacy implementation in two or more incompatible ways.

Unicode clarifies the semantics of this character for Unicode implementations, but this does not affect its usage in ISO 8859-1 implementations. Processes that convert back and forth may need to pay attention to semantic differences between the standards, just as for any other character.

In a terminal emulation environment, particularly in ISO-8859-1 contexts, one could display the soft hyphen as a hyphen in all circumstances. The change in semantics of the Unicode character does not require that implementations of terminal emulators in other environments, such as ISO 8859-1, make any change in their current behavior.

By the way, the Unicode standard specifies other types of hyphens including two nonbreaking hyphen characters: U+2011 non-breaking hyphen and U+0F0C tibetan mark delimiter tsheg bstar. See Table 6.3 of the Unicode standard for a full list of hyphen or hyphen-like characters. You will be surprised by how many there are!

Comments are closed.