Experiments With Machine Translation: Case Mapping & Case Folding

The modern English language typically draws little distinction between "uppercase" and "titlecase" letters. For the most part, a "titlecased" word in which the first letter is capitalized (such as the first word in each sentence in this article) is identical to the same word in which the first letter has just been uppercased. In other words, in most cases in English, the operations of "titlecase()" and "uppercase()" of a given letter are identical.

This is not the case in all languages. For example, take the Latin character "ǆ" used in Serbo-Croatian languages. While this appears as two visible letters ("d" followed by an accented "z"), it is actually a single code point. Because the two letters form a single code point, there are actually three forms of this character: its lowercase form of "ǆ" (U+01C6), its uppercase form of "Ǆ" (U+01C4) and its titlecase form of "ǅ" (U+01C5).

Unicode provides mappings of these distinct code points to indicate that they represent the different casings of the same letter, along with a fourth case used for "case folding" for case insensitive comparison.

Due to the logical equivalence of upper and title case in English, not all analytic packages provide native support for titlecase or case folding, instead incorrectly pointing users to simply uppercase initial characters for titlecasing or lowercasing both texts for case insensitive matching. Even some major widely-used packages do not support titlecasing or fully compliant case folding and preprocessing may be required for some workflows.

The GDELT Project

Experiments With Machine Translation: Case Mapping & Case Folding

Archives