Martin, I found this article that might be of interest...
https://developer.twitter.com/en/docs/b ... characters
Counting Characters
General Concepts
The “café” issue mentioned above raises the question of how you count the characters in the Tweet string “café”. To the human eye the length is clearly four characters. Depending on how the data is represented this could be either five or six UTF-8 bytes. Twitter does not want to penalize a user for the fact we use UTF-8 or for the fact that the API client in question used the longer representation. Therefore, Twitter does count “café” as four characters no matter which representation is sent.
Nearly all user input methods automatically convert the longer combining mark version into the composed version but the Twitter API cannot count on that. Even if we did ignore that the byte length of the “é” character is two bytes rather than the one you would expect. Below there is some more specific information on how to get that information out of Ruby/Rails but for now I’ll cover the general concepts that should be available in any language.
The Unicode Standard covers much more that a listing of characters with numbers associated. Unicode does provide such a list of “codepoints” (`more info <
http://www.unicode.org/charts/>`__), which is the U+XXXX notation you sometimes see. The Unicode Standard also provides several different ways to encode those codepoints (UTF-8 and UTF-16 are examples, but there are others). The Unicode standard also provides some detailed information on how to deal with character issues such as Sorting, Regular Expressions and of importance to this issue, Normalization.
Combining Diacritical Marks - A Prelude to Normalization
So, back in the café, the issue of multiple byte sequences having the same on-screen representation was breezed right by. There is an entire section of the Unicode tables devoted to the “Combining Diacritical Marks” (see that Unicode “block” here). These are not stand-alone characters but instead the additional “diacritical marks” used in addition to other base characters in many languages. For example the ¨ over the ü, common to German; or the ˜ over the ñ in Spanish. There are a great many combinations needed to cover all languages in the world so Unicode provides some simple building blocks, the Combining Diacritical Marks.
For the most common characters (like é, ü and company) there is also a character just for the combination. The reasons for that are mostly historical but since they exist it’s something we’ll always need to be aware of. This historical oddity is the exact reason for the two “café” representations. If you look back at the representations you’ll see one uses 0x65 0xCC 0x81, where 0x65 is simply the letter “e” and >0xCC 0x81 is the Combining Diacritical Mark for ´. Since there are multiple ways to represent the same thing using Unicode the Unicode Standard provides information on how to normalize the multiple different representations.
Unicode Normalization
The Unicode Standard provides information on several different kinds of normalization, Canonical and Compatibility. There is a full description of the different options in the Unicode Standard Annex #15, the report on normalization. The normalization report is 32 pages and covers the issue in great detail. Reproducing the entire report here would be of very little use so instead we’ll focus on what normalization Twitter is using.
Twitter counts the length of a Tweet using the Normalization Form C (NFC) version of the text. This type of normalization favors the use of a fully combined character (0xC3 0xA9 from the café example) over the long-form version (0x65 0xCC 0x81). Twitter also counts the number of codepoints in the text rather than UTF-8 bytes. The 0xC3 0xA9 from the café example is one codepoint (U+00E9) that is encoded as two bytes in UTF-8, whereas 0x65 0xCC 0x81 is two codepoints encoded as three bytes.