Softly blurred cross with pastel blue, yellow, and orange colors on a light background

Unicode Properties Of Strings

  • I have a primer on Unicode here
  • I have a post on more basic Unicode character classes here
  • Properties of Strings are properties that apply to a sequence of codepoints rather than individual codepoints
  • The emoji 👩🏼‍🤝‍👨🏽 is made up of seven Unicode code points 👩 (U+1F469), 🏼 (U+1F3FC), ‍ (U+200D), 🤝 (U+1F91D), ‍ (U+200D), 👨 (U+1F468) and 🏽 (U+1F3FD)
  • In JavaScript you can print them using the Unicode codepoint values as is with \u{} or convert them to UTF16 and use '\u'
//Using the Unicode codepoints as is with \u{}
let targetString = "\u{1F469}\u{1F3FC}\u{200D}\u{1F91D}\u{200D}\u{1F468}\u{1F3FD}";
console.log(targetString);
//Using the Unicde codepoints after converting them to UTF16 with \u
targetString = ("\uD83D\uDC69\uD83C\uDFFC\u200D\uD83E\uDD1D\u200D\uD83D\uDC68\uD83C\uDFFD");
console.log(targetString);
  • The seven unicode codepoints all have the property ‘RGI_Emoji_ZWJ_Sequence’ and you can match the string of those seven codepoints in a regular expression with \p{RGI_Emoji_ZWJ_Sequence}
let pattern = /\p{RGI_Emoji_ZWJ_Sequence}/v
//Using the Unicode codepoints as is with \u{}
let targetString = "\u{1F469}\u{1F3FC}\u{200D}\u{1F91D}\u{200D}\u{1F468}\u{1F3FD}";
let result = targetString.match(pattern);
console.log(result[0]);
//Using the Unicde codepoints after converting them to UTF16 with \u
targetString = ("\uD83D\uDC69\uD83C\uDFFC\u200D\uD83E\uDD1D\u200D\uD83D\uDC68\uD83C\uDFFD");
result = targetString.match(pattern);
console.log(result[0]);
  • Note that all of the codepoints in the above strings were matched without any repetition specifiers like *, + or {}
  • This is what is meant by Properties of Strings – a string of codepoints are matched with a property without repetition specifiers because that string forms a grapheme cluster. Grapheme cluster is a string of codepoints that form a single user perceived character
  • You will need to use the 'v' flag to match properties of strings
  • You can match multiple grapheme clusters by using repetition characters. The variable 'targetString' in the code below holds a string of 11 UTF-16 characters. In that string are encoded three emojis 🥷🏾, ✌🏽and 🫸🏿which are made up of 4, 3 and 4 UTF-16 characters respectively
let wordPttn = /\p{RGI_Emoji_Modifier_Sequence}*/vg;
let targetString = "\uD83E\uDD77\uD83C\uDFFE\u270C\uD83C\uDFFD\uD83E\uDEF8\uD83C\uDFFF";
//console.log(targetString);
let results = targetString.matchAll(wordPttn);
for (let result of results) {
console.log(result[0]);
}
  • There are three emoji’s in the above string
    • A Ninja 🥷🏾 which is made up of the two code poings U+1F977 and U+1F3FE or 4 UTF-16 characters ‘\uD83E\uDD77\uD83C\uDFFE’
    • A victory sign ✌🏽 which is made up of the two code points U+270C and U+1F3FD or 3 UTF-16 characters ‘\u270C\uD83C\uDFFD’
    • A right ward pushing hand 🫸🏿 which is made up of the two code points U+1FAF8 and U+1F3FF or 4 UTF-16 characters ‘\uD83E\uDEF8\uD83C\uDFFF’
  • Note that the regex engine knows how to group the 11 UTF-16 characters ‘\uD83E\uDD77\uD83C\uDFFE\u270C\uD83C\uDFFD\uD83E\uDEF8\uD83C\uDFFF’ into the correct grapheme clusters of 4, 3 and 4 UTF-16 characters
  • Properties of Strings are binary properties, either codepoints have a property or not
  • There is no negation operator \P for properties of strings
  • The following are the primary “Properties of Strings” defined in the Unicode Standard
    • Basic_Emoji
    • Emoji_Keycap_Sequence
    • RGI_Emoji
    • RGI_Emoji_Flag_Sequence
    • RGI_Emoji_Modifier_Sequence
    • RGI_Emoji_Tag_Sequence
    • RGI_Emoji_ZWJ_Sequence
  • You can get an extensive list of codepoints with properties of strings here and here
Home » Unicode Properties Of Strings

Leave a Reply

Discover more from Programmer Lite

Subscribe now to keep reading and get access to the full archive.

Continue reading