- I have a primer on Unicode here
- I have a post on more basic Unicode character classes here
- Properties of Strings are properties that apply to a sequence of codepoints rather than individual codepoints
- The emoji 👩🏼🤝👨🏽 is made up of seven Unicode code points 👩 (U+1F469), 🏼 (U+1F3FC), (U+200D), 🤝 (U+1F91D), (U+200D), 👨 (U+1F468) and 🏽 (U+1F3FD)
- In JavaScript you can print them using the Unicode codepoint values as is with
\u{}or convert them to UTF16 and use'\u'
//Using the Unicode codepoints as is with \u{}let targetString = "\u{1F469}\u{1F3FC}\u{200D}\u{1F91D}\u{200D}\u{1F468}\u{1F3FD}";console.log(targetString); //Using the Unicde codepoints after converting them to UTF16 with \utargetString = ("\uD83D\uDC69\uD83C\uDFFC\u200D\uD83E\uDD1D\u200D\uD83D\uDC68\uD83C\uDFFD");console.log(targetString);
- The seven unicode codepoints all have the property ‘RGI_Emoji_ZWJ_Sequence’ and you can match the string of those seven codepoints in a regular expression with \p{RGI_Emoji_ZWJ_Sequence}
let pattern = /\p{RGI_Emoji_ZWJ_Sequence}/v//Using the Unicode codepoints as is with \u{}let targetString = "\u{1F469}\u{1F3FC}\u{200D}\u{1F91D}\u{200D}\u{1F468}\u{1F3FD}";let result = targetString.match(pattern);console.log(result[0]);//Using the Unicde codepoints after converting them to UTF16 with \utargetString = ("\uD83D\uDC69\uD83C\uDFFC\u200D\uD83E\uDD1D\u200D\uD83D\uDC68\uD83C\uDFFD");result = targetString.match(pattern);console.log(result[0]);
- Note that all of the codepoints in the above strings were matched without any repetition specifiers like *, + or {}
- This is what is meant by Properties of Strings – a string of codepoints are matched with a property without repetition specifiers because that string forms a grapheme cluster. Grapheme cluster is a string of codepoints that form a single user perceived character
- You will need to use the
'v'flag to match properties of strings - You can match multiple grapheme clusters by using repetition characters. The variable
'targetString'in the code below holds a string of 11 UTF-16 characters. In that string are encoded three emojis 🥷🏾, ✌🏽and 🫸🏿which are made up of 4, 3 and 4 UTF-16 characters respectively
let wordPttn = /\p{RGI_Emoji_Modifier_Sequence}*/vg; let targetString = "\uD83E\uDD77\uD83C\uDFFE\u270C\uD83C\uDFFD\uD83E\uDEF8\uD83C\uDFFF";//console.log(targetString);let results = targetString.matchAll(wordPttn);for (let result of results) { console.log(result[0]);}
- There are three emoji’s in the above string
- A Ninja 🥷🏾 which is made up of the two code poings U+1F977 and U+1F3FE or 4 UTF-16 characters ‘\uD83E\uDD77\uD83C\uDFFE’
- A victory sign ✌🏽 which is made up of the two code points U+270C and U+1F3FD or 3 UTF-16 characters ‘\u270C\uD83C\uDFFD’
- A right ward pushing hand 🫸🏿 which is made up of the two code points U+1FAF8 and U+1F3FF or 4 UTF-16 characters ‘\uD83E\uDEF8\uD83C\uDFFF’
- Note that the regex engine knows how to group the 11 UTF-16 characters ‘\uD83E\uDD77\uD83C\uDFFE\u270C\uD83C\uDFFD\uD83E\uDEF8\uD83C\uDFFF’ into the correct grapheme clusters of 4, 3 and 4 UTF-16 characters
- Properties of Strings are binary properties, either codepoints have a property or not
- There is no negation operator
\Pfor properties of strings
- The following are the primary “Properties of Strings” defined in the Unicode Standard
- Basic_Emoji
- Emoji_Keycap_Sequence
- RGI_Emoji
- RGI_Emoji_Flag_Sequence
- RGI_Emoji_Modifier_Sequence
- RGI_Emoji_Tag_Sequence
- RGI_Emoji_ZWJ_Sequence

Leave a Reply