Programmer Lite

A LITTLE MORE THAN BASIC…

Softly blurred cross with pastel blue, yellow, and orange colors on a light background

Unicode Properties Of Strings

Apr 25

I have a primer on Unicode here
I have a post on more basic Unicode character classes here
Properties of Strings are properties that apply to a sequence of codepoints rather than individual codepoints
The emoji 👩🏼‍🤝‍👨🏽 is made up of seven Unicode code points 👩 (U+1F469), 🏼 (U+1F3FC), ‍ (U+200D), 🤝 (U+1F91D), ‍ (U+200D), 👨 (U+1F468) and 🏽 (U+1F3FD)
In JavaScript you can print them using the Unicode codepoint values as is with \u{} or convert them to UTF16 and use '\u'

			
//Using the Unicode codepoints as is with \u{}
let targetString = "\u{1F469}\u{1F3FC}\u{200D}\u{1F91D}\u{200D}\u{1F468}\u{1F3FD}";
console.log(targetString);
    
//Using the Unicde codepoints after converting them to UTF16 with \u
targetString = ("\uD83D\uDC69\uD83C\uDFFC\u200D\uD83E\uDD1D\u200D\uD83D\uDC68\uD83C\uDFFD");
console.log(targetString);

		

The seven unicode codepoints all have the property ‘RGI_Emoji_ZWJ_Sequence’ and you can match the string of those seven codepoints in a regular expression with \p{RGI_Emoji_ZWJ_Sequence}

			
let pattern = /\p{RGI_Emoji_ZWJ_Sequence}/v
//Using the Unicode codepoints as is with \u{}
let targetString = "\u{1F469}\u{1F3FC}\u{200D}\u{1F91D}\u{200D}\u{1F468}\u{1F3FD}";
let result = targetString.match(pattern);
console.log(result[0]);
//Using the Unicde codepoints after converting them to UTF16 with \u
targetString = ("\uD83D\uDC69\uD83C\uDFFC\u200D\uD83E\uDD1D\u200D\uD83D\uDC68\uD83C\uDFFD");
result = targetString.match(pattern);
console.log(result[0]);

		

Note that all of the codepoints in the above strings were matched without any repetition specifiers like *, + or {}
This is what is meant by Properties of Strings – a string of codepoints are matched with a property without repetition specifiers because that string forms a grapheme cluster. Grapheme cluster is a string of codepoints that form a single user perceived character
You will need to use the 'v' flag to match properties of strings
You can match multiple grapheme clusters by using repetition characters. The variable 'targetString' in the code below holds a string of 11 UTF-16 characters. In that string are encoded three emojis 🥷🏾, ✌🏽and 🫸🏿which are made up of 4, 3 and 4 UTF-16 characters respectively

			
let wordPttn = /\p{RGI_Emoji_Modifier_Sequence}*/vg;
        
let targetString = "\uD83E\uDD77\uD83C\uDFFE\u270C\uD83C\uDFFD\uD83E\uDEF8\uD83C\uDFFF";
//console.log(targetString);
let results = targetString.matchAll(wordPttn);
for (let result of results) {
  console.log(result[0]);
}

		

There are three emoji’s in the above string
- A Ninja 🥷🏾 which is made up of the two code poings U+1F977 and U+1F3FE or 4 UTF-16 characters ‘\uD83E\uDD77\uD83C\uDFFE’
- A victory sign ✌🏽 which is made up of the two code points U+270C and U+1F3FD or 3 UTF-16 characters ‘\u270C\uD83C\uDFFD’
- A right ward pushing hand 🫸🏿 which is made up of the two code points U+1FAF8 and U+1F3FF or 4 UTF-16 characters ‘\uD83E\uDEF8\uD83C\uDFFF’

Note that the regex engine knows how to group the 11 UTF-16 characters ‘\uD83E\uDD77\uD83C\uDFFE\u270C\uD83C\uDFFD\uD83E\uDEF8\uD83C\uDFFF’ into the correct grapheme clusters of 4, 3 and 4 UTF-16 characters

Properties of Strings are binary properties, either codepoints have a property or not

There is no negation operator \P for properties of strings

The following are the primary “Properties of Strings” defined in the Unicode Standard
- Basic_Emoji
- Emoji_Keycap_Sequence
- RGI_Emoji
- RGI_Emoji_Flag_Sequence
- RGI_Emoji_Modifier_Sequence
- RGI_Emoji_Tag_Sequence
- RGI_Emoji_ZWJ_Sequence

You can get an extensive list of codepoints with properties of strings here and here

Home » Unicode Properties Of Strings

Unicode Properties Of Strings

Share this:

Like this:

Leave a ReplyCancel reply

Discover more from Programmer Lite