Unicode Character Classes In JavaScript Regex

,

·

·

  • If you need to understand what Unicode is, I have a little primer here
  • Many of the character class escape sequences like '\w' do not work with Unicode characters yet. They are limited to ASCII characters only

Unicode Character Classes

  • You can define your own Unicode character classes using the Unicode escape sequence \u
  • Just like you would define that the character class [a-p] would match any lower case alphabet from a to p, you can define a character class that matches any Devanagari vowel with Unicode codepoints like [\u0905-\u0914]
  • [\u00E0] matches the French character à
  • [\u00E0\U00E2] would match either of the French characters à and â
  • [\u00E0\U00E2\u00E6-\u00EB] would match any one of the French characters à, â, æ, ç, è, é, ê, ë. Note that this character class definition is a mix of individual codepoints and a range. Also there are several more French characters that included in the above character class
  let targetString = `Je m’appelle Jessica. Je suis une 
  fille, je suis française et j’ai treize ans. Je vais à 
  l’école à Nice,   mais j’habite à Cagnes-Sur-Mer. J’ai 
  deux frères. Le premier s’appelle Thomas, il a quatorze 
  ans. Le second s’appelle Yann et il a neuf ans.`

  let frenchWord = /[\w\u2019\u00E0\u00E2\u00E6-\u00EB]+/g;
    
  //Print all french words
  results = targetString.matchAll(frenchWord);
  
  console.log("French Words");
  for (let result of results) {
    console.log(`${result[0]}`);
  } 
  • The Greek unicode block is in the range \u0370-\u03FF and Greek extended is \u1F00-\u1FFF. You can create a character class [\u0370-\u03FF\u1F00-\u1FFF] to match Greek letters

Character Classes With \p

  • You can use the Unicode property escape sequence '\p' and its negation '\P' to create character classes that match unicode codepoints by their properties. Ex: \p{General_Category=Letter}. You can use the property name of 'General_Category' or 'gc' and property value of 'Letter' to match a letter in any script. You must use either the flag 'u' or 'v' to use this escape sequence. The flag 'v' supercedes 'u'. Below are lower case Latin alphabets, Upper case Latin alphabets, Devanagari alphabet, lower case Greek alphabet and upper case Greek alphabet. You can match all of them with the regular expression /\p{General_Category=Letter}/v
//Find letter in any script
let letter =
/\p{General_Category=Letter}/gv;
let targetString = `a b c d e f g h i j k l m n o p q r
s t u v w x y z,
A, B, C, D, E, F, G, H, I, J, K, L, M,
N, O, P, Q, R, S, T, U, V, W, X, Y, Z,
अ, आ, इ, ई, उ, ऊ, ऋ, ए, ऐ, ओ, औ, अं, अः,
क, ख, ग, घ, ङ, च, छ,
ज, झ, ञ, ट, ठ, ड, ढ, ण, त, थ, द, ध, न, प, फ, ब,
भ, म, य, र, ल, व, श, ष, स, ह, क्ष, त्, ज्ञ, श्र,
α, β, γ, δ, ε, ζ, η, θ, ι, κ, λ, μ, ν, ξ,
ο, π, ρ, σ ς, τ, υ, φ, χ, ψ, ω,
Α, Β, Γ, Δ, Ε, Ζ, Η, Θ, Ι, Κ, Λ, Μ, Ν, Ξ,
Ο, Π, Ρ, Σ, Τ, Υ, Φ, Χ, Ψ, Ω`;
let results = targetString.matchAll(letter);
console.log("All Characters");
for (let result of results) {
console.log(`${result[0]}`);
}
  • ‘gc’ is an alias for General_Category. The above character class be written as \p{gc=Letter}
  • If the property name is 'General_Category' (or 'gc'), you can omit the property name /\p{Letter}/v
  • 'L' is a short form for 'Letter'. So \p{L} is the same as \p{Letter}. So the regular expression to match letters would be /\p{L}/v
  • You can use \P{L} (note the uppercase P) to match anything that is NOT a letter in any script
//Find non letter in any script
let letter =
/\P{L}/gv;
//targetstring contains Letters and Numbers from various scripts
let targetString = `a, X, Z, अ, आ, इ,
ρ, σ ς, τ, Ⅰ, Ⅱ, ८, ९, 1, 2`;
let results = targetString.matchAll(letter);
console.log("Not Letters");
for (let result of results) {
console.log(`${result[0]}`);
}
  • To match only uppercase letters use 'Uppercase_Letter' or its short form 'Lu'. Similarly for lowercase letters it will be 'Lowercase_Letter' or 'Ll'. NOTE: Not all scripts have cased letters
//Find Upper cased letters
/*
Will match upper case letters in Latin script
and Greek script.
Will not match any Devanagari letters because they have
no case.
*/
let upperCaseLetter =
/\p{Lu}/gv;
let targetString = `a b c d e f g h i j k l m n o p q r
s t u v w x y z,
A, B, C, D, E, F, G, H, I, J, K, L, M,
N, O, P, Q, R, S, T, U, V, W, X, Y, Z,
अ, आ, इ, ई, उ, ऊ, ऋ, ए, ऐ, ओ, औ, अं, अः,
क, ख, ग, घ, ङ, च, छ,
ज, झ, ञ, ट, ठ, ड, ढ, ण, त, थ, द, ध, न, प, फ, ब,
भ, म, य, र, ल, व, श, ष, स, ह, क्ष, त्, ज्ञ, श्र,
α, β, γ, δ, ε, ζ, η, θ, ι, κ, λ, μ, ν, ξ,
ο, π, ρ, σ ς, τ, υ, φ, χ, ψ, ω,
Α, Β, Γ, Δ, Ε, Ζ, Η, Θ, Ι, Κ, Λ, Μ, Ν, Ξ,
Ο, Π, Ρ, Σ, Τ, Υ, Φ, Χ, Ψ, Ω`;
let results = targetString.matchAll(upperCaseLetter);
console.log("Uppercase Characters");
for (let result of results) {
console.log(`${result[0]}`);
}
  • Note that upper case letters have the property values ‘L’ and ‘Lu’ for the same property name ‘gc’ which means there can be multiple values for the same property name
  • Numbers have the property 'Number' or 'N'. \p{N} matches any number. Below are Roman numerals, Devanagari numbers and Hindu Arabic numbers. \p{N} matches all of them
//Find Numbers
let number = /\p{N}/gv;
let targetString = `Ⅰ, Ⅱ, Ⅲ, Ⅳ, Ⅴ, Ⅵ,
०, १, २, ३, ४, ५, ६, ७, ८, ९,
1, 2, 3, 4, 5, 6, 7, 8, 9, 0`;
let results = targetString.matchAll(number);
console.log("Numbers");
for (let result of results) {
console.log(`${result[0]}`);
}
  • Decimal numbers have the property 'Decimal_Number' or 'Nd'. \p{Nd} matches only decimal numbers
//Find Decimal Numbers
let decimalNumber = /\p{Nd}/gv;
let targetString = `Ⅰ, Ⅱ, Ⅲ, Ⅳ, Ⅴ, Ⅵ,
०, १, २, ३, ४, ५, ६, ७, ८, ९,
1, 2, 3, 4, 5, 6, 7, 8, 9, 0`;
let results = targetString.matchAll(decimalNumber);
console.log("Decimal Numbers");
for (let result of results) {
console.log(`${result[0]}`);
}
  • You can match a script using \p{Script=<script_name>}. To match Greek letters use \p{Script=Greek}
//Find Greek Letters
let greekLetter = /\p{Script=Greek}/gv;
let targetString = `a b c d e f g h i j k l m n o p q r
s t u v w x y z,
A, B, C, D, E, F, G, H, I, J, K, L, M,
N, O, P, Q, R, S, T, U, V, W, X, Y, Z,
अ, आ, इ, ई, उ, ऊ, ऋ, ए, ऐ, ओ, औ, अं, अः,
क, ख, ग, घ, ङ, च, छ,
ज, झ, ञ, ट, ठ, ड, ढ, ण, त, थ, द, ध, न, प, फ, ब,
भ, म, य, र, ल, व, श, ष, स, ह, क्ष, त्, ज्ञ, श्र,
α, β, γ, δ, ε, ζ, η, θ, ι, κ, λ, μ, ν, ξ,
ο, π, ρ, σ ς, τ, υ, φ, χ, ψ, ω,
Α, Β, Γ, Δ, Ε, Ζ, Η, Θ, Ι, Κ, Λ, Μ, Ν, Ξ,
Ο, Π, Ρ, Σ, Τ, Υ, Φ, Χ, Ψ, Ω`;
let results = targetString.matchAll(greekLetter);
console.log("Greek Letters");
for (let result of results) {
console.log(`${result[0]}`);
}
  • You can use set operations like union, intersection, subtraction and negation on Unicode character classes. You will need the ‘v’ flag to use intersection and subtraction. To Find uppercase or greek letters Find Uppercase or Greek letters use /[\p{Lu}\p{Script=Greek}]/v
  • You can use intersection to find Greek Uppercase letters
    /[\p{Lu}&&\p{Script=Greek}]/v
  • Some characters are a combination of a Letter followed by one or more Marks eg: अं, अः To match them you can use a pattern like
    /\p{Letter}\p{Mark}*/v
  • You can use repetition specifiers with Unicode character classes to match words. To match words use /\p{Letter}+/v. NOTE: This regex will not match the Devanagari words because Devanagari characters may be made up of a Letter and one or more Marks
let wordPttn = /\p{Letter}+/gv;
let targetString = `Je m’appelle Jessica.
Je suis une fille, je suis française et
j’ai treize ans. Je vais à l’école à Nice, mais
j’habite à Cagnes-Sur-Mer.
ΌΠιθανή επιστροφή για συνομιλίες ΗΠΑ και Ιράν στο
Ισλαμαμπάντ. Σε ισχύ ο ναυτικός αποκλεισμός των
ΗΠΑ στα λιμάνια του Ιράν.
हालाँकि सूर के जीवन के बारे में कई जनश्रुतियाँ प्रचलित हैं,
पर इन में कितनी सच्चाई है यह कहना कठिन है। कहा जाता है
उनका जन्म सन् १४७८ में दिल्ली के पास एक ग़रीब ब्राह्मीण
परिवार में हुआ.
Albeit there are many folk lores about the life
of sur das. It is said that he was born in 1478
near Delhi in a poor Brahmin's house.`;
let results = targetString.matchAll(wordPttn);
for (let result of results) {
console.log(result[0]);
}
  • As stated above /\p{Letter}+/gv will not match characters (for example Devanagari characters) that are a combination of a Letter and one or more Marks. You have to modify it to /(\p{Letter}\p{Mark}*)+/gv to match such characters
let wordPttn = /(\p{Letter}\p{Mark}*)+/gv;
let targetString = `Je m’appelle Jessica.
Je suis une fille, je suis française et
j’ai treize ans. Je vais à l’école à Nice, mais
j’habite à Cagnes-Sur-Mer.
ΌΠιθανή επιστροφή για συνομιλίες ΗΠΑ και Ιράν στο
Ισλαμαμπάντ. Σε ισχύ ο ναυτικός αποκλεισμός των
ΗΠΑ στα λιμάνια του Ιράν.
हालाँकि सूर के जीवन के बारे में कई जनश्रुतियाँ प्रचलित हैं,
पर इन में कितनी सच्चाई है यह कहना कठिन है। कहा जाता है
उनका जन्म सन् १४७८ में दिल्ली के पास एक ग़रीब ब्राह्मीण
परिवार में हुआ.
Albeit there are many folk lores about the life
of sur das. It is said that he was born in 1478
near Delhi in a poor Brahmin's house.`;
let results = targetString.matchAll(wordPttn);
for (let result of results) {
console.log(result[0]);
}

Leave a Reply

Discover more from Pretentious Programmer

Subscribe now to keep reading and get access to the full archive.

Continue reading