- If you need to understand what Unicode is, I have a little primer here
- Many of the character class escape sequences like
'\w'do not work with Unicode characters yet. They are limited to ASCII characters only
Unicode Character Classes
- You can define your own Unicode character classes using the Unicode escape sequence \u
- Just like you would define that the character class [a-p] would match any lower case alphabet from a to p, you can define a character class that matches any Devanagari vowel with Unicode codepoints like
[\u0905-\u0914] - [\u00E0] matches the French character à
- [\u00E0\U00E2] would match either of the French characters à and â
- [\u00E0\U00E2\u00E6-\u00EB] would match any one of the French characters à, â, æ, ç, è, é, ê, ë. Note that this character class definition is a mix of individual codepoints and a range. Also there are several more French characters that included in the above character class
let targetString = `Je m’appelle Jessica. Je suis une
fille, je suis française et j’ai treize ans. Je vais à
l’école à Nice, mais j’habite à Cagnes-Sur-Mer. J’ai
deux frères. Le premier s’appelle Thomas, il a quatorze
ans. Le second s’appelle Yann et il a neuf ans.`
let frenchWord = /[\w\u2019\u00E0\u00E2\u00E6-\u00EB]+/g;
//Print all french words
results = targetString.matchAll(frenchWord);
console.log("French Words");
for (let result of results) {
console.log(`${result[0]}`);
}
- The Greek unicode block is in the range
\u0370-\u03FFand Greek extended is\u1F00-\u1FFF. You can create a character class[\u0370-\u03FF\u1F00-\u1FFF]to match Greek letters
Character Classes With \p
- You can use the Unicode property escape sequence
'\p'and its negation'\P'to create character classes that match unicode codepoints by their properties. Ex:\p{General_Category=Letter}. You can use the property name of'General_Category'or'gc'and property value of'Letter'to match a letter in any script. You must use either the flag'u'or'v'to use this escape sequence. The flag'v'supercedes'u'. Below are lower case Latin alphabets, Upper case Latin alphabets, Devanagari alphabet, lower case Greek alphabet and upper case Greek alphabet. You can match all of them with the regular expression/\p{General_Category=Letter}/v
//Find letter in any script let letter = /\p{General_Category=Letter}/gv; let targetString = `a b c d e f g h i j k l m n o p q r s t u v w x y z, A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X, Y, Z, अ, आ, इ, ई, उ, ऊ, ऋ, ए, ऐ, ओ, औ, अं, अः, क, ख, ग, घ, ङ, च, छ, ज, झ, ञ, ट, ठ, ड, ढ, ण, त, थ, द, ध, न, प, फ, ब, भ, म, य, र, ल, व, श, ष, स, ह, क्ष, त्, ज्ञ, श्र, α, β, γ, δ, ε, ζ, η, θ, ι, κ, λ, μ, ν, ξ, ο, π, ρ, σ ς, τ, υ, φ, χ, ψ, ω, Α, Β, Γ, Δ, Ε, Ζ, Η, Θ, Ι, Κ, Λ, Μ, Ν, Ξ, Ο, Π, Ρ, Σ, Τ, Υ, Φ, Χ, Ψ, Ω`; let results = targetString.matchAll(letter); console.log("All Characters"); for (let result of results) { console.log(`${result[0]}`); }
- ‘gc’ is an alias for
General_Category. The above character class be written as\p{gc=Letter} - If the property name is
'General_Category'(or'gc'), you can omit the property name/\p{Letter}/v 'L'is a short form for'Letter'. So\p{L}is the same as\p{Letter}. So the regular expression to match letters would be/\p{L}/v- You can use
\P{L}(note the uppercase P) to match anything that is NOT a letter in any script
//Find non letter in any script let letter = /\P{L}/gv;//targetstring contains Letters and Numbers from various scriptslet targetString = `a, X, Z, अ, आ, इ, ρ, σ ς, τ, Ⅰ, Ⅱ, ८, ९, 1, 2`;let results = targetString.matchAll(letter);console.log("Not Letters");for (let result of results) { console.log(`${result[0]}`);}
- To match only uppercase letters use
'Uppercase_Letter'or its short form'Lu'. Similarly for lowercase letters it will be'Lowercase_Letter'or'Ll'. NOTE: Not all scripts have cased letters
//Find Upper cased letters/* Will match upper case letters in Latin script and Greek script. Will not match any Devanagari letters because they have no case.*/let upperCaseLetter = /\p{Lu}/gv;let targetString = `a b c d e f g h i j k l m n o p q r s t u v w x y z, A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X, Y, Z, अ, आ, इ, ई, उ, ऊ, ऋ, ए, ऐ, ओ, औ, अं, अः, क, ख, ग, घ, ङ, च, छ, ज, झ, ञ, ट, ठ, ड, ढ, ण, त, थ, द, ध, न, प, फ, ब, भ, म, य, र, ल, व, श, ष, स, ह, क्ष, त्, ज्ञ, श्र, α, β, γ, δ, ε, ζ, η, θ, ι, κ, λ, μ, ν, ξ, ο, π, ρ, σ ς, τ, υ, φ, χ, ψ, ω, Α, Β, Γ, Δ, Ε, Ζ, Η, Θ, Ι, Κ, Λ, Μ, Ν, Ξ, Ο, Π, Ρ, Σ, Τ, Υ, Φ, Χ, Ψ, Ω`;let results = targetString.matchAll(upperCaseLetter);console.log("Uppercase Characters");for (let result of results) { console.log(`${result[0]}`);}
- Note that upper case letters have the property values ‘L’ and ‘Lu’ for the same property name ‘gc’ which means there can be multiple values for the same property name
- Numbers have the property
'Number'or'N'. \p{N} matches any number. Below are Roman numerals, Devanagari numbers and Hindu Arabic numbers. \p{N} matches all of them
//Find Numberslet number = /\p{N}/gv;let targetString = `Ⅰ, Ⅱ, Ⅲ, Ⅳ, Ⅴ, Ⅵ, ०, १, २, ३, ४, ५, ६, ७, ८, ९, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0`;let results = targetString.matchAll(number);console.log("Numbers");for (let result of results) { console.log(`${result[0]}`);}
- Decimal numbers have the property
'Decimal_Number'or'Nd'.\p{Nd}matches only decimal numbers
//Find Decimal Numberslet decimalNumber = /\p{Nd}/gv;let targetString = `Ⅰ, Ⅱ, Ⅲ, Ⅳ, Ⅴ, Ⅵ, ०, १, २, ३, ४, ५, ६, ७, ८, ९, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0`;let results = targetString.matchAll(decimalNumber);console.log("Decimal Numbers");for (let result of results) { console.log(`${result[0]}`);}
- You can match a script using \p{Script=<script_name>}. To match Greek letters use \p{Script=Greek}
//Find Greek Letterslet greekLetter = /\p{Script=Greek}/gv;let targetString = `a b c d e f g h i j k l m n o p q r s t u v w x y z, A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X, Y, Z, अ, आ, इ, ई, उ, ऊ, ऋ, ए, ऐ, ओ, औ, अं, अः, क, ख, ग, घ, ङ, च, छ, ज, झ, ञ, ट, ठ, ड, ढ, ण, त, थ, द, ध, न, प, फ, ब, भ, म, य, र, ल, व, श, ष, स, ह, क्ष, त्, ज्ञ, श्र, α, β, γ, δ, ε, ζ, η, θ, ι, κ, λ, μ, ν, ξ, ο, π, ρ, σ ς, τ, υ, φ, χ, ψ, ω, Α, Β, Γ, Δ, Ε, Ζ, Η, Θ, Ι, Κ, Λ, Μ, Ν, Ξ, Ο, Π, Ρ, Σ, Τ, Υ, Φ, Χ, Ψ, Ω`;let results = targetString.matchAll(greekLetter);console.log("Greek Letters");for (let result of results) { console.log(`${result[0]}`);}
- You can use set operations like union, intersection, subtraction and negation on Unicode character classes. You will need the ‘v’ flag to use intersection and subtraction. To Find uppercase or greek letters Find Uppercase or Greek letters use
/[\p{Lu}\p{Script=Greek}]/v - You can use intersection to find Greek Uppercase letters
/[\p{Lu}&&\p{Script=Greek}]/v - Some characters are a combination of a Letter followed by one or more Marks eg: अं, अः To match them you can use a pattern like
/\p{Letter}\p{Mark}*/v - You can use repetition specifiers with Unicode character classes to match words. To match words use
/\p{Letter}+/v. NOTE: This regex will not match the Devanagari words because Devanagari characters may be made up of a Letter and one or more Marks
let wordPttn = /\p{Letter}+/gv; let targetString = `Je m’appelle Jessica. Je suis une fille, je suis française et j’ai treize ans. Je vais à l’école à Nice, mais j’habite à Cagnes-Sur-Mer. ΌΠιθανή επιστροφή για συνομιλίες ΗΠΑ και Ιράν στο Ισλαμαμπάντ. Σε ισχύ ο ναυτικός αποκλεισμός των ΗΠΑ στα λιμάνια του Ιράν.हालाँकि सूर के जीवन के बारे में कई जनश्रुतियाँ प्रचलित हैं, पर इन में कितनी सच्चाई है यह कहना कठिन है। कहा जाता है उनका जन्म सन् १४७८ में दिल्ली के पास एक ग़रीब ब्राह्मीण परिवार में हुआ.Albeit there are many folk lores about the life of sur das. It is said that he was born in 1478 near Delhi in a poor Brahmin's house.`;let results = targetString.matchAll(wordPttn);for (let result of results) { console.log(result[0]);}
- As stated above
/\p{Letter}+/gvwill not match characters (for example Devanagari characters) that are a combination of a Letter and one or more Marks. You have to modify it to/(\p{Letter}\p{Mark}*)+/gvto match such characters
let wordPttn = /(\p{Letter}\p{Mark}*)+/gv; let targetString = `Je m’appelle Jessica. Je suis une fille, je suis française et j’ai treize ans. Je vais à l’école à Nice, mais j’habite à Cagnes-Sur-Mer. ΌΠιθανή επιστροφή για συνομιλίες ΗΠΑ και Ιράν στο Ισλαμαμπάντ. Σε ισχύ ο ναυτικός αποκλεισμός των ΗΠΑ στα λιμάνια του Ιράν.हालाँकि सूर के जीवन के बारे में कई जनश्रुतियाँ प्रचलित हैं, पर इन में कितनी सच्चाई है यह कहना कठिन है। कहा जाता है उनका जन्म सन् १४७८ में दिल्ली के पास एक ग़रीब ब्राह्मीण परिवार में हुआ.Albeit there are many folk lores about the life of sur das. It is said that he was born in 1478 near Delhi in a poor Brahmin's house.`;let results = targetString.matchAll(wordPttn);for (let result of results) { console.log(result[0]);}
Leave a Reply