A Primer To Unicode

  • Unicode is an encoding standard for characters maintained by the Unicode Consortium
  • Unlike ASCII which supported only 128 characters, Unicode aims to support all characters in all scripts of the world including emojis
  • As of 2025 Unicode defines more than 159000 characters in 172 scripts
  • Unicode characters are made up of one or more Unicode codepoints
  • Unicode codepoints range from U+0000 to U+10FFFF
  • The letter ‘a’ is made up of one unicode codepoint \U+0061
console.log('U+0061 = \u0061');
  • The Devanagari character लाँ is made up of the letter ल (U+0932) and the Marks ा (U+093E) and ँ (U+0901)
console.log('\u0932 + \u093E + \u0901 = \u0932\u093E\u0901');
  • The Devnagari character ‘कि’ is made up of two unicode codepoints ‘U+0915’ (क) and ‘U+093F’ (ि)
console.log('U+0915 = \u0915');
console.log('U+093F = \u093f');
console.log('U+0915 + U+093F = \u0915\u093f');
console.log('U+0915 = \u0915');
console.log('U+093F = \u093f');
console.log('U+0915 + U+093F = \u0915\u093f');
  • Another example the emoji 👩🏼‍🤝‍👨🏽, is made up of the 7 Unicode code points 👩 (U+1F469) + 🏼 (U+1F3FC) + ZWJ + 🤝 (U+1F91D) + ZWJ + 👨 (U+1F468) + 🏽 (U+1F3FD). Some of the codepoints are more than 16 bits long while JavaScript strings are encoded in UTF-16. You have to represent the longer codepoints into a UTF-16 surrogate pair like U+1F469 is converted to the surrogate pair 0xD83D and 0xDC69. Unicode encodes more than 3,790 emojis
console.log('\uD83D\uDC69\uD83C\uDFFC\u200D\uD83E\uDD1D\u200D\uD83D\uDC68\uD83C\uDFFD');
  • ASCII characters are made up of only one Unicode codepoint and range from U+0000 to U+007F (0 to 127)
for (let i = 0x0061; i < 0x007b; i++) {
console.log(String.fromCodePoint(i));
}
  • Each Unicode codepoint has some properties associated with it
  • The most common property name is General_Category or gc
  • gc can take any of several values
    • If the Unicode codepoint is a letter in any script, it has the property value of ‘Letter’ or ‘L’ for short
    • If the Unicode codepoint is a lower case letter, it also has the property value of ‘Lowercase_Letter’ or ‘Ll’ for short
    • Note that it is possible to have multiple values for the same property like a lower case letter ‘a’ will have values of both ‘L’ and ‘Ll’ (because a lower case letter is also a letter) and possibly others
    • Similarly uppercase letters have the properties ‘L’ and ‘Lu’
    • Any number in any writing system including Hindu Arabic Numbers (0, 1, 2, 3, 4, 5, 6, 7, 8, 9), Devanagari Numbers (०, १, २, ३, ४, ५, ६, ७, ८, ९), Roman Numbers (Ⅰ, Ⅱ, Ⅲ, Ⅳ, Ⅴ, Ⅵ) etc all have a ‘gc’ property value of ‘Number’ or ‘N’
    • Decimal numbers like Hindu Arabic numbers and Devanagari numbers will have the additional ‘gc’ property value of ‘Decimal_Number’ or its short form ‘Nd’ but not Roman numerals
    • Other values include ‘Mark’ (M), ‘Punctuation’ (P), ‘Symbol’ (S) depending on what the codepoint represents
    • You can find a full list of values for gc here

Leave a comment