Convert between text and Unicode/HTML entities
Convert to HTML entities for special characters
String encoding and escape handling
Solve text data encoding issues
Identify and analyze invisible characters
Unicode is an international standard for representing all the world's characters in a unified system. It assigns unique code points (U+XXXX) to over 1.4 million characters including Korean, English, Chinese, Arabic, and emoji. As of Unicode 15.0, 149,813 characters are defined.
UTF-8 is variable-length encoding: ASCII characters use 1 byte, Korean characters use 3 bytes. It is the most widely used encoding on the web and Linux, and is backward-compatible with ASCII. UTF-16 represents most characters in 2 bytes and is used internally by Windows, Java, and JavaScript strings.
Emoji are located in the Supplementary Multilingual Plane of Unicode and use code points of U+1F600 and above. In UTF-16 they are represented as two surrogate pairs, and in UTF-8 they are encoded as 4 bytes. Some emoji are sequences of multiple emoji joined by ZWJ (Zero Width Joiner, U+200D).
Precomposed Hangul syllables are assigned from U+AC00 (가) to U+D7A3 (힣), covering 11,172 characters. Hangul Jamo (individual consonants and vowels) are located in the U+1100–U+11FF range. Korean is composed of initial, medial, and final elements, all of which are supported by Unicode.
Before Unicode, each country used its own character encoding system. English-speaking regions used ASCII (128 characters), Korea used EUC-KR, Japan used Shift-JIS, and China used GB2312 — hundreds of encodings coexisted. Data exchange between different encoding systems frequently caused garbled text (mojibake).
In 1987, engineers from Xerox and Apple started the Unicode project, and Unicode 1.0 was released in 1991. Today, major tech companies including Apple, Google, Microsoft, and IBM participate in the Unicode Consortium. Unicode is synchronized with ISO/IEC 10646 and has been adopted as an international standard (ISO).
HTML5 recommends UTF-8 as the default encoding, and most modern websites use UTF-8. JavaScript strings are stored internally as UTF-16, and the encodeURIComponent() function performs URL encoding based on UTF-8. Databases are also trending toward Unicode encodings like utf8mb4 (MySQL) as the default setting.