Resources

Unicode

Unicode is a standard for consistent encoding representation, and handling of text expressed in most of the world's writing systems. At the end is a table that matches a letter/emoji/character/symbol to a number, this number is called code point.

Multibyte Characters

A multibyte character will mean a character whose encoding requires more than 1 byte. Usual strings (array of chars) are made of multibyte characters, making a multibyte string

Wide Characters

A wide character is a computer character datatype that generally has a size greater than the traditional 8-bit (1-byte) character.

Encodings

Encoding tells us how to represent a code point in memory. There are many Unicode encoding:

UTF-8
UTF-16
UTF-32

UTF-8

If it's smaller or equal than 7-bits: Start with 0s until 8-bits.
- Example: Character ! has code point 33, then it would be 00100001
If its larger than 7-bits:
- Start with as many 1s as how many bytes you need, including the one where you put those 1s, then add a 0
- For every byte after the first, start with 10
- Add 0s as needed to fill up enough bits for a byte
- Example: Character £ has code point 163, then it would be 11000010 10100011