Text

Resources around text in regards to a form of data. I.e. more technical and focus on details behinds the characters we use all the time.

Introduction
ASCII
Encoding
Percent-Encoding (an encoding)
Programming Languages
Punycode (an encoding)
- Volkswagen Example in Punycode, and UTF-8, and Percent-Encoding
Unicode (a character set standard)
- Unicode Lab
UTF-8 (an encoding)
UTF-16 (an encoding)
Web Pages

Introduction

Something as generally fairly easily understood as text can become rather complicated when dealing with in computers. Puh, that sentence shows text maybe isn't that easy for humans either...!

Anyhow, when text is handled in computers - basically down to zeros (0) and ones (1), one must:

Decide on which characters to handle - some set of allowed characters , a character set, or sometimes charset for short.
Decide on how to encode each character in this set to zeros and ones - represetation - some form of encoding.

Among the earliest sets used in computers (originating in the 1960s) include ASCII and EBCDIC (wikipedia) of which both allowed a limited set of US-English letters and symbols.

Today, all modern handling of text in computers should aim for using Unicode and encodings like UTF-8 or UTF-16.

Character Set	Alphabet(s) handled, Max # of characters handled	Size per character	E.g. "A"
ASCII	US English w limited number of symbols (127 chars, 7 bits) or variants of extended ASCII with different sets for different languages (255 chars, 8 bits)	7 or 8 bits, 1 byte	65_dec
EBCDIC	Similar to ASCII but with different encodings than ASCII	8 bits, 1 byte	193_dec
Unicode	"the latest version of Unicode contains a repertoire of more than 128,000 characters covering 135 modern and historic scripts, as well as multiple symbol sets."	1-4 bytes, 8-32 bits	0041_hex (=65_dec)

ASCII

"ASCII (Listeni/ˈæski/ ass-kee),[1]:6 abbreviated from American Standard Code for Information Interchange, is a character encoding standard (the Internet Assigned Numbers Authority (IANA) prefers the name US-ASCII[2]). ASCII codes represent text in computers, telecommunications equipment, and other devices. Most modern character-encoding schemes are based on ASCII, although they support many additional characters." [wikipedia.org/wiki/ASCII]

Honorable mentioning it here and because many lower-level systems, existing and not unlikely also future, may be using this old encoding of characters.

Pros:

Widely spread use, essentially all development tools etc handles ASCII without any fuss or special techniques.

Cons:

As long as only considering using the first 127 characters, lower part of ASCII table - none... Though then limiting the alphabet to the English with A-Z,a-z. No Greek, very few special symbols like for currency, mathematics, et c. I.e. ASCII is good for use for programming languages syntax and so on but FAR LESS so for data handled by the program created, including text used in user interface.

Strongly recommendation today is aiming for handling all texts handled by program in Unicode, UTF-8 or UTF-16 encodings.

Rec use:

See Cons.

Charsets

"In computing, a character encoding is used to represent a repertoire of characters by some kind of an encoding system.[1] Depending on the abstraction level and context, corresponding code points and the resulting code space may be regarded as bit patterns, octets, natural numbers, electrical pulses, etc. A character encoding is used in computation, data storage, and transmission of textual data.Character set, character map, codeset and code page are related, but not identical, terms." [wikipedia.org/wiki/Character_encoding, the article presented when search on 'charset']

Character Set	Size	Encoding Notes
ISO 646, ASCII	1 byte	An "ASCII number" (0-255, where 0-127 is common for all variants of ASCII, including international variants i.e.)
ISO 8859, ISO 8859-n	1 byte	(E.g.) wikipedia.org/wiki/ISO/IEC_8859; ISO 8859-1, also known as Latin1 - wikipedia.org/wiki/ISO/IEC_8859-1, another example is ISO 8859-5, wikipedia.org/wiki/ISO/IEC_8859-5
Unicode	1-4 bytes	A "Unicode code point", e.g. U+0041 (Latin A; ASCII 65)
Universal Coded Character Set (UCS), ISO/IEC 10646 - a parallel standard to Unicode	1-4 bytes	"ISO 10646 and Unicode have an identical repertoire and numbers—the same characters with the same numbers exist on both standards, although Unicode releases new versions and adds new characters more often. Unicode has rules and specifications outside the scope of ISO 10646." [wikipedia]

Encoding

google.com/search?q=encoding
- wikipedia.org/wiki/Code
  - wikipedia.org/wiki/Parsing

UTF-8
Base64
- wikipedia.org/wiki/Base64
- wikipedia.org/wiki/MIME
Q-encoded

Percent-Encoding (an encoding)

Also known as URL Encoding.

wikipedia.org/wiki/Percent-encoding
tools.ietf.org/html/rfc3986

www.w3schools.com/TAgs/ref_urlencode.asp

URL encoding replaces unsafe ASCII characters with a "%" followed by two hexadecimal digits.

URLs cannot contain spaces. URL encoding normally replaces a space with a plus (+) sign or with %20.

Examples:

Character	From Windows-1252	From UTF-8
space	%20	%20
ƒ	%83	%C6%92
…	%85	%E2%80%A6
•	%95	%E2%80%A2

google.com/search?q=URL+Encoding+(Percent+Encoding)
On this page - Volkswagen Example in Punycode, and UTF-8, and Percent-Encoding

Programming Languages, Development

First it's important to recognize important details:

Source code for a program (e.g. written in PHP or JavaScript) may have one encoding (e.g. ASCII or UTF-8) and text strings used in constants and variables within the program may have a different encoding (e.g. UTF-16).
#

how to determine which alphabet utf8 - google.com/search?q=what+alphabet+ %D0%B8 ...

In HTML

See Web Pages below (HTML is not a programming language) for more.

"In HTML markup language (not a programming language),use &#x<hex>; (hex) or &#<dec>; (dec)"

JavaScript

Two forms of escape sequences:

'\x': for two hex numbers, for Unicode U+0000 through U+00FF.
Example, \x41 (65 dec), Latin/ASCII letter 'A'.
NOTE: (Obviously) this format only support handling first 256 characters (0-255d; 00-FFh)
'\u': for four hex numbers, for Unicode U+0000 through U+FFFF.
Example, \u0041 (65 dec), Latin/ASCII letter 'A'.
NOTE: (Obviously) this format only support handling first 65536 characters (0-65535d; 0000-FFFFh)

JavaScript has a Unicode problem Published 20th October 2013 mathiasbynens.be/notes/javascript-unicode

http://speakingjs.com/es5/ch24.html

string to

JavaScript

charCodeAt - developer.mozilla.org/ en-US / docs / Web / JavaScript / Reference / Global_Objects / String / charCodeAt

codePointAt - developer.mozilla.org/ en-US / docs / Web / JavaScript / Reference / Global_Objects / String / codePointAt

PHP

Regular Expressions

Punycode (an encoding)

Punycode is the encoding used for handling Unicode characters in International Domain Names (IDNs). See our TLD database for more examples.

Example, the domain åland.ax is encoded to xn--land-poa.ax in Punycode, another example is xn--0zwm56d that represent the international top level domain (TLD) 测试.

Learn
- wikipedia: Punycode
- ietf.org/rfc/rfc3492.txt
Programming
- google php Punycode
- zedwood.com/article/php-idn-punycode-converter
- ckon.wordpress.com/2010/08/24/ punycode-to-unicode-converter-php/
- PHP
  - php.net/manual/en/function.idn-to-utf8.php
    Result of executing idn_to_utf8(xn--3oq18vl8pn36a): 大众汽车.
Tools
- charset.org/pages/punycode.php
  - example encode a string "abcåäö" (abc%C3%A5%C3%A4%C3%B6)

Volkswagen Example in Punycode, and UTF-8, and Percent-Encoding

Volkswagen, the German car manufacturer, were early on to reserve own top-level domains (TLDs) when this became a possibility.

Top Level Domain	Type	Created	Organization
volkswagen	gTLD	2015-12-23	Volkswagen Group of America Inc.
大众汽车 (xn--3oq18vl8pn36a)	gTLD	2016-07-21	Volkswagen (China) Investment Co., Ltd.

(Google Translate translates '大众汽车' to 'Volkswagen')

Encoding	Result
UTF-8, Unicode	大众汽车
In Punycode	xn--3oq18vl8pn36a0
In Percent-Encoding	%E5%A4%A7%E4%BC%97%E6%B1%BD%E8%BD%A6

Google 大众汽车 - googling 'Volkswagen' in Chinese)

google.com/search?q= %E5%A4%A7%E4%BC%97%E6%B1%BD%E8%BD%A6

Wikipedia, which is a great site for both learning about topics in general as well as meta data about cross overs between languages:

English-language page:	en.wikipedia.org/wiki/Volkswagen
Related page in Chinese	zh.wikipedia.org/wiki/大众汽车 (zh.wikipedia.org/wiki/%E5%A4....)

Unicode (a character set standard)

"Unicode is a computing industry standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. Developed in conjunction with the Universal Coded Character Set (UCS) standard and published asThe Unicode Standard, the latest version of Unicode contains a repertoire of more than 128,000 characters covering 135 modern and historic scripts, as well as multiple symbol sets." [wikipedia]

"Unicode defines a codespace of 1,114,112 code points in the range 0_hex to 10FFFF_hex.[5] Normally a Unicode code point is referred to by writing "U+" followed by its hexadecimal number. For code points in the Basic Multilingual Plane (BMP), four digits are used (e.g., U+0058 for the character LATIN CAPITAL LETTER X); for code points outside the BMP, five or six digits are used, as required (e.g., U+E0001 for the character LANGUAGE TAG and U+10FFFD for the character PRIVATE USE CHARACTER-10FFFD).[6] " [wikipedia]

A Unicode code point (e.g. U+0065) correspond to an ASCII number (e.g. 65 decimal).

Unicode Lab

"Characters outside the BMP, e.g. U+1D306 tetragram for centre (𝌆), can only be encoded in UTF-16 using two 16-bit code units: 0xD834 0xDF06. This is called a surrogate pair. Note that a surrogate pair only represents a single character." [mathiasbynens.be]

Extract from wikipedia: UTF-8

The following table summarises this conversion, as well as others with different lengths in UTF-8. The colors indicate how bits from the code point are distributed among the UTF-8 bytes. Additional bits added by the UTF-8 encoding process are shown in black.

Character		Binary code point	Binary UTF-8	Hexadecimal UTF-8
$	`U+0024`	`010 0100`	`00100100`	`24`
¢	`U+00A2`	`000 1010 0010`	`11000010 10100010`	`C2 A2`
€	`U+20AC`	`0010 0000 1010 1100`	`11100010 10000010 10101100`	`E2 82 AC`
𐍈	`U+10348`	`0 0001 0000 0011 0100 1000`	`11110000 10010000 10001101 10001000`	`F0 90 8D 88`

Extract from https://en.wikipedia.org/wiki/UTF-16

Consider the encoding of U+10437 (𐐷):

Subtract 0x10000 from 0x10437. The result is 0x00437, 0000 0000 0100 0011 0111.

Split this into the high 10-bit value and the low 10-bit value: 0000000001 and 0000110111.

Add 0xD800 to the high value to form the high surrogate: 0xD800 + 0x0001 = 0xD801.

Add 0xDC00 to the low value to form the low surrogate: 0xDC00 + 0x0037 = 0xDC37.

The following table summarizes this conversion, as well as others. The colors indicate how bits from the code point are distributed among the UTF-16 bytes. Additional bits added by the UTF-16 encoding process are shown in black.

Character		Binary code point	Binary UTF-16	UTF-16 hex code units	UTF-16BE hex bytes	UTF-16LE hex bytes
$	`U+0024`	`0000 0000 0010 0100`	`0000 0000 0010 0100`	`0024`	`00 24`	`24 00`
€	`U+20AC`	`0010 0000 1010 1100`	`0010 0000 1010 1100`	`20AC`	`20 AC`	`AC 20`
𐐷	`U+10437`	`0001 0000 0100 0011 0111`	`1101 1000 0000 0001 1101 1100 0011 0111`	`D801 DC37`	`D8 01 DC 37`	`01 D8 37 DC`
𤭢	`U+24B62`	`0010 0100 1011 0110 0010`	`1101 1000 0101 0010 1101 1111 0110 0010`	`D852 DF62`	`D8 52 DF 62`	`52 D8 62 DF`

The following table summarizes this conversion, as well as others. The colors indicate how bits from the code point are distributed among the UTF-16 bytes. Additional bits added by the UTF-16 encoding process are shown in black.

Character		Binary code point	Binary UTF-16	UTF-16 hex code units	UTF-16BE hex bytes	UTF-16LE hex bytes
$	U+0024	0000 0000 0010 0100	0000 0000 0010 0100	0024	00 24	24 00
€	U+20AC	0010 0000 1010 1100	0010 0000 1010 1100	20AC	20 AC	AC 20
𐐷	U+10437	0001 0000 0100 0011 0111	1101 1000 0000 0001 1101 1100 0011 0111	D801 DC37	D8 01 DC 37	01 D8 37 DC
𤭢	U+24B62	0010 0100 1011 0110 0010	1101 1000 0101 0010 1101 1111 0110 0010	D852 DF62	D8 52 DF 62	52 D8 62 DF

unicode.org/
unicode.org/versions/Unicode9.0.0/
unicode.org/charts/
unicode.org/charts/charindex.html
wikipedia: Unicode
wikipedia: List of Unicode characters - include a nice, color-coded, map where different scripts (language families) are used
wikipedia: Comparison of Unicode encodings

Encoding	Character Length	Note
UCS-2	Fixed-width, 16 bits/2 bytes	Obsolete, old; more modern alt is UTF-16
UCS-4	Fixed-width, 32 bits/4 bytes	Obsolete, old; more modern alt is UTF-32
UTF-8	one to four 8-bit code units (variable-width, 8,16,24, or 32 bits, 1-4 bytes)	Recommended for Web use!
UTF-16	one or two 16-bit code units(variable-width, 16 or 32 bits, 2 or 4 bytes)	Commonly used in operating system (e.g. MS Windows and Mac OS X), in programming languages for internal handling of text. NOT recommended for Web use.
UTF-32	uses exactly 32 bits per Unicode code point, Fixed-width, 32 bits (4 bytes)	"Each 32-bit value in UTF-32 is exactly equal to a code point's numerical value." [wikipedia] See Wikipedia for more, including usage. NOT recommended for Web use.

Eight-bit environments

Code range, hexadecimal (decimal)*	UTF-8	UTF-16	UTF-32	UTF-EBCDIC	GB 18030
000000 – 00007F (0-127)	1	2	4	1	1
000080 – 00009F (128-159)	2			1	2 for characters inherited from GB 2312/GBK (e.g. most Chinese characters) 4 for everything else.
0000A0 – 0003FF (160-1023)				2
000400 – 0007FF (1024-2047)				3
000800 – 003FFF (2048-16383)	3			3
004000 – 00FFFF (16484-65535)	3			4
010000 – 03FFFF (65536-262143)	4	4		4	4
040000 – 10FFFF (262144-1114111	4	4		5	4
110000 - FFFFFFFF (1114112-4294967295)	No encodings, undefined values in Unicode (yet)

*) Possible numerical values, a value does NOT neccessarily mean it's a valid Unicode code point as there are certain values that are invalid or undefined for various different reasons. Examples, ranges:

The range D800:DFFF are invalid code points because they are used as escape characters in UTF-16 (called surrogates.)
http://www.unicode.org/charts/PDF/UD800.pdf
D800-DFFF cannot can't be encoded in UTF-16 which in turns makes it Invalid in UTF-8 and UTF-32.
FFFE-FFFF:
- FFFE is an invalid code point because it is used to detect an endian (byte order) mismatch in UTF-16 (U-FEFF, the BOM (Byte Order Mark) character, is a character expected at the beginning of a UTF-16 stream.)
- There is no current necessary reason for FFFF to be an invalid code point. Although this can be used as an escape character to extend UTF-16 in the future, should it be necessary, a larger value formed from a surrogate pair that is currently unassigned could also be used.

32 bits, 4 bytes, 8 hex: FFFFFFFF, 4,294,967,296 values (0-4,294,967,295)

Quick Guide to understanding Unicode Data Transfer Formats - azillionmonkeys.com/qed/unicode.html

http://php.net/manual/en/regexp.reference.unicode.php

UTF-8 (an encoding)

UTF-8, UTF-16, UTF-32 & BOM - unicode.org/faq/utf_bom.html
wikipedia: UTF-8, include comments on derivates like
- CESU-8
- Modified UTF-8
- WTF-8 (Wobbly Transformation Format – 8-bit) is an extension of UTF-8)
google utf-8 v utf-16

Extract from stackoverflow.com/questions/4655250/difference-between-utf-8-and-utf-16

I believe there are a lot of good articles about this around the Web, but here is a short summary.

Both UTF-8 and UTF-16 are variable length encodings. However, in UTF-8 a character may occupy a minimum of 8 bits, while in UTF-16 character length starts with 16 bits.

Main UTF-8 pros:

Basic ASCII characters like digits, Latin characters with no accents, etc. occupy one byte which is identical to US-ASCII representation. This way all US-ASCII strings become valid UTF-8, which provides decent backwards compatibility in many cases.
No null bytes, which allows to use null-terminated strings, this introduces a great deal of backwards compatibility too.
UTF-8 is independent of byte order, so you don't have to worry about Big Endian / Little Endian issue.

Main UTF-8 cons:

Many common characters have different length, which slows indexing by codepoint and calculating a codepoint count terribly.
Even though byte order doesn't matter, sometimes UTF-8 still has BOM (byte order mark) which serves to notify that the text is encoded in UTF-8, and also breaks compatibility with ASCII software even if the text only contains ASCII characters. Microsoft software (like Notepad) especially likes to add BOM to UTF-8.

<UTF-16 pros and cons>

In general, UTF-16 is usually better for in-memory representation while UTF-8 is extremely good for text files and network protocols.

UTF-16 (an encoding)

Web Pages

(For (e.g.) JavaScript and PHP, see Programming Languages above.)

In HTML markup language (not a programming language),use &#x<hex>; (hex) or &#<dec>; (dec).

Example:

Unicode U+21D4 (double arrow) ⇔ (hex) or &#8660 (dec)

See (e.g.) http://www.w3schools.com/charsets/ref_utf_dingbats.asp for examples of useful symbols.

w3.org

Extract

Avoid these encodings

The HTML5 specification calls out a number of encodings that you should avoid.

Documents must not use JIS_C6226-1983, JIS_X0212-1990, HZ-GB-2312, JOHAB(Windows code page 1361), encodings based on ISO-2022, or encodings based onEBCDIC. This is because they allow ASCII code points to represent non-ASCII characters, which poses a security threat.

Documents must also not use CESU-8, UTF-7, BOCU-1, or SCSU encodings, since they were never intended for Web content and the HTML5 specification forbids browsers from recognising them.

The specification also strongly discourages the use of UTF-16, and the use of UTF-32 is 'especially discouraged'.

Other character encodings listed in the Encoding specification should also be avoided. These include Big5 and EUC-JP encodings, which have interoperability issues. ISO-8859-8(Hebrew encoding for visually ordered text) should also be avoided, in favour of an encoding that works with logically ordered text (ie. UTF-8, or failing that ISO-8859-8-i).

The replacement encoding, listed in the Encoding specification, is not actually an encoding; it is a fallback that maps every octet to the Unicode code point U+FFFD REPLACEMENT CHARACTER. Obviously, it is not useful to transmit data in this encoding.

The x-user-defined encoding is a single-byte encoding whose lower half is ASCII and whose upper half is mapped into the Unicode Private Use Area (PUA). Like the PUA in general, using this encoding on the public Internet is best avoided because it damages interoperability and long-term use.

<!doctype html>
<html lang="en">
<head>
  <meta charset="utf-8"> <!-- Strongly recommended encoding and charset for HTML5 -->
... rest of head
</head>
<body>
...
</body>
</html>

https://www.smashingmagazine.com/2012/06/all-about-unicode-utf8-character-sets/

				Character
Unicode code point - Decimal		0 - 1114111	⇔
Unicode code point - Hex (also value in UTF-32)	U+	0 - 10FFFF
Surrogate pair:
UTF-16:	(hex code units, 16-bits/ea)
UTF-8:
	(msg area)