Text

Resources around text in regards to a form of data. I.e. more technical and focus on details behinds the characters we use all the time.

Introduction

Something as generally fairly easily understood as text can become rather complicated when dealing with in computers. Puh, that sentence shows text maybe isn't that easy for humans either...!

Anyhow, when text is handled in computers - basically down to zeros (0) and ones (1), one must:

  • Decide on which characters to handle - some set of allowed characters , a character set, or sometimes charset for short.
  • Decide on how to encode each character in this set to zeros and ones - represetation - some form of encoding.

Among the earliest sets used in computers (originating in the 1960s) include ASCII and EBCDIC (wikipedia) of which both allowed a limited set of US-English letters and symbols.

Today, all modern handling of text in computers should aim for using Unicode and encodings like UTF-8 or UTF-16.

Character Set Alphabet(s) handled, Max # of characters handled Size per character E.g. "A"
ASCII US English w limited number of symbols (127 chars, 7 bits) or variants of extended ASCII with different sets for different languages (255 chars, 8 bits) 7 or 8 bits, 1 byte 65dec
EBCDIC Similar to ASCII but with different encodings than ASCII 8 bits, 1 byte 193dec
Unicode "the latest version of Unicode contains a repertoire of more than 128,000 characters covering 135 modern and historic scripts, as well as multiple symbol sets." 1-4 bytes, 8-32 bits 0041hex
(=65dec)

 

ASCII

"ASCII (Listeni/ˈæski/ ass-kee),[1]:6 abbreviated from American Standard Code for Information Interchange, is a character encoding standard (the Internet Assigned Numbers Authority (IANA) prefers the name US-ASCII[2]). ASCII codes represent text in computers, telecommunications equipment, and other devices. Most modern character-encoding schemes are based on ASCII, although they support many additional characters." [wikipedia.org/wiki/ASCII]

Honorable mentioning it here and because many lower-level systems, existing and not unlikely also future, may be using this old encoding of characters.

Pros: Widely spread use, essentially all development tools etc handles ASCII without any fuss or special techniques.
Cons:

As long as only considering using the first 127 characters, lower part of ASCII table - none... Though then limiting the alphabet to the English with A-Z,a-z. No Greek, very few special symbols like for currency, mathematics, et c. I.e. ASCII is good for use for programming languages syntax and so on but FAR LESS so for data handled by the program created, including text used in user interface.

Strongly recommendation today is aiming for handling all texts handled by program in Unicode, UTF-8 or UTF-16 encodings.

Rec use: See Cons.

 

Charsets

"In computing, a character encoding is used to represent a repertoire of characters by some kind of an encoding system.[1] Depending on the abstraction level and context, corresponding code points and the resulting code space may be regarded as bit patternsoctets, natural numbers, electrical pulses, etc. A character encoding is used in computationdata storage, and transmission of textual data.Character setcharacter mapcodeset and code page are related, but not identical, terms." [wikipedia.org/wiki/Character_encoding, the article presented when search on 'charset']

Character Set Size Encoding Notes
ISO 646, ASCII 1 byte An "ASCII number" (0-255, where 0-127 is common for all variants of ASCII, including international variants i.e.)
ISO 8859, ISO 8859-n 1 byte (E.g.) wikipedia.org/wiki/ISO/IEC_8859; ISO 8859-1, also known as Latin1 - wikipedia.org/wiki/ISO/IEC_8859-1, another example is ISO 8859-5, wikipedia.org/wiki/ISO/IEC_8859-5
Unicode 1-4 bytes A "Unicode code point", e.g. U+0041 (Latin A; ASCII 65)
Universal Coded Character Set (UCS),
ISO/IEC 10646
- a parallel standard to Unicode
1-4 bytes "ISO 10646 and Unicode have an identical repertoire and numbers—the same characters with the same numbers exist on both standards, although Unicode releases new versions and adds new characters more often. Unicode has rules and specifications outside the scope of ISO 10646." [wikipedia]

 

Encoding

 

Percent-Encoding (an encoding)

Also known as URL Encoding.

 

Programming Languages, Development

First it's important to recognize important details:

  1. Source code for a program (e.g. written in PHP or JavaScript) may have one encoding (e.g. ASCII or UTF-8) and text strings used in constants and variables within the program may have a different encoding (e.g. UTF-16).
  2. #

how to determine which alphabet utf8 - google.com/search?q=what+alphabet+ %D0%B8 ...

In HTML

See Web Pages below (HTML is not a programming language) for more.

"In HTML markup language (not a programming language),use &#x<hex>; (hex) or &#<dec>; (dec)"

JavaScript

Two forms of escape sequences:

  1. '\x': for two hex numbers, for Unicode U+0000 through U+00FF.
    Example, \x41 (65 dec), Latin/ASCII letter 'A'.
    NOTE: (Obviously) this format only support handling first 256 characters (0-255d; 00-FFh)
  2. '\u': for four hex numbers, for Unicode U+0000 through U+FFFF.
    Example, \u0041 (65 dec), Latin/ASCII letter 'A'.
    NOTE: (Obviously) this format only support handling first 65536 characters (0-65535d; 0000-FFFFh)

JavaScript has a Unicode problem Published 20th October 2013 mathiasbynens.be/notes/javascript-unicode

http://speakingjs.com/es5/ch24.html

string to

JavaScript

charCodeAt - developer.mozilla.org/ en-US / docs / Web / JavaScript / Reference / Global_Objects / String / charCodeAt

codePointAt - developer.mozilla.org/ en-US / docs / Web / JavaScript / Reference / Global_Objects / String / codePointAt

PHP

Regular Expressions

 

 

 

Punycode (an encoding)

Punycode is the encoding used for handling Unicode characters in International Domain Names (IDNs). See our TLD database for more examples.

Example, the domain åland.ax is encoded to xn--land-poa.ax in Punycode, another example is xn--0zwm56d that represent the international top level domain (TLD) 测试.

Volkswagen Example in Punycode, and UTF-8, and Percent-Encoding

Volkswagen, the German car manufacturer, were early on to reserve own top-level domains (TLDs) when this became a possibility.

Top Level Domain Type Created Organization
volkswagen gTLD 2015-12-23 Volkswagen Group of America Inc.
大众汽车 (xn--3oq18vl8pn36a) gTLD 2016-07-21 Volkswagen (China) Investment Co., Ltd.

(Google Translate translates '大众汽车' to 'Volkswagen')

Encoding Result
UTF-8, Unicode 大众汽车
In Punycode xn--3oq18vl8pn36a0
In Percent-Encoding %E5%A4%A7%E4%BC%97%E6%B1%BD%E8%BD%A6

Google 大众汽车 - googling 'Volkswagen' in Chinese)

google.com/search?q= %E5%A4%A7%E4%BC%97%E6%B1%BD%E8%BD%A6

Wikipedia, which is a great site for both learning about topics in general as well as meta data about cross overs between languages:

English-language page: en.wikipedia.org/wiki/Volkswagen
Related page in Chinese

zh.wikipedia.org/wiki/大众汽车
(zh.wikipedia.org/wiki/%E5%A4....)

 

Unicode (a character set standard)

"Unicode is a computing industry standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. Developed in conjunction with the Universal Coded Character Set (UCS) standard and published asThe Unicode Standard, the latest version of Unicode contains a repertoire of more than 128,000 characters covering 135 modern and historic scripts, as well as multiple symbol sets." [wikipedia]

"Unicode defines a codespace of 1,114,112 code points in the range 0hex to 10FFFFhex.[5] Normally a Unicode code point is referred to by writing "U+" followed by its hexadecimal number. For code points in the Basic Multilingual Plane (BMP), four digits are used (e.g., U+0058 for the character LATIN CAPITAL LETTER X); for code points outside the BMP, five or six digits are used, as required (e.g., U+E0001 for the character LANGUAGE TAG and U+10FFFD for the character PRIVATE USE CHARACTER-10FFFD).[6] " [wikipedia]

A Unicode code point (e.g. U+0065) correspond to an ASCII number (e.g. 65 decimal).

Unicode Lab

      Character
      0 - 1114111  ⇔ 
Unicode code point - Hex
(also value in UTF-32)
U+ 0 - 10FFFF
Surrogate pair:  
UTF-16: (hex code units, 16-bits/ea)
UTF-8:    
  (msg area)

"Characters outside the BMP, e.g. U+1D306 tetragram for centre (𝌆), can only be encoded in UTF-16 using two 16-bit code units: 0xD834 0xDF06. This is called a surrogate pair. Note that a surrogate pair only represents a single character." [mathiasbynens.be]

Extract from wikipedia: UTF-8

The following table summarises this conversion, as well as others with different lengths in UTF-8. The colors indicate how bits from the code point are distributed among the UTF-8 bytes. Additional bits added by the UTF-8 encoding process are shown in black.

Character Binary code point Binary UTF-8 Hexadecimal UTF-8
$ U+0024 010 0100 00100100 24
¢ U+00A2 000 1010 0010 11000010 10100010 C2 A2
U+20AC 0010 0000 1010 1100 11100010 10000010 10101100 E2 82 AC
𐍈 U+10348 0 0001 0000 0011 0100 1000 11110000 10010000 10001101 10001000 F0 90 8D 88

 

Extract from https://en.wikipedia.org/wiki/UTF-16

Consider the encoding of U+10437 (𐐷):

  • Subtract 0x10000 from 0x10437. The result is 0x00437, 0000 0000 0100 0011 0111.
  • Split this into the high 10-bit value and the low 10-bit value: 0000000001 and 0000110111.
  • Add 0xD800 to the high value to form the high surrogate: 0xD800 + 0x0001 = 0xD801.
  • Add 0xDC00 to the low value to form the low surrogate: 0xDC00 + 0x0037 = 0xDC37.

The following table summarizes this conversion, as well as others. The colors indicate how bits from the code point are distributed among the UTF-16 bytes. Additional bits added by the UTF-16 encoding process are shown in black.

Character Binary code point Binary UTF-16 UTF-16 hex
code units
UTF-16BE
hex bytes
UTF-16LE
hex bytes
$ U+0024 0000 0000 0010 0100 0000 0000 0010 0100 0024 00 24 24 00
U+20AC 0010 0000 1010 1100 0010 0000 1010 1100 20AC 20 AC AC 20
𐐷 U+10437 0001 0000 0100 0011 0111 1101 1000 0000 0001 1101 1100 0011 0111 D801 DC37 D8 01 DC 37 01 D8 37 DC
𤭢 U+24B62 0010 0100 1011 0110 0010 1101 1000 0101 0010 1101 1111 0110 0010 D852 DF62 D8 52 DF 62 52 D8 62 DF

 

The following table summarizes this conversion, as well as others. The colors indicate how bits from the code point are distributed among the UTF-16 bytes. Additional bits added by the UTF-16 encoding process are shown in black.

Character Binary code point Binary UTF-16 UTF-16 hex
code units
UTF-16BE
hex bytes
UTF-16LE
hex bytes
$ U+0024 0000 0000 0010 0100 0000 0000 0010 0100 0024 00 24 24 00
U+20AC 0010 0000 1010 1100 0010 0000 1010 1100 20AC 20 AC AC 20
𐐷 U+10437 0001 0000 0100 0011 0111 1101 1000 0000 0001 1101 1100 0011 0111 D801 DC37 D8 01 DC 37 01 D8 37 DC
𤭢 U+24B62 0010 0100 1011 0110 0010 1101 1000 0101 0010 1101 1111 0110 0010 D852 DF62 D8 52 DF 62 52 D8 62 DF

 

Encoding Character Length Note
UCS-2 Fixed-width, 16 bits/2 bytes Obsolete, old; more modern alt is UTF-16
UCS-4 Fixed-width, 32 bits/4 bytes Obsolete, old; more modern alt is UTF-32
UTF-8 one to four 8-bit code units (variable-width, 8,16,24, or 32 bits, 1-4 bytes) Recommended for Web use!
UTF-16 one or two 16-bit code units(variable-width, 16 or 32 bits, 2 or 4 bytes) Commonly used in operating system (e.g. MS Windows and Mac OS X), in programming languages for internal handling of text.
NOT recommended for Web use.
UTF-32 uses exactly 32 bits per Unicode code point, Fixed-width, 32 bits (4 bytes) "Each 32-bit value in UTF-32 is exactly equal to a code point's numerical value." [wikipedia]
See Wikipedia for more, including usage.
NOT recommended for Web use.

 

Eight-bit environments

Code range, hexadecimal (decimal)* UTF-8 UTF-16 UTF-32 UTF-EBCDIC GB 18030
000000 – 00007F (0-127) 1 2 4 1 1
000080 – 00009F (128-159) 2 2 for characters inherited from
GB 2312/GBK (e.g. most
Chinese characters) 4 for
everything else.
0000A0 – 0003FF (160-1023) 2
000400 – 0007FF (1024-2047) 3
000800 – 003FFF (2048-16383) 3
004000 – 00FFFF (16484-65535) 4
010000 – 03FFFF (65536-262143) 4 4 4
040000 – 10FFFF (262144-1114111 5
110000 - FFFFFFFF (1114112-4294967295) No encodings, undefined values in Unicode (yet)

*) Possible numerical values, a value does NOT neccessarily mean it's a valid Unicode code point as there are certain values that are invalid or undefined for various different reasons. Examples, ranges:

  • The range D800:DFFF are invalid code points because they are used as escape characters in UTF-16 (called surrogates.)
  • http://www.unicode.org/charts/PDF/UD800.pdf
  • D800-DFFF cannot can't be encoded in UTF-16 which in turns makes it Invalid in UTF-8 and UTF-32.
  • FFFE-FFFF:
    • FFFE is an invalid code point because it is used to detect an endian (byte order) mismatch in UTF-16 (U-FEFF, the BOM (Byte Order Mark) character, is a character expected at the beginning of a UTF-16 stream.)
    • There is no current necessary reason for FFFF to be an invalid code point. Although this can be used as an escape character to extend UTF-16 in the future, should it be necessary, a larger value formed from a surrogate pair that is currently unassigned could also be used.

32 bits, 4 bytes, 8 hex: FFFFFFFF, 4,294,967,296 values (0-4,294,967,295)

Quick Guide to understanding Unicode Data Transfer Formats - azillionmonkeys.com/qed/unicode.html

http://php.net/manual/en/regexp.reference.unicode.php

UTF-8 (an encoding)

Extract from stackoverflow.com/questions/4655250/difference-between-utf-8-and-utf-16

I believe there are a lot of good articles about this around the Web, but here is a short summary.

Both UTF-8 and UTF-16 are variable length encodings. However, in UTF-8 a character may occupy a minimum of 8 bits, while in UTF-16 character length starts with 16 bits.

Main UTF-8 pros:

  • Basic ASCII characters like digits, Latin characters with no accents, etc. occupy one byte which is identical to US-ASCII representation. This way all US-ASCII strings become valid UTF-8, which provides decent backwards compatibility in many cases.
  • No null bytes, which allows to use null-terminated strings, this introduces a great deal of backwards compatibility too.
  • UTF-8 is independent of byte order, so you don't have to worry about Big Endian / Little Endian issue.

Main UTF-8 cons:

  • Many common characters have different length, which slows indexing by codepoint and calculating a codepoint count terribly.
  • Even though byte order doesn't matter, sometimes UTF-8 still has BOM (byte order mark) which serves to notify that the text is encoded in UTF-8, and also breaks compatibility with ASCII software even if the text only contains ASCII characters. Microsoft software (like Notepad) especially likes to add BOM to UTF-8.

<UTF-16 pros and cons>

In general, UTF-16 is usually better for in-memory representation while UTF-8 is extremely good for text files and network protocols.

 

UTF-16 (an encoding)

See also Unicode above.

Extract from stackoverflow.com/questions/4655250/difference-between-utf-8-and-utf-16

I believe there are a lot of good articles about this around the Web, but here is a short summary.

Both UTF-8 and UTF-16 are variable length encodings. However, in UTF-8 a character may occupy a minimum of 8 bits, while in UTF-16 character length starts with 16 bits.

<UTF-8 pros and cons>

Main UTF-16 pros:

  • BMP (basic multilingual plane) characters, including Latin, Cyrillic, most Chinese (the PRC made support for some codepoints outside BMP mandatory), most Japanese can be represented with 2 bytes. This speeds up indexing and calculating codepoint count in case the text does not contain supplementary characters.
  • Even if the text has supplementary characters, they are still represented by pairs of 16-bit values, which means that the total length is still divisible by two and allows to use 16-bit charas the primitive component of the string.

Main UTF-16 cons:

  • Lots of null bytes in US-ASCII strings, which means no null-terminated strings and a lot of wasted memory.
  • Using it as a fixed-length encoding “mostly works” in many common scenarios (especially in US / EU / countries with Cyrillic alphabets / Israel / Arab countries / Iran and many others), often leading to broken support where it doesn't. This means the programmers have to be aware of surrogate pairs and handle them properly in cases where it matters!
  • It's variable length, so counting or indexing codepoints is costly, though less than UTF-8.

In general, UTF-16 is usually better for in-memory representation while UTF-8 is extremely good for text files and network protocols.

 

 

Web Pages

(For (e.g.) JavaScript and PHP, see Programming Languages above.)

In HTML markup language (not a programming language),use &#x<hex>; (hex) or &#<dec>; (dec).

Example:

Unicode U+21D4 (double arrow) &#x21D4; (hex) or &#8660 (dec)

See (e.g.) http://www.w3schools.com/charsets/ref_utf_dingbats.asp for examples of useful symbols.

Extract

Avoid these encodings

The HTML5 specification calls out a number of encodings that you should avoid.

Documents must not use JIS_C6226-1983JIS_X0212-1990HZ-GB-2312JOHAB(Windows code page 1361), encodings based on ISO-2022, or encodings based onEBCDIC. This is because they allow ASCII code points to represent non-ASCII characters, which poses a security threat.

Documents must also not use CESU-8UTF-7BOCU-1, or SCSU encodings, since they were never intended for Web content and the HTML5 specification forbids browsers from recognising them.

The specification also strongly discourages the use of UTF-16, and the use of UTF-32 is 'especially discouraged'.

Other character encodings listed in the Encoding specification should also be avoided. These include Big5 and EUC-JP encodings, which have interoperability issues. ISO-8859-8(Hebrew encoding for visually ordered text) should also be avoided, in favour of an encoding that works with logically ordered text (ie. UTF-8, or failing that ISO-8859-8-i).

The replacement encoding, listed in the Encoding specification, is not actually an encoding; it is a fallback that maps every octet to the Unicode code point U+FFFD REPLACEMENT CHARACTER. Obviously, it is not useful to transmit data in this encoding.

The x-user-defined encoding is a single-byte encoding whose lower half is ASCII and whose upper half is mapped into the Unicode Private Use Area (PUA). Like the PUA in general, using this encoding on the public Internet is best avoided because it damages interoperability and long-term use.

 

 

<!doctype html>
<html lang="en">
<head>
  <meta charset="utf-8"> <!-- Strongly recommended encoding and charset for HTML5 -->
... rest of head
</head>
<body>
...
</body>
</html>

 

 

 

https://www.smashingmagazine.com/2012/06/all-about-unicode-utf8-character-sets/