ASCII CODING
sistema binario
ASCII (acronym for American Standard Code for Information Interchange) is a code for character encoding. The ASCII standard was published by the American National Standards Institute (ANSI) in 1968. In Italian it is pronounced aschi / ˈaski / or asci / ˈaʃʃi /, while the original English pronunciation is askey / ˈæski /. The initial specification based on 7-bit codes was followed over the years by many proposals for 8-bit extensions, with the aim of doubling the number of characters that can be represented. In IBM PCs, one of these extensions, now de facto standard, is called extended ASCII or high ASCII. In this extended ASCII, the added characters are accented vowels, semigraphic symbols, and other less commonly used symbols. Extended ASCII characters are encoded in so-called code pages. The term extended ASCII (in English extended ASCII or high ASCII) designates an encoding of 8 bits or more, capable of representing many other characters in addition to the traditional 128 of the 7-bit ASCII. The use of this term has often been criticized as it may suggest (erroneously) that the ASCII has been updated or that it is the same encoding. A new encoding called Unicode was developed in 1991 to be able to encode multiple characters in a standard way and to allow the use of multiple sets of extended characters (eg Greek and Cyrillic) in a single document; this character set is widespread today. Initially it had 65,536 characters (code points) and was later extended to 1,114,112 (= 220 + 216) and so far about 101,000 have been assigned. The first 256 code points are exactly the same as those of ISO 8859-1. Most of the codes are used to encode languages such as Chinese, Japanese and Korean. the UNICODE uses 4 bytes, but not always as we will see, otherwise it would be a huge waste of RAM which is a very precious resource.
THE UNICODE CODING
Unicode is an encoding system that assigns a unique number to each character used for writing texts, independently of the language, the computer platform and the program used. Unicode was originally thought of as a 16-bit (four hexadecimal digits) encoding that gave the ability to encode 65,535 (216 -1) characters. This was believed to be sufficient to represent the characters used in all the written languages of the world. Now, on the other hand, the Unicode standard, which tends to be perfectly aligned with the ISO/IEC 10646 standard, provides for an encoding of up to 21 bits and supports a repertoire of numeric codes that can represent about one million characters. This appears sufficient to cover also the coding needs of writings of the historical heritage of humanity, in the various languages and in the various sign systems used. As of 2009, only a very small part of this availability of codes is assigned. In fact, 17 “plans” (“planes”, in English) are foreseen for the development of the codes, from 00 to 10hex, each with 65,536 positions (four hexadecimal digits), but only the first three and last three floors are currently assigned , and of these the first, also called BMP, is practically sufficient to cover all the most used languages. In concrete terms, this repertoire of numeric codes is serialized by means of different recoding schemes, which allow the use of more compact codes for the most frequently used characters. The use of encodings with 8-bit (byte), 16-bit (word) and 32-bit (double word) units, respectively described as UTF-8, UTF-16 and UTF-32, is envisaged.
THE BASIC MULTILANGUAL PLAN (BMP)
In this scheme each single square represents 256 characters, each symbol or character is called a code point. In order not to break compatibility with pre-existing programs in block 00, all 127 characters common to all countries have been reported. From 128 onwards this time they have been fixed for everyone. The one shown in the figure is called plain (floor) each floor contains 65536 code points, they already seem an infinity, in reality there are 17 floors. The floor we have just discussed is called Basic Multilingual Plain (BMP). The coordinates of each code point are identified by 6 hexadecimal digits (3 bytes), two digits to identify the floor, the first will have coordinate 00 and identify the BMP, four hexadecimal digits to represent each individual code point within a floor. The foreground, the BMP plan contains characters for almost all modern languages and a large number of symbols. A primary goal for the BMP is to support the unification of legacy character sets and characters for writing. Most of the code points assigned in the BMP are used to encode Chinese, Japanese and Korean (CJK) characters. 65,472 of the 65,536 code points in this plan have been assigned to a Unicode block, leaving only 64 code points in unallocated ranges (48 code points from 0870 to 089F and 16 code points from 2FE0 to 2FEF).
DIAGRAM OF THE 17 UNICODE PLANS
BMP DIAGRAM
THE UTF-8, UTF-16, UTF32 CHARACTER CODES
Various character encoding (UTF-8, UTF-16, UTF-32) have been introduced. (UTF Unicode Transformation Format). These encodings take advantage of the different frequency of a certain symbol in different languages, there will be characters used more and characters rarely used. For this reason, variable-length encoding was introduced. The basic idea is to use fewer bytes for the more frequent code points and more bytes for the less frequent ones.
UTF-8 CODING
UTF-8 uses 1 to 4 bytes to represent a Unicode character. For example, only one byte is needed to represent the 128 characters of the ASCII alphabet, corresponding to the Unicode positions from U + 0000 to U + 007F. Four bytes may seem too much for a single character; however this is only required for fonts outside the Basic Multilingual Plane, which are generally very rare. Furthermore, UTF-16 (the main alternative to UTF-8) also requires four bytes for these characters. Which is more efficient, UTF-8 or UTF-16, depends on the range of characters used, and the use of traditional compression algorithms significantly reduces the difference between the two encodings. For short pieces of text, where traditional compression algorithms are inefficient and low memory footprint is important, the standard Unicode Compression Scheme could be used. The Internet Engineering Task Force (IETF) requires that all Internet protocols identify the character encoding used, and be able to use at least UTF-8. UTF-8 is described in the RFC 3629 standard (UTF-8, a transformation format of ISO 10646). Briefly, the bits that make up a Unicode character are divided into groups, which are then divided between the least significant bits within the bytes that make up the UTF-8 encoding of the character. Characters whose Unicode value is less than U + 0080 are represented with a single byte containing their value; they correspond exactly to the 128 ASCII characters. In all other cases up to 4 bytes are required, each of these with the most significant bit set to 1, in order to distinguish them from the 7-bit ASCII alphabet character representation, especially those whose Unicode code is less than U + 0020, traditionally used as control characters. Anyone who reads a website in English or an email in Japanese not only speaks both languages, but is also most likely witnessing the triumph of UTF-8 encoding. “UTF-8” is the abbreviation for “8-Bit UCS Transformation Format” and represents the most widespread character encoding on the World Wide Web. The international Unicode standard covers all linguistic characters and text elements of (almost) all the languages of the world for EDP processing. UTF-8 encoding plays a vital role in the Unicode character set.
For example, the character alef (א), corresponding to Unicode U + 05D0, is represented in UTF-8 with this procedure: it falls in the range from 0x0080 to 0x07FF. According to the table it will be represented with two bytes (110XXXXX 10XXXXXX); hexadecimal 0x05D0 is equivalent to binary 101-1101-0000; the eleven bits are copied in order to the positions marked with “X”: 110-10111 10-010000; the final result is the pair of bytes 11010111 10010000, or in hexadecimal 0xD7 0x90. In summary, the first 128 characters are represented with a single byte. The later 1920s require two, and include the Latin alphabets with diacritics, Greek, Cyrillic, Coptic, Armenian, Hebrew and Arabic. The remaining characters in the Basic Multilingual Plane need three bytes, the rest four.
MORE INFORMATION ON UTF-8 CODING
UTF-8 is a character encoding that assigns each existing Unicode character a specific sequence of bits, which can also be read as a binary number. This means that UTF-8 assigns a fixed binary number to every letter, number and symbol in an increasing number of languages. International organizations, which are particularly interested in Internet standards and consequently in their standardization, are working to make UTF-8 the coding model par excellence. Some examples are the W3C and the Internet Engineering Task Force. In fact, as early as 2009, most of the websites in the world used UTF-8 encoding. According to a March 2018 W3Techs report, 90.9% of all existing websites use this character encoding.
Problems before the introduction of UTF-8
Different regions with related languages and writing systems had each developed their own coding standards to meet different needs. In English-speaking countries, for example, ASCII encoding was sufficient, the structure of which allows the assignment of 128 characters to a sequence of characters that can be read on the computer. However, Asian scripts or the Cyrillic alphabet use more unique single characters. German umlauts (such as the letter ä) are also missing in the ASCII. There was a possibility that different encoding assignments would overlap. This means that, for example, a document written in Russian could appear on an American computer with the Latin letters assigned to this system instead of in Cyrillic letters. Results of this kind have obviously made international communication much more difficult.
BIRTH OF UTF-8
To solve this problem, Joseph D. Becker developed the universal Unicode character set for Xerox between 1988 and 1991. Since 1992, the X/Open IT consortium has also been looking for a way to replace ASCII and expand its character repertoire. . An important requirement was that this encoding be compatible with ASCII. However, the first encoding, called UCS-2, did not meet this requirement and was limited to converting the numeric value of the characters into 16-bit values. The compatibility goal was not achieved even by UTF-1 encoding, since Unicode assignments partially overlapped with those of existing ASCII characters. A server set to ASCII, therefore, could easily produce incorrect characters. This was a considerable problem, since most English-speaking computers at the time worked with this encoding. The next attempt was Dave Prosser’s File System Safe UCS Transformation Format (FSS-UTF), which solved the problem of overlapping ASCII characters. In August of the same year the project spread among the experts. At the time in Bell Labs, known for numerous Nobel laureates, Unix co-founders Ken Thompson and Rob Pike worked on the Plan 9 operating system, who took up the idea of Prosser, developing a coding capable of self-synchronization (each character indicates how many bits it needs) and establishing rules for assigning characters that could be represented differently in the code (example: “ä” as a proper character or “a + ¨”). They successfully exploited this coding for their operating system and later introduced it to those responsible. Thus was born the FSS-UTF encoding, now known as “UTF-8”.
UTF-8 IN THE UNICODE CHARACTER SET: A STANDARD FOR ALL LANGUAGES
UTF-8 encoding is a conversion format belonging to the Unicode standard. The Unicode character set is broadly defined in the international standard ISO 10646, which is referred to as the “Universal Coded Character Set”. To ensure more convenience in use, the developers of the standard have decided to limit some parameters. The standard is intended to ensure a uniform and internationally compatible encoding of characters and text elements. When it was introduced in 1991, the Unicode standard defined 24 modern writing systems and currency symbols for data processing. In June 2017 there were 139. There are several Unicode conversion formats, the so-called “UTF”, which reproduce the 1,114,112 possible code points. The formats that have prevailed are three: UTF-8, UTF-16 and UTF-32. Other encodings such as UTF-7 or SCSU also have their advantages, but have never established themselves. Unicode is divided into 17 planes, each of which contains 65,536 characters. Each floor consists of 16 columns and 16 rows. The first plan, called “Basic Multilingual Plane” (plan 0), covers most of the writing systems currently used in the world, as well as punctuation, control characters and symbols. Five other floors are currently in use:
- Supplementary Multilingual Plane (plan 1): ancient writing systems, rarely used characters
Supplementary Ideographic Plane (plane 2): rare ideographic characters CJK (“Chinese, Japanese, Korean”). - Supplementary Special-Purpose Plane (piano 14): single control characters.
- Supplementary Private Use Area – A (floor 15): private use.
- Supplementary Private Use Area – B (floor 16): private use
UTF encodings allow you to access all Unicode characters. Thanks to their respective characteristics, the individual floors can be used for different contexts.
THE ALTERNATIVES: UTF-32 and UTF-16
UTF-32 always uses 32-bit number sequences, i.e. 4 bytes. The simplicity of its structure improves the readability of the format. In languages that mainly use the Latin alphabet and therefore only the first 128 characters, this encoding takes up much more memory than necessary (4 bytes instead of 1). UTF-16 has established itself as a display format in operating systems such as Apple macOS and Microsoft Windows and is also used in many software development frameworks. It is one of the oldest UTF encodings still in use. Its structure is particularly suitable for encoding non-Latin characters, because it takes up little memory space. Most characters can be represented with 2 bytes (16 bits). Only in the case of rare characters can the length double up to 4 bytes.
EFFICIENT AND SCALABLE: UTF-8
UTF-8 consists of up to four bit sequences, each of which consists of 8 bits. The ASCII predecessor on the other hand uses bit sequences consisting of 7 bits. Both encodings define the first 128 characters encoded equivalently. Each of the characters coming mainly from the English-speaking world is therefore covered by a byte. For languages with the Latin alphabet, this format ensures more efficient use of storage space. Unix and Linux operating systems use it internally. UTF-8 encoding plays its most important role however in relation to Internet applications, particularly in the representation of text on the World Wide Web or in electronic mail.
Thanks to the self-synchronizing structure, readability is maintained despite the variable length per character. Without the Unicode restriction UTF-8 could encode a total of 231 (= 4,398,046,511,104) characters. The actual ones are 221, due to the 4-byte limit imposed by the Unicode standard, which is more than enough. The Unicode standard itself still has empty plans for many other writing systems.
Exact assignment avoids overlaps between code points, which have hindered communication in the past. UTF-16 and UTF-32 also guarantee exact assignment, but UTF-8 uses storage space particularly efficiently for the Latin writing system and is designed to allow for easy coexistence and coverage of different writing systems. allowing simultaneous and sensible display within a text field without compatibility problems.
BASICS: UTF-8 CODING AND COMPOSITION
UTF-8 encoding offers several advantages such as backward compatibility with ASCII and its self-synchronizing structure which makes it easier for developers to identify sources of error even at a later time. UTF uses only 1 byte for each of the ASCII characters. The total number of bit sequences can be recognized by the first digits of the binary number. Since the ASCII code consists of only 7 bits, the first digit is 0. The 0 occupies the storage space corresponding to a full byte and signals the start of a sequence with no subsequent bytes. If we encoded the name “UTF-8” as a binary number with UTF-8 encoding, it would look like this:
UTF-8 encoding assigns ASCII characters, such as those used in the table, to a single sequence of bits. All of the following characters and symbols within the Unicode standard consist of two to four 8-bit sequences. The first sequence is the start byte, or initial byte, the following sequences are the following bytes. The start bytes with subsequent bytes always begin with 11. The subsequent bytes, on the other hand, always begin with 10. If you manually search for a certain point in the code, you can therefore recognize the beginning of a character encoded by the markers 0 and 11. The first character multi-byte printable is the inverted exclamation point:
CODING OF THE PREFIX
Prefix encoding prevents another character from being encoded within a sequence of bytes. If a stream of bytes begins in the middle of a document, the computer still displays legible characters correctly, as incomplete characters are not represented at all. If you are looking for the beginning of a character, bearing in mind the limit of 4 bytes, you must go back to any point of up to three sequences of bytes to find the initial one. Another structurally important element is that the quantity of 1 at the beginning of the start byte indicates the length of the byte sequence. As shown above, 110xxxxx stands for 2 bytes. 1110xxxx stands for 3 bytes and 11110xxx for 4 bytes. In Unicode the assigned byte value corresponds to the character number, allowing for lexical ordering. However, there are gaps. The Unicode range U + 007F to U + 009F includes unassigned control numbers. In this range the UTF-8 standard does not assign printable characters, but only commands. UTF-8 encoding can, as mentioned above, theoretically join up to eight byte sequences, but Unicode prescribes a maximum length of 4 bytes. Consequently, byte sequences consisting of 5 or more bytes are invalid by default. On the other hand, this limitation reflects the goal of representing the code in a more compact way, that is, in the most efficient way, in terms of storage space, and as structured as possible. A basic rule of using UTF-8 is to prefer the shortest possible encoding. For example, the letter ä is encoded using 2 bytes: 11000011 10100100. Theoretically you could combine the code points of the letter a (01100001) and the umlaut character ¨ (11001100 10001000) to represent the ä: 01100001 11001100 10001000. In the in the case of UTF-8, however, this form is considered too extensive and therefore inadmissible. Some Unicode value ranges have not been defined for UTF-8 to remain available to UTF-16 surrogates. The overview shows which bytes in UTF-8, in the Unicode standard, are valid according to the Internet Engineering Task Force (IETF) (ranges marked in green are valid bytes, those marked in red are invalid).
EXAMPLE
The character ᅢ (Manul Junseong, Ä) corresponds to U + 1162 in Unicode.
UTF-8 code expects 3 bytes for the U + 1162 code point, because the code point is in the range of U + 0800 to U + FFFF. The start byte then begins with 1110. The next two bytes each start with 10. In the free bits, which do not specify the structure, add the binary number from right to left. Fill the remaining bit positions in the starting byte with 0 until the octet is full. We thus obtain the following UTF-8 encoding:
11100001 10000101 10100010 (the code point entered is in bold).
PRINT A UNIQUE CHARACTER UNICODE ON WORD
Based on this table we see that encoding a text that uses the 127 characters of the ASCII code with the UTF-16 or even worse with the UTF-32 would be just a waste of ram.
Link to the Wikipedia page containing all the plans.
https://en.wikipedia.org/wiki/Plane_%28Unicode%29
Let’s try to print on word a character found in the BMP.
说 This ideogram is found in the range 8000-8FFF in the BMP at address 8BF4
We transform the hexadecimal number into decimal.
4×160 + 15X161 + 11X162 + 8×163 = 4 + 240 + 2816 + 32768 = 35828
Finally, we print the ideogram on word by pressing alt, activating the numeric keypad, and writing the sequence 35828.
说
Leave A Comment