Unicode: Difference between revisions

Latest revision as of 08:54, 4 October 2019

Unicode is a universal character set which consists of more than 100,000 characters and additional information regarding various character properties. Unicode represents all of the world's living languages, plus other symbols and notations. Old operating systems like Windows 95/98/Me were able to work only with character sets consisting only out of 256 characters (ANSI), meaning that an operating system could work only in 1 or 2 different alphabets/languages. These days, modern operating systems use Unicode in one form or another, hence, are able to work with almost any language and alphabet.

When it comes to applications, many of them are not able to work with Unicode text or process Unicode filenames. This is because they were initially designed for the old operating systems, and from the programmer's point of view it is not an easy task to redesign an application to work with Unicode on new operating systems.

ASCII vs ANSI

ASCII includes definitions for 128 characters, with ordinary values 0-127 ($00-$7F Hex). These represent 33 control characters and the rest are printable characters consisting of letters of English alphabet, some common symbols and punctuations. ANSI, on the other hand, is an extension to ASCII. Since each ASCII character occupies 1 byte (8 bits, with ordinary value range 0-255), there are 128 unused mappings. ANSI extends ASCII to support letters and symbols for alphabets of other languages. Each language can use their own custom extension to ASCII with their own letters included, and each such custom extension will have it's own name (code page), for example: Windows-1252 for Latin, Windows-1251 for Russian, etc.

Encoding

Unicode text can be stored using a range of encoding schemas, including UTF-8, UTF-16, UTF-32, and their relevant byte order variants (little-endian and big-endian).

Each encoding offers a different balance between compactness (storage efficiency) and convenience (ease of manipulation). The choice depends heavily on the application needs.

Number of bytes taken by each encoding to represent a Unicode character:

Encoding	Bytes
UTF-8	1-4
UTF-16	2-4
UTF-32	4

@@ Line 1: / Line 1: @@
-{{Expand}}
+[http://en.wikipedia.org/wiki/Unicode Unicode] is a universal character set which consists of more than 100,000 characters and additional information regarding various character properties. Unicode represents all of the world's living languages, plus other symbols and notations. Old operating systems like Windows 95/98/Me were able to work only with character sets consisting only out of 256 characters ([http://en.wikipedia.org/wiki/ANSI ANSI]), meaning that an operating system could work only in 1 or 2 different alphabets/languages. These days, modern operating systems use Unicode in one form or another, hence, are able to work with almost any language and alphabet.
-[http://en.wikipedia.org/wiki/Unicode Unicode] is a universal character set which consists of more than 100,000 characters and additional information regarding various character properties. Unicode represents all of the world's living languages, plus other symbols and notations. Old operating systems like Windows 95/98/Me were able to work only with character sets consisting only out of 256 characters ([http://en.wikipedia.org/wiki/ANSI ANSI]), meaning that an operating system could work only in 1 or 2 different alphabets/languages. These days, modern operating systems use Unicode in one form or another, hence, are able to work with almost any language and alphabet.
+When it comes to applications, many of them are not able to work with Unicode text or process Unicode filenames. This is because they were initially designed for the old operating systems, and from the programmer's point of view it is not an easy task to redesign an application to work with Unicode on new operating systems.
 == ASCII vs ANSI ==
-'''Question:''' Is it ANSI character or ASCII character? (The distinction is, ANSI also has the extensions, such as escape sequences). -- [[User:Narayan|Narayan]] 07:15, 21 June 2009 (UTC)
+[http://en.wikipedia.org/wiki/ASCII ASCII] includes definitions for 128 characters, with ordinary values 0-127 ($00-$7F Hex). These represent 33 control characters and the rest are printable characters consisting of letters of English alphabet, some common symbols and punctuations. [http://en.wikipedia.org/wiki/ANSI ANSI], on the other hand, is an extension to ASCII. Since each ASCII character occupies 1 byte (8 bits, with ordinary value range 0-255), there are 128 unused mappings. ANSI extends ASCII to support letters and symbols for alphabets of other languages. Each language can use their own custom extension to ASCII with their own letters included, and each such custom extension will have it's own name (code page), for example: [http://en.wikipedia.org/wiki/Windows-1252 Windows-1252] for Latin, [http://en.wikipedia.org/wiki/Windows-1251 Windows-1251] for Russian, etc.
+== Encoding ==
+Unicode text can be stored using a [https://en.wikipedia.org/wiki/Comparison_of_Unicode_encodings range of encoding schemas], including UTF-8, UTF-16, UTF-32, and their relevant byte order variants (little-endian and big-endian).
+Each encoding offers a different balance between compactness (storage efficiency) and convenience (ease of manipulation). The choice depends heavily on the application needs.
+Number of bytes taken by each encoding to represent a Unicode character:
+{| class="wikitable"
+! Encoding
+! Bytes
+|-
+| UTF-8 || 1-4
+|-
+| UTF-16 || 2-4
+|-
+| UTF-32 || 4
+|}
+== External References ==
-'''Answer:''' [http://en.wikipedia.org/wiki/ASCII ASCII] includes definitions for 128 characters, with ordinary values 0-127 ($00-$7F Hex). These represent 33 control characters and the rest are printable characters consisting of letters of English alphabet, some common symbols and punctuations. [http://en.wikipedia.org/wiki/ANSI ANSI], on the other hand, is an extension to ASCII. Since each ASCII character occupies 1 byte (8 bits, with ordinary value range 0-255), there are 128 unused mappings. ANSI extends ASCII to support letters and symbols for alphabets of other languages. Each language can use their own custom extension to ASCII with their own letters included, and each such custom extension will have it's own name (code page), for example: [http://en.wikipedia.org/wiki/Windows-1252 Windows-1252] for Latin, [http://en.wikipedia.org/wiki/Windows-1251 Windows-1251] for Russian, etc. -- [[User:Den4b|Den4b]] 11:25, 21 June 2009 (UTC)
+* http://en.wikipedia.org/wiki/Unicode
+* http://en.wikipedia.org/wiki/ANSI
+* http://en.wikipedia.org/wiki/UTF-8
+* http://en.wikipedia.org/wiki/Windows-1252
+* http://en.wikipedia.org/wiki/Windows-1251

Unicode: Difference between revisions

Latest revision as of 08:54, 4 October 2019

ASCII vs ANSI

Encoding

External References

Navigation menu

Search