The meaning of «big5»

Big-5 or Big5 is a Chinese character encoding method used in Taiwan, Hong Kong, and Macau for traditional Chinese characters.

The People's Republic of China (PRC), which uses simplified Chinese characters, uses the GB 18030 character set instead.

Big5 gets its name from the consortium of five companies in Taiwan that developed it.[1]

The original Big5 character set is sorted first by usage frequency, second by stroke count, lastly by Kangxi radical.

The original Big5 character set lacked many commonly used characters. To solve this problem, each vendor developed its own extension. The ETen extension became part of the current Big5 standard through popularity.

The structure of Big5 does not conform to the ISO 2022 standard, but rather bears a certain similarity to the Shift JIS encoding. It is a double-byte character set (DBCS) with the following structure:

(the prefix 0x signifying hexadecimal numbers).

Standard assignments (excluding vendor or user-defined extensions) do not use the bytes 0x7F through 0xA0, nor 0xFF, as either lead (first) or trail (second) bytes. Bytes 0xA1 through 0xFE are used for both lead and trail bytes for double-byte (Big5) codes. Bytes 0x40 through 0x7E are used as trail bytes following a lead byte, or for single-byte codes otherwise. If the second byte is not in either range, behaviour is unspecified (i.e., varies from system to system). Additionally, certain variants of the Big5 character set, for example the HKSCS, use an expanded range for the lead byte, including values in the 0x81 to 0xA0 range (similar to Shift JIS), whereas others use reduced lead byte ranges (for instance, the Apple Macintosh variant uses 0xFD through 0xFF as single-byte codes, limiting the lead byte range to 0xA1 through 0xFC).[2]

The numerical value of individual Big5 codes are frequently given as a 4-digit hexadecimal number, which describes the two bytes that comprise the Big5 code as if the two bytes were a big endian representation of a 16-bit number. For example, the Big5 code for a full-width space, which are the bytes 0xa1 0x40, is usually written as 0xa140 or just A140.

Strictly speaking, the Big5 encoding contains only DBCS characters. However, in practice, the Big5 codes are always used together with an unspecified, system-dependent single-byte character set (ASCII, or an 8-bit character set such as code page 437), so that you will find a mix of DBCS characters and single-byte characters in Big5-encoded text. Bytes in the range 0x00 to 0x7f that are not part of a double-byte character are assumed to be single-byte characters. (For a more detailed description of this problem, please see the discussion on "The Matching SBCS" below.)

The meaning of non-ASCII single bytes outside the permitted values that are not part of a double-byte character varies from system to system. In old MSDOS-based systems, they are likely to be displayed as 8-bit characters; in modern systems, they are likely to either give unpredictable results or generate an error.

