ikB:Completely understand GBK and UTF8 that have troubled programmers for many years-Font Tutorial免费ppt模版下载-道格办公

Completely understand GBK and UTF8 that have troubled programmers for many years

In fact, when you understand these two encoding methods, you will know that GBK is the Chinese standard, UTF8 is the network transmission standard, and Unicode is the global standard.


Let’s first introduce GBK: (the development history of GBK)

Then what we have to mention is the location code:

The first two digits are "district" and the last two digits are "bit". The numbering area code of Chinese characters starts from 16, and the bit number starts from 1. The area code in front has some symbols, numbers, letters, phonetic symbols (Taiwan), tab characters, Japanese characters, etc. Simply put, 0~1599 represents the character number other than Chinese characters. Some of 1600~9999 represent Chinese character numbers. Of course, the number of Chinese characters at that time should not have occupied all the numbers.

Next development to GB2312:

It is based on location code and uses double-byte encoding to represent Chinese and Chinese symbols. The general encoding method is: 0xA0+area code, 0xA0+bit number. For "An" in the following table, the area code is 1618 (decimal), then the GB2312 encoding of the word "安" is 0xA0+16 0xA0+18, which is 0xB0 0xB2. According to the location code table, the Chinese character encoding range of GB2312 is 0xB0A1~0xF7FE

Encoded in ASCII, that is to say, modern GBK encoding is compatible with ASCII encoding. For example, for a number 2, the corresponding binary number is 0x32, not 0xA3 0xB2. So the question is, what does 0xA3 0xB2 correspond to? Still 2. Pay attention, is there something different between 2 and 2 here? ! It is indeed different. Heredouble-byte 2 is full-width 2, and ASCII 2 is half-width 2. This is the difference between full-width and half-width switching in general input methods.

So in fact, GBK is a supplement to GB2312. Of course, GB18030 will be a supplement to GBK in the future

How to distinguish ASCII and Chinese encoding in the same encoding file? From the ASCII table, we know that standard ASCII has only 128 characters, 0~127, which is 0x00~0x7F (0111 1111). So the way to distinguish is that if the highest bit of the high byte is 0, it is ASCII, and if it is 1, it is Chinese.

Now that we have finished introducing GBK in our country, do you feel a little enlightened after reading it? In fact, it is a numbering method that corresponds to Chinese characters one-to-one, hehe!

So let's take a look at how the world is encoded? In fact, it is similar, except that it is not only Chinese characters, but also includes characters from various countries around the world.


In the current Unicode encoding standard, most programming languages ​​only support double-byte, so Unicode is the double-byte standard that represents all characters in the world (can include 65536 characters),

All English characters use double bytes, which will greatly increase storage costs and traffic. Therefore, in most cases, Unicode encoding is not used originally, but is converted and encoded into UTF8.This is how UTF8 appeared.< /strong>

The conversion between Unicode and UTF8 is performed through the following table:

Now the last question is BOM. What is BOM?

The so-called BOM header (Byte Order Mark) is the first few bytes in the text file that do not represent any characters. You can see it with a binary editor (such as bz.exe).

  1. The BOM header of UTF8 is 0xEF 0xBB 0xBF

  2. Unicode big endian mode is 0xFE 0xFF

  3. Unicode little endian mode is 0xFF 0xFE

How to distinguish whether a text is UTF8 without BOM or GBK?

The answer is that it can only be distinguished by extensive coding analysis. At present, the recognition accuracy is very high: some commonly used IDEs such as Notepad++, PHP's mb_ series functions, python's chardet library and other language derivatives such as jchardet, jschardet, etc.

Articles are uploaded by users and are for non-commercial browsing only. Posted by: Lomu, please indicate the source: https://www.daogebangong.com/en/articles/detail/che-di-gao-dong-kun-rao-cheng-xu-yuan-duo-nian-de-GBK-he-UTF8.html

Like (810)
Reward 支付宝扫一扫 支付宝扫一扫
single-end

Related Suggestion