从认识mysql开始就开始受困于mysql的字符集问题,希望能花点时间搞清楚,下面是一个什么样的问题呢:
“汉字”在utf8的字符集中是6个字节的,然而,这里却是8个字节,怎么解释呢?而且,“汉字”在utf8字符集中的16进制编码应该是“e6 b1 89 e5 ad 97”
参考资料: http://mysql.rjweb.org/doc.php/charcoll
里面有如下一段描述,似乎可以解释上述的问题:
—————————–
“Double encoding” is a term I made up for the following situation. It is a case where “two wrongs make a right”.
⚈ Table defined utf8
⚈ INSERTer declared latin1 (used default instead of doing SET NAMES)
⚈ Data being INSERTed is actually utf8 already.
What happened:
⚈ A 2-byte letter (say, a grave-e) was correctly represented in utf8.
⚈ The INSERT statement handed the 2 bytes to MySQL, but implied the need for conversion
⚈ Each byte was converted to utf8
⚈ The table has 4 bytes.
When SELECTing, the reverse happens 4->2->1, and the user is oblivious of the bug. Looking CHAR_LENGTH will spot it. Strange orderings may happen.
—————————–
看来预期6个字节存储为8个字节还不算严重的,严重的是上面存储翻倍的情况,上面的引用翻译如下:
================
“Double encoding” 是我用来描述下述情况的一个术语,这是一种“错错得对”的情况。
⚈ 表定义为utf8
⚈ INSERT时声明的字符集为latin1(如果没有使用 set names,则默认就是latin1的)
⚈ 要插入的数据确实是utf8编码
发生了什么?
⚈ 一个utf8编码的2字节的字母
⚈ 插入语句写mysql时要对2字节做处理,但是因为字符集声明为latin1的,所以要做编码转换
⚈ 每个字节用2字节的utf8编码表示
⚈ 存储结果为4个字节
在SELECT 的时候,发生一个反向的转换 4->2->1,并且用户不知道这个’bug’,于是就出现了上面的问题
=================