关于mysql的字符集问题

从认识mysql开始就开始受困于mysql的字符集问题，希望能花点时间搞清楚，下面是一个什么样的问题呢：

“汉字”在utf8的字符集中是6个字节的，然而，这里却是8个字节，怎么解释呢？而且，“汉字”在utf8字符集中的16进制编码应该是“e6 b1 89 e5 ad 97”

参考资料： http://mysql.rjweb.org/doc.php/charcoll

里面有如下一段描述，似乎可以解释上述的问题：

—————————–

“Double encoding” is a term I made up for the following situation. It is a case where “two wrongs make a right”.

⚈  Table defined utf8
⚈  INSERTer declared latin1 (used default instead of doing SET NAMES)
⚈  Data being INSERTed is actually utf8 already.
What happened:
⚈  A 2-byte letter (say, a grave-e) was correctly represented in utf8.
⚈  The INSERT statement handed the 2 bytes to MySQL, but implied the need for conversion
⚈  Each byte was converted to utf8
⚈  The table has 4 bytes.
When SELECTing, the reverse happens 4->2->1, and the user is oblivious of the bug. Looking CHAR_LENGTH will spot it. Strange orderings may happen.

—————————–

看来预期6个字节存储为8个字节还不算严重的，严重的是上面存储翻倍的情况，上面的引用翻译如下：

================

“Double encoding” 是我用来描述下述情况的一个术语，这是一种“错错得对”的情况。

⚈ 表定义为utf8
⚈ INSERT时声明的字符集为latin1（如果没有使用 set names，则默认就是latin1的）
⚈ 要插入的数据确实是utf8编码

发生了什么？
⚈ 一个utf8编码的2字节的字母
⚈ 插入语句写mysql时要对2字节做处理，但是因为字符集声明为latin1的，所以要做编码转换
⚈ 每个字节用2字节的utf8编码表示
⚈ 存储结果为4个字节

在SELECT 的时候，发生一个反向的转换 4->2->1，并且用户不知道这个’bug’,于是就出现了上面的问题

=================

关于mysql的字符集问题

留下评论

取消回复