您现在的位置是：网站首页 -> 数据库文章内容

mysql字符编码问题latin1到utf8-itarticl.cc-IT技术类文章记录&分享

发布时间: 8年前【数据库】 264人已围观【返回】

测试目的：

Java从MySQL读取数据并打印

测试环境1

a. Hadoop集群某节点的环境是utf8，java代码也是utf8编码

b. 需要读取的mysql服，数据库、数据表均是latin1

运行mysql -u* -p* -A -h 进入mysql服

（一）查看mysql编码的基本方法

先在mysql下确认原始数据是什么编码，经过下面三步可以确认原始数据是latin1

进入mysql > use db;

mysql> show variables like 'character%';

+--------------------------+----------------------------+

| Variable_name | Value |

+--------------------------+----------------------------+

| character_set_client | gbk |

| character_set_connection | gbk |

| character_set_database | latin1 |

| character_set_filesystem | binary |

| character_set_results | gbk |

| character_set_server | latin1 |

| character_set_system | utf8 |

| character_sets_dir | /usr/share/mysql/charsets/ |

+--------------------------+----------------------------+

8 rows in set (0.00 sec)

// set names的作用是改变三个红色部分的三个参数

mysql> set names 'latin1';

Query OK, 0 rows affected (0.00 sec)

mysql> show variables like 'character%';

+--------------------------+----------------------------+

| Variable_name | Value |

+--------------------------+----------------------------+

| character_set_client | latin1 |

| character_set_connection | latin1 |

| character_set_database | latin1 |

| character_set_filesystem | binary |

| character_set_results | latin1 |

| character_set_server | latin1 |

| character_set_system | utf8 |

| character_sets_dir | /usr/share/mysql/charsets/ |

+--------------------------+----------------------------+

8 rows in set (0.00 sec)

Set names 'utf8'等价于下面三条语句

SET character_set_client = utf8 用来设置客户端送给MySQL服务器的数据的字符集

SET character_set_results = utf8 服务器返回查询结果时使用的字符集

SET character_set_connection = utf8 MySQL 服务器把客户端传来的数据，从character_set_client字符集转换成character_set_connection字符集

mysql的信息流向及编码转换

接收：client --> connection --> database

反馈：database--->connection-->results

mysql> SELECT LOWER(consumption_name),consumption FROM dimen_table;

+-------------------------+-------------+

| LOWER(consumption_name) | consumption |

+-------------------------+-------------+

| gm命令 | -1019 |

| 神器解锁 | -1018 |

| 购买经验药水 | -1017 |

| 购买时装 | -1016 |

| 运营活动(领物品) | -1015 |

| 神器精炼(棍) | -1014 |

| 神器精炼(拳) | -1013 |

| 商店购买 | -1012 |

（二） java下的代码测试

测试1： java连接代码

jdbc:mysql://106.2.67.10/sdc_hdfs?useUnicode=true&characterEncoding=UTF-8

Statement statement = con.createStatement();

statement.execute("set names 'utf8'");//与参数characterEncoding=UTF-8的作用类似

sql = "SELECT LOWER(consumption_name),consumption FROM dimen_table"

ResultSet rs = statement.executeQuery(sql);

while (rs.next()) {

System.out.println(new String(rs.getBytes(i+1),"cp1252")); // latin1 对应的解码用ISO-8859-1或者Cp1252，我测试的时候发现应该用cp1252

}

// 查看mysql支持的字符集及其描述

mysql> SHOW CHARACTER SET;

+----------+-----------------------------+---------------------+--------+

+----------+-----------------------------+---------------------+--------+

| big5 | Big5 Traditional Chinese | big5_chinese_ci | 2 |

| dec8 | DEC West European | dec8_swedish_ci | 1 |

| cp850 | DOS West European | cp850_general_ci | 1 |

| hp8 | HP West European | hp8_english_ci | 1 |

| swe7 | 7bit Swedish | swe7_swedish_ci | 1 |

【参考】

http://stackoverflow.com/questions/21689665/mysql-latin1-to-utf-8-using-java-hibernate-jpa ：

提到latin1应该用cp1252解码：

MySQL's version of latin1 is an extended version of CP1252: it uses the 5 bytes that CP1252 leaves undefined. Unfortunately the current Connector/J has a "bug" in that it uses the original CP1252 rather than MySQL's own version. Therefore it's impossible to recover strings whose encoding uses one of these 5 bytes. Patching the Connector/J source to fix the bug could solve the problem, but ideally you should migrate the tables to UTF-8.

提到修改sql语句的解决方法：SELECT CONVERT(CONVERT(CONVERT( column_name USING latin1) USING binary) using utf8) FROM...

测试2，改sql，将编码转换交给mysql来处理

url置为jdbc:mysql://123.*.*.108/db

sql置成SELECT CONVERT(CONVERT(CONVERT(LOWER(consumption_name) USING latin1) USING binary) USING utf8),consumption FROM dimen_table

执行sql前，先执行statement.execute("set names 'utf8'");

在ResultSet中，rs.getString()获得的就是utf-8编码

在windows下显示正常。

在linux下显示有问题。

gm命令 -1019

神器解锁 -1018

购买经验药水 -1017

购买时装 -1016

运营活动(领物-1015

神器精炼(-1014

神器精炼(-1013

商店购买 -1012

（说明：红色部分不能正常解析）

这个是linux环境的编码环境导致的：解决方法

1) 使用locale查看file.encoding这个系统变量，如果不是utf8可以运行这个命令 export LANG=zh_CN.utf8

或者这样LANG=zh_CN.utf8 java -Djava.ext.dir=/sdfls/asdlfjal/test Main执行之后

2) 或者java -Dfile.encoding=utf8 MainClass

测试3：修改测试环境

a. hadoop集群某节点的环境是utf8，java代码也是utf8编码

b. 将上面的latin1数据库的编码改为utf8（修改生效的判定：新建的表默认是utf8.修改之前默认的编码是latin1）

步骤及结果

url置为jdbc:mysql://123.*.*.108/db?CharSet=utf8&useUnicode=true&characterEncoding=utf8

sql还是普通写法:SELECT LOWER(consumption_name),consumption FROM info.dimen_table

在执行上面这个sql之前，查看字符集

System.out.println(SqlTest.getSqlResutl(statement, "show variables like '%char%'"));// 初始连接进去之后查看字符集

/** 结果5

character_set_client utf8

character_set_connection utf8

character_set_database utf8

character_set_filesystem binary

character_set_results

<span style="color:#FF0000;">character_set_server latin1</span>

character_set_system utf8

character_sets_dir /usr/share/mysql/charsets/

statement.execute("set names 'utf8'"); //修改连接方式

//statement.execute("set character_set_server='utf8'");

System.out.println(SqlTest.getSqlResutl(statement, "show variables like '%char%'"));// 修改之后，查看连接使用的字符集

/** 结果6

character_set_client utf8

character_set_connection utf8

character_set_database utf8

character_set_filesystem binary

character_set_results

<span style="color:#FF0000;">character_set_server utf8</span>

character_set_system utf8

character_sets_dir /usr/share/mysql/charsets/

在ResultSet中，用utf8解释：new String(rs.getString(i).getBytes("UTF-8"));// 右边这个写法完全是乱码 new String(rs.getString(i).getBytes(),"UTF-8")

跟测试2的最终输出结果是一样的。

（对于纯粹的utf8环境：服务器编码是utf8;mysql的默认编码也是utf8，进入mysql之后查看字符集，会看到字符集展示就是结果6，即默认的character_set_server=utf8）

感觉所有问题的根源在于默认的character_set_server=latin1

http://stackoverflow.com/questions/27866533/whacky-latin1-to-utf8-conversion-in-jdbc ：提到jdbc对不识别的latin1的编码字符插入了特殊的替换字符

JDBC seems to insert a utf8 replacement character when asked to read from a latin1 column containing undefined latin1 codepage characters

【参考】使用Java读写存储在latin1编码的MySQL中的UTF-8编码的中文

Character set bug at server with utf8 column and latin1 connection

Description: // This bug is reproduced using a MySQL Linux default installation where "character_set_server" is "latin1"‘

其他相关资料及描述

貌似最好把所有字符集设置成utf8的方法，解决起来最彻底；并且这个问题只出现在jdbc中，使用Python操作的时候，读取都是正常的

conn10 = MySQLdb.connect(host=db10, user=mdUser, passwd=mdPasswd, db="sdc_hdfs" )

cursor10 = conn10.cursor()

cursor10.execute("set @@autocommit=1")

cursor10.execute("SHOW VARIABLES LIKE 'character_set_database'")

data=cursor10.fetchone()

if data[1]=='utf8':

cursor10.close()

cursor10 = MySQLdb.connect(db10, mdUser, mdPasswd, 'sdc_hdfs', charset='utf8', use_unicode=False).cursor() #使用utf8来连接

cursor10.execute("set @@autocommit=1")

Java中String解码、编码

Strings: although Java uses Unicode all the time under the hood, when you convert between String and byte[] using String#getBytes() or String(byte[]), you should rather use the overloaded method/constructor which takes the character encoding:

byte[] bytesInDefaultEncoding = someString.getBytes(); // May generate corrupt bytes.

byte[] bytesInUTF8 = someString.getBytes("UTF-8"); // Correct.

String stringUsingDefaultEncoding = new String(bytesInUTF8); // Unknown bytes becomes "?".

String stringUsingUTF8 = new String(bytesInUTF8, "UTF-8"); // Correct.

Otherwise the platform default encoding will be used, which can be the one of the underlying operating system or the IDE(!).

Unicode - How to get the characters right?

latin1转gbk的乱码问题，jdbc的bug

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

最后发现：

1. 不同jdbc对编码的支持情况不一样

2. 测试2，其实能解决问题，也就是set names 'utf8' + 在SQL里边使用convert()来转码是万能的：能绕开不同jdbc带来的问题。

到MySQL命令行输入“SET NAMES UTF8;”，然后执行“show variebles like“character_set_%”;”，发现原来为latin1的那些变量“character_set_client”、“character_set_connection”、“character_set_results”的值全部变为utf8了，原来是这3个变量在捣蛋。

查阅手册，上面那句等于：

SET character_set_client = utf8;

SET character_set_results = utf8;

SET character_set_connection = utf8;

看看这3个变量的作用：

信息输入路径：client→connection→server；

信息输出路径：server→connection→results。

换句话说，每个路径要经过3次改变字符集编码。以出现乱码的输出为例，server里utf8的数据，传入connection转为latin1，传入results转为latin1，utf-8页面又把results转过来。如果两种字符集不兼容，比如latin1和utf8，转化过程就为不可逆的，破坏性的。

但这里要声明一点，“SET NAMES UTF8”作用只是临时的，MySQL重启后就恢复默认了。

发布时间: 8年前【数据库】264人已围观【返回】【回到顶端】

很赞哦！ (1)

上一篇：utf8_general_ci 与 utf8_general_cs 和 utf8_bin的区别