Linux | c&cpp | Email | github | QQ群:425043908 关注本站

itarticle.cc

您现在的位置是:网站首页 -> 数据库 文章内容

mysql字符编码问题latin1到utf8-itarticl.cc-IT技术类文章记录&分享

发布时间: 8年前数据库 153人已围观返回

测试目的:

Java从MySQL读取数据并打印

测试环境1

a. Hadoop集群某节点的环境是utf8,java代码也是utf8编码

b. 需要读取的mysql服,数据库、数据表均是latin1

运行mysql -u* -p* -A -h 进入mysql服

(一)查看mysql编码的基本方法

先在mysql下确认原始数据是什么编码,经过下面三步可以确认原始数据是latin1

进入mysql > use db;

mysql> show variables like 'character%';

+--------------------------+----------------------------+

| Variable_name | Value |

+--------------------------+----------------------------+

| character_set_client | gbk |

| character_set_connection | gbk |

| character_set_database | latin1 |

| character_set_filesystem | binary |

| character_set_results | gbk |

| character_set_server | latin1 |

| character_set_system | utf8 |

| character_sets_dir | /usr/share/mysql/charsets/ |

+--------------------------+----------------------------+

8 rows in set (0.00 sec)

// set names的作用是改变三个红色部分的三个参数

mysql> set names 'latin1';

Query OK, 0 rows affected (0.00 sec)

mysql> show variables like 'character%';

+--------------------------+----------------------------+

| Variable_name | Value |

+--------------------------+----------------------------+

| character_set_client | latin1 |

| character_set_connection | latin1 |

| character_set_database | latin1 |

| character_set_filesystem | binary |

| character_set_results | latin1 |

| character_set_server | latin1 |

| character_set_system | utf8 |

| character_sets_dir | /usr/share/mysql/charsets/ |

+--------------------------+----------------------------+

8 rows in set (0.00 sec)

Set names 'utf8'等价于下面三条语句

SET character_set_client = utf8 用来设置客户端送给MySQL服务器的数据的 字符集

SET character_set_results = utf8 服务器返回查询结果时使用的字符集

SET character_set_connection = utf8 MySQL 服务器 把客户端传来的数据,从character_set_client字符集转换成character_set_connection字符集

mysql的信息流向及编码转换

接收:client --> connection --> database

反馈:database--->connection-->results

mysql> SELECT LOWER(consumption_name),consumption FROM dimen_table;

+-------------------------+-------------+

| LOWER(consumption_name) | consumption |

+-------------------------+-------------+

| gm命令 | -1019 |

| 神器解锁 | -1018 |

| 购买经验药水 | -1017 |

| 购买时装 | -1016 |

| 运营活动(领物品) | -1015 |

| 神器精炼(棍) | -1014 |

| 神器精炼(拳) | -1013 |

| 商店购买 | -1012 |

(二) java下的代码测试

测试1: java连接代码

jdbc:mysql://106.2.67.10/sdc_hdfs?useUnicode=true&characterEncoding=UTF-8

Statement statement = con.createStatement();

statement.execute("set names 'utf8'");//与参数characterEncoding=UTF-8的作用类似

sql = "SELECT LOWER(consumption_name),consumption FROM dimen_table"

ResultSet rs = statement.executeQuery(sql);

while (rs.next()) {

System.out.println(new String(rs.getBytes(i+1),"cp1252")); // latin1 对应的解码用ISO-8859-1或者Cp1252,我测试的时候发现应该用cp1252

}

// 查看mysql支持的字符集及其描述

mysql> SHOW CHARACTER SET;

+----------+-----------------------------+---------------------+--------+

| Charset | Description | Default collation | Maxlen |

+----------+-----------------------------+---------------------+--------+

| big5 | Big5 Traditional Chinese | big5_chinese_ci | 2 |

| dec8 | DEC West European | dec8_swedish_ci | 1 |

| cp850 | DOS West European | cp850_general_ci | 1 |

| hp8 | HP West European | hp8_english_ci | 1 |

| koi8r | KOI8-R Relcom Russian | koi8r_general_ci | 1 |

| latin1 | cp1252 West European | latin1_swedish_ci | 1 |

| latin2 | ISO 8859-2 Central European | latin2_general_ci | 1 |

| swe7 | 7bit Swedish | swe7_swedish_ci | 1 |

| ascii | US ASCII | ascii_general_ci | 1 |

| ujis | EUC-JP Japanese | ujis_japanese_ci | 3 |

【参考】

http://stackoverflow.com/questions/21689665/mysql-latin1-to-utf-8-using-java-hibernate-jpa :

提到latin1应该用cp1252解码:

MySQL's version of latin1 is an extended version of CP1252: it uses the 5 bytes that CP1252 leaves undefined. Unfortunately the current Connector/J has a "bug" in that it uses the original CP1252 rather than MySQL's own version. Therefore it's impossible to recover strings whose encoding uses one of these 5 bytes. Patching the Connector/J source to fix the bug could solve the problem, but ideally you should migrate the tables to UTF-8.

提到修改sql语句的解决方法:SELECT CONVERT(CONVERT(CONVERT( column_name USING latin1) USING binary) using utf8) FROM...

测试2,改sql,将编码转换交给mysql来处理

url置为jdbc:mysql://123.*.*.108/db

sql置成SELECT CONVERT(CONVERT(CONVERT(LOWER(consumption_name) USING latin1) USING binary) USING utf8),consumption FROM dimen_table

执行sql前,先执行statement.execute("set names 'utf8'");

在ResultSet中,rs.getString()获得的就是utf-8编码

在windows下显示正常。

在linux下显示有问题。

gm命令 -1019

神器解锁 -1018

购买经验药水 -1017

购买时装 -1016

运营活动(领物-1015

神器精炼(-1014

神器精炼(-1013

商店购买 -1012

(说明:红色部分不能正常解析)

这个是linux环境的编码环境导致的:解决方法

1) 使用locale查看file.encoding这个系统变量,如果不是utf8可以运行这个命令 export LANG=zh_CN.utf8

或者这样LANG=zh_CN.utf8 java -Djava.ext.dir=/sdfls/asdlfjal/test Main执行之后

2) 或者java -Dfile.encoding=utf8 MainClass

测试3:修改测试环境

a. hadoop集群某节点的环境是utf8,java代码也是utf8编码

b. 将上面的latin1数据库的编码改为utf8(修改生效的判定:新建的表默认是utf8.修改之前默认的编码是latin1)

步骤及结果

url置为jdbc:mysql://123.*.*.108/db?CharSet=utf8&useUnicode=true&characterEncoding=utf8

sql还是普通写法:SELECT LOWER(consumption_name),consumption FROM info.dimen_table

在执行上面这个sql之前,查看字符集

System.out.println(SqlTest.getSqlResutl(statement, "show variables like '%char%'"));// 初始连接进去之后查看 字符集

/** 结果5

character_set_client utf8

character_set_connection utf8

character_set_database utf8

character_set_filesystem binary

character_set_results

<span style="color:#FF0000;">character_set_server latin1</span>

character_set_system utf8

character_sets_dir /usr/share/mysql/charsets/

*/


statement.execute("set names 'utf8'"); //修改连接方式

//statement.execute("set character_set_server='utf8'");

System.out.println(SqlTest.getSqlResutl(statement, "show variables like '%char%'"));// 修改之后,查看连接使用的字符集

/** 结果6

character_set_client utf8

character_set_connection utf8

character_set_database utf8

character_set_filesystem binary

character_set_results

<span style="color:#FF0000;">character_set_server utf8</span>

character_set_system utf8

character_sets_dir /usr/share/mysql/charsets/

*/

在ResultSet中,用utf8解释:new String(rs.getString(i).getBytes("UTF-8"));// 右边这个写法完全是乱码 new String(rs.getString(i).getBytes(),"UTF-8")

跟测试2的最终输出结果是一样的。

(对于纯粹的utf8环境:服务器编码是utf8;mysql的默认编码也是utf8,进入mysql之后查看字符集,会看到字符集展示就是结果6,即默认的character_set_server=utf8)

感觉所有问题的根源在于默认的character_set_server=latin1

http://stackoverflow.com/questions/27866533/whacky-latin1-to-utf8-conversion-in-jdbc :提到jdbc对不识别的latin1的编码字符插入了特殊的替换字符

JDBC seems to insert a utf8 replacement character when asked to read from a latin1 column containing undefined latin1 codepage characters

【参考】使用Java读写存储在latin1编码的MySQL中的UTF-8编码的中文

Character set bug at server with utf8 column and latin1 connection

Description: // This bug is reproduced using a MySQL Linux default installation where "character_set_server" is "latin1"‘

其他相关资料及描述

貌似最好把所有字符集设置成utf8的方法,解决起来最彻底;并且这个问题只出现在jdbc中,使用Python操作的时候,读取都是正常的

conn10 = MySQLdb.connect(host=db10, user=mdUser, passwd=mdPasswd, db="sdc_hdfs" )

cursor10 = conn10.cursor()

cursor10.execute("set @@autocommit=1")

cursor10.execute("SHOW VARIABLES LIKE 'character_set_database'")

data=cursor10.fetchone()

if data[1]=='utf8':

cursor10.close()

cursor10 = MySQLdb.connect(db10, mdUser, mdPasswd, 'sdc_hdfs', charset='utf8', use_unicode=False).cursor() #使用utf8来连接

cursor10.execute("set @@autocommit=1")

Java中String解码、编码

Strings: although Java uses Unicode all the time under the hood, when you convert between String and byte[] using String#getBytes() or String(byte[]), you should rather use the overloaded method/constructor which takes the character encoding:

byte[] bytesInDefaultEncoding = someString.getBytes(); // May generate corrupt bytes.

byte[] bytesInUTF8 = someString.getBytes("UTF-8"); // Correct.

String stringUsingDefaultEncoding = new String(bytesInUTF8); // Unknown bytes becomes "?".

String stringUsingUTF8 = new String(bytesInUTF8, "UTF-8"); // Correct.

Otherwise the platform default encoding will be used, which can be the one of the underlying operating system or the IDE(!).


Unicode - How to get the characters right?

latin1转gbk的乱码问题,jdbc的bug

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

最后发现:

1. 不同jdbc对编码的支持情况不一样

2. 测试2,其实能解决问题,也就是set names 'utf8' + 在SQL里边使用convert()来转码是万能的: 能绕开不同jdbc带来的问题。



到MySQL命令行输入“SET NAMES UTF8;”,然后执行“show variebles like“character_set_%”;”,发现原来为latin1的那些变量“character_set_client”、“character_set_connection”、“character_set_results”的值全部变为utf8了,原来是这3个变量在捣蛋。

查阅手册,上面那句等于:

SET character_set_client = utf8;

SET character_set_results = utf8;

SET character_set_connection = utf8;

看看这3个变量的作用:

信息输入路径:client→connection→server;

信息输出路径:server→connection→results。

换句话说,每个路径要经过3次改变字符集编码。以出现乱码的输出为例,server里utf8的数据,传入connection转为latin1,传入results转为latin1,utf-8页面又把results转过来。如果两种字符集不兼容,比如latin1和utf8,转化过程就为不可逆的,破坏性的。

但这里要声明一点,“SET NAMES UTF8”作用只是临时的,MySQL重启后就恢复默认了。

发布时间: 8年前数据库153人已围观返回回到顶端

很赞哦! (1)

文章评论

  • 请先说点什么
    热门评论
    152人参与,0条评论

站点信息

  • 建站时间:2016-04-01
  • 文章统计:728条
  • 文章评论:82条
  • QQ群二维码:扫描二维码,互相交流