• PRODUCT

    PRODUCT

  • PRICING
    PRICING

  • HELP
    HELP

  • BLOG
    BLOG

  • APPSTORE
    APPSTORE

  • COMPANY
    COMPANY

  • LEGAL
    LEGAL

  • LOGIN
    LOGIN

  • String Encodings and Mysql


  • String Encodings and Mysql


  • There are some subtle tricks that every good engineer should be aware of when importing data into mysql using a .sql file.

    mismatched encodings
    we were trying to import large customer log .sql files (>1GB) into our sql database and hit this error:

    mysql -u user -p mydatabase < db-backup.sql

    ERROR 1366 (HY000) at line 2119643: Incorrect string value: '\\xE2\\x80\\x8E' for column 'xx' at row 1

    Looking at the .sql file with vim:

    INSERT INTO extract VALUES(...., "Tokyo<200e>", ...);

    Next step was to look at the binary values (we used hexdump | grep to find the location of the line first):

    hexdump -C -s 0x0238e430 -n 5 db-backup.sql

    0238e479 e2 80 8e 22 2c |...",|

    The values seem like things that would be OK for a utf_8 string, but not so for latin1, ascii. Lets try to verify that with Python:

    >>> b='\\xe2\\x80\\x8e\\x22\\x2c ';print b.decode("utf_8")

    ϠφϠášÏ á¹",

    >>> b='\\xe2\\x80\\x8e\\x22\\x2c ';print b.decode("latin1")

    Ï áœÏ¡Ð›Ï á™Ï ášÏ á™Ï á¹",

    >>> b='\\xe2\\x80\\x8e\\x22\\x2c ';print b.decode("ascii")

    Traceback (most recent call last):

    File "", line 1, in

    UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)

    Interestingly enough, the column that the insert statement was targetting was a varchar with CHARSET=latin1; So it seems like mysql's decoder is 'slightly' different that pythons. The other surprise was that we tried using UNHEX('...') and mysql was OK with it, even though we were trying to push illegal byte values into a latin1 datatype (the only difference was how the input insert statement was written). So the workaround was to change the CHARSET to utf8.

    Hack

    If you don't care about data integrity, another trick is to manipulate the client char set and have the bad chars filtered by the mysql interpreter (not a good method, but it was fun). This trick is also useful when you dont want to or can not change the contents of the sql file (i.e. large file on a readonly backup volume that you can not easily edit.)

    Our default charactersetclient was utf8, so we changed it:

    mysql --default-character-set=latin1 -u user -p mydatabase < db-backup.sql

    Or from console:

    set charset ascii;

    SHOW VARIABLES LIKE 'character_set%';

    charactersetclient ascii

    Importing the .sql file will be successful but you will get warnings:

    1 row(s) affected, 1 warning(s): 1300 Invalid ascii character string: ' ...

    And you will get the familiar '???' characters as replacement for things that mysql is not able to convert to latin1.


  • Izyware Blog
    Izyware Blog