MySQL’s UTF8 isn’t *really* UTF8 (and how to properly support Unicode)

I was forced to pull out my hair recently, after several SQL queries with UTF8 uni-coded text being fired from my project of the day, were failing to insert into my UTF8 MySQL database. Working on a project with a lot of middle eastern and asian content was simply crashing out my SQL insert queries and I was loosing a tremendous quantity of data.

Turns out MySQL’s UTF8 character set only partially implements proper UTF-8 encoding. It can only store UTF-8-encoded symbols that consist of one to three bytes; encoded symbols that take up four bytes aren’t supported. Luckily, MySQL 5.5.3 introduced a new encoding called utf8mb4 which maps to proper UTF-8 and thus fully supports Unicode, including astral symbols.

Switching from MySQL’s utf8 to utf8mb4

First, we need to change the character set and collation properties of the database, tables, and columns, to use utf8mb4, instead of utf8.

# For each database:
ALTER DATABASE database_name CHARACTER SET = utf8mb4 COLLATE utf8mb4_unicode_ci;
# For each table:
ALTER TABLE table_name CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
# For each column:
ALTER TABLE table_name CHANGE column_name column_name VARCHAR(191) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

Note: Don’t simply copy-paste this! You will need to tailor the queries to meet your specific requirements, the above lines are for illustrative purposes only.

Repair and optimize all tables:

Do this with some simple SQL Queries:

# For each table
REPAIR TABLE table_name;
OPTIMIZE TABLE table_name;

Or using the MySQL command-line:

$ mysqlcheck -u root -p --auto-repair --optimize --all-databases

Note: Don’t forget to change the character encoding for your PHP/Ruby code to use the correct character encoding – i.e. utf8mb4 instead of utf8.

Test the changes:

You can test that the changes have taken effect by running this command to see if the collation/charsets have been switched to utf8mb4:

SHOW VARIABLES WHERE Variable_name LIKE ’character\_set\_%’ OR Variable_name LIKE ’collation%’;

Alternatively, change the default character set in MySQL.

In the /etc/my.cnf config file, we set the following instructions:

[client]
default-character-set = utf8mb4

[mysql]
default-character-set = utf8mb4

[mysqld]
collation-server = utf8mb4_unicode_ci
init-connect='SET NAMES utf8mb4'
character-set-server = utf8mb4

That’s it. Obviously you will need to restart my SQL first though, and it will only apply to databases made after the default change, but I highly recommend this change. Even if you think you don’t need the full range of UTF8 characters, it only takes one to mess things up!

Conclusion

Never use utf8 in MySQL — always use utf8mb4 instead. Arbitrarily limiting the set of symbols you can use is just silly and is bound to cause a problem if your user is a young Japanese girl putting lots of cats in your strings. Why would you risk loosing lose data every time an Arabian user writes a message in their native locale. There’s no logical reason for it, and a few minutes of pain now, might save you a lot of heartache later. Do the right thing, and use utf8mb4 instead.

Leave a Reply

Your email address will not be published. Required fields are marked *