Fixing MySQL Encoding Issues with Ruby1.8
Having a recent and urgent need to convert a large (9 GB) MySQL database from latin1 to utf8, I went looking for a straight-forward solution. With over a half-million Google results for MySQL UTF-8 conversion, I was left with more questions than answers. In-place conversion? Dump, convert, and reload? Convert, then fix encoding errors? Fix encoding errors, then convert? Do both at the same time? Perl? Shell? Python? iconv?
(I’ll save my rant on MySQL documentation and default settings for a later post.)
Data encoding conversion is one of those types of tasks that doesn’t happen often, but can easily overshoot time-to-complete estimates by a considerable amount. It’s confusing, much of the documentation out there doesn’t help, and not everyone wants to become an expert just to get the data converted correctly. Conversion is totally necessary, though, and getting it done correctly shouldn’t be such a mystery. I’m not sure if this post helps or adds more confusion to the process, but I imagine there are still hundreds or thousands of large and useful MySQL databases out there that need to go through similar steps to what I’ve laid out here.
So, what follows is my attempt to explain a no-nonsense solution – after many dissimilar failed attempts – of using just Ruby 1.8.7 and a couple of gems to convert a poorly maintained latin1 MySQL database into utf8. (If I don’t cover it, assume I tried it and it didn’t work or took entirely too long for proper exploration.)
Legacy MySQL Database: Large, In Charge, and Brok’n
The database in question is a bit larger than the machine’s available RAM and it is “legacy” in the sense that concepts such as encoding and character sets were never considerations in the original design. It is also the company’s main business database with an everyone-will-be-calling level of consequences conversion failure.
Not only did the database need to be converted from the latin1 (ISO-8859-1) character set to utf8, there were four different classes of non-ASCII characters that prevented an easy conversion process:
- UTF-8 incorrectly encoded as ISO-8859-1 (many)
- Properly encoded ISO-8859-1 (few, unfortunately)
- UTF-8 correctly encoded in HTML entities (few)
- ISO-8859-1 incorrectly encoded as UTF-8 encoded as ISO-8859-1 (rare, but WTF‽)
So, shoving this kind of data through iconv doesn’t work as expected. iconv will choke, munge, or ignore anything that isn’t technically correct. Thus, a straight iconv conversion was out of the question.
Enter Ruby
I chose the “dump, fix and convert, reload” conversion method rather than the “fix the running database, table at a time” method. This way, I was able to test on sample data, then on snapshots of single tables, diff the results with the original snapshots, and update the conversion script accordingly. Once I settled on this method, it didn’t take long to get the proper results and prepare for converting the entire database.
The Ruby script below is my final conversion script. A database dump (taken with mysqldump) is piped in. Fresh new UTF-8 will be the output, preferably redirected to the filesystem. UNIX and Ruby at their simplest!
#!/usr/bin/env ruby1.8
require 'iconv'
require 'rubygems'
require 'htmlentities'
# double encoding error fixes
BAD_LATIN_TO_HTML = {
"\xe2\x84\xa2" => "’", # ™ -> ’
"\xc2\x9d" => "“", # ? -> “
"\xc5\x93" => "”", # œ -> ”
"\xe2\x80\x9c" => "–", # “ -> –
"\xc2\xa6" => "…" # ¦ -> …
}
# UTF-8 to HTML entity conversion map
# and "bad string" list
REPLACEMENTS = {
'‘' => '‘',
'’' => '’',
'–' => '–',
'“' => '“',
'”' => '”',
'…' => '…',
'é' => 'é',
' ' => ' '
}
html_coder = HTMLEntities.new
iconv = Iconv.new('latin1', 'UTF-8')
STDIN.each do |line|
# stage 1: double encoding fixes
line.gsub!(/\303\242\342\202\254[\177-\377]+/){ |chars|
chars.gsub!(/\303\242\342\202\254/, '')
if BAD_LATIN_TO_HTML[chars]
chars = BAD_LATIN_TO_HTML[chars]
else
STDERR.puts "No translation for #{chars}"
end
chars
}
# stage 2: proper UTF-8 -> HTML entities conversion
# and "bad string" replacement(s)
REPLACEMENTS.each do |bad, good|
line.gsub!(bad, good)
end
# stage 3: reverse iconv conversion: UTF-8 -> Latin1
# (unintuitive, but correct!)
line = iconv.iconv(line)
# stage 4: HTML entity conversion
line = html_coder.decode(line)
# stage 5: database specific conversion
line.gsub!('CHARSET=latin1', 'CHARSET=utf8')
puts line
end
Discussion
Because of the particular distribution of non-ASCII characters in my case (class #1), I elected to reverse-convert the database dump from utf8 to latin1, which is handled in the script by stage 3. For many data encoding conversions, this is the incorrect approach. Do it on a correctly encoded latin1 data and you’ll have even more of a mess. For cases like mine where incorrectly encoded characters are the norm, this approach will yield the correct results most of the time.
Encoding conversion can easily be handled with iconv, either through the command line or extension libraries in your favorite open source language. Before iconv has a chance to work with the data, all incorrect encodings must be dealt with. If you lie to iconv about the input encoding, the incorrectly encoded characters will be ignored, munged, or simply cause iconv to choke on the error. Not good.
(iconv, the command line program, also has this little issue where it puts the entire input stream in memory, which may not be how you want to handle a multi-gigabyte conversion)
With my case, there were two character encoding classes that had to be handled before iconv got ahold of the input data. Classes #2 and # 4 caused iconv to choke or return more garbage. What I needed was a way to keep some of the non-ASCII characters accurate through the iconv conversion process without iconv choking or munging the data. My solution (handled by the script in stages 1 and 2) was to convert double-encoded and correctly encoded characters into HTML entities.
With harmfully encoded characters out of the way in ASCII-land, stage 3 of the script calls iconv to back-convert the class #1 characters into proper UTF-8. Unintuitive, but correct.
The final character encoding issue, class #4, is handled in the script by the htmlentities gem in stage 4. Convert HTML to UTF-8 and we’re done with the four classes of encoding issues.
Stage 5 in the script handles the CHARSET= statements so that latin1 is replaced with utf8.
Process and Additional Tasks
Using this script, or one based on it, goes something like this: First, grab an affected table in the database:
$ mysqldump --skip-extended-insert -p \
$DATABASE $TABLE \
> $TABLE.latin1.mysql
Second, run the conversion script on the table snapshot:
$ ./utf8fix.rb < $TABLE.latin1.mysql \
> $TABLE.utf8.mysql
Third, diff the results and look for issues:
$ vimdiff $TABLE.*.mysql
Fourth, continue doing the first three steps until the errors stop and all of the output character look correct.
Fifth, do the first four steps on each table in the database that is likely to be affected by the conversion.
Sixth, take a full database snapshot, convert it, and load it into MySQL under a similar database name.
$ mysqldump -p $DATABASE \
> $DATABASE.latin1.mysql
$ ./utf8fix.rb < $DATABASE.latin1.mysql \
> $DATABASE.utf8.mysql
$ sed -e "s/`$DATABASE`/`$DATABASE_utf8`/g" \
$DATABASE.utf8.mysql \
| mysql -p
Seventh, test the new database with your existing development code (modified, hopefully!) to make sure these types of encoding issues won’t be repeated.
Eighth, set aside the deployment time to convert the existing database as done above, deploy the encoding-aware code, and set your production environment (MySQL, PHP, Perl, Rails, etc.) to use UTF-8 from now on, by default.
For MySQL, I added the following lines, under their respective sections, in /etc/mysql/my.cnf:
[client]
default-character-set = utf8
[mysqld]
default-character-set = utf8
default-collation = utf8_general_ci
For PHP5, this line is needed in /etc/php5/apache2/php.ini
default_charset = "utf-8"
For Ruby (ActiveRecord, really), the database configuration hash needs an extra option:
:encoding => 'utf8'
Finally, the converted database itself was still set to latin1, despite all tables, all columns, and the server itself using utf8. Here’s the fix:
mysql> ALTER DATABASE $DATABASE
CHARACTER SET = 'utf8';
Conclusion
While I didn’t cover the steps needed to modify code for encoding conversion, I did lay out an honest and thorough approach to converting data in a MySQL database, which seems to give developers a bit more trouble.
Part of the confusion, I think, comes from the lack of a de facto approach for MySQL data encoding conversion. Another source of confusion is MySQL’s continued choice of using latin1 as the character set default. The biggest issue, though, is that MySQL was the FLOSS database server of choice for an entire generation of webapp programmers who knew less than nothing about encoding issues. (“Encoding?”) Many of them have moved on, while their creations are still silently collecting UTF-8 data and storing it incorrectly, waiting for the unlucky developer who recognizes the problem.