Dummy’s guide for converting character set of a web application

When developing a web application, one has to be careful about the “character encoding”. There are many options such as latin1, utf8, etc, and the best character encoding depends on the language that your targeted audience use for writing. Since most  web applications are required to handle all types of languages and, thus, are required to handle all different types of characters such as í, é, á. The best way to handle all these various characters is to use UTF8 encoding.

Now the question is where to set UTF8 character encoding. And the short answer is “everywhere”:
          1) at the database level,
          2) at the interface/connection between database and client level (PHP), and
          3) at the data display level (HTML).

Below are the steps that I took to convert Memento from latin1 to UTF8 character encoding

(Note: Broadly these steps will remain same for converting any web application from one character encoding to another. But here I will assume that the application is being changed from latin1 to UTF8 encoding)

 1. Changing Database to UTF8: First step is to set database to UTF8 character set. Although MySQL has SQL commands for converting character set, based on my trials, I found that these commands are not full proof. What I ended up is adopting a script from maisonbisson to convert my database from latin1 to utf8. This script creates empty tables by copying the structure of the original tables and set their character set to UTF8. It finally, copies all the data from the original table to the new table, making necessary adjustments to char and text fields.

<?php
$username = “username”;
$password = “change_to_your_password”;
$hostname = “localhost”; 
$dbh = mysql_connect($hostname, $username, $password) or die(“Unable to connect to MySQL”);
@mysql_select_db(“your_database_name”,$dbh) or die(“Could not select first_test”);
$tables=mysql_query(‘SHOW TABLES;’);
while($row = mysql_fetch_row($tables))
{
 $table = $row[0];
 //$changes = array();
 if(mysql_query(“CREATE TABLE conv_{$table} LIKE {$table};”) === false){
  echo “Failed creating conv_{$table}:”, mysql_error(), “\n”;
  continue;
 }
 if(mysql_query(“ALTER TABLE conv_$table DEFAULT CHARACTER SET=utf8;”) === false){
  echo “Failed Altering conv_{$table}:”, mysql_error(), “\n”;
  continue;
 }
 
 // copy the data into the new table, transcoding to UTF8
 $fields = mysql_query(“DESCRIBE {$table};”);
 $allfields = $selectfields = array();
 while($field = mysql_fetch_array($fields)){
  $allfields[] = $field[‘Field’];
  $selectfields[] = (stripos($field[‘Type’], ‘char’) !== false || stripos($field[‘Type’], ‘text’) !== false) ? “CONVERT({$field[‘Field’]} USING utf8)” : “{$field[‘Field’]}”;
 }
 
 if(mysql_query(“INSERT INTO conv_{$table} (“. implode($allfields, ‘, ‘) .”) SELECT “. implode($selectfields, ‘, ‘) .” FROM {$table};”) === false){
  echo “Failed Insert into conv_{$table}:”, mysql_error(), “\n”;
  continue; 
 }
 
 if(mysql_query(“RENAME TABLE {$table} TO preutf8_{$table};”) === false){
  echo “Failed rename {$table} to preutf8_{$table}:”, mysql_error(), “\n”;
  continue;
 }
 
 if(mysql_query(“RENAME TABLE conv_{$table} TO {$table};”) === false){
  echo “Failed rename conv_{$table} to {$table}:”, mysql_error(), “\n”;
  continue; 
 }
 //echo implode($changes, “;\n”); 
}
 

mysql_close($dbh);

?>

Step 2: Specifying Connection between MySQL and PHP to be UTF8: The second step is specifying UTF8 as the default character encoding for transferring data between database (MySQL) and server side code (PHP). To do so, use “SET Names” command (MySQL). According to MySQL Reference, SET NAMES indicates what character set the client will use to send SQL statements to the server (exactly what we wanted). Run this command each time your application establishes a new connection with database.

If you are using CakePHP, it can automatically set this for you. All you need to do is set “encoding” variable in your database configuration file. Thanks to Dev weblog for posting this tip. Below is a sample database configuration file that I have used in my CakePHP application.

var $default = array(
              ‘driver’ => ‘mysql’,
              ‘persistent’ => false,
              ‘host’ => ‘localhost’,
              ‘login’ => ‘username’,
              ‘password’ => ‘my_password’,
              ‘database’ => ‘mydatabase’,
              ‘prefix’ => ”,
              ‘encoding’ => ‘utf8′
 );

Step 3: Setting Ouput (HTML) to be in UTF8: Final step is to specify the output to be in UTF8 character set. This is important otherwise browsers will have a hard time guessing the correct encoding. And the bad part is they are really bad in guessing the character set. As, in most cases the output is HTML and you can easily set character encoding of HTML by using the meta tag shown below:

                       <META http-equiv=”Content-Type” Content=”text/html; charset=utf-8″>

Just add this line in the “head” section of the html file. If you are using CakePHP, you can add this line in your layout.

That’s all.
Note: Depending how dirty is your data, you might need some more intermediary steps at the database level. In my case, just using the above script worked.

References:

Introduction to character code issues and related general readings
1. A tutorial on Character Code Issues: http://www.cs.tut.fi/~jkorpela/chars.html
2. Multilingual Forms – http://www.w3.org/International/questions/qa-forms-utf-8.en.php
3. Wikipedia on Character encoding – http://en.wikipedia.org/wiki/Character_encoding

Blogs on changing character code of database
1. http://www.oreillynet.com/onlamp/blog/2006/01/turning_mysql_data_in_latin1_t.html -
2. Script for converting character set: http://maisonbisson.com/blog/post/11972/freaking-mysql-character-set-encodings
3. Quick PHP Script to figure out bad UTF8 characters – http://mysqldba.blogspot.com/2006/08/quick-php-script-to-figure-out-bad.html

CakePHP Specific
1. CakePHP and Character Set in the database – http://nik.chankov.net/2007/10/01/cakephp-and-character-set-in-the-database/
2. UTF8 Multibyte Behaviour: http://bakery.cakephp.org/articles/view/utf8-multibyte-behavior

About these ads

About Ritesh Agrawal

I am a applied researcher who enjoys anything related to statistics, large data analysis, data mining, machine learning and data visualization.
This entry was posted in CakePHP, Database, MySQL, PHP, Web. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s