Mastering UTF-8 encoding in PHP

Encoding issues can appear in several locations between backend / database and frontend. I’m going to explain each of those locations and how to properly implement a clean encoding throughout your project.

What is a multibyte character?

A single byte as in its primitive form is a number between 0 and 255, making up a total 256 different values. If a single character is saved as a single byte, we are using the ASCII character set or a limited ISO set. Now imagine a website that stores and displays ALL characters existing in the world or internet, including symbols and non roman characters – just like Twitter or many other international websites.

We have to expand.

To let all characters fit into a “byte space”, we need to make room and concat several bytes to represent a single character. This is then a multibyte character. UTF-8 can store up to 4 bytes per character, that’s a number between 0 and 65535.

PHP file encoding

notepad_encoding_utf8

Notepad++ let’s you easily switch between encodings. In NetBeans for example, you set a global encoding in the project settings. Some libraries are not able to handle UTF-8 with BOM correctly. Sometimes you may save a file as UTF-8 without BOM but it does get displayed as ANSI in other programs, this may be the case when you do not have any multibyte characters in the file, if you arent using any character beyond the 255 per byte limit, there is no difference between ANSI and UTF-8, so don’t worry.

When PHP generates content (you write something to a file or send data through an API), the data is encoded by the same charset of the script itself.

Data from MySQL or any other Database

After setting the UTF-8 encoding for the PHP script, you still might get strange data from the database. This is because the data itself in the database might have a different encoding or the database connection has a different encoding.
To set the database connection encoding, you can issue these queries before any other:

But – these functions change the behaviour of functions like mysqli_real_escape_string(); see http://php.net/manual/en/mysqlinfo.concepts.charset.php.

The character set should be understood and defined, as it has an affect on every action, and includes security implications. For example, the escaping mechanism (e.g., mysqli_real_escape_string() for mysqli, mysql_real_escape_string() for mysql, and PDO::quote() for PDO_MySQL) will adhere to this setting. It is important to know that these functions will not use the character set that is defined in the SET NAMES/SET CHARACTER SET queries.

The database / table / column collation in MySQL is mostly for sorting but it is suggested to set the default collation to either utf8_general_ci or utf8_unicode_ci (read this very good article if unsure: http://stackoverflow.com/questions/766809/whats-the-difference-between-utf8-general-ci-and-utf8-unicode-ci)

Saving data in MySQL – best practice

When creating tables, you often encounter “collation“. This field specifies how the database engine should interpret the contents when looking up data or sorting the data. It’s important to distinct between accented characters. Like when you search for Hello and you also get Héllo as a result. The collation is also used for the index (which is primarily involved in sorting and searching data).

The best practice here is to use the collation utf8mb4_unicode_ci, it is the extended and improved utf8_unicode_ci and can store a few bytes more. It will shrink the size available for the index, but it’s worth!

When using utf8mb4_unicode_ci you have less space for the index because this collation takes up a few more bytes then what you may have used before – to solve issues, don’t simply max your varchar‘s to a high number, instead only allow what you really need.

HTTP (HTML page) encoding

The browser needs to know which encoding you are using, and it should be set to what really comes from PHP. This is also important to prevent UTF-7 XSS Vulnerability.

In nginx, simply add

to your server { directive.

In Apache, put anywhere:

Finally include the encoding meta tag into your <head>:

or the HTML5 short version:

One thought on “Mastering UTF-8 encoding in PHP

  1. Pingback: The importance of putty session encoding - FLOWL blog | php - security - linux - analytics

Leave a Reply

Your email address will not be published. Required fields are marked *