subscribe

Internationalized domain names, are you ready?

Since may 11 TLD's (top-level domainnames) have been added. In order for this to work successfully, a lot of applications will have to be fixed.

Many email-validation scripts might use an approach like this:


$ok = preg_match('/^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,6}$/i', $email);

This one is pretty simple, it matches the most common address formats, as long as the tld (.com, nl, .uk, etc) is under 6 characters. For a bit more sophistication you might want to ensure that the tld is a bit more valid:


$ok = preg_match('/^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.(?:[A-Z]{2}|com|org|net|edu|gov|mil|biz|info|mobi|name|aero|asia|jobs|museum)$/i',$email);

Note: both these regexes were taken from regular-expression.info. The top google hit, and decent examples.

The new TLD's use non-ascii characters, and they might become aliases for existing top-level domains, or new tld's altogether. Here are the currently working examples:

At first sight these look like regular utf-8, characters, but if you look at the sourcecode of this page, you'll notice that it's actually encoded differently.

The korean url http://실례.테스트, is actually encoded as http://xn--9n2bp8q.xn--9t4b11yi5a/. This is called Punycode.

If you want support for these new urls (and thus domainnames in emails), you should have support for punycode. You will likely receive UTF-8 encoded domainnames for email address (example@실례.테스트), but internally you must make sure that you only deal with the punycode representation.

This translating is also what modern browsers do. If you were to paste "http://xn--9n2bp8q.xn--9t4b11yi5a/" directly in the firefox address bar, it will show you the UTF-8 characters instead. Firefox will re-encode to punycode though and use that format for HTTP requests.

The best way really to check for valid email addresses is to use a very liberal regex, but verify with a simple MX record lookup if a mailserver exists for the given domain. This example is an expansion on the first regex.


$email = 'example@xn--9n2bp8q.xn--9t4b11yi5a';

if(preg_match('/^[A-Z0-9._%+-]+@([A-Z0-9.-]+\.[A-Z0-9-]{2,})$/i', $email,$matches)) {
    $hostname = $matches[1];
        if (!getmxrr($hostname, $hosts)) {
            echo "Host has an MX record\n";
        } else {
            echo "Host does not exist or does not have an MX record\n";
        }
} else {
    echo "Email address did not match regular expression\n";
}

The preceeding code does not convert UTF-8 to punycode though. There's not yet an easy native way in PHP to do this, but Pear's Net_IDNA2 provides a way. The implementation seems very complex though, and leaves me wondering if there's an easier way to go about it.

Web mentions