basename() is locale-aware

For years I've always just assumed:

$baseName = basename('dir/file');

Was just an easy way to do:

$file = 'dir/file';
$baseName = substr($file,strrpos($file,'/')+1);

It turns out basename does a bit more than just splicing the string at the last slash, because it's locale aware. In my case I was dealing with a multi-byte UTF-8 string. It took me quite some time figuring out what was going on, because I was testing from the console which had the en_US.UTF-8 locale, and the bug was appearing on Apache, which defaults to the C locale.

Example:

<?php

$str = urldecode('%C3%A0fo%C3%B3');

setlocale(LC_ALL,'C');
echo urlencode(basename($str)) . "\n";

setlocale(LC_ALL,'en_US.UTF-8');
echo urlencode(basename($str)) . "\n";

?>

Output:

fo%C3%B3
%C3%A0fo%C3%B3

What bugs me about this, is that there was no way for me to know basename() operates on anything else than bytes. The PHP manual also doesn't point this out. It makes me wonder how many other string functions change behaviour based on their locale.

Web mentions

Comments

  • Sean Coates

    Sean Coates

    Hopefully it bugs you enough to help fix the manual: http://php.net/dochowto (-: S
  • Andy Thompson

    Andy Thompson

    basename is also platform aware, which is another reason to use it.
  • Tobias Nyholm

    Thank you for this post. I needed to write a test for this bug. Btw, here is the solution for your problem:

    pathinfo($str, PATHINFO_BASENAME)