basename() is locale-aware

For years I've always just assumed:

$baseName = basename('dir/file');

Was just an easy way to do:

$file = 'dir/file';
$baseName = substr($file,strrpos($file,'/')+1);

It turns out basename does a bit more than just splicing the string at the last slash, because it's locale aware. In my case I was dealing with a multi-byte UTF-8 string. It took me quite some time figuring out what was going on, because I was testing from the console which had the en_US.UTF-8 locale, and the bug was appearing on Apache, which defaults to the C locale.

Example:

<?php

$str = urldecode('%C3%A0fo%C3%B3');

setlocale(LC_ALL,'C');
echo urlencode(basename($str)) . "\n";

setlocale(LC_ALL,'en_US.UTF-8');
echo urlencode(basename($str)) . "\n";

?>

Output:

fo%C3%B3
%C3%A0fo%C3%B3

What bugs me about this, is that there was no way for me to know basename() operates on anything else than bytes. The PHP manual also doesn't point this out. It makes me wonder how many other string functions change behaviour based on their locale.