UTF-8 aware parse_url() replacement.
I've realized that even though UTF-8 characters are not allowed in URL's, I have to work with a lot of them and parse_url() will break.
Based largely on the work of "mallluhuct at gmail dot com", I added parse_url() compatible "named values" which makes the array values a lot easier to work with (instead of just numbers). I also implemented detection of port, username/password and a back-reference to better detect URL's like this: //en.wikipedia.com
... which, although is technically an invalid URL, it's used extensively on sites like wikipedia in the href of anchor tags where it's valid in browsers (one of the types of URL's you have to support when crawling pages). This will be accurately detected as the host name instead of "path" as in all other examples.
I will submit my complete function (instead of just the RegExp) which is an almost "drop-in" replacement for parse_url(). It returns a cleaned up array (or false) with values compatible with parse_url(). I could have told the preg_match() not to store the unused extra values, but it would complicate the RegExp and make it more difficult to read, understand and extend. The key to detecting UTF-8 characters is the use of the "u" parameter in preg_match().
<?php
function parse_utf8_url($url)
{
static $keys = array('scheme'=>0,'user'=>0,'pass'=>0,'host'=>0,'port'=>0,'path'=>0,'query'=>0,'fragment'=>0);
if (is_string($url) && preg_match(
'~^((?P<scheme>[^:/?#]+):(//))?((\\3|//)?(?:(?P<user>[^:]+):(?P<pass>[^@]+)@)?(?P<host>[^/?:#]*))(:(?P<port>\\d+))?' .
'(?P<path>[^?#]*)(\\?(?P<query>[^#]*))?(#(?P<fragment>.*))?~u', $url, $matches))
{
foreach ($matches as $key => $value)
if (!isset($keys[$key]) || empty($value))
unset($matches[$key]);
return $matches;
}
return false;
}
?>
UTF-8 URL's can/should be "normalized" after extraction with this function.