Regular expressions for treatment of lines of Utf-8 in PHP
Jun 07
At development of multilingual sites for html-pages most more comfortable and predpochtitel’ney to use the code of Utf-8, providing support all or almost all existent languages and encoding ascii-characters (Roman alphabet, numbers and special characters) by one byte, and national alphabets — a few. Thus, the code of Utf-8 has variable physical length of every character. In this connection sometimes there are problems at programming of multilanguage sites.
For example, in a programming of PHP of function of strlen and substr language give out improper results, if there are characters of national alphabet in a line (as intended for work with an onebyte code). Certainly, in PHP there are such functions as mb_strlen and mb_susbtr, specially intended for work with multibyte lines. But, by default support of Multibyte String Functions in PHP is turned off,that automatically limits the choice of khostinga for the designed site. In addition, during connecting of the module of mb_string the set of the supported languages is specified. And that is why there is probability, that the language required you can not appear in the list of supported.
However, there is other, more comfortable and flexible decision of problem. Taking advantage of functions of PCRE, correctly perceiving the code of Utf-8, it is possible to write the functions of utf8_strlen and utf8_substr:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | function utf8_strlen($s) { return preg_match_all('/./u', $s, $tmp); } </code> <code> function utf8_substr($s, $offset, $len = 'all') { if ($offset<0) $offset = utf8_strlen($s) + $offset; if ($len!='all') { if ($len<0) $len = utf8_strlen($s) - $offset + $len; $xlen = utf8_strlen($s) - $offset; $len = ($len>$xlen) ? $xlen : $len; preg_match('/^.{' . $offset . '}(.{0,'.$len.'})/us', $s, $tmp); } else { preg_match('/^.{' . $offset . '}(.*)/us', $s, $tmp); } return (isset($tmp[1])) ? $tmp[1] : false; } |
Continuing the theme of work with lines in the code of Utf-8, will consider a few functions, workings without set in PHP of expansion of Multibyte String Functions, namely utf8_strpos and utf8_substr_count:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 | function utf8_strpos($haystack, $needle, $offset = 0) { # get substring (if isset offset param) $offset = ($offset<0) ? 0 : $offset; if ($offset>0) { preg_match('/^.{' . $offset . '}(.*)/us', $haystack, $dummy); $haystack = (isset($dummy[1])) ? $dummy[1] : ''; } # get relative pos $p = strpos($haystack, $needle); if ($haystack=='' or $p===false) return false; $r = $offset; $i = 0; # calc real pos while($i<$p) { if (ord($haystack[$i])<128) { # ascii symbol $i = $i + 1; } else { # non-ascii symbol with variable length # (handling first byte) $bvalue = decbin(ord($haystack[$i])); $i = $i + strlen(preg_replace('/^(1+)(.+)$/', '\1', $bvalue)); } $r++; } return $r; } function utf8_substr_count($h, $n) { # preparing $n for using in reg. ex. $n = preg_quote($n, '/'); # select all matches preg_match_all('/' . $n . '/u', $h, $dummy); return count($dummy[0]); } |