Wednesday, February 1, 2017

mb strtok implementation in PHP–String tokenizer for Multibyte

mb strtok implementation in PHP–String tokenizer for Multibyte


This is a simple function to implement some kind of mb_strtok() in PHP. As maybe you all are aware the mb_strtok function does not available for multibyte string (aka Unicode string). So this is my attempt to solve the problem. Anyway, there are bugs where the program halt if the input text is too long (how long? not sure yet). Maybe you could improve to provide better result.

Thank you and happy coding.

string-tokenizer-for-multibyte

The PHP code mb_strtok.php ;

<html>
<head>
<title>String token for MB_STRING</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
</head>
<body>
<h1>String token for MB_STRING</h1>
<h2><a href="http://kerul.net">kerul.net</a></h2>

<form method="GET" ACTION="">
Input text <br>
<textarea name="txtinput" cols=30 rows=10></textarea>
<br>
<input type="submit" >
</form>

<?php
$in=$_GET["txtinput"];
$inputlen=mb_strlen($in, UTF-8);
echo ("Input length: $inputlen characters. <br> ");

$tokens=mb_strtok(" /n/t?.", $in);
echo ("List of TOKENS<br> ");
//echo $tokens;
for($i=0; $i<count($tokens); $i++){
echo ("[$i] -> ".$tokens[$i] ." <br> ");
}

function mb_strtok($delimiters, $str=NULL)
{
static $pos = 0; // Keep track of the position on the string for each subsequent call.
static $string = "";
static $listtoken=array();
// If a new string is passed, reset the static parameters.
if($str!=NULL)
{
$pos = 0;
$string = $str;
}

// Initialize the token.
$token = "";

while ($pos < mb_strlen($string,UTF-8))//loop till end of input string
{

$char = mb_substr($string, $pos, 1);//fetch one character, pos = char position
$pos++;
//echo ("Char at $pos => $char <br> ");//trace character at position

if(mb_strpos($delimiters, $char)===FALSE)//if character is not delimeter
{
$token .= $char;//put character in the token node
}
else
{
//if arrive at delimeter, push token to listtoken
array_push($listtoken, $token);
$token="";//clear the token node
}
}
// return the list of tokens
if ($listtoken!="")
{
return $listtoken;
}
else
{
return false;
}
}
?>
</body>
</html>


There is another one, this time the separator (.,;:) will be stored in the list of token (listtoken).


<html>
<head>
<title>String token for MB_STRING</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
</head>
<body>
<h1>String token for MB_STRING</h1>
<h2><a href="http://kerul.net">kerul.net</a></h2>

<form method="GET" ACTION="">
Input text <br>
<textarea name="txtinput" cols=30 rows=10></textarea>
<br>
<input type="submit" >
</form>

<?php
$in=$_GET["txtinput"];
$inputlen=mb_strlen($in, UTF-8);
echo ("Input length: $inputlen characters. <br> ");

$tokens=mb_strtok(" /n/t/f", $in);//delimeter by whitespace only
echo ("List of TOKENS<br> ");
//echo $tokens;
for($i=0; $i<count($tokens); $i++){
echo ("[$i] -> ".$tokens[$i] ." <br> ");
}

function mb_strtok($delimiters, $str=NULL)
{
static $pos = 0; // Keep track of the position on the string for each subsequent call.
static $string = "";
static $listtoken=array();
// If a new string is passed, reset the static parameters.
if($str!=NULL)
{
$pos = 0;
$string = $str;
}

// Initialize the token.
$token = "";

while ($pos < mb_strlen($string,UTF-8))//loop till end of input string
{

$char = mb_substr($string, $pos, 1, UTF-8);//fetch one character, pos = char position

echo ("Char at $pos => $char <br> ");//trace character at position


if(mb_strpos($delimiters, $char)===FALSE)//if character is not delimeter
{
if($char=="." || $char==";"||$char==":"||$char==","){
echo "Token detected $token <br> ";
array_push($listtoken, $char);
//$token="";//clear the token node
}else{
$token .= $char;//put character in the token node
}
}
else
{
//if arrive at delimeter, push token to listtoken
echo "Token detected $token <br> ";
array_push($listtoken, $token);
$token="";//clear the token node
}
$pos++;
}
return $listtoken;
// return the list of tokens
if ($listtoken!="")
{
return $listtoken;
}
else
{
return false;
}

}
?>
</body>
</html>


mb_strtok implementation in PHP for Arabic unicode strings


Modified using the code by http://www.anastis.gr/mb_strtok-a-php-implementation/


Available link for download