Page 1 of 1

[bug] garbled characters in some plugins

Posted: Fri Jun 23, 2006 12:23 am
by deminy
Although s9y supports many languages including east Asian languages, there are still some minor bugs on Asian languages support. When using s9y to build a blog of a multibyte language (such as Chinese, Japanese, etc), sometime you could find garbled characters were shown up in sidebar.

At least, garbled characters occur in sidebar when using the internal plugin "serendipity_archives_plugin" and the sidebar plugin "serendipity_plugin_comments".

There are two possible reasons why the garbled characters occur: 1. the web server doesn't support mb_string module; 2. PHP functions like "wordwrap" etc don't support for multibyte strings.

Here are the solution I found to solve the problem. The solution works for s9y v0.8.x to v1.0.

1. For internal plugin "serendipity_archives_plugin":

In file "./include/lang.inc.php", check line 63 (in function serendipity_mb()). The original code is:

Code: Select all

return mb_strtoupper(mb_substr($args[1], 0, 1)) . mb_substr($args[1], 1);
Modify it to:

Code: Select all

return mb_strtoupper(mb_substr($args[1], 0, 1, mb_detect_encoding($args[1])), mb_detect_encoding($args[1])) . mb_substr($args[1], 1, mb_strlen($args[1], mb_detect_encoding($args[1])), mb_detect_encoding($args[1]));
2. For sidebar plugin "serendipity_plugin_comments"

In file "...../serendipity_plugin_comments.php, from line 153 to line 202 (in function generate_content(&$title)). Modify the following:

2.1. replace "$serendipity['lang'] == "ja"" to

Code: Select all

($serendipity['lang'] == "ja" || $serendipity['lang'] == "cn" || $serendipity['lang'] == "zh" || $serendipity['lang'] == "ko" || $serendipity['lang'] == "tw" || $serendipity['lang'] == "tn")
2.2 For those multibyte functions like mb_strimwidth() and mb_strlen(), add the last parameter for encoding selection.

For exmaple:

Original source code:

Code: Select all

mb_strlen( $comment)
After modification:

Code: Select all

mb_strlen( $comment, mb_detect_encoding($comment))
I wrote a Chinese blog discussing this problem:
http://www.deminy.net/blog/archives/4214-y.html

Posted: Fri Jun 23, 2006 3:06 am
by deminy
I forgot to tell one thing.

S9y do have a "mb_internal_encoding()" statement in file "include/lang.inc.php". But it seems to have no effects on multi-byte functions which are called later in s9y.

For example, as said above, in file "...../serendipity_plugin_comments.php", when you want to call a PHP function "mb_strlen", you might thought that since you have set a value for the internal encoding, you could write the code in this way:

Code: Select all

mb_strlen( $comment)
Here you suppose to use the default encoding (which had been defined when calling function "mb_internal_encoding") when calling function "mb_strlen".

BUT, actually, to avoid garbled characters (for east Asian languages), you MUST write it down like this:

Code: Select all

mb_strlen( $comment, mb_detect_encoding($comment))
The above changes won't affect the performance too much, especially when u r using single-byte encoding languages, like English.

Posted: Fri Jun 23, 2006 10:48 am
by garvinhicking
Hi!

Thanks a lot for helping to improve that situation! I committed your changes to SVN trunk.

However, for me to understand things: Do you know why mb_internal_encoding() does not work? I would really think that setting it to LANG_CHARSET (which should be 'UTF-8' in your case) should do the trick?

I believe you that if you say it doesn't work, but isn't this an unsatisfying situation? :)

Best regards,
Garvin

Posted: Fri Jun 23, 2006 4:02 pm
by deminy
A detailed debug and test could cost me a lot time which I can not afford now.

But I did do a few simple test, but still couldn't tell the exact reason. Setting it to LANG_CHARSET might work for very simple code, but not for s9y.

Let me make it clear.

Following PHP's definition, the above sample code,

Code: Select all

mb_strlen( $comment)
SHOULD have the same effect as the following code :

Code: Select all

mb_strlen( $comment, mb_detect_encoding($comment))
But actually, in s9y, the first would cause garbled characters while the second won't.

The possible reason could be: either a bug in PHP's multibyte functions, or other unknown bug/settings in s9y or in my test. (I am guessing this is not a bug of s9y itself, but I don't think I did something wrong in the test.)

It might be an unknown bug in PHP (but seems not). For more information, you can read comments for function mb_strtolower in PHP.net:
http://ca.php.net/mb_strtolower

Posted: Sat Jun 24, 2006 12:28 am
by garvinhicking
Hi!

Oh, okay. I can understand that. I'll try to run some tests, but mainly my chinese is a bit rusty. ;))

Are you using a UTF-8 environment? With which serendipity language file?

Regards,
Garvin

Posted: Sat Jun 24, 2006 6:41 am
by deminy
I am using Simplified Chinese ( utf-8 ).

Based on my experience, I think there are no too much difference no matter you choose Simplified Chinese ( utf-8 ) or Simplified Chinese ( gb2312 ).

These two encoding charsets use exactly the same language files in s9y.

Good luck