Are there plans for full unicode support?

Discussion corner for Developers of Serendipity.
Post Reply
hansm
Regular
Posts: 5
Joined: Sat Oct 10, 2009 4:03 pm

Are there plans for full unicode support?

Post by hansm »

Are there plans for full unicode support, i.e. not only in blog entries, but even for author names, category titles etc.?

Update: Field "permalink" in DB table permalinks looks quite odd when using unicode characters in author names, category titles or entry titles. Non-ASCII characters are simply kicked out. The malformed entries are produced by function serendipity_makeFilename(). However, this function is also used for HTML output of URLs. I wonder what strange string manipulations it does. For sure, this is not what I would call unicode support.

In my local copy, I have replaced the whole function body by the following code:

Code: Select all

    if ($stripDots) {
        $str = str_replace('.', '', $str);
    }
    $str = str_replace(' ', '_', $str);
    $str = str_replace('&', '%25', $str);
    $str = str_replace('/', '%2F', $str);
    return  urlencode( $str);
This is not perfect, but at least I can use permalinks without this wired ID numbers, now.

Any comments? Can anybody explain the deeper meaning of the original code?
garvinhicking
Core Developer
Posts: 30022
Joined: Tue Sep 16, 2003 9:45 pm
Location: Cologne, Germany
Contact:

Re: Are there plans for full unicode support?

Post by garvinhicking »

Hi!

The permalinks are used in URL context, Unicode isn't valid there?! This is why we try to replace all characters so that permalinks look good and are not mangled with unreadable characters (%5D or whatver).

You can already remove ID numbers by going to the serendipity configuration, permalink section, and remove the %id% characters there.

Regards,
Garvin
# Garvin Hicking (s9y Developer)
# Did I help you? Consider making me happy: http://wishes.garv.in/
# or use my PayPal account "paypal {at} supergarv (dot) de"
# My "other" hobby: http://flickr.garv.in/
hansm
Regular
Posts: 5
Joined: Sat Oct 10, 2009 4:03 pm

Re: Are there plans for full unicode support?

Post by hansm »

Yes, in URLs, all non-ASCII characters (and even some ASCII) need to be escaped and, in deed, those URLs are hard to read. However, they are unambigous and correct. At the other hand, all modern browsers decode this URLs when displayed in the status line or in the address line as long as the encoded part is not in the URL's query part (not after the question mark). When you use the shorter URLs (without "index.php?/") with mod_rewrite, even UTF-8 encoded URLs are displayed in a pretty readable way by the browser.

When I configure the permalinks, let's say like "authors/%realname%" (without the "%id%-" part), s9y needs to retrieve the author's ID from his realname. This is done by qurying the permalinks DB table. Here comes the problem: What if the realname completely consists of non-ASCII characters or only has very few ASCII characters? The author's name is not found at all or becomes ambigious. Not a big deal as long as you use common European languages, but fatal in all other cases.

That's why I have decided to use the URL-encoded full realname in the permalinks table instead of what serendipity_makeFilename() used to make of it.
garvinhicking
Core Developer
Posts: 30022
Joined: Tue Sep 16, 2003 9:45 pm
Location: Cologne, Germany
Contact:

Re: Are there plans for full unicode support?

Post by garvinhicking »

Hi!

I see. If an author has ambiguous translations, then the %id% should IMHO best be put into the permalink.

However, I do get your point. I think we could add a configuration option for future serendipity versions that will allow you to turn of the substitutions.

Currently, language include files could dictate their own replacement key/values in a language file ($globals['i18n_from'] and i18n_to), but that might not be a usable way out of your situation.

For the moment, your patch should work just fine, I'll try to work up a similar configurable patch for the next s9y version, as soon as we can switch our SVN trunk development to 1.6.

Thanks,
Garvin
# Garvin Hicking (s9y Developer)
# Did I help you? Consider making me happy: http://wishes.garv.in/
# or use my PayPal account "paypal {at} supergarv (dot) de"
# My "other" hobby: http://flickr.garv.in/
hansm
Regular
Posts: 5
Joined: Sat Oct 10, 2009 4:03 pm

Re: Are there plans for full unicode support?

Post by hansm »

Fine, a configurable switch would be great. Then, mono-lingual users with European languages can use the current character encoding/dropping scheme and get readable URLs even without rewriting while users of non-European langugages can take profit of full unicode support.

Just one thing: When writing the HTML output, there seems to be the wrong order of HTML-escaping and URL-escaping of author, category or archive names, respectively. This only affects names that contain HTML special characters like "&" or quote marks. So, no big limitation. But take care for this when including it into the trunk.

Big thanks.
garvinhicking
Core Developer
Posts: 30022
Joined: Tue Sep 16, 2003 9:45 pm
Location: Cologne, Germany
Contact:

Re: Are there plans for full unicode support?

Post by garvinhicking »

Hi!

Great, thanks for the contribution. I've added this topic to my todolist, even though it might take me until the end of this month, I'll not forget it. :-)

Thanks,
Garvin
# Garvin Hicking (s9y Developer)
# Did I help you? Consider making me happy: http://wishes.garv.in/
# or use my PayPal account "paypal {at} supergarv (dot) de"
# My "other" hobby: http://flickr.garv.in/
garvinhicking
Core Developer
Posts: 30022
Joined: Tue Sep 16, 2003 9:45 pm
Location: Cologne, Germany
Contact:

Re: Are there plans for full unicode support?

Post by garvinhicking »

Hi!

I've committed an experimental patch to the new s9y 1.6 trunk, which will use your provided patch once the variable $i18n_filename_utf8 is set in either a language include file or the serendipity_config_local.inc.php file.

If anyone wants to try it out, I'd appreciate feedback. Once this new code proves stable, we can add it as default to some of the UTF-8 languages.

Regards,
Garvin
# Garvin Hicking (s9y Developer)
# Did I help you? Consider making me happy: http://wishes.garv.in/
# or use my PayPal account "paypal {at} supergarv (dot) de"
# My "other" hobby: http://flickr.garv.in/
Post Reply