Updated Debian to Stretch and now s9y delivers double-UTF8

Zugschlus · Post by **Zugschlus** » Wed Aug 02, 2017 3:46 pm

Hi,

my blog is running with s9y since 2005 (or so), and I recently updated the system it runs on from Debian jessie to Debian stretch. The machine now has PHP 7.0, apache 2.4.25 and MariaDB 10.1. I think that the old system had php 5.someting, apache 2.4.10 and MariaDB 10.0.

s9y itself is running from a (admittedly outdated) git checkout of serendipity 2.0.5 from https://github.com/s9y/Serendipity.git.

Doing the system update broke my Umlauts. It looks like the Umlauts are displayed in UTF-8 after doing an additional and superfluous second conversion to UTF-8:

: blog screenshot with broken umlauts; Screenshot_20170802_152858.png (90.21 KiB) Viewed 5045 times

The web page delivered is declared as

Code: Select all

 <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

Note the broken Umlaut in "Verbunden über IPv6" in the sidebar, which confuses me because usually those encoding foo issues happen only with the actual _content_.

The database collation is utf8_unicode_ci:

: database collation; Screenshot_20170802_152347.png (2.27 KiB) Viewed 5045 times

The blog itself is also set to UTF-8:

: Blog configuration 1; Screenshot_20170802_152525.png (81.45 KiB) Viewed 5045 times

Notice that in this screenshot, the actual Blog Title is already broken. I am pretty sure that it was correct before the OS upgrade.

On Twitter, @supergarv told me to twiddle with $serendipity{'dbNames'}, but from the changelog (docs/NEWS_OLD) this looks like it was only an experimental option and the frontend option "Enable DB Charset Conversion" has superseded that? I have tried both ways, with no result (no attachment, only 3 allowed).

Where am I supposed to set $serendipity{'dbNames'}? serendipity_config_local.inc.php seems to be overwritten from the frontend.

What am I supposed to restart after doing such a change? Is it only necessary to do Shift-Reload in the Browser?

Please note that I didn't do any database conversion or migration, it was just a minor version number upgrade from MariaDB 10.0 to MariaDB 10.1 (the big MySQL to MariaDB migration was half a year ago and didn't break things).

Any ideas? Thanks in advance.

Greetings
Marc

Post by **garvinhicking** » Wed Aug 02, 2017 4:01 pm

Hi!

The easiest thing to debug what issues one has with database charsets is to start as low level as possible.

Write a "utf8test.php" script:

Code: Select all

<?php
$c = mysqli_connect('localhost', 'root', 'supersecretpassword');
mysqli_select_db($c, 'serendipity');
header('Content-Type: text/plain; charset=UTF-8');
echo "Output without setting a charset:";
$r = mysqli_query($c, "SELECT title FROM serendipity_entries LIMIT 15");
while($row = mysqli_fetch_array($r, MYSQLI_ASSOC)) {
    echo "* " . $row['title'] . "\n";
}

echo "Output with setting a charset:";
mysqli_query($c, "SET NAMES utf8");
$r = mysqli_query($c, "SELECT title FROM serendipity_entries LIMIT 15");
while($row = mysqli_fetch_array($r, MYSQLI_ASSOC)) {
    echo "* " . $row['title'] . "\n";
}

Adapt that code to match your database credentials, database name and serendipity table prefix. Then run that in your browser.

In an ideal world, where the mysql default charset is UTF-8, you should see all proper UTF-8 chars (given that any of your 15 entries contain an umlaut in the title; raise the LIMIT if that doesn't happen).

In a bad world you might get wrong UTF-8 output in the first variant, but proper output in the last variant. If that happens, your mysql runs ISO-8859-1 as default charset.

In a very bad world, both will yield double UTF-8 encodings. The reason for that will most likely be the following: When you installed Serendipity, you might have used a MySQL server that had iso-8859-1 as default charset. The tables were created with ISO-8859-1 collactions. But then you configured s9y to use UTF-8, and all the data in the table got saved as UTF-8.

That will probably have worked out just fine, simply because you saved raw UTF-8 and MySQL simply saved that in the database because it did not perform any custom charset transplantations. But at the point where MySQL got to use UTF-8 as default, you might have gotten "?" characters at an output; this would have gotten fixed with a "SET NAMES utf8" command, which s9y enables with the "Set dbNames" option you also found.

Now, what probably happened is that some automatic upgrade script tried to be smarter than you, and see that you had tables with latin1/iso-8859-1 collactions. It thought "Hey, that's bad, we want UTF-8". It then used MySQL characters transformations to put latin1 to UTF-8 and "fix" the collaction of the table.

Where it failed, is that already UTF-8 was stored in the table! So it got double encoded, and is now stored like that.

The real way to perform a UTF-8 migration in that case is to use the table in latin1 state, save the SQL dump as a file, DO NOT CHANGE the content of the file, and simply add a "SET NAMES utf8" (if not existing already) in the first line of the dump, then change all latin1 to "utf8_general_ci" collations and then reimport the dump. This way, only the metadata of a table gets updated and the actual content remain proper UTF-8 as is. I think this should be documented in our FAQ for UTF-8 migration on our documentation site.

In your case, when you don't have the old table structure with latin1 collations, you will need to resort to scripts that try to re-encoding double UTF-8 to "single" UTF-8. This can basically work like described in https://stackoverflow.com/questions/114 ... tf-8-table and needs to be done for every field, in every table.

The other way I sometimes resort to is a simple PHP script like this, that operates on a SQL dump you have created with the "--skip-set-charset" option:

Code: Select all

<?php
$dump = file_get_contents('dump.sql');
$fp = fopen('dump.fixed.sql, 'wb');
fwrite($fp, "SET NAMES utf8;\n" . utf8_decode($dump));
fclose($fp);

This would remove "one" of the double UTF-8 encoding, but in some edge cases it can really fail. Best way is really to operate with the dump in first place before an automatic upgrade screwed with it.

Best regards,
Garvin

Zugschlus · Post by **Zugschlus** » Fri Aug 04, 2017 11:22 am

Hi Garvin,

thanks for your comments. With those and help from Isotopp, I was able to solve the issue for me. To make things easier for others, I'm documenting this here.

garvinhicking wrote:

Code: Select all

<?php
$c = mysqli_connect('localhost', 'root', 'supersecretpassword');
mysqli_select_db($c, 'serendipity');
header('Content-Type: text/plain; charset=UTF-8');
echo "Output without setting a charset:";
$r = mysqli_query($c, "SELECT title FROM serendipity_entries LIMIT 15");
while($row = mysqli_fetch_array($r, MYSQLI_ASSOC)) {
    echo "* " . $row['title'] . "\n";
}

echo "Output with setting a charset:";
mysqli_query($c, "SET NAMES utf8");
$r = mysqli_query($c, "SELECT title FROM serendipity_entries LIMIT 15");
while($row = mysqli_fetch_array($r, MYSQLI_ASSOC)) {
    echo "* " . $row['title'] . "\n";
}

The working code is:

Code: Select all

<?php
$c = new mysqli("localhost", "username", "password", "database");
$r = mysqli_query($c, "SELECT title FROM serendipity_entries LIMIT 15");
while($row = mysqli_fetch_array($r, MYSQLI_ASSOC)) {
    echo "* " . $row['title'] . "\n";
}

echo "Output with setting a charset:";
mysqli_query($c, "SET NAMES utf8");
$r = mysqli_query($c, "SELECT title FROM serendipity_entries LIMIT 15");
while($row = mysqli_fetch_array($r, MYSQLI_ASSOC)) {
    echo "* " . $row['title'] . "\n";
}

?>

garvinhicking wrote: In a very bad world, both will yield double UTF-8 encodings.

Debian stretch is a very bad world, double UTF-8 encoding in both variants. After solving the issue, I restored the pre-update backup, fired that one up and retried on Debian jessie (with PHP 5.6.30 and MariaDB 10.0). That one returned the one possibility you didn't mention, OK umlauts in the first variant and broken ones in the second.

In serendipity proper on Debian jessie, _enabling_ the "Enable DB charset conversion" option breaks the Umlauts.

garvinhicking wrote: The reason for that will most likely be the following: When you installed Serendipity, you might have used a MySQL server that had iso-8859-1 as default charset. The tables were created with ISO-8859-1 collactions.

Correct. My blog uses serendipity since 2005-something, so my database was ancient. The tables all had CHARSET=latin1, while the database itself was running with ci_utf8_unicode (but a collation doesn't have anything to do with encoding and charset).

garvinhicking wrote: But at the point where MySQL got to use UTF-8 as default, you might have gotten "?" characters at an output

I never saw that.

garvinhicking wrote: Now, what probably happened is that some automatic upgrade script tried to be smarter than you, and see that you had tables with latin1/iso-8859-1 collactions. It thought "Hey, that's bad, we want UTF-8". It then used MySQL characters transformations to put latin1 to UTF-8 and "fix" the collaction of the table.

Nothing like that happened, "show create table" still showed the tables to have charset=latin1.

garvinhicking wrote: The real way to perform a UTF-8 migration in that case is to use the table in latin1 state, save the SQL dump as a file, DO NOT CHANGE the content of the file, and simply add a "SET NAMES utf8" (if not existing already) in the first line of the dump, then change all latin1 to "utf8_general_ci" collations and then reimport the dump.

I changed all latin1 to plain utf8 and had some issues when reimporting the dump. Two of the issues were caused by indexes being too big for tables with an utf8 charset. I found a thread in the German forum and commented my findings there, including a link to the MySQL documentation that says that a utf8 character is considered as 3 bytes. I solved that one by shortening the columns in question. And MariaDB didn't like the spamblock_bayes table on restore due to duplicate primary keys, I simply deleted it.

After this operation, my blog is now operational again with correct Umlauts.

garvinhicking wrote: In your case, when you don't have the old table structure with latin1 collations

I had those because they weren't converted in the first place.

Thanks for helping.

Greetings
Marc

Post by **thh** » Fri Aug 04, 2017 7:31 pm

Thanks for your very detailed reply on those pesky charset issues!

Updated Debian to Stretch and now s9y delivers double-UTF8

Updated Debian to Stretch and now s9y delivers double-UTF8

Re: Updated Debian to Stretch and now s9y delivers double-UT

Re: Updated Debian to Stretch and now s9y delivers double-UT

Re: Updated Debian to Stretch and now s9y delivers double-UT