metadesc: extract issues (p vs. div, classes & ids)

serotonic · Post by **serotonic** » Tue Jan 11, 2011 4:24 pm

Hi,

I've experienced some odd behavior while using the automatic extract function of the metadesc plugin. (I've already reported the issues in german. Garvin thought it would be advantageous to discuss them in english--so I'll try.)

1. Description: Number of Characters
Under certain circumstances, the automatic extraction delivers more than the first 120 characters, as you can see here (> 2.000 chracters).

Timbalu suggested to change the extract_description function (Lines 91, 96) from <p> to <div>, which worked perfectly fine for me in many aspects (see 3&4).

2. Description: Line Wrap
The meta description of the blogpost about tasty, tasty Wirsinggemüse contains several line wraps. Not too bad, but not so pretty either

(I've seen this in other blogposts, too. Please let me know if you need more examples.)

3. Description: <p> vs. <div>
I don't get the feature of restraining the extraction to the use of an initial <p> tag and think that a <div> would do much better in this case. Don't get me wrong, I am using <p> tags and love semantic markup, but I can imagine a whole lot of blogposts containing not a single <p> tag. (Blogposts starting with (or even consisting of) lists, blockquotes and subheadlines, for example.)

So why not change the <p> to the <div>? It shouldn't do any harm to users who only use <p> tags, should it?

4. Description and Keywords: Classes and IDs
The automatic extraction skips all elements supplied with classes or ids.

Since Garvin enhanced the extract_keywords part, it works for some elements with classes, but for some it still does'nt. E.g. in my blog, keywords should be exctracted from h3,h4 and cite, but as you can see in this example, the <h3>s are still being skipped.

extract_description doesn't skip these elements if <p> is changed to <div>.

I know that classes and ids are usually not used by s9y and most of its users, but I happen to use them very frequently and would appreciate your help a lot.

Regards,
serotonic

Timbalu · Post by **Timbalu** » Tue Jan 11, 2011 5:28 pm

serotonic wrote:Timbalu suggested to change the extract_description function (Lines 91, 96) from <p> to <div>, which worked perfectly fine for me in many aspects (see 3&4).

Just some small notes to our discussion and the developers....
Well, me, myself and all the other, suggested to use a div instead of p - but in the entry text itself

to get around the behaviour of not displaying the first paragraph in the meta output quickly.

I dont think you need all this stuff in this function since an automatically set meta description snipping entrytext does not need any html tags, so, very simple, it could just look like IMHO

Code: Select all

    function extract_description($text) {
        return substr(strip_tags($text), 0, 120);
    }

2. Description: Line Wrap

the automatic description does not need any linebreaks, tags or htmspecialchars

3. Description: <p> vs. <div>
So why not change the <p> to the <div>?

what for? No need in Meta tags.

4. Description and Keywords: Classes and IDs
The automatic meta extraction skips all elements supplied with classes or ids.

Since Garvin enhanced the extract_keywords part, it works for some elements with classes, but for some it still does'nt. E.g. in my blog, keywords should be exctracted from h3,h4 and cite, but as you can see in this example, the <h3>s are still being skipped.

I still dont see the very need for the regex, while it is used to find freetag "tags" in normal entry text, as far as I understand this plugin.
So, in the end, you just need to get rid of html <tags> with strip(tags() and find freetag "tags" in the pure text. Thats all.

Trying to have a look into the source code of your example page, I would say it is a need to ged rid of all these entities in the description too. Google does not need something like this:
<meta name="description" content="★★★★★★★☆☆☆" />

Regards,
Ian

serotonic · Post by **serotonic** » Tue Jan 11, 2011 6:27 pm

Ian, just some small notes to your notes

Timbalu wrote:
2. Description: Line Wrap
the automatic description does not need any linebreaks, tags or htmspecialchars

Sure it doesn't. But for now, there are linebreaks.

Timbalu wrote:
3. Description: <p> vs. <div>
So why not change the <p> to the <div>?
what for? No need in Meta tags.

Oh, I'm not a coder, I don't know how to solve this in a neat way. I just didn't get the limitation to the <p> tag and thought it would be better to extract the first 120 characters of (text) content, no matter wich tag is wrapped around it.

4. Description and Keywords: Classes and IDs
The automatic meta extraction skips all elements supplied with classes or ids.

Timbalu wrote:I still dont see the very need for the regex, while it is used to find freetag "tags" in normal entry text, as far as I understand this plugin.

This one's got nothing to do with the freetag plugin, it's not even installed at this installation. As far as I understand the metadesc plugin, it searches tags like <b> or <strong> and uses their content as metakeywords.

Timbalu wrote:Trying to have a look into the source code of your example page, I would say it is a need to ged rid of all these entities in the description too. Google does not need something like this:
<meta name="description" content="★★★★★★★☆☆☆" />

It depends

First of all, if the headline wasn't skipped due to the classes-issue, the description would look this way:

Code: Select all

<meta name="description" content="Private Practice – Staffel 3 &#9733;&#9733;&#9733;&#9733;&#9733;&#9733;&#9733;&#9734;&#9734;&#9734;">

Displaying my rating of the series in the search results definitely would make sence.

Regards,
serotonic

Timbalu · Post by **Timbalu** » Tue Jan 11, 2011 6:56 pm

serotonic wrote:Ian, just some small notes to your notes

Ok, I never used this before, I am/was just guessing after having a very quick dive into plugins code.

And my notes appeared here for Don and Judebert to think about

serotonic wrote:
Timbalu wrote:
2. Description: Line Wrap
the automatic description does not need any linebreaks, tags or htmspecialchars
Sure it doesn't. But for now, there are linebreaks.

Yes, thats why you dont need htmspecialchars and will need some sort of regex to get rid of \n

serotonic wrote:
Timbalu wrote:
3. Description: <p> vs. <div>
So why not change the <p> to the <div>?
what for? No need in Meta tags.
Oh, I'm not a coder, I don't know how to solve this in a neat way. I just didn't get the limitation to the <p> tag and thought it would be better to extract the first 120 characters of (text) content, no matter wich tag is wrapped around it.

Well the cut by 120 is build in, as far as I know.... (and this function does not need tags to do so....) As far as I understood Garvin, this function will be used only, when you do not set the meta desc manually.

4. Description and Keywords: Classes and IDs
The automatic meta extraction skips all elements supplied with classes or ids.

serotonic wrote:
Timbalu wrote:I still dont see the very need for the regex, while it is used to find freetag "tags" in normal entry text, as far as I understand this plugin.
This one's got nothing to do with the freetag plugin, it's not even installed at this installation. As far as I understand the metadesc plugin, it searches tags like <b> or <strong> and uses their content as metakeywords.

O holy Sh.., back to start! I thought these were meant, sorry. In this case you really need the regex!

serotonic wrote:
Timbalu wrote:Trying to have a look into the source code of your example page, I would say it is a need to ged rid of all these entities in the description too. Google does not need something like this:
<meta name="description" content="★★★★★★★☆☆☆" />
It depends

First of all, if the headline wasn't skipped due to the classes-issue, the description would look this way:
Code: Select all
<meta name="description" content="Private Practice – Staffel 3 &#9733;&#9733;&#9733;&#9733;&#9733;&#9733;&#9733;&#9734;&#9734;&#9734;">
Displaying my rating of the series in the search results definitely would make sence.

Yes truly, but none of these exiting entities....

Cheers,
Ian

serotonic · Post by **serotonic** » Wed Jan 12, 2011 11:27 am

Timbalu wrote:
serotonic wrote: First of all, if the headline wasn't skipped due to the classes-issue, the description would look this way:
Code: Select all
<meta name="description" content="Private Practice – Staffel 3 &#9733;&#9733;&#9733;&#9733;&#9733;&#9733;&#9733;&#9734;&#9734;&#9734;">
Displaying my rating of the series in the search results definitely would make sence.
Yes truly, but none of these exiting entities....

Hehe

These exiting entities seem to be the only way to display unicode star characters in entries. Using the characters itself only works for static pages, and google at least interprets the black star, as you can see here.

So I still don't see an urgent need to get rid of them in the matter of improving the metadesc plugin --although I'd love to use the character instead of its entities (in entrybody AND metadesc).

Please tell me if we are talking past each other

Regards,
serotonic

Timbalu · Post by **Timbalu** » Wed Jan 12, 2011 12:04 pm

serotonic wrote: Hehe These exiting entities seem to be the only way to display unicode star characters in entries. Using the characters itself only works for static pages, and google at least interprets the black star, as you can see here.

Yes, but google - able to read entities - does not use your meta desc to display these results.... (at least I think so...) and ... I once read this:
To html-encode Unicode characters that may not be part of your document character set (given in the META tag of your page), and so can not be output directly into your document source, you need to use mb_encode_numericentity(). Pay attention to it's conversion map argument. and surely the opposite mb_decode_numericentity

serotonic wrote:So I still don't see an urgent need to get rid of them in the matter of improving the metadesc plugin --although I'd love to use the character instead of its entities (in entrybody AND metadesc).

I am able to do so in my local blog. ★ = ★ but Meta is a question of htmlspecialchars, I assume.

serotonic wrote:Please tell me if we are talking past each other

Are we?

Regards,
Ian

Post by **garvinhicking** » Wed Jan 12, 2011 12:26 pm

Hi!

You did see that I added a new config option to disable htmlspecialchars() into the plugin, yes?

Regards,
Garvin

serotonic · Post by **serotonic** » Wed Jan 12, 2011 12:34 pm

Hi Garvin,

yes, I did!

When it is set to "no", the output is:

Code: Select all

<meta name="description" content="&#9733;&#9733;&#9733;&#9733;&#9733;&#9733;&#9733;&#9734;&#9734;&#9734;" />

And set to "yes" (default):

Code: Select all

<meta name="description" content="&#9733;&#9733;&#9733;&#9733;&#9733;&#9733;&#9733;&#9734;&#9734;&#9734;" />

So using this option helps me to have accurate entities, but it won't help getting rid of them?

Timbalu wrote:I am able to do so in my local blog. ★ = &#9733

On my blog using ★ in entries leads to ?. Maybe I should post that to a new thread.

Regards,
serotonic

Timbalu · Post by **Timbalu** » Wed Jan 12, 2011 12:42 pm

Garvin, could you try with an mb_decode_numericentity in the no htmlspecialchars section?
Then they should still appear in meta desc, but as stars.

Ian

Post by **garvinhicking** » Wed Jan 12, 2011 1:18 pm

Hi!

mb* is not always available, which is why I'd like to avoid depending on it.

The entities should only be there because they are inside your database table, serotonic. It could be that your blog and the database tables have a mismatching charset; the tables and your blog should run in UTF-8 - of course only if your entered chars are also part of UTF-8? Many browsers encode entities on their own, so you might want to check if changing the browser to submit an entry might help. Also, if you're not doing that already, avoid WYSIWYG editors, those might also translate real characters to entities.

The goal for you/us would be to make sure that the characters will not get saved as entities, but proper UTF-8 characters. Are they maybe only UTF-16 characters? Or maybe only contained in latin1?

Regards,
Garvin

serotonic · Post by **serotonic** » Thu Jan 13, 2011 1:07 pm

Hi!

For the sake of completeness: Garvin was right, my database tables had a mismatching charset, which is now fixed. So no need to talk about entities anymore

Meanwhile, this thread looks quite confusing -- so I'll try to outline the problems I see with the automatic(!) features of this plugin again.

1. Description: Number of Characters
I know that a cut by 120 characters is build in, but as I mentioned before, the automatic extraction delivers more than the first 120 characters under certain circumstances. Example, > 2.000 chracters

2. Description: Linebreaks
The automatic meta description contains linebreaks, though the meta description doesn't need any linebreaks, as Timbalu mentioned, too.

3. Description: Why only <p> tags?
I don't get the feature of restraining the automatic extraction to the use of an initial <p> tag. I can imagine a whole lot of blogposts containing not a single <p> tag. (Blogposts starting with (or even consisting of) lists, blockquotes and subheadlines, for example.)

4. Description and Keywords: Classes and IDs
The automatic extraction skips all elements supplied with classes or ids.

extract_description: Maybe wouldn't skip these elements if there was no restriction to the <p> tag.
extract_keywords: Since Garvin enhanced this, it works for some elements with classes, but for some it still does'nt. E.g. in my blog, keywords should be exctracted from h3,h4 and cite, but as you can see in this example, the <h3>s are still being skipped.

Hope this summary helps making things a bit clearer.
I appreciate your feedback and help a lot!

Regards,
serotonic

Timbalu · Post by **Timbalu** » Thu Jan 13, 2011 1:54 pm

serotonic wrote:1. Description: Number of Characters
2. Description: Linebreaks

Hi Serotonic

As I tried to say before:
If you want to have the automatic decription parsed from entrytext, text will be parsed by the function extract_description($text) {
This one is looking for first occurance of <p> or </p> and returns the text cut by 120 if there isn't any starting or ending p tag. If there are any p's happen to be, it takes the stripped code without the cut by 120 chars. Don't ask why!

What we need to put in there now, should be something like this

Code: Select all

    return substr(strip_tags(str_replace('\n',' ',$title)), 0, 120);

at the end of this function and replace the

Code: Select all

$title = strip_tags($title);  
return $title;

with it.
You could append this to serendipity_event_metadesc.php line 102/103 of last revision and have a try.

Update Serendipity, if you haven't done already! It is possibly vulnerable!

Regards,
Ian

Edit:

serotonic wrote:4. Description and Keywords: Classes and IDs

your h3 looks like

Code: Select all

<h3 id="PrivatePractice"><a href="http://www.imdb.com/title/tt0972412/">Private Practice</a> – Staffel 3</h3>

Is it "Private Practice" or " – Staffel 3", the keyword you want?
I know it was you, who wrote the cite tags, but did you do it with <h3> und <a> too?

If so, the regex has to be something like
look for <
look for $tag[$i]
look for (*.?) anything with spaces and characters following until
look for >
which covers the <h3 id="PrivatePractice">
var and hold the rest (*.?)
look for </
look for $tag[$i]
look for >
then
$tag = strip tags($rest)
which yould be near to this kind of regex change, but I am not good in coding regexes

Code: Select all

preg_match_all('/[<' . $tags[$i] . '[*.?][^>]*>]([^>]*)[<\/' . $tags[$i] . '>]/si', $text, strip_tags($match))) {

serotonic · Post by **serotonic** » Fri Jan 14, 2011 1:45 pm

Hi Timbalu!

1. Description: Number of Characters
2. Description: Linebreaks

You are right, editing lines 102/103 of serendipity_event_metadesc.php to

Code: Select all

    return substr(strip_tags(str_replace('\n',' ',$title)), 0, 120);

fixes 1. Description: Number of Characters. Thanks a lot!

But on blogposts starting with an element using a class/id, there is still a linebreak. (Wirsinggemüse) So apparently it does not fix the entire 2. Description: Linebreaks-thingy.

4. Description and Keywords: Classes and IDs

For the keyword part of this I simply wasn't aware that tags with nested tags won't work. I've expected the plugin to ignore the <a> tag and return "Private Practice – Staffel 3" as one keyword.

I've tried the regex change you posted on line 112 (111 according to the change on lines 102/103), but it'll return no keywords at all.

3. Description: Why only <p> tags?

After all, actually all of the automatic description issues trace back to the fact that it relies on the use of the first, class-and/or-id-less <p> tag. I still think that it would be the much better choice to use the first 120 characters of text content, no matter wich tag is wrapped around it. It covers so much more entry scenarios!

I'd really like to convince the developers of the metadesc plugin that this would return more matching and less error-prone results for all kind of s9y users (for those who are using wysiwyg editors, mainly using paragraphs AND those who use everything html gives us in the matter of semantics and love to have individual formatting).

Regards,
serotonic

Timbalu · Post by **Timbalu** » Fri Jan 14, 2011 2:23 pm

Sounds good - we could try to solve number 2 with

Code: Select all

return substr(strip_tags(str_replace('\n',' ',trim($title))), 0, 120);

To point 3: yes, full ACK!

Number 4 regex will get improved, just wait a minute....

Ian

Post by **garvinhicking** » Fri Jan 14, 2011 2:37 pm

Hi!

(I don't have time/motivation for most of these issues, but:)

I'd really like to convince the developers of the metadesc plugin that this would return more matching and less error-prone results for all kind of s9y users (for those who are using wysiwyg editors, mainly using paragraphs AND those who use everything html gives us in the matter of semantics and love to have individual formatting).

You can use individual meta content/properties instead of automatic detected ones, and this would eradicate your problems, wouldn't it? There you'd have your individual formatting?

(I don't use the plugin, so I'm simply assuming)

Regards,
Garvin

metadesc: extract issues (p vs. div, classes & ids)

metadesc: extract issues (p vs. div, classes & ids)

Re: metadesc: extract issues (p vs. div, classes & ids)

Re: metadesc: extract issues (p vs. div, classes & ids)

Re: metadesc: extract issues (p vs. div, classes & ids)

Re: metadesc: extract issues (p vs. div, classes & ids)

Re: metadesc: extract issues (p vs. div, classes & ids)

Re: metadesc: extract issues (p vs. div, classes & ids)

Re: metadesc: extract issues (p vs. div, classes & ids)

Re: metadesc: extract issues (p vs. div, classes & ids)

Re: metadesc: extract issues (p vs. div, classes & ids)

Re: metadesc: extract issues (p vs. div, classes & ids)

Re: metadesc: extract issues (p vs. div, classes & ids)

Re: metadesc: extract issues (p vs. div, classes & ids)

Re: metadesc: extract issues (p vs. div, classes & ids)

Re: metadesc: extract issues (p vs. div, classes & ids)