metadesc: extract issues (p vs. div, classes & ids)
metadesc: extract issues (p vs. div, classes & ids)
Hi,
I've experienced some odd behavior while using the automatic extract function of the metadesc plugin. (I've already reported the issues in german. Garvin thought it would be advantageous to discuss them in english--so I'll try.)
1. Description: Number of Characters
Under certain circumstances, the automatic extraction delivers more than the first 120 characters, as you can see here (> 2.000 chracters).
Timbalu suggested to change the extract_description function (Lines 91, 96) from <p> to <div>, which worked perfectly fine for me in many aspects (see 3&4).
2. Description: Line Wrap
The meta description of the blogpost about tasty, tasty Wirsinggemüse contains several line wraps. Not too bad, but not so pretty either (I've seen this in other blogposts, too. Please let me know if you need more examples.)
3. Description: <p> vs. <div>
I don't get the feature of restraining the extraction to the use of an initial <p> tag and think that a <div> would do much better in this case. Don't get me wrong, I am using <p> tags and love semantic markup, but I can imagine a whole lot of blogposts containing not a single <p> tag. (Blogposts starting with (or even consisting of) lists, blockquotes and subheadlines, for example.)
So why not change the <p> to the <div>? It shouldn't do any harm to users who only use <p> tags, should it?
4. Description and Keywords: Classes and IDs
The automatic extraction skips all elements supplied with classes or ids.
Since Garvin enhanced the extract_keywords part, it works for some elements with classes, but for some it still does'nt. E.g. in my blog, keywords should be exctracted from h3,h4 and cite, but as you can see in this example, the <h3>s are still being skipped.
extract_description doesn't skip these elements if <p> is changed to <div>.
I know that classes and ids are usually not used by s9y and most of its users, but I happen to use them very frequently and would appreciate your help a lot.
Regards,
serotonic
I've experienced some odd behavior while using the automatic extract function of the metadesc plugin. (I've already reported the issues in german. Garvin thought it would be advantageous to discuss them in english--so I'll try.)
1. Description: Number of Characters
Under certain circumstances, the automatic extraction delivers more than the first 120 characters, as you can see here (> 2.000 chracters).
Timbalu suggested to change the extract_description function (Lines 91, 96) from <p> to <div>, which worked perfectly fine for me in many aspects (see 3&4).
2. Description: Line Wrap
The meta description of the blogpost about tasty, tasty Wirsinggemüse contains several line wraps. Not too bad, but not so pretty either (I've seen this in other blogposts, too. Please let me know if you need more examples.)
3. Description: <p> vs. <div>
I don't get the feature of restraining the extraction to the use of an initial <p> tag and think that a <div> would do much better in this case. Don't get me wrong, I am using <p> tags and love semantic markup, but I can imagine a whole lot of blogposts containing not a single <p> tag. (Blogposts starting with (or even consisting of) lists, blockquotes and subheadlines, for example.)
So why not change the <p> to the <div>? It shouldn't do any harm to users who only use <p> tags, should it?
4. Description and Keywords: Classes and IDs
The automatic extraction skips all elements supplied with classes or ids.
Since Garvin enhanced the extract_keywords part, it works for some elements with classes, but for some it still does'nt. E.g. in my blog, keywords should be exctracted from h3,h4 and cite, but as you can see in this example, the <h3>s are still being skipped.
extract_description doesn't skip these elements if <p> is changed to <div>.
I know that classes and ids are usually not used by s9y and most of its users, but I happen to use them very frequently and would appreciate your help a lot.
Regards,
serotonic
Last edited by serotonic on Tue Jan 11, 2011 5:34 pm, edited 1 time in total.
Re: metadesc: extract issues (p vs. div, classes & ids)
Just some small notes to our discussion and the developers....serotonic wrote:Timbalu suggested to change the extract_description function (Lines 91, 96) from <p> to <div>, which worked perfectly fine for me in many aspects (see 3&4).
Well, me, myself and all the other, suggested to use a div instead of p - but in the entry text itself to get around the behaviour of not displaying the first paragraph in the meta output quickly.
I dont think you need all this stuff in this function since an automatically set meta description snipping entrytext does not need any html tags, so, very simple, it could just look like IMHO
Code: Select all
function extract_description($text) {
return substr(strip_tags($text), 0, 120);
}
the automatic description does not need any linebreaks, tags or htmspecialchars2. Description: Line Wrap
what for? No need in Meta tags.3. Description: <p> vs. <div>
So why not change the <p> to the <div>?
I still dont see the very need for the regex, while it is used to find freetag "tags" in normal entry text, as far as I understand this plugin.4. Description and Keywords: Classes and IDs
The automatic meta extraction skips all elements supplied with classes or ids.
Since Garvin enhanced the extract_keywords part, it works for some elements with classes, but for some it still does'nt. E.g. in my blog, keywords should be exctracted from h3,h4 and cite, but as you can see in this example, the <h3>s are still being skipped.
So, in the end, you just need to get rid of html <tags> with strip(tags() and find freetag "tags" in the pure text. Thats all.
Trying to have a look into the source code of your example page, I would say it is a need to ged rid of all these entities in the description too. Google does not need something like this:
<meta name="description" content="★★★★★★★☆☆☆" />
Regards,
Ian
Re: metadesc: extract issues (p vs. div, classes & ids)
Ian, just some small notes to your notes
First of all, if the headline wasn't skipped due to the classes-issue, the description would look this way:
Displaying my rating of the series in the search results definitely would make sence.
Regards,
serotonic
Sure it doesn't. But for now, there are linebreaks.Timbalu wrote:the automatic description does not need any linebreaks, tags or htmspecialchars2. Description: Line Wrap
Oh, I'm not a coder, I don't know how to solve this in a neat way. I just didn't get the limitation to the <p> tag and thought it would be better to extract the first 120 characters of (text) content, no matter wich tag is wrapped around it.Timbalu wrote:what for? No need in Meta tags.3. Description: <p> vs. <div>
So why not change the <p> to the <div>?
This one's got nothing to do with the freetag plugin, it's not even installed at this installation. As far as I understand the metadesc plugin, it searches tags like <b> or <strong> and uses their content as metakeywords.4. Description and Keywords: Classes and IDs
The automatic meta extraction skips all elements supplied with classes or ids.
Timbalu wrote:I still dont see the very need for the regex, while it is used to find freetag "tags" in normal entry text, as far as I understand this plugin.
It dependsTimbalu wrote:Trying to have a look into the source code of your example page, I would say it is a need to ged rid of all these entities in the description too. Google does not need something like this:
<meta name="description" content="★★★★★★★☆☆☆" />
First of all, if the headline wasn't skipped due to the classes-issue, the description would look this way:
Code: Select all
<meta name="description" content="Private Practice – Staffel 3 ★★★★★★★☆☆☆">
Regards,
serotonic
Re: metadesc: extract issues (p vs. div, classes & ids)
Ok, I never used this before, I am/was just guessing after having a very quick dive into plugins code. And my notes appeared here for Don and Judebert to think aboutserotonic wrote:Ian, just some small notes to your notes
Yes, thats why you dont need htmspecialchars and will need some sort of regex to get rid of \nserotonic wrote:Sure it doesn't. But for now, there are linebreaks.Timbalu wrote:the automatic description does not need any linebreaks, tags or htmspecialchars2. Description: Line Wrap
Well the cut by 120 is build in, as far as I know.... (and this function does not need tags to do so....) As far as I understood Garvin, this function will be used only, when you do not set the meta desc manually.serotonic wrote:Oh, I'm not a coder, I don't know how to solve this in a neat way. I just didn't get the limitation to the <p> tag and thought it would be better to extract the first 120 characters of (text) content, no matter wich tag is wrapped around it.Timbalu wrote:what for? No need in Meta tags.3. Description: <p> vs. <div>
So why not change the <p> to the <div>?
O holy Sh.., back to start! I thought these were meant, sorry. In this case you really need the regex!4. Description and Keywords: Classes and IDs
The automatic meta extraction skips all elements supplied with classes or ids.
This one's got nothing to do with the freetag plugin, it's not even installed at this installation. As far as I understand the metadesc plugin, it searches tags like <b> or <strong> and uses their content as metakeywords.serotonic wrote:Timbalu wrote:I still dont see the very need for the regex, while it is used to find freetag "tags" in normal entry text, as far as I understand this plugin.
Yes truly, but none of these exiting entities....serotonic wrote:It dependsTimbalu wrote:Trying to have a look into the source code of your example page, I would say it is a need to ged rid of all these entities in the description too. Google does not need something like this:
<meta name="description" content="★★★★★★★☆☆☆" />
First of all, if the headline wasn't skipped due to the classes-issue, the description would look this way:Displaying my rating of the series in the search results definitely would make sence.Code: Select all
<meta name="description" content="Private Practice – Staffel 3 ★★★★★★★☆☆☆">
Cheers,
Ian
Re: metadesc: extract issues (p vs. div, classes & ids)
Hehe These exiting entities seem to be the only way to display unicode star characters in entries. Using the characters itself only works for static pages, and google at least interprets the black star, as you can see here.Timbalu wrote:Yes truly, but none of these exiting entities....serotonic wrote: First of all, if the headline wasn't skipped due to the classes-issue, the description would look this way:Displaying my rating of the series in the search results definitely would make sence.Code: Select all
<meta name="description" content="Private Practice – Staffel 3 ★★★★★★★☆☆☆">
So I still don't see an urgent need to get rid of them in the matter of improving the metadesc plugin --although I'd love to use the character instead of its entities (in entrybody AND metadesc).
Please tell me if we are talking past each other
Regards,
serotonic
Re: metadesc: extract issues (p vs. div, classes & ids)
Yes, but google - able to read entities - does not use your meta desc to display these results.... (at least I think so...) and ... I once read this:serotonic wrote: Hehe These exiting entities seem to be the only way to display unicode star characters in entries. Using the characters itself only works for static pages, and google at least interprets the black star, as you can see here.
To html-encode Unicode characters that may not be part of your document character set (given in the META tag of your page), and so can not be output directly into your document source, you need to use mb_encode_numericentity(). Pay attention to it's conversion map argument. and surely the opposite mb_decode_numericentity
I am able to do so in my local blog. ★ = ★ but Meta is a question of htmlspecialchars, I assume.serotonic wrote:So I still don't see an urgent need to get rid of them in the matter of improving the metadesc plugin --although I'd love to use the character instead of its entities (in entrybody AND metadesc).
Are we?serotonic wrote:Please tell me if we are talking past each other
Regards,
Ian
-
- Core Developer
- Posts: 30022
- Joined: Tue Sep 16, 2003 9:45 pm
- Location: Cologne, Germany
- Contact:
Re: metadesc: extract issues (p vs. div, classes & ids)
Hi!
You did see that I added a new config option to disable htmlspecialchars() into the plugin, yes?
Regards,
Garvin
You did see that I added a new config option to disable htmlspecialchars() into the plugin, yes?
Regards,
Garvin
# Garvin Hicking (s9y Developer)
# Did I help you? Consider making me happy: http://wishes.garv.in/
# or use my PayPal account "paypal {at} supergarv (dot) de"
# My "other" hobby: http://flickr.garv.in/
# Did I help you? Consider making me happy: http://wishes.garv.in/
# or use my PayPal account "paypal {at} supergarv (dot) de"
# My "other" hobby: http://flickr.garv.in/
Re: metadesc: extract issues (p vs. div, classes & ids)
Hi Garvin,
yes, I did!
When it is set to "no", the output is:
And set to "yes" (default):
So using this option helps me to have accurate entities, but it won't help getting rid of them?
Regards,
serotonic
yes, I did!
When it is set to "no", the output is:
Code: Select all
<meta name="description" content="★★★★★★★☆☆☆" />
Code: Select all
<meta name="description" content="★★★★★★★☆☆☆" />
On my blog using ★ in entries leads to ?. Maybe I should post that to a new thread.Timbalu wrote:I am able to do so in my local blog. ★ = ★
Regards,
serotonic
Re: metadesc: extract issues (p vs. div, classes & ids)
Garvin, could you try with an mb_decode_numericentity in the no htmlspecialchars section?
Then they should still appear in meta desc, but as stars.
Ian
Then they should still appear in meta desc, but as stars.
Ian
-
- Core Developer
- Posts: 30022
- Joined: Tue Sep 16, 2003 9:45 pm
- Location: Cologne, Germany
- Contact:
Re: metadesc: extract issues (p vs. div, classes & ids)
Hi!
mb* is not always available, which is why I'd like to avoid depending on it.
The entities should only be there because they are inside your database table, serotonic. It could be that your blog and the database tables have a mismatching charset; the tables and your blog should run in UTF-8 - of course only if your entered chars are also part of UTF-8? Many browsers encode entities on their own, so you might want to check if changing the browser to submit an entry might help. Also, if you're not doing that already, avoid WYSIWYG editors, those might also translate real characters to entities.
The goal for you/us would be to make sure that the characters will not get saved as entities, but proper UTF-8 characters. Are they maybe only UTF-16 characters? Or maybe only contained in latin1?
Regards,
Garvin
mb* is not always available, which is why I'd like to avoid depending on it.
The entities should only be there because they are inside your database table, serotonic. It could be that your blog and the database tables have a mismatching charset; the tables and your blog should run in UTF-8 - of course only if your entered chars are also part of UTF-8? Many browsers encode entities on their own, so you might want to check if changing the browser to submit an entry might help. Also, if you're not doing that already, avoid WYSIWYG editors, those might also translate real characters to entities.
The goal for you/us would be to make sure that the characters will not get saved as entities, but proper UTF-8 characters. Are they maybe only UTF-16 characters? Or maybe only contained in latin1?
Regards,
Garvin
# Garvin Hicking (s9y Developer)
# Did I help you? Consider making me happy: http://wishes.garv.in/
# or use my PayPal account "paypal {at} supergarv (dot) de"
# My "other" hobby: http://flickr.garv.in/
# Did I help you? Consider making me happy: http://wishes.garv.in/
# or use my PayPal account "paypal {at} supergarv (dot) de"
# My "other" hobby: http://flickr.garv.in/
Re: metadesc: extract issues (p vs. div, classes & ids)
Hi!
For the sake of completeness: Garvin was right, my database tables had a mismatching charset, which is now fixed. So no need to talk about entities anymore
Meanwhile, this thread looks quite confusing -- so I'll try to outline the problems I see with the automatic(!) features of this plugin again.
1. Description: Number of Characters
I know that a cut by 120 characters is build in, but as I mentioned before, the automatic extraction delivers more than the first 120 characters under certain circumstances. Example, > 2.000 chracters
2. Description: Linebreaks
The automatic meta description contains linebreaks, though the meta description doesn't need any linebreaks, as Timbalu mentioned, too.
3. Description: Why only <p> tags?
I don't get the feature of restraining the automatic extraction to the use of an initial <p> tag. I can imagine a whole lot of blogposts containing not a single <p> tag. (Blogposts starting with (or even consisting of) lists, blockquotes and subheadlines, for example.)
4. Description and Keywords: Classes and IDs
The automatic extraction skips all elements supplied with classes or ids.
extract_description: Maybe wouldn't skip these elements if there was no restriction to the <p> tag.
extract_keywords: Since Garvin enhanced this, it works for some elements with classes, but for some it still does'nt. E.g. in my blog, keywords should be exctracted from h3,h4 and cite, but as you can see in this example, the <h3>s are still being skipped.
Hope this summary helps making things a bit clearer.
I appreciate your feedback and help a lot!
Regards,
serotonic
For the sake of completeness: Garvin was right, my database tables had a mismatching charset, which is now fixed. So no need to talk about entities anymore
Meanwhile, this thread looks quite confusing -- so I'll try to outline the problems I see with the automatic(!) features of this plugin again.
1. Description: Number of Characters
I know that a cut by 120 characters is build in, but as I mentioned before, the automatic extraction delivers more than the first 120 characters under certain circumstances. Example, > 2.000 chracters
2. Description: Linebreaks
The automatic meta description contains linebreaks, though the meta description doesn't need any linebreaks, as Timbalu mentioned, too.
3. Description: Why only <p> tags?
I don't get the feature of restraining the automatic extraction to the use of an initial <p> tag. I can imagine a whole lot of blogposts containing not a single <p> tag. (Blogposts starting with (or even consisting of) lists, blockquotes and subheadlines, for example.)
4. Description and Keywords: Classes and IDs
The automatic extraction skips all elements supplied with classes or ids.
extract_description: Maybe wouldn't skip these elements if there was no restriction to the <p> tag.
extract_keywords: Since Garvin enhanced this, it works for some elements with classes, but for some it still does'nt. E.g. in my blog, keywords should be exctracted from h3,h4 and cite, but as you can see in this example, the <h3>s are still being skipped.
Hope this summary helps making things a bit clearer.
I appreciate your feedback and help a lot!
Regards,
serotonic
Re: metadesc: extract issues (p vs. div, classes & ids)
Hi Serotonicserotonic wrote:1. Description: Number of Characters
2. Description: Linebreaks
As I tried to say before:
If you want to have the automatic decription parsed from entrytext, text will be parsed by the function extract_description($text) {
This one is looking for first occurance of <p> or </p> and returns the text cut by 120 if there isn't any starting or ending p tag. If there are any p's happen to be, it takes the stripped code without the cut by 120 chars. Don't ask why!
What we need to put in there now, should be something like this
Code: Select all
return substr(strip_tags(str_replace('\n',' ',$title)), 0, 120);
Code: Select all
$title = strip_tags($title);
return $title;
You could append this to serendipity_event_metadesc.php line 102/103 of last revision and have a try.
Update Serendipity, if you haven't done already! It is possibly vulnerable!
Regards,
Ian
Edit:
your h3 looks likeserotonic wrote:4. Description and Keywords: Classes and IDs
Code: Select all
<h3 id="PrivatePractice"><a href="http://www.imdb.com/title/tt0972412/">Private Practice</a> – Staffel 3</h3>
I know it was you, who wrote the cite tags, but did you do it with <h3> und <a> too?
If so, the regex has to be something like
look for <
look for $tag[$i]
look for (*.?) anything with spaces and characters following until
look for >
which covers the <h3 id="PrivatePractice">
var and hold the rest (*.?)
look for </
look for $tag[$i]
look for >
then
$tag = strip tags($rest)
which yould be near to this kind of regex change, but I am not good in coding regexes
Code: Select all
preg_match_all('/[<' . $tags[$i] . '[*.?][^>]*>]([^>]*)[<\/' . $tags[$i] . '>]/si', $text, strip_tags($match))) {
Re: metadesc: extract issues (p vs. div, classes & ids)
Hi Timbalu!
fixes 1. Description: Number of Characters. Thanks a lot!
But on blogposts starting with an element using a class/id, there is still a linebreak. (Wirsinggemüse) So apparently it does not fix the entire 2. Description: Linebreaks-thingy.
I've tried the regex change you posted on line 112 (111 according to the change on lines 102/103), but it'll return no keywords at all.
I'd really like to convince the developers of the metadesc plugin that this would return more matching and less error-prone results for all kind of s9y users (for those who are using wysiwyg editors, mainly using paragraphs AND those who use everything html gives us in the matter of semantics and love to have individual formatting).
Regards,
serotonic
You are right, editing lines 102/103 of serendipity_event_metadesc.php to1. Description: Number of Characters
2. Description: Linebreaks
Code: Select all
return substr(strip_tags(str_replace('\n',' ',$title)), 0, 120);
But on blogposts starting with an element using a class/id, there is still a linebreak. (Wirsinggemüse) So apparently it does not fix the entire 2. Description: Linebreaks-thingy.
For the keyword part of this I simply wasn't aware that tags with nested tags won't work. I've expected the plugin to ignore the <a> tag and return "Private Practice – Staffel 3" as one keyword.4. Description and Keywords: Classes and IDs
I've tried the regex change you posted on line 112 (111 according to the change on lines 102/103), but it'll return no keywords at all.
After all, actually all of the automatic description issues trace back to the fact that it relies on the use of the first, class-and/or-id-less <p> tag. I still think that it would be the much better choice to use the first 120 characters of text content, no matter wich tag is wrapped around it. It covers so much more entry scenarios!3. Description: Why only <p> tags?
I'd really like to convince the developers of the metadesc plugin that this would return more matching and less error-prone results for all kind of s9y users (for those who are using wysiwyg editors, mainly using paragraphs AND those who use everything html gives us in the matter of semantics and love to have individual formatting).
Regards,
serotonic
Re: metadesc: extract issues (p vs. div, classes & ids)
Sounds good - we could try to solve number 2 with
To point 3: yes, full ACK!
Number 4 regex will get improved, just wait a minute....
Ian
Code: Select all
return substr(strip_tags(str_replace('\n',' ',trim($title))), 0, 120);
Number 4 regex will get improved, just wait a minute....
Ian
-
- Core Developer
- Posts: 30022
- Joined: Tue Sep 16, 2003 9:45 pm
- Location: Cologne, Germany
- Contact:
Re: metadesc: extract issues (p vs. div, classes & ids)
Hi!
(I don't have time/motivation for most of these issues, but:)
(I don't use the plugin, so I'm simply assuming)
Regards,
Garvin
(I don't have time/motivation for most of these issues, but:)
You can use individual meta content/properties instead of automatic detected ones, and this would eradicate your problems, wouldn't it? There you'd have your individual formatting?I'd really like to convince the developers of the metadesc plugin that this would return more matching and less error-prone results for all kind of s9y users (for those who are using wysiwyg editors, mainly using paragraphs AND those who use everything html gives us in the matter of semantics and love to have individual formatting).
(I don't use the plugin, so I'm simply assuming)
Regards,
Garvin
# Garvin Hicking (s9y Developer)
# Did I help you? Consider making me happy: http://wishes.garv.in/
# or use my PayPal account "paypal {at} supergarv (dot) de"
# My "other" hobby: http://flickr.garv.in/
# Did I help you? Consider making me happy: http://wishes.garv.in/
# or use my PayPal account "paypal {at} supergarv (dot) de"
# My "other" hobby: http://flickr.garv.in/