Regular Expression help required

Random stuff about serendipity. Discussion, Questions, Paraphernalia.
Post Reply
User avatar
sonichouse
Regular
Posts: 196
Joined: Sun May 11, 2008 2:53 am
Contact:

Regular Expression help required

Post by sonichouse » Mon May 12, 2008 11:26 pm

I am trying to debug the amaazon plugin, but have hit a wall

The html returned from amazon has the following
<td id="prodImageCell" height="280" width="280"><a href="http://www.amazon.co.uk/gp/product/images/B000Z2ZQ7U/ref=dp_image_0?ie=UTF8&n=300703&s=videogames" target="AmazonHelp" onclick="return amz_js_PopWin(this.href,'AmazonHelp','width=700,height=600,resizable=1,scrollbars=1,toolbar=1,status=1');" ><img src="http://ecx.images-amazon.com/images/I/51m6ypIN3tL._SL500_AA280_.jpg" id="prodImage" width="280" height="280" border="0" alt="Winter Sports (Wii)" /></a></td>


How do I get the image src when the entry contains B000Z2ZQ7U ?

Thanks.

User avatar
Don Chambers
Regular
Posts: 3638
Joined: Mon Feb 13, 2006 2:40 am
Location: Chicago, IL, USA
Contact:

Post by Don Chambers » Mon May 12, 2008 11:56 pm

I do not personally understand the problem. What are you trying to accomplish, and why is it a problem that the image path contains "B000Z2ZQ7U"??????

User avatar
sonichouse
Regular
Posts: 196
Joined: Sun May 11, 2008 2:53 am
Contact:

Post by sonichouse » Tue May 13, 2008 12:48 am

Don Chambers wrote:I do not personally understand the problem. What are you trying to accomplish, and why is it a problem that the image path contains "B000Z2ZQ7U"??????
The old method was that the img src used to contain the ASIN e.g. images/blah/ASIN.jpg.

so the plugin just searched for the img src containing the ASIN using

Code: Select all

            if (preg_match('@<\s*img.*?src=\s*([\'"])(http://[^/]*images(|-de\.|-jp\.|-eu\.|\-)amazon\.com/images/.+/'.$asin.'.+?\.(png|jpg|gif))\1\s+@i', $content, $matches)) 


However, the img src tag does not have the ASIN contained in it any more, so we need to find the container with the ASIN reference, and fetch the obfuscated image url.

Does that help explain it any better ?

User avatar
garvinhicking
Core Developer
Posts: 30020
Joined: Tue Sep 16, 2003 9:45 pm
Location: Cologne, Germany
Contact:

Re: Regular Expression help required

Post by garvinhicking » Tue May 13, 2008 8:51 am

Hi!

Phew. That is one wicked RegExp. I worked on this a few minutes, but creating a regexp that returns the same matches with a different regexp is very hard. I'll see if Judebert has a clue :)

Regards,
Garvin
# Garvin Hicking (s9y Developer)
# Did I help you? Consider making me happy: http://wishes.garv.in/
# or use my PayPal account "paypal {at} supergarv (dot) de"
# My "other" hobby: http://flickr.garv.in/

User avatar
sonichouse
Regular
Posts: 196
Joined: Sun May 11, 2008 2:53 am
Contact:

Re: Regular Expression help required

Post by sonichouse » Tue May 13, 2008 10:35 am

garvinhicking wrote:Hi!

Phew. That is one wicked RegExp. I worked on this a few minutes, but creating a regexp that returns the same matches with a different regexp is very hard. I'll see if Judebert has a clue :)

Regards,
Garvin
Hi,

thanks, you can see why I struggled :wink:

User avatar
judebert
Regular
Posts: 2478
Joined: Sat Oct 15, 2005 6:57 am
Location: Orlando, FL
Contact:

Post by judebert » Tue May 13, 2008 3:47 pm

Okay, let's see here... *cracks knuckles*

Holy carp!

So, knowing the href contains the .../{ASIN}/..., you want to pull the src="{weird stuff}" out of that HTML? And we're scraping the entire Amazon product page to get that HTML?

Don't they provide tools for Amazon affiliates that make this much easier? Sheesh.

I think what we need here is:

Code: Select all

            if (preg_match('@id=[\'"]prodImageCell\1.*images/'.$asin.'/.*<\s*img.*?src=\s*([\'"])(http://[^/]*images(|-de\.|-jp\.|-eu\.|\-)amazon\.com/images/.+/.+?\.(png|jpg|gif))\2@i', $content, $matches))


That should find the element with the ID "prodImageCell", skip over everything to the URL with the ASIN (just to confirm it's there), skip some more to the <img> tag, and pull everything from the src= attribute. You'll have to change the $matches[2] to $matches[3] for the $image_url, and the $file_type will likewise need to be $matches[5] instead of $matches[4].

This contains some extra stuff to ensure we're not being sidetracked: we don't really *need* to check for the prodImageCell ID, and we don't need to ensure the src= occurs in an <img> tag. It could be made more specific, if necessary: we could make sure the id= were in a <td> tag, and even check for the id="prodImage" attribute after the src.

I say "think" because I haven't tested it myself. Let me know if there are any problems.
Judebert
---
Website | Wishlist | PayPal

User avatar
sonichouse
Regular
Posts: 196
Joined: Sun May 11, 2008 2:53 am
Contact:

Post by sonichouse » Tue May 13, 2008 9:43 pm

Thanks for that....

I couldn't get it to work, and to be honest I do not understand the syntax fully.

I was able to write a grep regex that I think is close to my requirement as

Code: Select all

grep "id=\"prodImageCell\".*img src=.*\.[png|jpg|gif].*id=\"prodImage\""

I tried to translate to use the php syntax but failed miserably.

User avatar
judebert
Regular
Posts: 2478
Joined: Sat Oct 15, 2005 6:57 am
Location: Orlando, FL
Contact:

Post by judebert » Tue May 13, 2008 11:09 pm

Well, that doesn't account for changing quotes, or optional spaces, or any of the other stuff the original does, but... it can be translated to PCRE.

In fact, this is what it would look like:

Code: Select all

preg_match('@id="prodImageCell".*img src="(.*\.(png|jpg|gif))".*id="prodImage"@', $content, $matches);


There are some malformed URLs that could cause problems with that regexp, but we don't expect any of that from Amazon. It also doesn't account for the ASIN at all. But it should get the job done.

The file_type is now $matches[2] (the second set of parens), and the image_url is not $matches[1] (the first set of parens).
Judebert
---
Website | Wishlist | PayPal

User avatar
sonichouse
Regular
Posts: 196
Joined: Sun May 11, 2008 2:53 am
Contact:

Post by sonichouse » Tue May 13, 2008 11:24 pm

judebert wrote:Well, that doesn't account for changing quotes, or optional spaces, or any of the other stuff the original does, but... it can be translated to PCRE.

In fact, this is what it would look like:

Code: Select all

preg_match('@id="prodImageCell".*img src="(.*\.(png|jpg|gif))".*id="prodImage"@', $content, $matches);


There are some malformed URLs that could cause problems with that regexp, but we don't expect any of that from Amazon. It also doesn't account for the ASIN at all. But it should get the job done.

The file_type is now $matches[2] (the second set of parens), and the image_url is not $matches[1] (the first set of parens).
Cheers, I will throw the ASIN into that to verify we have the right image.

Thanks for the mini tutorial, I will get there eventually :lol:

User avatar
sonichouse
Regular
Posts: 196
Joined: Sun May 11, 2008 2:53 am
Contact:

Post by sonichouse » Wed May 14, 2008 12:36 am

judebert wrote:There are some malformed URLs that could cause problems with that regexp, but we don't expect any of that from Amazon. It also doesn't account for the ASIN at all. But it should get the job done.
With many thanks to judebert, I have hacked a temporary solution that works for me on amazon.co.uk.

Code: Select all

if(preg_match('@id="prodImageCell".*images/'.$asin.'.*img src="(.*\.(png|jpg|gif))".*id="prodImage"@', $content, $matches)) {
                echo "found image url = $matches[1]<br />\n";
                $image_url = $matches[1]; //was 2
                $file_type = strtolower($matches[2]); // was 4
            } else {
                echo "could not find image url.\n";
            }


I will leave the clever stuff to those that know what they are doing, but at the moment it is working for me.

einliterbier
Posts: 2
Joined: Fri May 23, 2008 10:14 pm

...doesn't work... :-(

Post by einliterbier » Fri May 23, 2008 10:17 pm

Hey guys,

i just copy/pasted the code without any result instead of the old one.
how can i get this to work?

cheers

einliterbier
Posts: 2
Joined: Fri May 23, 2008 10:14 pm

:-)

Post by einliterbier » Fri May 23, 2008 10:20 pm

it does work. thx
i just deleted the false images and reloaded.
is there a way to resize the images?

User avatar
sonichouse
Regular
Posts: 196
Joined: Sun May 11, 2008 2:53 am
Contact:

Re: :-)

Post by sonichouse » Fri May 23, 2008 11:22 pm

einliterbier wrote:it does work. thx
i just deleted the false images and reloaded.
is there a way to resize the images?
When the plugin writes the <img> tag, you could force the display width e.g height=120 width=120
Steve is occasionally blogging here

Post Reply