Page 1 of 1

Regular Expression help required

Posted: Mon May 12, 2008 11:26 pm
by sonichouse
I am trying to debug the amaazon plugin, but have hit a wall

The html returned from amazon has the following
<td id="prodImageCell" height="280" width="280"><a href="http://www.amazon.co.uk/gp/product/imag ... videogames" target="AmazonHelp" onclick="return amz_js_PopWin(this.href,'AmazonHelp','width=700,height=600,resizable=1,scrollbars=1,toolbar=1,status=1');" ><img src="http://ecx.images-amazon.com/images/I/5 ... AA280_.jpg" id="prodImage" width="280" height="280" border="0" alt="Winter Sports (Wii)" /></a></td>
How do I get the image src when the entry contains B000Z2ZQ7U ?

Thanks.

Posted: Mon May 12, 2008 11:56 pm
by Don Chambers
I do not personally understand the problem. What are you trying to accomplish, and why is it a problem that the image path contains "B000Z2ZQ7U"??????

Posted: Tue May 13, 2008 12:48 am
by sonichouse
Don Chambers wrote:I do not personally understand the problem. What are you trying to accomplish, and why is it a problem that the image path contains "B000Z2ZQ7U"??????
The old method was that the img src used to contain the ASIN e.g. images/blah/ASIN.jpg.

so the plugin just searched for the img src containing the ASIN using

Code: Select all

            if (preg_match('@<\s*img.*?src=\s*([\'"])(http://[^/]*images(|-de\.|-jp\.|-eu\.|\-)amazon\.com/images/.+/'.$asin.'.+?\.(png|jpg|gif))\1\s+@i', $content, $matches)) 
However, the img src tag does not have the ASIN contained in it any more, so we need to find the container with the ASIN reference, and fetch the obfuscated image url.

Does that help explain it any better ?

Re: Regular Expression help required

Posted: Tue May 13, 2008 8:51 am
by garvinhicking
Hi!

Phew. That is one wicked RegExp. I worked on this a few minutes, but creating a regexp that returns the same matches with a different regexp is very hard. I'll see if Judebert has a clue :)

Regards,
Garvin

Re: Regular Expression help required

Posted: Tue May 13, 2008 10:35 am
by sonichouse
garvinhicking wrote:Hi!

Phew. That is one wicked RegExp. I worked on this a few minutes, but creating a regexp that returns the same matches with a different regexp is very hard. I'll see if Judebert has a clue :)

Regards,
Garvin
Hi,

thanks, you can see why I struggled :wink:

Posted: Tue May 13, 2008 3:47 pm
by judebert
Okay, let's see here... *cracks knuckles*

Holy carp!

So, knowing the href contains the .../{ASIN}/..., you want to pull the src="{weird stuff}" out of that HTML? And we're scraping the entire Amazon product page to get that HTML?

Don't they provide tools for Amazon affiliates that make this much easier? Sheesh.

I think what we need here is:

Code: Select all

            if (preg_match('@id=[\'"]prodImageCell\1.*images/'.$asin.'/.*<\s*img.*?src=\s*([\'"])(http://[^/]*images(|-de\.|-jp\.|-eu\.|\-)amazon\.com/images/.+/.+?\.(png|jpg|gif))\2@i', $content, $matches))
That should find the element with the ID "prodImageCell", skip over everything to the URL with the ASIN (just to confirm it's there), skip some more to the <img> tag, and pull everything from the src= attribute. You'll have to change the $matches[2] to $matches[3] for the $image_url, and the $file_type will likewise need to be $matches[5] instead of $matches[4].

This contains some extra stuff to ensure we're not being sidetracked: we don't really *need* to check for the prodImageCell ID, and we don't need to ensure the src= occurs in an <img> tag. It could be made more specific, if necessary: we could make sure the id= were in a <td> tag, and even check for the id="prodImage" attribute after the src.

I say "think" because I haven't tested it myself. Let me know if there are any problems.

Posted: Tue May 13, 2008 9:43 pm
by sonichouse
Thanks for that....

I couldn't get it to work, and to be honest I do not understand the syntax fully.

I was able to write a grep regex that I think is close to my requirement as

Code: Select all

grep "id=\"prodImageCell\".*img src=.*\.[png|jpg|gif].*id=\"prodImage\""
I tried to translate to use the php syntax but failed miserably.

Posted: Tue May 13, 2008 11:09 pm
by judebert
Well, that doesn't account for changing quotes, or optional spaces, or any of the other stuff the original does, but... it can be translated to PCRE.

In fact, this is what it would look like:

Code: Select all

preg_match('@id="prodImageCell".*img src="(.*\.(png|jpg|gif))".*id="prodImage"@', $content, $matches);
There are some malformed URLs that could cause problems with that regexp, but we don't expect any of that from Amazon. It also doesn't account for the ASIN at all. But it should get the job done.

The file_type is now $matches[2] (the second set of parens), and the image_url is not $matches[1] (the first set of parens).

Posted: Tue May 13, 2008 11:24 pm
by sonichouse
judebert wrote:Well, that doesn't account for changing quotes, or optional spaces, or any of the other stuff the original does, but... it can be translated to PCRE.

In fact, this is what it would look like:

Code: Select all

preg_match('@id="prodImageCell".*img src="(.*\.(png|jpg|gif))".*id="prodImage"@', $content, $matches);
There are some malformed URLs that could cause problems with that regexp, but we don't expect any of that from Amazon. It also doesn't account for the ASIN at all. But it should get the job done.

The file_type is now $matches[2] (the second set of parens), and the image_url is not $matches[1] (the first set of parens).
Cheers, I will throw the ASIN into that to verify we have the right image.

Thanks for the mini tutorial, I will get there eventually :lol:

Posted: Wed May 14, 2008 12:36 am
by sonichouse
judebert wrote:There are some malformed URLs that could cause problems with that regexp, but we don't expect any of that from Amazon. It also doesn't account for the ASIN at all. But it should get the job done.
With many thanks to judebert, I have hacked a temporary solution that works for me on amazon.co.uk.

Code: Select all

if(preg_match('@id="prodImageCell".*images/'.$asin.'.*img src="(.*\.(png|jpg|gif))".*id="prodImage"@', $content, $matches)) {
                echo "found image url = $matches[1]<br />\n";
                $image_url = $matches[1]; //was 2
                $file_type = strtolower($matches[2]); // was 4
            } else {
                echo "could not find image url.\n";
            }
I will leave the clever stuff to those that know what they are doing, but at the moment it is working for me.

...doesn't work... :-(

Posted: Fri May 23, 2008 10:17 pm
by einliterbier
Hey guys,

i just copy/pasted the code without any result instead of the old one.
how can i get this to work?

cheers

:-)

Posted: Fri May 23, 2008 10:20 pm
by einliterbier
it does work. thx
i just deleted the false images and reloaded.
is there a way to resize the images?

Re: :-)

Posted: Fri May 23, 2008 11:22 pm
by sonichouse
einliterbier wrote:it does work. thx
i just deleted the false images and reloaded.
is there a way to resize the images?
When the plugin writes the <img> tag, you could force the display width e.g height=120 width=120