Regular expression help

This is the place for queries that don't fit in any of the other categories.

Regular expression help

Postby jdawg1989 » Wed Aug 21, 2013 11:15 am

I am trying to extract the href from the following code

Code: Select all
<a id="leader-1029391503" class="description" title="White iPhone 4 like new" href="http://www.gumtree.com/p/for-sale/white-iphone-4-like-new/1029391503">


Where 'a id' and 'title' can be anything.
So far I've tried:

Code: Select all
link = re.compile('<a id="\d"\sclass="description"\stitle=".*?"\shref="(.*?)">')
link = re.findall(link,htmltext)
print link


I am trying to scrape gumtree, I have managed to scrape all other things I need, below is the full code:
Code: Select all
import urllib
import re

searchwords = ["apple"] #Enter words you want to search Gumtree for
searchlocation = ["cardiff"] #Enter locations where you want to search for your words
searchcategory = ["all"]

i1=0 #location
i2=0 #words
i3=1 #page number

while i1<len(searchlocation):
    while i2<len(searchwords):
        while i3 > 0:
            url = "http://www.gumtree.com/" +searchcategory[0] +"/" +searchlocation[i1] +"/" +searchwords[i2] +"/page" +str(i3) +"?price=over_0"
            htmlfile = urllib.urlopen(url)
            htmltext = htmlfile.read()
            title = re.compile('<span class="ad-title-text" itemprop="itemListElement">(.+?)</span>')
            title = re.findall(title,htmltext)
            if len(title) ==0:
                i3 = 0
                break
            else:
                price = re.compile('<span class="price">(.+?)</span>')
                price = re.findall(price,htmltext)
               
                des = re.compile('<div class="ad-description"><span>(.+?)</span></div>')
                des = re.findall(des,htmltext)
               
                link = re.compile('<a id="\d"\sclass="description"\stitle=".*?"\shref="(.*?)">')
                link = re.findall(link,htmltext)
                print link
             
                listIterator = []
                listIterator[:] = range(1,len(title))
           
                for i in listIterator:
                    print title[i] +" (" +searchlocation[i1] +")"
                    print des[i]
                    print link[i]
                    print price[i]
                    print "\n"
            i3+=1
        i2+=1
    i2=0
    i1+=1

print "end"
jdawg1989
 
Posts: 2
Joined: Wed Aug 21, 2013 11:08 am

Re: Regular expression help

Postby micseydel » Wed Aug 21, 2013 11:32 am

The standard answer to a question like this is: regular expressions are not the tool to use to accomplish this task. I use the third party lxml module for my XML/HTML parsing.
Join the #python-forum IRC channel on irc.freenode.net!

Please do not PM members regarding questions which are meant to be discussed publicly. The point of the forum is so that others can benefit from it. We don't want to help you over PMs or emails.
User avatar
micseydel
 
Posts: 1390
Joined: Tue Feb 12, 2013 2:18 am
Location: Mountain View, CA

Re: Regular expression help

Postby jdawg1989 » Thu Aug 22, 2013 2:53 pm

Thanks for your suggestion, micseydel.

I have solved my issue, I used the breatifulsoup library instead.

Code: Select all
url = listing.find("a", class_="description").get("href")
jdawg1989
 
Posts: 2
Joined: Wed Aug 21, 2013 11:08 am


Return to General Coding Help

Who is online

Users browsing this forum: No registered users and 6 guests