Regular expression help

This is the place for queries that don't fit in any of the other categories.

Regular expression help

Postby jdawg1989 » Wed Aug 21, 2013 11:15 am

I am trying to extract the href from the following code

Code: Select all
<a id="leader-1029391503" class="description" title="White iPhone 4 like new" href="http://www.gumtree.com/p/for-sale/white-iphone-4-like-new/1029391503">


Where 'a id' and 'title' can be anything.
So far I've tried:

Code: Select all
link = re.compile('<a id="\d"\sclass="description"\stitle=".*?"\shref="(.*?)">')
link = re.findall(link,htmltext)
print link


I am trying to scrape gumtree, I have managed to scrape all other things I need, below is the full code:
Code: Select all
import urllib
import re

searchwords = ["apple"] #Enter words you want to search Gumtree for
searchlocation = ["cardiff"] #Enter locations where you want to search for your words
searchcategory = ["all"]

i1=0 #location
i2=0 #words
i3=1 #page number

while i1<len(searchlocation):
    while i2<len(searchwords):
        while i3 > 0:
            url = "http://www.gumtree.com/" +searchcategory[0] +"/" +searchlocation[i1] +"/" +searchwords[i2] +"/page" +str(i3) +"?price=over_0"
            htmlfile = urllib.urlopen(url)
            htmltext = htmlfile.read()
            title = re.compile('<span class="ad-title-text" itemprop="itemListElement">(.+?)</span>')
            title = re.findall(title,htmltext)
            if len(title) ==0:
                i3 = 0
                break
            else:
                price = re.compile('<span class="price">(.+?)</span>')
                price = re.findall(price,htmltext)
               
                des = re.compile('<div class="ad-description"><span>(.+?)</span></div>')
                des = re.findall(des,htmltext)
               
                link = re.compile('<a id="\d"\sclass="description"\stitle=".*?"\shref="(.*?)">')
                link = re.findall(link,htmltext)
                print link
             
                listIterator = []
                listIterator[:] = range(1,len(title))
           
                for i in listIterator:
                    print title[i] +" (" +searchlocation[i1] +")"
                    print des[i]
                    print link[i]
                    print price[i]
                    print "\n"
            i3+=1
        i2+=1
    i2=0
    i1+=1

print "end"
jdawg1989
 
Posts: 2
Joined: Wed Aug 21, 2013 11:08 am

Re: Regular expression help

Postby micseydel » Wed Aug 21, 2013 11:32 am

The standard answer to a question like this is: regular expressions are not the tool to use to accomplish this task. I use the third party lxml module for my XML/HTML parsing.
Join the #python-forum IRC channel on irc.freenode.net!
User avatar
micseydel
 
Posts: 941
Joined: Tue Feb 12, 2013 2:18 am
Location: Mountain View, CA

Re: Regular expression help

Postby jdawg1989 » Thu Aug 22, 2013 2:53 pm

Thanks for your suggestion, micseydel.

I have solved my issue, I used the breatifulsoup library instead.

Code: Select all
url = listing.find("a", class_="description").get("href")
jdawg1989
 
Posts: 2
Joined: Wed Aug 21, 2013 11:08 am


Return to General Coding Help

Who is online

Users browsing this forum: No registered users and 4 guests