Scraping a website and getting specific data

This is the place for queries that don't fit in any of the other categories.

Scraping a website and getting specific data

Postby SMNALLY » Tue Sep 24, 2013 2:49 pm

Hello Python Forum,

I am trying to scrape some specific areas from this website for horse racing results, I am having trouble defining the area which I want to read. I have tried a few variations of code and the one below is probably closest to what I want to achieve.

Code: Select all
import urllib2
import re

def getData():

    html = urllib2.urlopen   
    pattern1 = re.compile(r'\<div class\=\"crBlock(.*?)\<\!\-\- \.resultGrid \-\-\>', re.S)
    dataset_items = re.findall(pattern1, html)

    ## If dataset_items is true (data exists) then...
    if dataset_items:

        ## For each dataset in dataset_items...
        for dataset in dataset_items:
            print dataset

getData()


The html below represents one of the Racecourses and all of the races for that race meeting. I innitially want to be able to select each of these html tables as I ultimately need to know the course for which I was collecting results which isnt included as html on a race by race basis. Following this I want to be able to select each race within these tables and collect select pieces of data.

Please note that for the first html table the class has ' noBorder' where as the subsiquent html tables do not contain this.

Code: Select all
<div class="crBlock noBorder">
 <a name="175"></a>
 <table class="raceHead">
 <tr>
 <td class="meeting">
 <div class="topLink">
 <h3>
 <


This is my first post in the python forum, so unsure of how much help will be available, but I thought it was worth a shot.

Thanks SMNALLY
Last edited by SMNALLY on Wed Sep 25, 2013 11:57 pm, edited 1 time in total.
SMNALLY
 
Posts: 5
Joined: Tue Sep 24, 2013 2:16 pm

Re: Scraping a website and getting specific data

Postby stranac » Tue Sep 24, 2013 3:14 pm

Please don't use regular expressions for scraping html.
There are parsers made for the job. I would recommend using lxml.
Friendship is magic!

R.I.P. Tracy M. You will be missed.
User avatar
stranac
 
Posts: 890
Joined: Thu Feb 07, 2013 3:42 pm

Re: Scraping a website and getting specific data

Postby SMNALLY » Tue Sep 24, 2013 6:54 pm

stranac wrote:Please don't use regular expressions for scraping html.
There are parsers made for the job. I would recommend using lxml.


Thanks for the comment I will check this out.
SMNALLY
 
Posts: 5
Joined: Tue Sep 24, 2013 2:16 pm


Return to General Coding Help

Who is online

Users browsing this forum: No registered users and 5 guests