How best to achieve the following?

This is the place for queries that don't fit in any of the other categories.

How best to achieve the following?

Postby AEA » Thu May 30, 2013 1:31 pm

Hey guys, im not after help with some specific code here, but more advice about how to achieve my current objective.

I have created a web scraper that scrapes live user data every 5 mins, so at the beginning of the day it will scrape statistics about users A,B,C (because they are the only ones who have entered data for that day) then 5 mins later it might be A,B,C,D,E. However this now leaves me with the list: A,B,C,A,B,C,D,E. Where as for the second scrape I would like to only scrape new data: First scrape A,B,C second scrape D,E, total scraped A,B,C,D,E.

My question to the forum is how best to achieve this? note.#Each user data has a unique user ID and a unique username.

Just for the record, I have been using python and coding for about a week (jargon catches me out right now.)

Any comments and advice would be appreciated.

Many thanks AEA
AEA
 
Posts: 32
Joined: Thu Apr 18, 2013 11:37 am

Re: How best to achieve the following?

Postby hansn » Thu May 30, 2013 2:29 pm

Hard to say exactly without looking at your code, and I have no experience with web scrapers.

But would not a simple
Code: Select all
if A not in list:
    list.append(A)

suffice?

Kudos for making a working web scraper after 1 week of programming!
hansn
 
Posts: 87
Joined: Thu Feb 21, 2013 8:46 pm

Re: How best to achieve the following?

Postby AEA » Thu May 30, 2013 2:58 pm

Thanks for the reply hansn, i didnt want to include my code purly so the website in question doesn't show up on search results for a web scraper. I will amend my code and supply examples of what it currently returns.

Code: Select all
import mechanize
import urllib
import json
import re

 
def getData(): 
    post_url = "URL"
    browser = mechanize.Browser()
    browser.set_handle_robots(False)
    browser.addheaders = [('User-agent', 'Firefox')]
 
    ######These are the parameters you've got from checking with the aforementioned tools
    parameters = {'page' : '1',
                  'rp' : '250',
                  'sortname' : 'roi',
                  'sortorder' : 'desc'
                 }
    #####Encode the parameters
    data = urllib.urlencode(parameters)
    trans_array = browser.open(post_url,data).read().decode('UTF-8')

    xmlload1 = json.loads(trans_array)
    pattern1 = re.compile('>&nbsp;&nbsp;(.*)<')
    pattern2 = re.compile('/control/profile/view/(.*)\' title=')
    pattern3 = re.compile('<span style=\'font-size:12px;\'>(.*)<\/span>')
    pattern4 = re.compile('title=\'Naps posted: (.*) Winners:')
    pattern5 = re.compile('Winners: (.*)\'><img src=')


    for i in xrange(250):
        user_delimiter = xmlload1['rows'][i]['cell']['username']
        selection_delimiter = xmlload1['rows'][i]['cell']['race_horse']
       
        username_delimiter_results = re.findall(pattern1, user_delimiter)[0]
        userid_delimiter_results = int(re.findall(pattern2, user_delimiter)[0])
        user_selection = re.findall(pattern3, selection_delimiter)[0]
        user_numberofselections = float(re.findall(pattern4, user_delimiter)[0])
        user_numberofwinners = float(re.findall(pattern5, user_delimiter)[0])

        strikeratecalc1 = user_numberofwinners/user_numberofselections
        strikeratecalc2 = strikeratecalc1*100

        print "user id = ",userid_delimiter_results
        print "username = ",username_delimiter_results
        print "user selection = ",user_selection
        print "best price available as decimal = ",xmlload1['rows'][i]['cell']['tws.best_price']
        print "race time = ",xmlload1['rows'][i]['cell']['race_time']
        print "race meeting = ",xmlload1['rows'][i]['cell']['race_meeting']
        print "ROI = ",xmlload1['rows'][i]['cell']['roi']
        print "number of selections = ",user_numberofselections
        print "number of winners = ",user_numberofwinners
        print "Strike rate = ",strikeratecalc2,"%"
        print ""


Examples of what it is currently printing:

Code: Select all
user id =  1764
username =  user1
user selection =  selection1
best price available as decimal =  3.500
race time =  19:50
race meeting =  SAND
ROI =  -16%
number of selections =  83.0
number of winners =  12.0
Strike rate =  14.4578313253 %

user id =  736
username =  user2
user selection =  selection2
best price available as decimal =  3.500
race time =  14:50
race meeting =  LING
ROI =  -17%
number of selections =  187.0
number of winners =  51.0
Strike rate =  27.2727272727 %

user id =  1567
username =  user3
user selection =  selection3
best price available as decimal =  4.000
race time =  18:10
race meeting =  SAND
ROI =  -17%
number of selections =  135.0
number of winners =  28.0
Strike rate =  20.7407407407 %

user id =  373
username =  user4
user selection =  selection4
best price available as decimal =  4.500
race time =  20:20
race meeting =  SAND
ROI =  -18%
number of selections =  243.0
number of winners =  48.0


I am assuming, that I need to make a list of the user ids in order to execute said functions?
AEA
 
Posts: 32
Joined: Thu Apr 18, 2013 11:37 am

Re: How best to achieve the following?

Postby setrofim » Thu May 30, 2013 3:19 pm

Use a dict that maps user_id onto the object containing user's data. On each run of the scraper, check if user id is already in the dict, and if so update the corresponding object, otherwise add a new entry to the dict.
setrofim
 
Posts: 288
Joined: Mon Mar 04, 2013 7:52 pm

Re: How best to achieve the following?

Postby AEA » Thu May 30, 2013 8:02 pm

setrofim wrote:Use a dict that maps user_id onto the object containing user's data. On each run of the scraper, check if user id is already in the dict, and if so update the corresponding object, otherwise add a new entry to the dict.


Hi thanks for the reply setrofim, currently I have no idea how to do what you said. Any hints about what functions I should watch tutorials for in order to better grasp this concept?

Kind regards

AEA
AEA
 
Posts: 32
Joined: Thu Apr 18, 2013 11:37 am

Re: How best to achieve the following?

Postby hansn » Thu May 30, 2013 8:38 pm

AEA wrote:Hi thanks for the reply setrofim, currently I have no idea how to do what you said

Assuming that you wrote the web scaper on your own, I'm guessing that you don't know how to use a dictionary - see here http://docs.python.org/2/tutorial/datas ... ctionaries
hansn
 
Posts: 87
Joined: Thu Feb 21, 2013 8:46 pm

Re: How best to achieve the following?

Postby AEA » Thu May 30, 2013 8:45 pm

hansn wrote:
AEA wrote:Hi thanks for the reply setrofim, currently I have no idea how to do what you said

Assuming that you wrote the web scaper on your own, I'm guessing that you don't know how to use a dictionary - see here http://docs.python.org/2/tutorial/datas ... ctionaries


I have had help, but for the most part I have got it to work on my own. The use of mechanise and the post request I got help on, as I was copying an example that used a get request. I also needed help getting the division to work using float() and getting... infact I think I have needed bits of help for most of it, but once I have understood the concept I can copy and apply to my own problems (it seems)
AEA
 
Posts: 32
Joined: Thu Apr 18, 2013 11:37 am

Re: How best to achieve the following?

Postby metulburr » Thu May 30, 2013 9:03 pm

AEA wrote:
setrofim wrote:Use a dict that maps user_id onto the object containing user's data. On each run of the scraper, check if user id is already in the dict, and if so update the corresponding object, otherwise add a new entry to the dict.




Hi thanks for the reply setrofim, currently I have no idea how to do what you said. Any hints about what functions I should watch tutorials for in order to better grasp this concept?


There is also a tut on the forum of dictionaries
New Users, Read This
OS Ubuntu 14.04, Arch Linux, Gentoo, Windows 7/8
https://github.com/metulburr
steam
User avatar
metulburr
 
Posts: 1471
Joined: Thu Feb 07, 2013 4:47 pm
Location: Elmira, NY

Re: How best to achieve the following?

Postby AEA » Thu May 30, 2013 9:39 pm

metulburr wrote:
AEA wrote:
setrofim wrote:Use a dict that maps user_id onto the object containing user's data. On each run of the scraper, check if user id is already in the dict, and if so update the corresponding object, otherwise add a new entry to the dict.




Hi thanks for the reply setrofim, currently I have no idea how to do what you said. Any hints about what functions I should watch tutorials for in order to better grasp this concept?


There is also a tut on the forum of dictionaries


Cheers metulbur!
AEA
 
Posts: 32
Joined: Thu Apr 18, 2013 11:37 am


Return to General Coding Help

Who is online

Users browsing this forum: Google [Bot], W3C [Linkcheck] and 4 guests