scraping google.

This is the place for queries that don't fit in any of the other categories.

scraping google.

Postby sheffieldlad » Wed Apr 03, 2013 12:47 pm

HI all,

I have 3 scripts which I use to extract business email addresses from google.

The 1st script asks for a search term then returns a list of domain names.

The 2nd script parses this list of domain names, removes duplicates and some domains I don't need like youtube.com or facebook.

I finish up with a text file full of domain names which looks something like this:

mtarch.co.uk
khouryarchitects.co.uk
la-architects.co.uk
swarch.co.uk
ents24.com
zmarchitecture.co.uk


The third script - the one I am having problems with reads in this text file and searches the first ten pages of google for email addresses belonging to each domain before writing each unique email address to a text file.

This used to work just fine but recently it has begun to bottom out with http error 503 - service unavailable mentioned in the trace back.

Here is my code for the 3rd script...

Code: Select all
import sys
import re
import string
import httplib
import urllib2
import re
import random
import time
import pickle



def StripTags(text):
    finished = 0
    while not finished:
        finished = 1
        start = text.find("<")
        if start >= 0:
            stop = text[start:].find(">")
            if stop >= 0:
                text = text[:start] + text[start+stop+1:]
                finished = 0
    return text


# read in the text file
with open("domains.txt","r") as f:
   
    list =f.readlines()
    list = map(lambda s: s.strip(), list)
   
f.close()



d={}
page_counter = 0

for s in list:
    domain_name=s
   
   
       
       
     

    while page_counter <100: # 100 equates to 100 results.
       
        results = 'http://www.google.com/search?q=%40'+str(domain_name)+'&hl=en&lr=&ie=UTF-8&start=' + repr(page_counter) + '&sa=N'
       
        request = urllib2.Request(results)
        request.add_header('User-Agent','Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)')
        opener = urllib2.build_opener()
        text = opener.open(request).read()
        emails = (re.findall('([\w\.\-]+@'+domain_name+')',StripTags(text)))
        for email in emails:
            d[email]=1
            uniq_emails=d.keys()
        page_counter = page_counter +10
        true_page = page_counter / 10
        #we have data, close the connection.
        opener.close()
        #give feedback to the user
        print "Checking Page " +str(true_page) +" for  @" +str(domain_name)+""
        print "found this address " +str(emails) +" On page " +str(true_page) + ""
        #pause the script for human emulation to stop google banning my ip.
        random.seed()
        n = random.random()
        i = n * 35
        print "Pausing script for " + str(i) + " Seconds" + ""
        time.sleep(i)
        #empty the variables.
    opener=""
    results=""
    text=""
   
    page_counter=0

   
#all addresses have been searched, write the email addreses found to mailaddresses.txt         

for uniq_emails_web in d.keys():
    with open ("mailaddresses.txt","a") as w:
        w.write("\n"+uniq_emails_web)
   

w.close()
print "Finished. Open mailaddresses.txt to view email addresses"




Here is the traceback in full...

Traceback (most recent call last):
File "C:\Python27\mail2.py", line 53, in <module>
text = opener.open(request).read()
File "C:\Python27\lib\urllib2.py", line 397, in open
response = meth(req, response)
File "C:\Python27\lib\urllib2.py", line 510, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Python27\lib\urllib2.py", line 429, in error
result = self._call_chain(*args)
File "C:\Python27\lib\urllib2.py", line 369, in _call_chain
result = func(*args)
File "C:\Python27\lib\urllib2.py", line 605, in http_error_302
return self.parent.open(new, timeout=req.timeout)
File "C:\Python27\lib\urllib2.py", line 397, in open
response = meth(req, response)
File "C:\Python27\lib\urllib2.py", line 510, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Python27\lib\urllib2.py", line 435, in error
return self._call_chain(*args)
File "C:\Python27\lib\urllib2.py", line 369, in _call_chain
result = func(*args)
File "C:\Python27\lib\urllib2.py", line 518, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 503: Service Unavailable



If anyone can suggest why I am suddenly getting this error or suggest a way to handle the error and move onto the next domain name in the list I would appreciate it.

Error handling is not my strong point..


Many thanks,

Paul.
Python 2.7
Windows XP
sheffieldlad
 
Posts: 37
Joined: Sat Feb 09, 2013 3:03 pm
Location: UK

Re: scraping google.

Postby setrofim » Wed Apr 03, 2013 12:54 pm

You should not be scraping Google search results pages. Instead, you should be using Google Search APIs. The 503 error you're getting is probably due to Google detecting that the requests are being made by a script and refusing to service them.
setrofim
 
Posts: 288
Joined: Mon Mar 04, 2013 7:52 pm

Re: scraping google.

Postby sheffieldlad » Wed Apr 03, 2013 12:58 pm

setrofim wrote:You should not be scraping Google search results pages. Instead, you should be using Google Search APIs.


I'm aware of that thank you.

The free account doesn't offer me enough queries per day and the paid account would not be cost effective.

I am aware of the ethics and in this case I believe them to be one sided.

Google is the biggest web scraper in the world and I have never heard of them paying for scraping information from websites.

That's my opinion. I understand many people here probably don't share my opinion on this matter.
It would be a boring world if we all thought the same....


Paul.

[edit] If google were detecting a script I would expect it to throw captures up for legitimate searches afterwards for a while but I may be wrong..[/edit]
Python 2.7
Windows XP
sheffieldlad
 
Posts: 37
Joined: Sat Feb 09, 2013 3:03 pm
Location: UK

Re: scraping google.

Postby setrofim » Wed Apr 03, 2013 1:11 pm

sheffieldlad wrote:I am aware of the ethics and in this case I believe them to be one sided.

Google is the biggest web scraper in the world and I have never heard of them paying for scraping information from websites.


You are not paying for scraping but for the use of Google's search engine and database on an "industrial" scale, i.e. going beyond what is expected of an average (human) user. You're using bandwidth and putting a load on Google's servers. This is not the same as scraping static content -- you are using a service not just accessing web pages. It is Google's prerogative to charge you for that if they want.

When it comes to ethics of scraping, there is one golden rule: respect the robots.txt for the site. All Google's scrapers follow that rule, so there is nothing one-sided about it. If you examine Google's robots.txt, you will see that searching is explicitly prohibited:
Google's robots.txt wrote:User-agent: *
Disallow: /search


Ethics aside, this is about what Google let's you get way with. You can try introducing a sizable delay (of a few seconds) between making the requests -- that might fool Google into believing that requests are being made by a human (obviously, that would also mean waiting much longer for your results).
setrofim
 
Posts: 288
Joined: Mon Mar 04, 2013 7:52 pm

Re: scraping google.

Postby sheffieldlad » Wed Apr 03, 2013 1:18 pm

setrofim wrote:
sheffieldlad wrote:I am aware of the ethics and in this case I believe them to be one sided.

Google is the biggest web scraper in the world and I have never heard of them paying for scraping information from websites.


You are not paying for scraping but for the use of Google's search engine and database on an "industrial" scale, i.e. going beyond what is expected of an average (human) user. You're using bandwidth and putting a load on Google's servers; it is Google's prerogative to charge you for that if they want.

When it comes to ethics of scraping, there is one golden rule: respect the robots.txt for the site. All Google's scrapers follow that rule, so there is nothing one-sided about it. If you examine Google's robots.txt, you will see that searching is explicitly prohibited:
Google's robots.txt wrote:User-agent: *
Disallow: /search


Ethics aside, this is about what Google let's you get way with. You can try introducing a sizable delay (of a few seconds) between making the requests -- that might fool Google into believing that requests are being made by a human (obviously, that would also mean waiting much longer for your results).



Thanks for your reply.

There are delays built into the script - anywhere between 1 and 35 seconds.
I guess you may be right, Google has smelt a script but again, I would expect captures for legitimate searches from my IP which I don't get.
Ethics aside, Does anyone have any advice on how to handle errors in python?

-Paul.
Python 2.7
Windows XP
sheffieldlad
 
Posts: 37
Joined: Sat Feb 09, 2013 3:03 pm
Location: UK

Re: scraping google.

Postby setrofim » Wed Apr 03, 2013 1:29 pm

sheffieldlad wrote: Does anyone have any advice on how to handle errors in python?

In Python, you handle errors by catching the exception and dealing with it. For HTTPError this may often involve simply logging it and moving on to the next request. In your case, that won't be of much help, since once you get a 503, subsequent requests are going to return the same thing. So you might what to just terminate the script (or move onto the next part after getting the requests).

If by that question you mean "how to I get around this particular error get Google to give me the results?", then, if the delays aren't helping, you could try using mechanize to emulate a browser more fully (rather than just faking the user-agent header).
setrofim
 
Posts: 288
Joined: Mon Mar 04, 2013 7:52 pm

Re: scraping google.

Postby sheffieldlad » Wed Apr 03, 2013 1:31 pm

setrofim wrote:
sheffieldlad wrote: Does anyone have any advice on how to handle errors in python?

In Python, you handle errors by catching the exception and dealing with it. For HTTPError this may often involve simply logging it and moving on to the next request. In your case, that won't be of much help, since once you get a 503, subsequent requests are going to return the same thing. So you might what to just terminate the script (or move onto the next part after getting the requests).

If by that question you mean "how to I get around this particular error get Google to give me the results?", then, if the delays aren't helping, you could try using mechanize to emulate a browser more fully (rather than just faking the user-agent header).



Many thanks, I will look into using Mechanize.

-Paul.
Python 2.7
Windows XP
sheffieldlad
 
Posts: 37
Joined: Sat Feb 09, 2013 3:03 pm
Location: UK

Re: scraping google.

Postby micseydel » Thu Apr 04, 2013 8:37 pm

One way scraping is detected is when you don't modify the user-agent in your requests.
Join the #python-forum IRC channel on irc.freenode.net!

Please do not PM members regarding questions which are meant to be discussed publicly. The point of the forum is so that others can benefit from it. We don't want to help you over PMs or emails.
User avatar
micseydel
 
Posts: 1390
Joined: Tue Feb 12, 2013 2:18 am
Location: Mountain View, CA

Re: scraping google.

Postby sheffieldlad » Thu Apr 04, 2013 9:17 pm

micseydel wrote:One way scraping is detected is when you don't modify the user-agent in your requests.


Thanks.

I have managed to get around the problem for the time being by minimizing the amount of requests I'm sending. I don't need the first 10 pages of results. between 3 and 5 usually gets me the information I need and by adding looooong delays between requests.
I do need to come up with a more permanent solution but what I'm doing isn't a long term thing.
I don't intend to scrape Google forever, it's a means to an end but I need to get top side of my code just to satisfy my own mind and hopefully to learn.

Since my last post I have introduced code to inform the user what is happening and handle errors gracefully which is something I wasn't sure how to do before.

I would like to take my little project further but I won't have a real need for it soon (apart from the learning aspect) and there are other things I would enjoy coding a lot more. :)

-Paul.
Python 2.7
Windows XP
sheffieldlad
 
Posts: 37
Joined: Sat Feb 09, 2013 3:03 pm
Location: UK


Return to General Coding Help

Who is online

Users browsing this forum: No registered users and 2 guests