read whole website (i.e. not just one webpage) urllib2

read whole website (i.e. not just one webpage) urllib2

Postby nico82 » Fri Apr 05, 2013 12:27 pm

Hello all!

I am trying to write a python program for reading a whole website.
With what I found from Google, I just found explanations for only one webpage with urllib2.
Here is my code:

Code: Select all
import urllib2
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
infile = opener.open('http://ru.wikipedia.org/wiki/%D0%97%D0%B0%D0%B2%D0%BE%D0%B4_%D0%B8%D0%BC%D0%B5%D0%BD%D0%B8_%D0%9C%D0%B0%D0%BB%D1%8B%D1%88%D0%B5%D0%B2%D0%B0')
page = infile.read()


Now, If I want to read from the whole wikipedia for example, how should I proceed?

Not only http://en.wikipedia.org but all the webpages which address starts with http://en.wikipedia.org/blablabla....

Thanks a lot all for your attention !
nico82
 
Posts: 2
Joined: Fri Apr 05, 2013 12:13 pm

Re: read whole website (i.e. not just one webpage) urllib2

Postby setrofim » Fri Apr 05, 2013 1:28 pm

Try Scrapy.
setrofim
 
Posts: 288
Joined: Mon Mar 04, 2013 7:52 pm

Re: read whole website (i.e. not just one webpage) urllib2

Postby micseydel » Fri Apr 05, 2013 9:36 pm

I use mechanize and lxml for my scraping. setrofim, do you have a scrapy tutorial you recommend? I remember briefly checking it out and being excited about it, and being massively confused and abandoning it.
Join the #python-forum IRC channel on irc.freenode.net!
User avatar
micseydel
 
Posts: 1132
Joined: Tue Feb 12, 2013 2:18 am
Location: Mountain View, CA

Re: read whole website (i.e. not just one webpage) urllib2

Postby setrofim » Sat Apr 06, 2013 2:14 pm

micseydel wrote:setrofim, do you have a scrapy tutorial you recommend?

Couldn't find anything I'd recommend by Googling, so I wrote one. Let me know if it makes sense, or if I should add/change something.
setrofim
 
Posts: 288
Joined: Mon Mar 04, 2013 7:52 pm

Re: read whole website (i.e. not just one webpage) urllib2

Postby micseydel » Sun Apr 07, 2013 1:12 am

Holy crap! Kudos, and thanks, setrofim! If I wasn't insanely busy I'd look it over right this minute.
Join the #python-forum IRC channel on irc.freenode.net!
User avatar
micseydel
 
Posts: 1132
Joined: Tue Feb 12, 2013 2:18 am
Location: Mountain View, CA

Re: read whole website (i.e. not just one webpage) urllib2

Postby nico82 » Tue Apr 16, 2013 2:01 pm

Ok, after all, I just decided to read the html source code and detect all the html parts. It was ok for what I wanted to do
nico82
 
Posts: 2
Joined: Fri Apr 05, 2013 12:13 pm


Return to Networking

Who is online

Users browsing this forum: No registered users and 2 guests