3.3 Web scraper - 2 specific questions

A forum for general discussion of the Python programming language.

3.3 Web scraper - 2 specific questions

Postby deeeeets » Tue Jun 04, 2013 12:38 am

I've spent about 50 hours trying to figure out how to write a webscraper in Python 3.3. It appears hopeless at this point, so I'm posting out of desperation, with the thought that one of you super geniuses can probably resolve my 50-hour failure in a couple minutes.

I have two specific actions I am trying to execute, and I know Python can do them, but I can't bridge the gap from concept to execution.
[Also, I can write the loops for these actions just fine; the URL activities are what I can't figure out]

I have a webpage that has 200 links on it, each of which pertains to a sports team.
--from the source code, I need to ( a ) extract all 200 of these URLs and compile them into a list (for the following purpose)
--each of the team URLs shows up in the following template: <h5><a href="http://website.com/team/id/[id_number]/[abbrev_team_name]" class="bi">[Team_Name]</a></h5>

Then, within each of these pages, there are ~12 links to individual players' pages
--I need to ( b ) make another list of the URLs of all 2400 players
--each of the player URLs shows up in the following template: <td><a href="http://website.com/player/id/[PLR_id_number]/[abbrev_player_name]">[Player_Name]</a></td>

Finally, once I have the list of 2400 URLs, I need to ( c ) capture a table of data from each of them, but I think that part is going to be too complex to describe for now. If you do have any advice where that is concerned (scraping tables of data), please DO share, though.

Thanks !!!!!!!!!!!!!

Deets
deeeeets
 
Posts: 9
Joined: Sat Jun 01, 2013 9:59 am

Re: 3.3 Web scraper - 2 specific questions

Postby metulburr » Tue Jun 04, 2013 3:04 am

stop relying on 100% regex to parse html for your web scraper. that is what BeautifulSoup was made for. I would never attempt a feature like that without BeautifulSoup. BeautifulSoup is a 3rd party module that you do have to install, but it makes such tasks simpler.


It would be eaiser if you had given the link to the website, but some examples would be:
Code: Select all
from bs4 import BeautifulSoup


table = '''
<table border="1">
<tr>
<td>row 1, cell 1</td>
<td>row 1, cell 2</td>
</tr>
<tr>
<td>row 2, cell 1</td>
<td>row 2, cell 2</td>
<td><a href="http://website.com/player/id/[PLR_id_number]/[abbrev_player_name]">[Player_Name]</a></td>
</tr>
</table>
'''

soup = BeautifulSoup(table)
section = soup.find('table')
td_list = section.find_all('td')

print(td_list)
print(td_list[0].text)
print(td_list[-1].text)
print(td_list[-1].find('a')['href'])

OUTPUT wrote:[<td>row 1, cell 1</td>, <td>row 1, cell 2</td>, <td>row 2, cell 1</td>, <td>row 2, cell 2</td>, <td><a href="http://website.com/player/id/[PLR_id_number]/[abbrev_player_name]">[Player_Name]</a></td>]
row 1, cell 1
[Player_Name]
http://website.com/player/id/[PLR_id_number]/[abbrev_player_name]

Code: Select all
from bs4 import BeautifulSoup
links = '''
<h5><a href="http://website.com/team/id/[id_number]/[abbrev_team_name1]" class="bi1">[Team_Name1]</a></h5>
<h5><a href="http://website.com/team/id/[id_number]/[abbrev_team_name2]" class="bi1">[Team_Name2]</a></h5>
<h5><a href="http://website.com/team/id/[id_number]/[abbrev_team_name3]" class="bi2">[Team_Name3]</a></h5>
'''

soup = BeautifulSoup(links)
print(soup.find_all('a', {'class','bi1'}))

OUTPUT wrote:[<a class="bi1" href="http://website.com/team/id/[id_number]/[abbrev_team_name1]">[Team_Name1]</a>, <a class="bi1" href="http://website.com/team/id/[id_number]/[abbrev_team_name2]">[Team_Name2]</a>]
New Users, Read This
OS Ubuntu 14.04, Arch Linux, Gentoo, Windows 7/8
https://github.com/metulburr
steam
User avatar
metulburr
 
Posts: 1374
Joined: Thu Feb 07, 2013 4:47 pm
Location: Elmira, NY

Re: 3.3 Web scraper - 2 specific questions

Postby deeeeets » Tue Jun 04, 2013 11:47 am

@metulburr I've tried to install beautifulsoup4 (I've unzipped the tarball and everything), but I can't seem to get python to use it. Any advice on how to get it to open?

Thanks,
Deets
deeeeets
 
Posts: 9
Joined: Sat Jun 01, 2013 9:59 am

Re: 3.3 Web scraper - 2 specific questions

Postby setrofim » Tue Jun 04, 2013 12:07 pm

did your run
Code: Select all
python setup.py install

in the unzipped location?
setrofim
 
Posts: 288
Joined: Mon Mar 04, 2013 7:52 pm

Re: 3.3 Web scraper - 2 specific questions

Postby deeeeets » Tue Jun 04, 2013 12:25 pm

I got it to work. I had been trying to use the python equivalent of cd and could not figure it out. Thanks! :D :D :D :D :D :D :D :D
deeeeets
 
Posts: 9
Joined: Sat Jun 01, 2013 9:59 am


Return to General Discussions

Who is online

Users browsing this forum: No registered users and 1 guest

cron