BeautifulSoup 4 - complication with .get('href')

A forum for general discussion of the Python programming language.

BeautifulSoup 4 - complication with .get('href')

Postby deeeeets » Thu Jun 06, 2013 6:14 pm

I am using BS4 (and Python 3.3), trying to capture the urls of links (350 of them basically). Here is an example of what prettify does for these links:
Code: Select all
<h5>
 <a class="bi" href="http://espn.go.com/mens-college-basketball/team/_/id/399/albany-great-danes">
  Albany
 </a>
</h5>


All that I want to capture, though, is the "http://espn.go.com/mens-college-basketball/team/_/id/399/albany-great-danes" portion, though. In an effort to do that, I have written the following code, but that returns "None" 350 times--ie, I am doing something wrong. Can anybody tell me how I can achieve capturing the url? Many thanks!!

Code: Select all
from bs4 import BeautifulSoup

import re
import urllib.request

url = "http://espn.go.com/mens-college-basketball/teams"
page1 = urllib.request.urlopen(url)
soup = BeautifulSoup(page1)
h5 = soup.find_all("h5")

print(h5[3])

for link in soup.find_all('h5'):
    print(link.get('href'))
deeeeets
 
Posts: 9
Joined: Sat Jun 01, 2013 9:59 am

Re: BeautifulSoup 4 - complication with .get('href')

Postby metulburr » Thu Jun 06, 2013 7:27 pm

Code: Select all
from bs4 import BeautifulSoup
import urllib.request

url = "http://espn.go.com/mens-college-basketball/teams"
response = urllib.request.urlopen(url)
soup = BeautifulSoup(response)
sections = soup.find_all('a', {'class', 'bi'})
for section in sections:
   print(section.get('href'))


my attempt was this, but it stops at the new hampshire link a few links down. My expected output was all of them as they all appear to have <a class=bi, so i guess i couldnt give that much insight. Maybe someone else here will post that are fluent in BeautifulSoup
New Users, Read This
version Python 3.3.2 and 2.7.5, tkinter 8.5, pyqt 4.8.4, pygame 1.9.2 pre
OS Ubuntu 14.04, Arch Linux, Gentoo, Windows 7/8
https://github.com/metulburr
User avatar
metulburr
 
Posts: 1122
Joined: Thu Feb 07, 2013 4:47 pm
Location: Elmira, NY

Re: BeautifulSoup 4 - complication with .get('href')

Postby stranac » Thu Jun 06, 2013 7:56 pm

Not sure what problem BeautifulSoup is having, but lxml.html works just fine here:
Code: Select all
>>> import lxml.html
>>> doc = lxml.html.parse('http://espn.go.com/mens-college-basketball/teams')
>>> links = doc.xpath('//a[@class="bi"]/@href')
>>> len(links)
347
>>> links[0]
'http://espn.go.com/mens-college-basketball/team/_/id/399/albany-great-danes'


I would recommend using lxml anyway, because it's just soooo much better than bs.
Friendship is magic!

R.I.P. Tracy M. You will be missed.
User avatar
stranac
 
Posts: 909
Joined: Thu Feb 07, 2013 3:42 pm

Re: BeautifulSoup 4 - complication with .get('href')

Postby deeeeets » Thu Jun 06, 2013 8:36 pm

@stranac - so lxml.html is much better than BS? I feel like I'm on a wild goose chase. :roll:
deeeeets
 
Posts: 9
Joined: Sat Jun 01, 2013 9:59 am

Re: BeautifulSoup 4 - complication with .get('href')

Postby metulburr » Thu Jun 06, 2013 8:46 pm

I believe the parser you use is a matter of preference. But regardless, in this case, his example works, so i would go with his.
New Users, Read This
version Python 3.3.2 and 2.7.5, tkinter 8.5, pyqt 4.8.4, pygame 1.9.2 pre
OS Ubuntu 14.04, Arch Linux, Gentoo, Windows 7/8
https://github.com/metulburr
User avatar
metulburr
 
Posts: 1122
Joined: Thu Feb 07, 2013 4:47 pm
Location: Elmira, NY

Re: BeautifulSoup 4 - complication with .get('href')

Postby deeeeets » Thu Jun 06, 2013 8:48 pm

@metulburr it worked! Thank you!
deeeeets
 
Posts: 9
Joined: Sat Jun 01, 2013 9:59 am

Re: BeautifulSoup 4 - complication with .get('href')

Postby deeeeets » Thu Jun 06, 2013 9:09 pm

Seriously, you guys have no idea how helpful you are!! Spent ALL afternoon on this...
deeeeets
 
Posts: 9
Joined: Sat Jun 01, 2013 9:59 am


Return to General Discussions

Who is online

Users browsing this forum: No registered users and 1 guest

cron