BeautifulSoup 4 - complication with .get('href')

A forum for general discussion of the Python programming language.

BeautifulSoup 4 - complication with .get('href')

Postby deeeeets » Thu Jun 06, 2013 6:14 pm

I am using BS4 (and Python 3.3), trying to capture the urls of links (350 of them basically). Here is an example of what prettify does for these links:
Code: Select all
<h5>
 <a class="bi" href="http://espn.go.com/mens-college-basketball/team/_/id/399/albany-great-danes">
  Albany
 </a>
</h5>


All that I want to capture, though, is the "http://espn.go.com/mens-college-basketball/team/_/id/399/albany-great-danes" portion, though. In an effort to do that, I have written the following code, but that returns "None" 350 times--ie, I am doing something wrong. Can anybody tell me how I can achieve capturing the url? Many thanks!!

Code: Select all
from bs4 import BeautifulSoup

import re
import urllib.request

url = "http://espn.go.com/mens-college-basketball/teams"
page1 = urllib.request.urlopen(url)
soup = BeautifulSoup(page1)
h5 = soup.find_all("h5")

print(h5[3])

for link in soup.find_all('h5'):
    print(link.get('href'))
deeeeets
 
Posts: 9
Joined: Sat Jun 01, 2013 9:59 am

Re: BeautifulSoup 4 - complication with .get('href')

Postby metulburr » Thu Jun 06, 2013 7:27 pm

Code: Select all
from bs4 import BeautifulSoup
import urllib.request

url = "http://espn.go.com/mens-college-basketball/teams"
response = urllib.request.urlopen(url)
soup = BeautifulSoup(response)
sections = soup.find_all('a', {'class', 'bi'})
for section in sections:
   print(section.get('href'))


my attempt was this, but it stops at the new hampshire link a few links down. My expected output was all of them as they all appear to have <a class=bi, so i guess i couldnt give that much insight. Maybe someone else here will post that are fluent in BeautifulSoup
New Users, Read This
OS Ubuntu 14.04, Arch Linux, Gentoo, Windows 7/8
https://github.com/metulburr
steam
User avatar
metulburr
 
Posts: 1415
Joined: Thu Feb 07, 2013 4:47 pm
Location: Elmira, NY

Re: BeautifulSoup 4 - complication with .get('href')

Postby stranac » Thu Jun 06, 2013 7:56 pm

Not sure what problem BeautifulSoup is having, but lxml.html works just fine here:
Code: Select all
>>> import lxml.html
>>> doc = lxml.html.parse('http://espn.go.com/mens-college-basketball/teams')
>>> links = doc.xpath('//a[@class="bi"]/@href')
>>> len(links)
347
>>> links[0]
'http://espn.go.com/mens-college-basketball/team/_/id/399/albany-great-danes'


I would recommend using lxml anyway, because it's just soooo much better than bs.
Friendship is magic!

R.I.P. Tracy M. You will be missed.
User avatar
stranac
 
Posts: 1111
Joined: Thu Feb 07, 2013 3:42 pm

Re: BeautifulSoup 4 - complication with .get('href')

Postby deeeeets » Thu Jun 06, 2013 8:36 pm

@stranac - so lxml.html is much better than BS? I feel like I'm on a wild goose chase. :roll:
deeeeets
 
Posts: 9
Joined: Sat Jun 01, 2013 9:59 am

Re: BeautifulSoup 4 - complication with .get('href')

Postby metulburr » Thu Jun 06, 2013 8:46 pm

I believe the parser you use is a matter of preference. But regardless, in this case, his example works, so i would go with his.
New Users, Read This
OS Ubuntu 14.04, Arch Linux, Gentoo, Windows 7/8
https://github.com/metulburr
steam
User avatar
metulburr
 
Posts: 1415
Joined: Thu Feb 07, 2013 4:47 pm
Location: Elmira, NY

Re: BeautifulSoup 4 - complication with .get('href')

Postby deeeeets » Thu Jun 06, 2013 8:48 pm

@metulburr it worked! Thank you!
deeeeets
 
Posts: 9
Joined: Sat Jun 01, 2013 9:59 am

Re: BeautifulSoup 4 - complication with .get('href')

Postby deeeeets » Thu Jun 06, 2013 9:09 pm

Seriously, you guys have no idea how helpful you are!! Spent ALL afternoon on this...
deeeeets
 
Posts: 9
Joined: Sat Jun 01, 2013 9:59 am


Return to General Discussions

Who is online

Users browsing this forum: Ecclesiastes and 2 guests