Selecting HTML bock using BeautifulSoup help

This is the place for queries that don't fit in any of the other categories.

Selecting HTML bock using BeautifulSoup help

Postby SMNALLY » Wed Sep 25, 2013 5:57 pm

I am trying to parse several div blocks using Beautiful soup using some html from a website. However I cannot work out which function should be used to select these div blocks. I have tried the following:

Code: Select all
   import urllib2
   from bs4 import BeautifulSoup

   def getData():
      
      html = urllib2.urlopen("websitename.com", timeout=10).read().decode('UTF-8')

      soup = BeautifulSoup(html)

      print(soup.title)
      print(soup.find_all('<div class="crBlock ">'))
      
   getData()

I want to be able to select everything between `<div class="crBlock ">` and its correct end `</div>` (obviously there are other div tags but I want to select the block all the way down to the one that represents the end of this section of html.)

Many thanks SMNALLY
Last edited by SMNALLY on Wed Sep 25, 2013 9:41 pm, edited 1 time in total.
SMNALLY
 
Posts: 5
Joined: Tue Sep 24, 2013 2:16 pm

Re: Selecting HTML bock using BeautifulSoup help

Postby ochichinyezaboombwa » Wed Sep 25, 2013 7:08 pm

Use
Code: Select all
soup.find_all('div', {"class":"crBlock "})
ochichinyezaboombwa
 
Posts: 200
Joined: Tue Jun 04, 2013 7:53 pm

Re: Selecting HTML bock using BeautifulSoup help

Postby SMNALLY » Wed Sep 25, 2013 9:44 pm

Hi thanks ochichinyezaboombwa that works perfectly, I also found two alternatives and I was wondering if you could advise?

Whats the difference betweeen your version and the following two:

Code: Select all
soup.find_all('div', class_="crBlock ")

soup.find_all('div', class="crBlock ")


Note one has an underscore and one doesnt

Many thanks

SMNALLY
SMNALLY
 
Posts: 5
Joined: Tue Sep 24, 2013 2:16 pm

Re: Selecting HTML bock using BeautifulSoup help

Postby micseydel » Wed Sep 25, 2013 9:53 pm

The one without the underscore is preferred.
Join the #python-forum IRC channel on irc.freenode.net!

Please do not PM members regarding questions which are meant to be discussed publicly. The point of the forum is so that others can benefit from it. We don't want to help you over PMs or emails.
User avatar
micseydel
 
Posts: 1371
Joined: Tue Feb 12, 2013 2:18 am
Location: Mountain View, CA

Re: Selecting HTML bock using BeautifulSoup help

Postby SMNALLY » Wed Sep 25, 2013 10:47 pm

Functionally they do nothing different?
SMNALLY
 
Posts: 5
Joined: Tue Sep 24, 2013 2:16 pm

Re: Selecting HTML bock using BeautifulSoup help

Postby micseydel » Wed Sep 25, 2013 11:19 pm

Actually, using class= actually works for you? When I saw you ask, I assumed that it was a legacy code kind of thing, but when I just tried to exemplify it, I can't use class= without getting a SyntaxError. If it works, I'd use that, since the underscore version is probably just legacy code, and it's just nicer without the underscore.
Join the #python-forum IRC channel on irc.freenode.net!

Please do not PM members regarding questions which are meant to be discussed publicly. The point of the forum is so that others can benefit from it. We don't want to help you over PMs or emails.
User avatar
micseydel
 
Posts: 1371
Joined: Tue Feb 12, 2013 2:18 am
Location: Mountain View, CA

Re: Selecting HTML bock using BeautifulSoup help

Postby ochichinyezaboombwa » Tue Oct 01, 2013 3:06 am

soup.find_all('div', {"class":"crBlock "}) means:
find all elements 'div' that have an attribute "class" and its value == "crBlock ";

The {} around "class":"crBlock " are a generic way of addressing more than one attribute simultaneously (they all are combined in a single dictionary which find_all knows what to do with).

For example,
soup.find_all('div', {"class":"crBlock ", "color": "black"}) would find all divs which class is "crBlock " AND color=black; it won't bring green ones as a result.

Moreover,
Code: Select all
soup.find_all('div', {"class":"crBlock"})
soup.find_all('div', {"class":"crBlock "})
is a big difference. If your HTML contains both you're in trouble and need to do something like
Code: Select all
soup.find_all('div', {"class":"crBlock"}) + soup.find_all('div', {"class":"crBlock "})
. if you have attributes "class" and "class_" and "_class" you are dealing with a very-very sick HTML.
ochichinyezaboombwa
 
Posts: 200
Joined: Tue Jun 04, 2013 7:53 pm


Return to General Coding Help

Who is online

Users browsing this forum: Google [Bot] and 3 guests