Parsing XML RSS feed byte stream for <item> tag

This is the place for queries that don't fit in any of the other categories.

Parsing XML RSS feed byte stream for <item> tag

Postby GreySquirl » Thu Feb 07, 2013 8:40 pm

I'm attempting to parse an RSS feed for the first instance of an element "".
Code: Select all
def pageReader(url):
try:
    readPage = urllib2.urlopen(url)
except urllib2.URLError, e:
#   print 'We failed to reach a server.'
#   print 'Reason: ', e.reason
    return 404 
except urllib2.HTTPError, e:
#   print('The server couldn\'t fulfill the request.')
#   print('Error code: ', e.code) 
    return 404 
else:
    outputPage = readPage.read()       
return outputPage

Assume arguments being passed are correct. The function returns a str object whose value is simply an entire rss feed - I've confirmed the type with:
Code: Select all
a = isinstance(value, str)
if not a:
   return -1

So, an entire rss feed has been returned from the function call, it's this point I hit a brick wall - I've tried parsing the feed with BeautifulSoup, lxml and various other libs, but no success (I had some success with BeautifulSoup, but it wasn't able to pull certain child elements from the parent, for example, . I'm just about ready to resort to writing my own parser, but I'd like to know if anybody has any suggestions.

To recreate my error, simply call the above function with an argument similar to:

http://www.cert.org/nav/cert_announcements.rss

You'll see I'm trying to return the first child.

Code: Select all
<item>
<title>New Blog Entry: Common Sense Guide to Mitigating Insider Threats - Best Practice 16 (of 19)</title>
<link>http://www.cert.org/blogs/insider_threat/2013/02/common_sense_guide_to_mitigating_insider_threats_-_best_practice_16_of_19.html</link>
<description>This sixteenth of 19 blog posts about the fourth edition of the Common   Sense Guide to Mitigating Insider Threats describes Practice 16: Develop a formalized insider threat program.</description>
<pubDate>Wed, 06 Feb 2013 06:38:07 -0500</pubDate>
</item>

As I've said, BeautifulSoup fails to find both pubDate and Link, which are crucial to my app.

Any advice would be greatly appreciated.
GreySquirl
 
Posts: 1
Joined: Thu Feb 07, 2013 8:37 pm

Re: Parsing XML RSS feed byte stream for <item> tag

Postby metulburr » Thu Feb 07, 2013 9:05 pm

Its not that BeautifulSoup cannot do it, you are just using it wrong.

from what i seen pubdate was not capped in the tags, so if your searching for a capped D in that tag, it would make sense why it would not find it, as the rss feed tag is pudate.

This is an example grabbing the first one, the indexes on title and desc are index 1 because the index 0 brought the title and desc of the entire rss feed and not the first one.

This by the way is written in python3.x so the function return_html() has use of grabbing the html via 3.x libs.

Code: Select all
import urllib.request
import urllib.parse
from bs4 import BeautifulSoup

def return_html(url, values=None):
   header = {
      'User-Agent':
      'Mozilla/5.0 (Windows NT 6.2; Win64; x64; rv:16.0.1) Gecko/20121011 Firefox/16.0.1'
      }
   if values:
      data = urllib.parse.urlencode(values).encode()
   else:
      data=None

   req = urllib.request.Request(url, data, header)
   res = urllib.request.urlopen(req)
   html = res.read().decode()
   return html
   
url = 'http://www.cert.org/nav/cert_announcements.rss'
html = return_html(url)

soup = BeautifulSoup(html)

title = soup.findAll('title')
desc = soup.findAll('description')
date = soup.findAll('pubdate')

print('FIRST TITLE: {}'.format(title[1].text))
print('FIRST DESC: {}'.format(desc[1].text))
print('FIRST DATE: {}'.format(date[0].text))


and the response i get:
Code: Select all
FIRST TITLE: New Blog Entry: Common Sense Guide to Mitigating Insider Threats - Best Practice 16 (of 19)
FIRST DESC: This sixteenth of 19 blog posts about the fourth edition of the Common Sense Guide to Mitigating Insider Threats describes Practice 16: Develop a formalized insider threat program.
FIRST DATE: Wed, 06 Feb 2013 06:38:07 -0500


There are numerous ways to go about using BeautifulSoup to get the data you need.
New Users, Read This
OS Ubuntu 14.04, Arch Linux, Gentoo, Windows 7/8
https://github.com/metulburr
steam
User avatar
metulburr
 
Posts: 1413
Joined: Thu Feb 07, 2013 4:47 pm
Location: Elmira, NY

Re: Parsing XML RSS feed byte stream for <item> tag

Postby stranac » Thu Feb 07, 2013 10:15 pm

You could also use a library specially made for parsing rss feeds, such as feedparser.
It will parse the entire feed into a dictionary, and let you easily extract information.

Example:
Code: Select all
>>> import feedparser
>>> from pprint import pprint
>>> feed = feedparser.parse('http://www.cert.org/nav/cert_announcements.rss')
>>> first_item = feed['items'][0]
>>> pprint(first_item)
{'link': u'http://www.cert.org/blogs/insider_threat/2013/02/common_sense_guide_to_mitigating_insider_threats_-_best_practice_16_of_19.html',
 'links': [{'href': u'http://www.cert.org/blogs/insider_threat/2013/02/common_sense_guide_to_mitigating_insider_threats_-_best_practice_16_of_19.html',
            'rel': u'alternate',
            'type': u'text/html'}],
 'published': u'Wed, 06 Feb 2013 06:38:07 -0500',
 'published_parsed': time.struct_time(tm_year=2013, tm_mon=2, tm_mday=6, tm_hour=11, tm_min=38, tm_sec=7, tm_wday=2, tm_yday=37, tm_isdst=0),
 'summary': u'This sixteenth of 19 blog posts about the fourth edition of the Common Sense Guide to Mitigating Insider Threats describes Practice 16: Develop a formalized insider threat program.',
 'summary_detail': {'base': u'http://www.cert.org/nav/cert_announcements.rss',
                    'language': None,
                    'type': u'text/html',
                    'value': u'This sixteenth of 19 blog posts about the fourth edition of the Common Sense Guide to Mitigating Insider Threats describes Practice 16: Develop a formalized insider threat program.'},
 'title': u'New Blog Entry: Common Sense Guide to Mitigating Insider Threats - Best Practice 16 (of 19)',
 'title_detail': {'base': u'http://www.cert.org/nav/cert_announcements.rss',
                  'language': None,
                  'type': u'text/plain',
                  'value': u'New Blog Entry: Common Sense Guide to Mitigating Insider Threats - Best Practice 16 (of 19)'}}


But there's no reason any xml parser(I highly recommend lxml) wouldn't work, if you correct the problems metulburr pointed out.

PS. no need to use his return_html() method. Whatever method you were using for getting the source of the page should work just fine.
Friendship is magic!

R.I.P. Tracy M. You will be missed.
User avatar
stranac
 
Posts: 1111
Joined: Thu Feb 07, 2013 3:42 pm


Return to General Coding Help

Who is online

Users browsing this forum: Google [Bot] and 4 guests