parsing/consuming html (noob)

This is the place for queries that don't fit in any of the other categories.

parsing/consuming html (noob)

Postby igotapochahontas » Fri Apr 18, 2014 6:35 pm

I have googled this for over a week and have found a thousand ways to do this
But have come up with nothing that actually works. The basic idea is for my script
To wake me up in the morning by going to a website, extracting a joke and reading
It to me in the morning. I can't use beautiful soup because I'm using a scripting layer
App, not a usual python install. I'm using python for android. I've also had no luck with jsoup.
This is my non-working code:
Code: Select all
 import android
droid = android.Android()
import urllib
current = 0
newlist = []

sock = urllib.urlopen("http://m.funtweets.com/random")
htmlSource = sock.read()
sock.close()
rawhtml = []
rawhtml.append (htmlSource)

while current < len(rawhtml):
    while current != "<div class=":
        if [current] == "</b></a>":
            newlist.append (current)
            current += 1
         

print newlist
igotapochahontas
 
Posts: 20
Joined: Thu Apr 10, 2014 5:51 pm

Re: parsing/consuming html (noob)

Postby metulburr » Fri Apr 18, 2014 6:48 pm

If it has the lxml module you can use
Code: Select all
lxml.html


Something along hte lines of:
Code: Select all
try:
    from urllib2 import urlopen
except ImportError:
    from urllib.request import urlopen
import lxml.html

code = urlopen("http://m.funtweets.com/random")
html = lxml.html.fromstring(code.read())
for el in html.xpath('//div[@class="tweet"]'):
    print(el.text_content())


I would give you a better example, but i am pretty bad at it myself. Stranac could give you examples for this if this would work for you.

EDIT:
the urlopen is pointless for xlml
Code: Select all
import lxml.html

html = lxml.html.parse('http://m.funtweets.com/random')
users = html.xpath('//div[@class="tweet"]/a[@class="tweet-user-link"]')

for user in users:
    print(user.text_content())
New Users, Read This
OS Ubuntu 14.04, Arch Linux, Gentoo, Windows 7/8
https://github.com/metulburr
steam
User avatar
metulburr
 
Posts: 1512
Joined: Thu Feb 07, 2013 4:47 pm
Location: Elmira, NY

Re: parsing/consuming html (noob)

Postby igotapochahontas » Fri Apr 18, 2014 7:00 pm

Edit: I do have beautiful soup. Sorry, I'm new to this. I've tried using lxml, but I will try your code...
igotapochahontas
 
Posts: 20
Joined: Thu Apr 10, 2014 5:51 pm

Re: parsing/consuming html (noob)

Postby igotapochahontas » Fri Apr 18, 2014 7:09 pm

"No such module" lxml.html. I'm not sure how to add modules to sl4a, if its not
Available from the module page, I hear its a pretty involved process....
igotapochahontas
 
Posts: 20
Joined: Thu Apr 10, 2014 5:51 pm

Re: parsing/consuming html (noob)

Postby snippsat » Fri Apr 18, 2014 7:46 pm

lmxl will for sure not work for android(sl4a),because lxml also need a couple off C library as part of install.
You can use BeautifulSoup the famous 3.08 version
There is no install because all it is in one .py file BeautifulSoup.py
Just place it same location as code you running.
Code: Select all
from Beautiful Soup import BeautifulSoup

html = """\
<html>
<head>
   <title>html page</title>
</head>
<body>
  <sometag>abc</sometag>
</body>
</html>
"""
soup = Beautiful Soup(html)
tag = soup.find('sometag')
my_data = tag.text
print my_data  # abc

The newer version of BeautifulSoup(bs4) is larger,and may need install.
For many years BeautifulSoup was just one Pyhon file,this make it of course easy to get it work on all platforms.
User avatar
snippsat
 
Posts: 273
Joined: Thu Feb 21, 2013 12:04 am

Re: parsing/consuming html (noob)

Postby igotapochahontas » Fri Apr 18, 2014 10:45 pm

ok, got that to work. now i need to extract all the html gibberish away from the data im looking for.
i used:
Code: Select all
page = urllib2.urlopen("http://www.m.funtweets.com/random")
soup = BeautifulSoup(page)
print soup.find("div", {"class" : "tweet"})

It returns TONS of html that i dont want or understand and the beautiful soup documentation doesnt specify how to extract
exactly what i want from it(unless im missing something). It returns:
Code: Select all
(html)
<div class="tweet">
         <a href="http://m.funtweets.com/u/SEAempire" class="tweet-user-link"><img src="http://pbs.twimg.com/profile_images/431902438065336322/uv8F1rmv_normal.jpeg"><b><span>@</span>SEAempire</b></a> Facebook needs a "who cares?" button.
         <div class="clearfix"></div>
         <a href="http://twitter.com/intent/retweet?tweet_id=23238411636703232&related=fun_tweets%3AFunny%20witty%20silly%20tweets" class="button">Retweet</a> <a href="http://twitter.com/intent/favorite?tweet_id=23238411636703232&related=fun_tweets%3AFunny%20witty%20silly%20tweets" class="button">Favorite</a> <a href="http://m.funtweets.com/t/3452" class="button date">Jan 7 2011</a>
      </div>

I only want "facebook needs a "who cares?" button." I would like to start saving to a list or file after:
Code: Select all
</b></a>

and then stop saving at:
Code: Select all
<div class="clearfix">

I can easilly save the above soup to a list but then how do i parse out the extra junk? (a list sees the whole thing as one element, and beautiful soup doesnt have a command unique to this task) Any suggestions?
igotapochahontas
 
Posts: 20
Joined: Thu Apr 10, 2014 5:51 pm

Re: parsing/consuming html (noob)

Postby snippsat » Sat Apr 19, 2014 12:23 am

I only want "facebook needs a "who cares?" button." I would like to start saving to a list or file after:

Code: Select all
>>> html = soup.find("div", {"class" : "tweet"})
>>> html
<div class="tweet">
<a href="http://m.funtweets.com/u/SEAempire" class="tweet-user-link"><img src="http://pbs.twimg.com/profile_images/431902438065336322/uv8F1rmv_normal.jpeg" /><b><span>@</span>SEAempire</b></a> Facebook needs a "who cares?" button.
         <div class="clearfix"></div>
<a href="http://twitter.com/intent/retweet?tweet_id=23238411636703232&amp;related=fun_tweets%3AFunny%20witty%20silly%20tweets" class="button">Retweet</a> <a href="http://twitter.com/intent/favorite?tweet_id=23238411636703232&amp;related=fun_tweets%3AFunny%20witty%20silly%20tweets" class="button">Favorite</a> <a href="http://m.funtweets.com/t/3452" class="button date">Jan 7 2011</a>
</div>

>>> html.contents[2]
u' Facebook needs a "who cares?" button.\n 
User avatar
snippsat
 
Posts: 273
Joined: Thu Feb 21, 2013 12:04 am

Re: parsing/consuming html (noob)

Postby igotapochahontas » Thu Apr 24, 2014 1:44 pm

This is the code for extracting one joke with no html tags if anyone is interested:
Code: Select all
import re
import urllib2
def remove_html_tags(text):
    pattern = re.compile(r'</b></a>')
    return pattern.sub('', text)

page = urllib2.urlopen("http://www.m.funtweets.com/random").read()
umatch = re.search(r"<span>@</span>(\w+)", page)
user = umatch.group()
utext = re.search(r"</b></a> (\w.*)", page)
text = utext.group()
print remove_html_tags(text)
 
igotapochahontas
 
Posts: 20
Joined: Thu Apr 10, 2014 5:51 pm

Re: parsing/consuming html (noob)

Postby snippsat » Thu Apr 24, 2014 2:55 pm

This is the code for extracting one joke with no html tags if anyone is interested:

It's terrible,code first remove remove html tag then extract :(
I guess you have gotten code from a other place?
In the other post here,i did post a regex for all jokes.
Just to make very clear regex for scraping html is a bad solution,i did make it because of android(sl4a) can have problem with 3 party libraries.

Just a little better(bad) regex solution for one joke.
Code: Select all
import re
import urllib2

page = urllib2.urlopen("http://www.m.funtweets.com/random").read()
print re.search(r'</b></a> (\w.*)', page).group(1)

Output:
Code: Select all
Everytime I hold someone's baby, I whisper "You aint shit" into their ear. Just to bring their huge baby-ego back down to Earth.
User avatar
snippsat
 
Posts: 273
Joined: Thu Feb 21, 2013 12:04 am

Re: parsing/consuming html (noob)

Postby igotapochahontas » Sun Jul 06, 2014 4:35 pm

^ if that code worked before, it doesnt now. I think they changed something on the site. here's my tracebacklopen dlibpython2.6.so
traceback (most recent call last):
File "/storage/sdcard0/sl4a/scripts/itcantbedone.py", line 6, in <module>
print re.search(r'</b></a>(\w.*)',page).group(1)
attributeError: 'NoneType' object has no attribute 'group'
igotapochahontas
 
Posts: 20
Joined: Thu Apr 10, 2014 5:51 pm


Return to General Coding Help

Who is online

Users browsing this forum: Google [Bot] and 4 guests