3.3 web scraper help!! [ re.findall() ]

This is the place for queries that don't fit in any of the other categories.

3.3 web scraper help!! [ re.findall() ]

Postby deeeeets » Sat Jun 01, 2013 10:41 am

Hello! I'm trying to write a basic web scraper on 3.3--right now just trying to grab the page titles--and when I input the below, I get an error message that follows in red :o . Can anybody tell me in plain english :) what is going on with my "re.findall()" and how I can fix it? Thanks so much!!!

Code: Select all
import urllib
import urllib.request
import re

urls = ["http://google.com", "http://nytimes.com", "http://CNN.com"]
i = 0
regex = '<title>(.+?)</title>'
pattern = re.compile(regex)

while i < len(urls):
    htmlfile = urllib.request.urlopen(urls[i])
    htmltext = htmlfile.read()
    titles = re.findall(pattern, htmltext)

    print(titles)
    i += 1


Code: Select all
Traceback (most recent call last):
  File "C:\Python33\0601_ScrapeBld.py", line 13, in <module>
    titles = re.findall(pattern,htmltext)
  File "C:\Python33\lib\re.py", line 201, in findall
    return _compile(pattern, flags).findall(string)
TypeError: can't use a string pattern on a bytes-like object
Last edited by Yoriz on Sat Jun 01, 2013 11:08 am, edited 1 time in total.
Reason: Added code tags and re-indented the code
deeeeets
 
Posts: 9
Joined: Sat Jun 01, 2013 9:59 am

Re: 3.3 web scraper help!! [ re.findall() ]

Postby Yoriz » Sat Jun 01, 2013 12:08 pm

With referance to http://docs.python.org/3/library/urllib.request.html#examples
Note that urlopen returns a bytes object. This is because there is no way for urlopen to automatically determine the encoding of the byte stream it receives from the http server. In general, a program will decode the returned bytes object to string once it determines or guesses the appropriate encoding.

Code: Select all
>>> import urllib.request
>>> f = urllib.request.urlopen('http://www.python.org/')
>>> print(f.read(300))
b'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n\n\n<html
xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">\n\n<head>\n
<meta http-equiv="content-type" content="text/html; charset=utf-8" />\n
<title>Python Programming '


As the python.org website uses utf-8 encoding as specified in it’s meta tag, we will use the same for decoding the bytes object.

Code: Select all
>>> with urllib.request.urlopen('http://www.python.org/') as f:
...     print(f.read(100).decode('utf-8'))
...
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtm


So you will need to do the appropiate decoding before you can use re.findall
New Users, Read This
Join the #python-forum IRC channel on irc.freenode.net!
Image
User avatar
Yoriz
 
Posts: 1170
Joined: Fri Feb 08, 2013 1:35 am
Location: UK

Re: 3.3 web scraper help!! [ re.findall() ]

Postby stranac » Sat Jun 01, 2013 12:15 pm

Another possibility is using a bytes pattern:
Code: Select all
regex = b'<title>(.+?)</title>'
pattern = re.compile(regex)


Also, parsing html is not something you should use regular expressions for.
Consider using an existing library for that(not sure if any good ones are supported by python 3, but even the built-in parser is better than this)


Also, this is awful code:
Code: Select all
i = 0
...
while i < len(urls):
    htmlfile = urllib.request.urlopen(urls[i])
    ...
    i += 1

A better alternative:
Code: Select all
for url in urls:
    htmlfile = urllib.request.urlopen(url)
Friendship is magic!

R.I.P. Tracy M. You will be missed.
User avatar
stranac
 
Posts: 1246
Joined: Thu Feb 07, 2013 3:42 pm

Re: 3.3 web scraper help!! [ re.findall() ]

Postby metulburr » Sat Jun 01, 2013 2:39 pm

Also, parsing html is not something you should use regular expressions for.
Consider using an existing library for that(not sure if any good ones are supported by python 3, but even the built-in parser is better than this)

not sure of about others, but BeautifulSoup 4 is compatable with 3.x

to the OP:
an example of BeautifulSoup obtaining the same.
Code: Select all
from bs4 import BeautifulSoup
from  urllib.request import urlopen

url = 'http://python-forum.org/index.php'
res = urlopen(url)
html = res.read().decode()

soup = BeautifulSoup(html)
print(soup.title.string)


output:
python-forum.org • Index page
New Users, Read This
OS Ubuntu 14.04, Arch Linux, Gentoo, Windows 7/8
https://github.com/metulburr
steam
User avatar
metulburr
 
Posts: 1562
Joined: Thu Feb 07, 2013 4:47 pm
Location: Elmira, NY

Re: 3.3 web scraper help!! [ re.findall() ]

Postby deeeeets » Sat Jun 01, 2013 2:59 pm

Resolved! Thank you!!! Particular thanks to stranac. Great advice on all fronts.

Deeeeets
deeeeets
 
Posts: 9
Joined: Sat Jun 01, 2013 9:59 am

Re: 3.3 web scraper help!! [ re.findall() ]

Postby snippsat » Sat Jun 01, 2013 8:32 pm

As postet using regex and html is a bad.
Sure you can get a way with a lot stuff,but it will bring you down to a dark place :twisted:
Read this post by bobince bobince funny and good answer.

So Python has BeautifulSoup and lxml,both work for 3.3.
@metulburr some advice,do not read and decode before you pass in url to BeautifulSoup.
Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8.
You don't have to think about encodings, unless the document doesn't specify an encoding and Beautiful Soup can't autodetect one.

A couple of examples.
Code: Select all
import lxml.html

tag = lxml.html.parse('http://CNN.com')
print(tag.find(".//title").text)

Code: Select all
from bs4 import BeautifulSoup
from  urllib.request import urlopen

url = urlopen('http://CNN.com')
soup = BeautifulSoup(url)
tag = soup.find('title')
print(tag.text)

Both output-->
Code: Select all
CNN.com International - Breaking, World, Business, Sports, Entertainment and Video News


Some notes in first example we see that lxml has it`s own way of reading url,don't need urlopen.
lxm also support Xpath as used in .//title
Regex can be used be used in combo with parser so soup.find_all(re.compile("t")): print(tag.name) will find <title> and <html> tags.
User avatar
snippsat
 
Posts: 295
Joined: Thu Feb 21, 2013 12:04 am

Re: 3.3 web scraper help!! [ re.findall() ]

Postby metulburr » Sat Jun 01, 2013 10:50 pm

@metulburr some advice,do not read and decode before you pass in url to BeautifulSoup.

ah, didnt know that
New Users, Read This
OS Ubuntu 14.04, Arch Linux, Gentoo, Windows 7/8
https://github.com/metulburr
steam
User avatar
metulburr
 
Posts: 1562
Joined: Thu Feb 07, 2013 4:47 pm
Location: Elmira, NY


Return to General Coding Help

Who is online

Users browsing this forum: Google [Bot], Mekire, snippsat and 4 guests