Beautifuloup extract string from href

This is the place for queries that don't fit in any of the other categories.

Beautifuloup extract string from href

Postby metulburr » Mon Apr 22, 2013 3:52 pm

I am not sure why i am getting the whole <a> tag, as i have tried index['href'], index.text, and index.string. What is weird is print the value and i get the 1.png string i want, but when i input that into urljoin, to downlaod it, somehow the <a> tag still remains, even though the print it did not.

I have tried:
Code: Select all
u = urllib.parse.urljoin(url,index['href'])

and
Code: Select all
u = urllib.parse.urljoin(url,index.text)

and
Code: Select all
u = urllib.parse.urljoin(url,index.string)


Code: Select all
import urllib.request
import urllib.parse
from bs4 import BeautifulSoup
import shutil
import os


url = 'http://littlealchemy.com/img/base/'
req = urllib.request.urlopen(url)
html = req.read().decode()

soup = BeautifulSoup(html)
print(html)
lister = soup.findAll('a')

for index in lister:
   try:
      u = urllib.parse.urljoin(url,index['href'])
      f = urllib.request.urlopen(u)
      print('downloading {}'.format(u))
      with open(index,'wb') as lf:
         shutil.copyfileobj(f,lf)
         
   except Exception as e:
      print(e)


but regardless i get the outcome of:
...
invalid file: <a href="95.png">95.png</a>
downloading http://littlealchemy.com/img/base/96.png
invalid file: <a href="96.png">96.png</a>
downloading http://littlealchemy.com/img/base/97.png
invalid file: <a href="97.png">97.png</a>
downloading http://littlealchemy.com/img/base/98.png
invalid file: <a href="98.png">98.png</a>
downloading http://littlealchemy.com/img/base/99.png
invalid file: <a href="99.png">99.png</a>
New Users, Read This
OS Ubuntu 14.04, Arch Linux, Gentoo, Windows 7/8
https://github.com/metulburr
steam
User avatar
metulburr
 
Posts: 1412
Joined: Thu Feb 07, 2013 4:47 pm
Location: Elmira, NY

Re: Beautifuloup extract string from href

Postby stranac » Mon Apr 22, 2013 4:19 pm

This is the line that's raising the error:
Code: Select all
      with open(index,'wb') as lf:

I don't think you wanted to pass index as filename.

Also, don't catch Exception, catch specific exceptions instead.
Friendship is magic!

R.I.P. Tracy M. You will be missed.
User avatar
stranac
 
Posts: 1111
Joined: Thu Feb 07, 2013 3:42 pm

Re: Beautifuloup extract string from href

Postby snippsat » Mon Apr 22, 2013 5:04 pm

Code: Select all
lister = soup.findAll('a')

You are using new BeautifulSoup(bs4) here it's called soup.find_all('a') (better name as it follow PEP-8 advice)

To write something that download all images to my hdd.
I use urlretrieve() then i don't need to use shutil.
lister is a terrible variable name ;)
Code: Select all
import urllib.request
import urllib.parse
from bs4 import BeautifulSoup


url = 'http://littlealchemy.com/img/base/'
req = urllib.request.urlopen(url)
soup = BeautifulSoup(req)
links = soup.find_all('a')
for img_link in links:
    if img_link['href'].endswith('.png'):
        img_name = ('{}{}'.format(req.geturl(),img_link['href']))
        urllib.request.urlretrieve(img_name, img_link['href'])
User avatar
snippsat
 
Posts: 163
Joined: Thu Feb 21, 2013 12:04 am

Re: Beautifuloup extract string from href

Postby metulburr » Mon Apr 22, 2013 6:39 pm

oh i am a dumbass, sending the index as the filename.

Yeah it wasnt great code, it was literally to shorten time instead of manuelly downloading each one. I made a quick fix before you both posted and just used a for loop with rnage(400), since the filenames were integers, lol.

Actually now on the subject. urlretreive(), for some reason i thought that wouldnt grab images, that is why i used shutil. For some reason i was thinking that was mainly used to download zipped archives.
New Users, Read This
OS Ubuntu 14.04, Arch Linux, Gentoo, Windows 7/8
https://github.com/metulburr
steam
User avatar
metulburr
 
Posts: 1412
Joined: Thu Feb 07, 2013 4:47 pm
Location: Elmira, NY

Re: Beautifuloup extract string from href

Postby snippsat » Mon Apr 22, 2013 7:43 pm

Actually now on the subject. urlretreive(), for some reason i thought that wouldnt grab images

urlretreive() will grab any filetype.

The reason to use shutil.copyfileobj() is for downloading large files,because it by default is downloading in chunks of 1024 bytes .
The source code of shutil.copyfileobj()
Code: Select all
def copyfileobj(fsrc, fdst, length=16*1024):
    """copy data from file-like object fsrc to file-like object fdst"""
    while 1:
        buf = fsrc.read(length)
        if not buf:
            break
        fdst.write(buf)

Can write the same for urllib.
But because of size of most images,chunk size is not important.
Code: Select all
req = urllib2.urlopen(url)
CHUNK = 16 * 1024
with open(file, 'wb') as fp:
  while True:
    chunk = req.read(CHUNK)
    if not chunk:
        break
    fp.write(chunk)
User avatar
snippsat
 
Posts: 163
Joined: Thu Feb 21, 2013 12:04 am


Return to General Coding Help

Who is online

Users browsing this forum: No registered users and 3 guests