retrieve url contents (dynamic gallery?)

retrieve url contents (dynamic gallery?)

Postby Kebap » Tue Oct 01, 2013 8:49 pm

Hi guys, I am a bit puzzled. I am working on a script (ugly) that will help us download lots of our own photos from a friendly social site. However, I found an odd problem, I can't quite grasp.

I have this url:
* http://animexx.onlinewelten.com/cosplay ... 6/#seite=0

At the bottom of the page, there is a little gallery with 15 of total 21 photos. I want to download all those and also those on the next page.

So when I click the link for page 2 (in my browser), I will reach the following url:
* http://animexx.onlinewelten.com/cosplay ... 6/#seite=1

It is the same url, only with something added at the end. There, I can see the last 6 photos of this gallery (in my browser). So far, so very good.

Only thing, when I try to fetch the exact same urls via urllib2.urlopen, I always get the contents of the first page, no matter which page I open:

Code: Select all
>>> from urllib2 import urlopen
>>> url1 = "http://animexx.onlinewelten.com/cosplay/mitglied/6125/order_1_0/4336/#seite=0"
>>> url2 = "http://animexx.onlinewelten.com/cosplay/mitglied/6125/order_1_0/4336/#seite=1"
>>> url1 == url2
False
>>> urlopen(url1).read() == urlopen(url2).read()
True


What is happening there? I can see a little turning Icon quickly (in my browser, when clicking the gallery page numbers). Maybe there is some dynamic refresh or something?! How can python grasp this?
Last edited by Kebap on Wed Oct 02, 2013 1:25 pm, edited 2 times in total.
Learn: How To Ask Questions The Smart Way
Join the #python-forum IRC channel on irc.freenode.net and chat with uns directly!
Kebap
 
Posts: 394
Joined: Thu Apr 04, 2013 1:17 pm
Location: Germany, Europe

Re: retrieve url contents (dynamic gallery?)

Postby micseydel » Tue Oct 01, 2013 10:05 pm

If you examine the HTML source code, you'll see that there's a Javascript function associated with that link (the href is "#" which just means the same page).

As for a solution, I don't know it. Whenever I see Javascript like that I just give up. I've made some half-assed efforts to learn Javascript, thinking that I could reverse engineer the JS they write and then write Python code that exploits whatever I learned. But I've never made it that far. I learned HTML and CSS about 8 years ago and never moved onto JS.

If you do figure it out, I would love to know what you did.
Join the #python-forum IRC channel on irc.freenode.net!

Please do not PM members regarding questions which are meant to be discussed publicly. The point of the forum is so that others can benefit from it. We don't want to help you over PMs or emails.
User avatar
micseydel
 
Posts: 1304
Joined: Tue Feb 12, 2013 2:18 am
Location: Mountain View, CA

Re: retrieve url contents (dynamic gallery?)

Postby stranac » Wed Oct 02, 2013 3:15 pm

You basically have two options here:
  1. Reverse the javascript - the page seems to be reading the escaped html data from urls like
    http://animexx.onlinewelten.com/cosplay ... calc=false
    You could try parsing this data yourself
  2. Use a python library that can execute javascript - I never really used one, but I do know they exist
Friendship is magic!

R.I.P. Tracy M. You will be missed.
User avatar
stranac
 
Posts: 1137
Joined: Thu Feb 07, 2013 3:42 pm

Re: retrieve url contents (dynamic gallery?)

Postby Kebap » Wed Oct 02, 2013 5:41 pm

stranac wrote:Reverse the javascript - the page seems to be reading the escaped html data from urls like
http://animexx.onlinewelten.com/cosplay ... calc=false
You could try parsing this data yourself

Thanks, how did you arrive at that link? I will try and see if I can work with this. It is like what I was trying before anyway.
Learn: How To Ask Questions The Smart Way
Join the #python-forum IRC channel on irc.freenode.net and chat with uns directly!
Kebap
 
Posts: 394
Joined: Thu Apr 04, 2013 1:17 pm
Location: Germany, Europe

Re: retrieve url contents (dynamic gallery?)

Postby stranac » Thu Oct 03, 2013 7:35 am

I opened the network tab in firebug and clicked the link.
Friendship is magic!

R.I.P. Tracy M. You will be missed.
User avatar
stranac
 
Posts: 1137
Joined: Thu Feb 07, 2013 3:42 pm

Re: retrieve url contents (dynamic gallery?)

Postby Kebap » Thu Oct 03, 2013 6:40 pm

I have now sucessfully completed my task! Thank you guys for helping
micseydel wrote:If you do figure it out, I would love to know what you did.

I worked with the link which stranac found. There, I can change the URL to see the contents of the next pages, like I would have expected from the get-go. Do you want more info? I can sure paste my whole code, only it is very ugly :mrgreen:
Learn: How To Ask Questions The Smart Way
Join the #python-forum IRC channel on irc.freenode.net and chat with uns directly!
Kebap
 
Posts: 394
Joined: Thu Apr 04, 2013 1:17 pm
Location: Germany, Europe

Re: retrieve url contents (dynamic gallery?)

Postby stranac » Thu Oct 03, 2013 8:57 pm

You should paste it so we can shout at you for writing ugly code. :twisted:
Friendship is magic!

R.I.P. Tracy M. You will be missed.
User avatar
stranac
 
Posts: 1137
Joined: Thu Feb 07, 2013 3:42 pm

Re: retrieve url contents (dynamic gallery?)

Postby Kebap » Fri Oct 04, 2013 9:01 am

Alright then, here goes nothing! :D

Code: Select all
from bs4 import BeautifulSoup
from urllib2 import urlopen
from urlparse import urljoin
from urllib import urlretrieve
from os.path import join, basename, abspath, exists, isdir
from os import chdir, makedirs
from time import sleep
 
 
def lese_seite1(url):
  # IN: source_url of overview page for all cosplays
  # OUT: list of numbers of those cosplays
  html_file = urlopen(url)
  html_doc = html_file.read()
  html_doc = html_doc.replace("""document.write(\'<scr\' + \'ipt src="\' + src + \'"></scr\' + \'ipt>\');""", "")
  # This is needed as I have no LXML and the default parser can't handle this line obviously
  soup = BeautifulSoup(html_doc)
 
  # All links to cosplay sub-pages
  links_doc = soup.find(id="kost_sort_table").find_all("a")
 
  links = set()
  for link in links_doc:
      link_URL = link.get("href")
      links.add(urljoin(html_file.geturl(), link_URL))
 
  # links LIKE [u'http://animexx.onlinewelten.com/cosplay/mitglied/6125/order_1_0/3188/', ]

  print u"%s cosplays found." % len(links)
 
  # grab cosplay numbers from links
  stripme = len("http://animexx.onlinewelten.com/cosplay/mitglied/6125/order_1_0/")
  cosplays = []
 
  for cos in links:
    cos = cos[stripme:-1]
    cos = int(cos)
    cosplays.append(cos)
 
  return cosplays
 

def lese_cosname(cosplay):
  url = "".join((source_url, str(cosplay)))
  html_file = urlopen(url)
  html_doc = html_file.read()
  html_doc = html_doc.replace("""document.write(\'<scr\' + \'ipt src="\' + src + \'"></scr\' + \'ipt>\');""", "")
  soup = BeautifulSoup(html_doc)
 
  # fetch headline which contains cosplay name
  cos_name = soup.find_all("h1")[0].get_text()
  # cos_name LIKE: Dachbodenluke als: Adora  (Magna Carta)
  cos_name = cos_name[cos_name.find(":") + 2 :]
  cos_name = cos_name.encode('ascii', 'ignore')
  cos_name = cos_name.replace(":", ";")
  # cos_name LIKE: Adora(Magna Carta)
  print "Looking at: {}".format(cos_name)
 
  return cos_name
 

def make_cosplay_url(cosplay_nr, seite):
  # thanks for this hack to stranac who can use firebug
  url = "http://animexx.onlinewelten.com/cosplay/ajax-cosplay-filter/?"
  url += "&query=&query_id=0&sortierung=1&recalc=false"
  url += "&cosplay={}&seite={}".format(cosplay_nr, seite)
  return url
 
 
def lese_seite2(cosplay):
  seite = 0
  foto_ids = []
 
  while True:
    html_doc = urlopen(make_cosplay_url(cosplay, seite))
    html_content = html_doc.read()
    soup = BeautifulSoup(html_content)
    fotos_zuvor = len(foto_ids)
 
    for td in soup.find_all("td"):
      foto_ids.append(int(td.get("data-fotoid")))
 
    if len(foto_ids) == fotos_zuvor:
      return foto_ids
    else:
      seite += 1
 
 
def lese_seite3(url):
  # Reads the page of a single photo. Will return a link to and the name of said photo
  # IN: "http://animexx.onlinewelten.com/cosplay/mitglied/6125/order_1_0/2630/16340/"
  # OUT 1: 'http://media.animexx.onlinewelten.com/himitsu/fotos/7/7/116/16340.jpg?st=kSAyUrfISog_OabrNxtxow&e=1380657600'
  # OUT 2: '16340.jpg'
 
  html_file = urlopen(url)
  html_doc = html_file.read()
  html_doc = html_doc.replace("""document.write(\'<scr\' + \'ipt src="\' + src + \'"></scr\' + \'ipt>\');""", "")
  soup = BeautifulSoup(html_doc)
 
  # Find area that contains the photo, and the img-tag inside that
  links_doc = soup.find(id="PhotoContainer").find_all("img")
 
  # Read link to and name of photo-file. Only 1 photo expected here
  if len(links_doc) == 1:
    link = links_doc[0].get("src")
    foto_name = get_fotoname(link)
  else:
    raise ValueError("Page does not contain exactly 1 photo!")
 
  return link, foto_name
 
 
 
 
def get_fotoname(url):
  # IN: http://media.animexx.onlinewelten.com/himitsu/fotos/7/7/116/16340.jpg?st=FbUrsVZcwA2nzc2pk3Ssvg&e=1380652200
  # STEP 2: 16340.jpg?st=FbUrsVZcwA2nzc2pk3Ssvg&e=1380652200
  # STEP 3: 16340.jpg
  fotoname = basename(url)
  return fotoname[:fotoname.find("?")]
 
 
 
def wget(url, destination):
  try:
    urlretrieve(url, destination)
    print ".",
    # print 'saved {} to {}'.format(url, abspath(destination))
  except IOError:
    print 'problem saving url {} to destination {}'.format(url, abspath(destination))
 
 
 
if __name__ == "__main__":
  folder_root = r"F:\#own\python\34-mexx foto download\dwnld"
  source_url = "http://animexx.onlinewelten.com/cosplay/mitglied/6125/order_1_0/"

  cosplays = lese_seite1(source_url)
  chdir(folder_root)
 
  for cosplay in cosplays:
    # prepare folder structure for cosplay
    cos_name = lese_cosname(cosplay)
    if not isdir(cos_name):
      makedirs(cos_name)
 
    fotos = lese_seite2(cosplay)
    for foto in fotos:
      foto_page = urljoin(source_url, "/".join((str(cosplay), str(foto))))
      link, foto_name = lese_seite3(foto_page)
      folder_cos_and_filename = join(folder_root, cos_name, foto_name)
      wget(link, folder_cos_and_filename)


This code will actually "idle-out" my IDE. It works just fine from command-line, but the IDE will not respond, will stop producing output, after a few seconds running. This is PyScripter on Win7 btw. I am not too eager to actually re-work this code, as it did its job, and will likely not be needed again. However, feel free to critique any major blows or ways to improve for me. :geek:
Learn: How To Ask Questions The Smart Way
Join the #python-forum IRC channel on irc.freenode.net and chat with uns directly!
Kebap
 
Posts: 394
Joined: Thu Apr 04, 2013 1:17 pm
Location: Germany, Europe

Re: retrieve url contents (dynamic gallery?)

Postby Kebap » Wed Oct 23, 2013 10:47 am

I can't blame you for ignoring this ugly code. I would have probably done the same.

Otherwise, feel free to give me any hints as to what can be improved next time. :geek:
Learn: How To Ask Questions The Smart Way
Join the #python-forum IRC channel on irc.freenode.net and chat with uns directly!
Kebap
 
Posts: 394
Joined: Thu Apr 04, 2013 1:17 pm
Location: Germany, Europe

Re: retrieve url contents (dynamic gallery?)

Postby stranac » Wed Oct 23, 2013 11:13 am

Two space indentation is yucky.
You have unused imports.

This looks like it should be in a function:
Code: Select all
html_file = urlopen(url)
html_doc = html_file.read()
html_doc = html_doc.replace("""document.write(\'<scr\' + \'ipt src="\' + src + \'"></scr\' + \'ipt>\');""", "")
soup = BeautifulSoup(html_doc)


I don't like this. In the general case, I would use implicit string concatenation inside parens.
In this specific case, I would use urllib.urlencode(), and get rid of this function.
Code: Select all
url = "http://animexx.onlinewelten.com/cosplay/ajax-cosplay-filter/?"
url += "&query=&query_id=0&sortierung=1&recalc=false"
url += "&cosplay={}&seite={}".format(cosplay_nr, seite)


Also, BeautifulSoup :? :roll:
Friendship is magic!

R.I.P. Tracy M. You will be missed.
User avatar
stranac
 
Posts: 1137
Joined: Thu Feb 07, 2013 3:42 pm

Re: retrieve url contents (dynamic gallery?)

Postby metulburr » Wed Oct 23, 2013 11:22 am

I am not a fan of:
Code: Select all
from os.path import join, basename, abspath, exists, isdir

but rather:
Code: Select all
import os
os.path.join
os.path.exists
etc..


the two spaced indentation also drives me nuts in even reading it.
New Users, Read This
OS Ubuntu 14.04, Arch Linux, Gentoo, Windows 7/8
https://github.com/metulburr
steam
User avatar
metulburr
 
Posts: 1455
Joined: Thu Feb 07, 2013 4:47 pm
Location: Elmira, NY

Re: retrieve url contents (dynamic gallery?)

Postby Kebap » Wed Oct 23, 2013 2:00 pm

Thanks!

I actually prefer 2 space indentation over 4 spaces. Guess it is a matter of taste. Anyway, I have now increased my indentation for your reading pleasures:

Code: Select all
from bs4 import BeautifulSoup
from urllib2 import urlopen
from urlparse import urljoin
from urllib import urlretrieve
from os.path import join, basename, abspath, exists, isdir
from os import chdir, makedirs
from time import sleep


def lese_seite1(url):
    # IN: source_url of overview page for all cosplays
    # OUT: list of numbers of those cosplays
    html_file = urlopen(url)
    html_doc = html_file.read()
    html_doc = html_doc.replace("""document.write(\'<scr\' + \'ipt src="\' + src + \'"></scr\' + \'ipt>\');""", "")
    # This is needed as I have no LXML and the default parser can't handle this line obviously
    soup = BeautifulSoup(html_doc)

    # All links to cosplay sub-pages
    links_doc = soup.find(id="kost_sort_table").find_all("a")

    links = set()
    for link in links_doc:
        link_URL = link.get("href")
        links.add(urljoin(html_file.geturl(), link_URL))

    # links LIKE [u'http://animexx.onlinewelten.com/cosplay/mitglied/6125/order_1_0/3188/', ]

    print u"%s cosplays found." % len(links)

    # grab cosplay numbers from links
    stripme = len("http://animexx.onlinewelten.com/cosplay/mitglied/6125/order_1_0/")
    cosplays = []

    for cos in links:
        cos = cos[stripme:-1]
        cos = int(cos)
        cosplays.append(cos)

    return cosplays


def lese_cosname(cosplay):
    url = "".join((source_url, str(cosplay)))
    html_file = urlopen(url)
    html_doc = html_file.read()
    html_doc = html_doc.replace("""document.write(\'<scr\' + \'ipt src="\' + src + \'"></scr\' + \'ipt>\');""", "")
    soup = BeautifulSoup(html_doc)

    # fetch headline which contains cosplay name
    cos_name = soup.find_all("h1")[0].get_text()
    # cos_name LIKE: Dachbodenluke als: Adora  (Magna Carta)
    cos_name = cos_name[cos_name.find(":") + 2 :]
    cos_name = cos_name.encode('ascii', 'ignore')
    cos_name = cos_name.replace(":", ";")
    # cos_name LIKE: Adora(Magna Carta)
    print "Looking at: {}".format(cos_name)

    return cos_name


def make_cosplay_url(cosplay_nr, seite):
    # thanks for this hack to stranac who can use firebug
    url = "http://animexx.onlinewelten.com/cosplay/ajax-cosplay-filter/?"
    url += "&query=&query_id=0&sortierung=1&recalc=false"
    url += "&cosplay={}&seite={}".format(cosplay_nr, seite)
    return url


def lese_seite2(cosplay):
    seite = 0
    foto_ids = []

    while True:
        html_doc = urlopen(make_cosplay_url(cosplay, seite))
        html_content = html_doc.read()
        soup = BeautifulSoup(html_content)
        fotos_zuvor = len(foto_ids)

        for td in soup.find_all("td"):
            foto_ids.append(int(td.get("data-fotoid")))

        if len(foto_ids) == fotos_zuvor:
            return foto_ids
        else:
            seite += 1


def lese_seite3(url):
    # Reads the page of a single photo. Will return a link to and the name of said photo
    # IN: "http://animexx.onlinewelten.com/cosplay/mitglied/6125/order_1_0/2630/16340/"
    # OUT 1: 'http://media.animexx.onlinewelten.com/himitsu/fotos/7/7/116/16340.jpg?st=kSAyUrfISog_OabrNxtxow&e=1380657600'
    # OUT 2: '16340.jpg'

    html_file = urlopen(url)
    html_doc = html_file.read()
    html_doc = html_doc.replace("""document.write(\'<scr\' + \'ipt src="\' + src + \'"></scr\' + \'ipt>\');""", "")
    soup = BeautifulSoup(html_doc)

    # Find area that contains the photo, and the img-tag inside that
    links_doc = soup.find(id="PhotoContainer").find_all("img")

    # Read link to and name of photo-file. Only 1 photo expected here
    if len(links_doc) == 1:
        link = links_doc[0].get("src")
        foto_name = get_fotoname(link)
    else:
        raise ValueError("Page does not contain exactly 1 photo!")

    return link, foto_name




def get_fotoname(url):
    # IN: http://media.animexx.onlinewelten.com/himitsu/fotos/7/7/116/16340.jpg?st=FbUrsVZcwA2nzc2pk3Ssvg&e=1380652200
    # STEP 2: 16340.jpg?st=FbUrsVZcwA2nzc2pk3Ssvg&e=1380652200
    # STEP 3: 16340.jpg
    fotoname = basename(url)
    return fotoname[:fotoname.find("?")]



def wget(url, destination):
    try:
        urlretrieve(url, destination)
        print ".",
        # print 'saved {} to {}'.format(url, abspath(destination))
    except IOError:
        print 'problem saving url {} to destination {}'.format(url, abspath(destination))



if __name__ == "__main__":
    folder_root = r"F:\#own\python\34-mexx foto download\dwnld"
    source_url = "http://animexx.onlinewelten.com/cosplay/mitglied/6125/order_1_0/"

    cosplays = lese_seite1(source_url)
    chdir(folder_root)

    for cosplay in cosplays:
        # prepare folder structure for cosplay
        cos_name = lese_cosname(cosplay)
        if not isdir(cos_name):
            makedirs(cos_name)

        fotos = lese_seite2(cosplay)
        for foto in fotos:
            foto_page = urljoin(source_url, "/".join((str(cosplay), str(foto))))
            link, foto_name = lese_seite3(foto_page)
            folder_cos_and_filename = join(folder_root, cos_name, foto_name)
            wget(link, folder_cos_and_filename)


I was under the impression, if I do "import os", it will import too much unused stuff into memory? That is why I only wanted to do "from os.path import join", etc.

The weird string concatenation is directly copied from the HTML source. BeautifulSoup could not parse it. That is why I removed it. I deliberately chose to use Soup, as Lxml was not available at my work station. What would be other alternatives, and why is Soup so :rolleyez:?

I have never known urllib.urlencode(). Seems like a good tool for the job. In my case, all parameters were constant, only 2 were changing at all. This is why I chose that weird way in a function of its own called make_cosplay_url(). I also never knew about implicit string concatenation. From what I read, it may deprecate soon though.
Learn: How To Ask Questions The Smart Way
Join the #python-forum IRC channel on irc.freenode.net and chat with uns directly!
Kebap
 
Posts: 394
Joined: Thu Apr 04, 2013 1:17 pm
Location: Germany, Europe

Re: retrieve url contents (dynamic gallery?)

Postby stranac » Wed Oct 23, 2013 2:29 pm

Kebap wrote:What would be other alternatives, and why is Soup so :rolleyez:?

lxml is probably the best thing out there, html5lib is also great, and there's even xml.etree.ElementTree in the stdlib, but I don't know if it can parse broken html.
Also, for scraping stuff, I usually use scrapy.

And Beautiful Soup is just awful, it's so unintuitive and hard to use.

Kebap wrote:I have never known urllib.urlencode(). Seems like a good tool for the job.

Using requests would be even better.
Friendship is magic!

R.I.P. Tracy M. You will be missed.
User avatar
stranac
 
Posts: 1137
Joined: Thu Feb 07, 2013 3:42 pm


Return to Networking

Who is online

Users browsing this forum: No registered users and 1 guest