Parsing Web page with Dynamic content

A forum for general discussion of the Python programming language.

Parsing Web page with Dynamic content

Postby somnathpal49 » Fri Mar 29, 2013 9:43 am

Hello !!

I want to parse webpage with dynamic content (like href, images, etc). I heard that it is possible to do that by incorporating web browser capability in the python script. Does any one have any idea as to how to do that, I mean retrieve the full page with all the dynamic content.

thanks.
somnathpal49
 
Posts: 1
Joined: Fri Mar 29, 2013 9:36 am

Re: Parsing Web page with Dynamic content

Postby metulburr » Fri Mar 29, 2013 9:49 am

New Users, Read This
OS Ubuntu 14.04, Arch Linux, Gentoo, Windows 7/8
https://github.com/metulburr
steam
User avatar
metulburr
 
Posts: 1471
Joined: Thu Feb 07, 2013 4:47 pm
Location: Elmira, NY

Re: Parsing Web page with Dynamic content

Postby setrofim » Fri Mar 29, 2013 9:52 am

setrofim
 
Posts: 288
Joined: Mon Mar 04, 2013 7:52 pm

Re: Parsing Web page with Dynamic content

Postby metulburr » Sat Mar 30, 2013 7:42 am

what are the good points, bad points, and just "What exactly is" of selenium, scrapy, and mechanize?

Suppose your working with a site that has a lot of javascript?
New Users, Read This
OS Ubuntu 14.04, Arch Linux, Gentoo, Windows 7/8
https://github.com/metulburr
steam
User avatar
metulburr
 
Posts: 1471
Joined: Thu Feb 07, 2013 4:47 pm
Location: Elmira, NY

Re: Parsing Web page with Dynamic content

Postby setrofim » Sat Mar 30, 2013 8:58 am

metulburr wrote:what are the good points, bad points, and just "What exactly is" of selenium, scrapy, and mechanize?

  • Selenium allows you to control a web browser from Python. With Selenium, you actually fire up a browser, such as Firefox, and issue commands to it. This is comparatively slow and clunky. Selenium was designed for, and is primarily used in, testing.
  • mechanize emulates certain browser behaviours without relying on an external program. Basically, it is itself a light weight browser. It will not give you the full range of functionality of a real browser, but it is faster and much simpler to use.
  • Scrapy is a web scraping framework. It also implements some browser behaviors and in that respect it is similar to mechanize. But it also comes with a bunch of stuff to help you scrape data from web sites, such as crawlers, classes for defining data structures to be extracted, etc. It is bigger and more complex than mechanize and comes with a steeper learning curve; but once you learn it, you can implement scrapers much faster as the framework gives you a lot of the stuff you would have to implement yourself with mechanize. Scrapy is also faster for scraping a large number of web pages.

metulburr wrote:Suppose your working with a site that has a lot of javascript?

Javascript is difficult. The issue is not only running the javascript itself, but emulating the DOM that it operates on. If you need to deal with pages that contain a lot of Javascript, then you do need something like Selenium or pyv8.
setrofim
 
Posts: 288
Joined: Mon Mar 04, 2013 7:52 pm


Return to General Discussions

Who is online

Users browsing this forum: No registered users and 3 guests