HTML of Website

This is the place for queries that don't fit in any of the other categories.

HTML of Website

Postby MarcelF6 » Sun Jul 28, 2013 10:04 pm

Hi everyone,

I'm looking for a little python script that writes the HTML-code of several websites (the sites differ only in integers at the end of the url) in a file. The website is displayed in its HTML-code.
Since the HTML-file contains some information I don't need, it's not necessary to save the whole website.
I only need the content which is in the tag "<html xmlns = "http://www.test.de">". In this tag, there is another tag, "<head>", whose content I don't need, either.
In addition to the constrictions above, I don't need to copy the meta-data (i.e. "[: ... :]") which can appear in the tag we want to copy.
Furthermore, it should be added after each page a line break, the tag "<hr>", again a line break, and now, the whole thing can be repeated with the next page. If it's the last page, there shouldn't be inserted the tag "<hr>".

Is this comprehensible? :S
However, I would be grateful for every propositon and for every hints :)
Thanks a lot!
MarcelF6
 
Posts: 3
Joined: Fri Apr 19, 2013 7:28 pm

Re: HTML of Website

Postby Yoriz » Sun Jul 28, 2013 11:37 pm

MarcelF6 wrote:Hi everyone,

I'm looking for a little python script that writes the HTML-code of several websites


Are you looking for someone to write the script for you ?
If so we can move this to the jobs part of the forum for you.
New Users, Read This
Join the #python-forum IRC channel on irc.freenode.net!
Spam topic disapproval technician
Windows7, Python 2.7.4., WxPython 2.9.5.0., some Python 3.3
User avatar
Yoriz
 
Posts: 726
Joined: Fri Feb 08, 2013 1:35 am
Location: UK

Re: HTML of Website

Postby MarcelF6 » Sun Jul 28, 2013 11:54 pm

No I'd like that someone can show me, how this is done.
Until now, I have this:
Code: Select all
from lxml import html

print html.parse('http://test.com/m=0033?action=source').xpath('//html')[0].text_content()


This prints me everything out.
But how can I ignore the tag "<head>" and signs like "[: ... :]" ?
And: how can it be achieved to iterate threw the sites? (0033 is a random page number)

Thanks a lot.
MarcelF6
 
Posts: 3
Joined: Fri Apr 19, 2013 7:28 pm


Return to General Coding Help

Who is online

Users browsing this forum: Google [Bot] and 3 guests