Read a web page as plain text

This is the place for queries that don't fit in any of the other categories.

Read a web page as plain text

Postby Sudheshna » Thu Dec 05, 2013 5:41 pm

I tried accessing a web page " . But I am able to access is only contents of but not the content of the exact page( .

I tried something like this:
Code: Select all
urlHTML = ""
raw = urlopen(urlHTML).read()
notraw = nltk.clean_html(raw)
tokens = nltk.word_tokenize(notraw)
print tokens

and also ::
Code: Select all
>>> page ="")
>>> text =
>>> page.close()
>>> soup = BeautifulSoup(text)
>>> text

Both of them reads content only not the exact page.

What I am missing here... Please help me out..
Last edited by micseydel on Thu Dec 05, 2013 6:26 pm, edited 1 time in total.
Reason: Code tags, locked OP.
Posts: 1
Joined: Thu Dec 05, 2013 5:25 pm

Re: Read a web page as plain text

Postby tnknepp » Thu Dec 05, 2013 6:46 pm

Python: 2.7 via Anaconda
Numpy: 1.7
Pandas: 0.11
OS: Windows 7
IDE: Spyder/IPython
User avatar
Posts: 153
Joined: Mon Mar 11, 2013 7:41 pm

Return to General Coding Help

Who is online

Users browsing this forum: Bing [Bot] and 9 guests