lxml.html

This is the place for queries that don't fit in any of the other categories.

lxml.html

Postby metulburr » Wed Jun 26, 2013 2:57 pm

I probably will have a ton of questions regarding lxml as their website tutorial is confusing me more than answering.

I did this code
Code: Select all
from lxml import etree, html

url = 'http://www.wunderground.com/cgi-bin/findweather/hdfForecast?query=14901'
doc = html.parse(url)
res = doc.xpath('//span[@id="rapidtemp"]/@value')
res2 = doc.xpath('//span[@id="rapidtemp"]')
print(res[0])

print(res2[0].keys())
print(res2[0].values())

in attempt to get the degrees, in which i succeeded, but now i am trying to get the entire outer html of that section:

Code: Select all
<span id="rapidtemp" class="pwsrt" pwsid="KNYELMIR5" pwsunit="english" pwsvariable="tempf" english="°F" metric="°C" value="70.6">
  <span class="nobr"><span class="b">70.6</span>&nbsp;°F</span>
</span>
New Users, Read This
OS Ubuntu 14.04, Arch Linux, Gentoo, Windows 7/8
https://github.com/metulburr
steam
User avatar
metulburr
 
Posts: 1387
Joined: Thu Feb 07, 2013 4:47 pm
Location: Elmira, NY

Re: lxml.html

Postby stranac » Wed Jun 26, 2013 4:22 pm

I'm not 100% sure, and I have no way of testing, but I think you want lxml.html.tostring()
It's either that or something very similar.

I agree that their documentation is pretty bad.
Friendship is magic!

R.I.P. Tracy M. You will be missed.
User avatar
stranac
 
Posts: 1097
Joined: Thu Feb 07, 2013 3:42 pm

Re: lxml.html

Postby metulburr » Wed Jun 26, 2013 5:39 pm

wiht tostring i get the traceback:
Code: Select all
Traceback (most recent call last):
  File "test3.py", line 16, in <module>
    print(html.tostring(res2))
  File "/usr/lib/python3/dist-packages/lxml/html/__init__.py", line 1581, in tostring
    doctype=doctype)
  File "lxml.etree.pyx", line 3122, in lxml.etree.tostring (src/lxml/lxml.etree.c:63526)
TypeError: Type 'b'list'' cannot be serialized.


I dont really understand how to get nested "span", for example a regular temp first parse, and then now attmepting to grab the "feels like temp", which is nested in some span's in a div.
Code: Select all
from lxml import etree, html
from urllib.request import urlopen

#parse from string

url = 'http://www.wunderground.com/cgi-bin/findweather/hdfForecast?query=14901'

res = urlopen(url)
doc = html.fromstring(res.read())
#doc2 = etree.fromstring(res.read())

temp = doc.xpath('//span[@id="rapidtemp"]/@value')
temp2 = doc.xpath('//div[@id="tempFeel"]')
print(len(temp2))
print(temp[0])
print(temp2[0])
print(temp2[0].text)


the outer html
Code: Select all
<div id="tempFeel">Feels Like
  <span class="nobr"><span class="b">73</span>&nbsp;°F</span>
</div>

im ultimately trying to grab 73 here via lxml.html
New Users, Read This
OS Ubuntu 14.04, Arch Linux, Gentoo, Windows 7/8
https://github.com/metulburr
steam
User avatar
metulburr
 
Posts: 1387
Joined: Thu Feb 07, 2013 4:47 pm
Location: Elmira, NY

Re: lxml.html

Postby stranac » Wed Jun 26, 2013 7:38 pm

That should be:
Code: Select all
html.tostring(res2[0])


But from your example, it seems to me that you don't need the html at all.
To get the texr inside the nested span, first navigate to the span, and then extract the text.
This (or something similar) should work:
Code: Select all
doc.xpath('//div[@class="tempFeel"//span[@class="b"]/text()'

Be sure to check that for spelling mistakes, I'm typing this on a phone.

If you just wanted all the text insode the div, you can use '//text()' in the xpath, or .text_content() (I think) of the element.
Friendship is magic!

R.I.P. Tracy M. You will be missed.
User avatar
stranac
 
Posts: 1097
Joined: Thu Feb 07, 2013 3:42 pm

Re: lxml.html

Postby metulburr » Wed Jun 26, 2013 7:49 pm

oh, ok you put it all as one argument to xpath. Yeah a couple errors.
Code: Select all
doc.xpath('//div[@id="tempFeel"]//span[@class="b"]/text()')


So the entire process is the string that goes to xpath regardless of how much nested it is, i havent found a great tutorial explaining that, and the format to use as well.
New Users, Read This
OS Ubuntu 14.04, Arch Linux, Gentoo, Windows 7/8
https://github.com/metulburr
steam
User avatar
metulburr
 
Posts: 1387
Joined: Thu Feb 07, 2013 4:47 pm
Location: Elmira, NY

Re: lxml.html

Postby stranac » Wed Jun 26, 2013 8:23 pm

Well, the format is just xpath.
It's an actual xml query language thingy.
lxml uses version 1 of xpath, which is a bit limited, but still very powerful.
Friendship is magic!

R.I.P. Tracy M. You will be missed.
User avatar
stranac
 
Posts: 1097
Joined: Thu Feb 07, 2013 3:42 pm

Re: lxml.html

Postby metulburr » Wed Jun 26, 2013 9:48 pm

New Users, Read This
OS Ubuntu 14.04, Arch Linux, Gentoo, Windows 7/8
https://github.com/metulburr
steam
User avatar
metulburr
 
Posts: 1387
Joined: Thu Feb 07, 2013 4:47 pm
Location: Elmira, NY

Re: lxml.html

Postby Mekire » Thu Jun 27, 2013 4:25 am

I am not familiar with your subject matter, but I will say that w3schools has become quite notorious as a terrible resource with a very good advertising campaign.
W3Schools: An Intervention

-Mek
User avatar
Mekire
 
Posts: 986
Joined: Thu Feb 07, 2013 11:33 pm
Location: Amakusa, Japan

Re: lxml.html

Postby metulburr » Thu Jun 27, 2013 12:44 pm

really? I used w3schools rfor a quick tut guide for HTML a lot, and it seemed accurate.
New Users, Read This
OS Ubuntu 14.04, Arch Linux, Gentoo, Windows 7/8
https://github.com/metulburr
steam
User avatar
metulburr
 
Posts: 1387
Joined: Thu Feb 07, 2013 4:47 pm
Location: Elmira, NY


Return to General Coding Help

Who is online

Users browsing this forum: Baldyr, W3C [Linkcheck] and 1 guest