github.com/hoffmann Peter Hoffmann on Stackoverflow @peterhoffmann on twitter Peter Hoffmann on Facebook Contact me per email Subscribe to Atom Feed

Peter Hoffmann

Software Engineer
prev page next page

Python and web-tags regex

Posted on August 10, 2009
#stackoverflow #python

This my Answer to the stackoverflow question: Python and web-tags regex:

While it is ok to use rexex for quick and dirty html processing a much better and cleaner way is to use a html parser like lxml.html and to query the parsed tree with XPath or CSS Selectors.

html = """<html><body><div class="deg">DATA1</div><div class="deg">DATA2</div></body></html>"""

import lxml.html

page = lxml.html.fromstring(html)
#page = lxml.html.parse(url)

for element in page.findall('.//div[@class="deg"]'):
    print element.text

#using css selectors
from lxml.cssselect import CSSSelector
sel = CSSSelector("div.deg")

for element in sel(page):
    print element.text