Peter Hoffmann Director Data Engineering at Blue Yonder. Python Developer, Conference Speaker, Mountaineer

Python and web-tags regex

This my Answer to the stackoverflow question: Python and web-tags regex:

While it is ok to use rexex for quick and dirty html processing a much better and cleaner way is to use a html parser like lxml.html and to query the parsed tree with XPath or CSS Selectors.

html = """<html><body><div class="deg">DATA1</div><div class="deg">DATA2</div></body></html>"""

import lxml.html

page = lxml.html.fromstring(html)
#page = lxml.html.parse(url)

for element in page.findall('.//div[@class="deg"]'):
    print element.text

#using css selectors
from lxml.cssselect import CSSSelector
sel = CSSSelector("div.deg")

for element in sel(page):
    print element.text