Cox Media Group
KryptoniteConference Schedule
Track one or two? Eeenie, meenie, miney, moe…
Y’all Wanna Scrape with Us? Content Ain’t a Thing : Web Scraping With Our Favorite Python Libraries
Time
Level
Description
Abstract
Outline
- lxml fu: etree vs html
- lxml faves: iterlinks, prev/next, strip_tags, linepos
- incorporating xpath
- building your xml views/templates with lxml (this bullet is optional: may not have time but would love to hear if folks might find this useful)
- learning how to build a good JSON API handler: what you can learn from some amazing api handlers when you have to build your own
- feedparser, HTMLParser, re: the quick & dirty ways to parse when LXML isn't fast enough
Why learn from me? I’ve utilized these libraries to help build high-scale Django applications for the Washington Post and USA TODAY, covering everything from neighborhood blog aggregators, election coverage, Katrina mapping and financial reporting.