DjangoCon US 2011

6–8 September 2011

talks

9–10 September 2011

sprints

Log in or Sign Up

Conference Schedule

Track one or two? Eeenie, meenie, miney, moe…

log in to bookmark.

Y’all Wanna Scrape with Us? Content Ain’t a Thing : Web Scraping With Our Favorite Python Libraries

A Talk presented by Katharine Jarmul

Time

Thursday, September 8^th 3:30 p.m.–4:10 p.m.

Level

Experienced

Description

Love or hate them, the top python scraping libraries have some hidden gems and tricks that you can use to enhance, update and diversify your Django models. This talk will teach you more advanced techniques to aggregate content from RSS feeds, Twitter, Tumblr and normal old web sites for your Django projects.

Abstract

Outline

lxml fu: etree vs html
lxml faves: iterlinks, prev/next, strip_tags, linepos
incorporating xpath
building your xml views/templates with lxml (this bullet is optional: may not have time but would love to hear if folks might find this useful)
learning how to build a good JSON API handler: what you can learn from some amazing api handlers when you have to build your own
feedparser, HTMLParser, re: the quick & dirty ways to parse when LXML isn't fast enough

Why learn from me? I’ve utilized these libraries to help build high-scale Django applications for the Washington Post and USA TODAY, covering everything from neighborhood blog aggregators, election coverage, Katrina mapping and financial reporting.