Conference Schedule

Track one or two? Eeenie, meenie, miney, moe…

log in to bookmark.

Y’all Wanna Scrape with Us? Content Ain’t a Thing : Web Scraping With Our Favorite Python Libraries

A Talk presented by Katharine Jarmul

Time

Thursday, September 8th 3:30 p.m.–4:10 p.m.

Level

Experienced

Description

Love or hate them, the top python scraping libraries have some hidden gems and tricks that you can use to enhance, update and diversify your Django models. This talk will teach you more advanced techniques to aggregate content from RSS feeds, Twitter, Tumblr and normal old web sites for your Django projects.

Abstract

Outline


  • lxml fu: etree vs html
  • lxml faves: iterlinks, prev/next, strip_tags, linepos
  • incorporating xpath
  • building your xml views/templates with lxml (this bullet is optional: may not have time but would love to hear if folks might find this useful)
  • learning how to build a good JSON API handler: what you can learn from some amazing api handlers when you have to build your own
  • feedparser, HTMLParser, re: the quick & dirty ways to parse when LXML isn't fast enough

Why learn from me? I’ve utilized these libraries to help build high-scale Django applications for the Washington Post and USA TODAY, covering everything from neighborhood blog aggregators, election coverage, Katrina mapping and financial reporting.