I have created a monster and this post is about killing it off by scraping the contents of this blog into structured Python objects. Sometime later I will convert the HTML content to markdown and download the images and other resources locally.
I want to put the contents into a ZODB object database to get a feel for working with object database. A greated goal is to migrate the content a new blog engine. I don't want to go into why I felt I need to scrape it or why I want to migrate to another blog engine as it's depressing.
Moving on, I wanted to put the content into these classes
class Post():
title = ''
content = ''
date = ''
tags = []
comments = []
class Comment():
content = ''
author = ''
date = ''
website = ''
The scraping code is not elegant but was quite fun to write as I could write it all from an interactive console session. I found BeautifulSoup was fantastic in making HTML into something that was easy to work with, although I would have liked to have used jQuery/CSS style selectors.
from BeautifulSoup import BeautifulSoup
from datetime import datetime
import urllib2
import re
def ParseComment(soup):
comment = Comment()
comment.author = soup.find('p',{"class":"author"}).first().string.strip
content = soup.find('p',{"class":"content"})
if content:
comment.content = content.prettify()
website = soup.find('p',{"class":"author"}).first()
if website.has_key('href'):
comment.website = soup.find('p',{"class":"author"}).first()['href']
r = re.compile('\d*/\d*/\d* \d*.\d*')
date = r.findall(soup.find('p',{"class":"date"}).renderContents())[0]
comment.date = datetime.strptime(date,'%d/%m/%Y %H:%M')
return comment
def ParsePost(postSoup):
post = Post()
post.title = postSoup.find('a',{"class":re.compile('posthead.*')}).string
post.content = postSoup.find('div', {"class":"entry"})
date = postSoup.find('div',{"class":"descr"}).contents[0][:-4]
post.date = datetime.strptime(date,'%B %d, %Y %H:%M')
post.author = postSoup.find('div',{"class":"descr"}).first().string
post.tags = map(lambda x: x.string, postSoup('a',{"rel":"tag"}))
comments = postSoup.find('div',{"id":"commentlist"})('div')
post.comments = [ParseComment(commentSoup) for commentSoup in comments]
return post
def DownloadPost(url):
postHtml = urllib2.urlopen('http://blog.sharpthinking.com.au/' + url).read()
postSoup = BeautifulSoup(postHtml)
return ParsePost(postSoup);
def GetPosts():
page = urllib2.urlopen("http://blog.sharpthinking.com.au/archive.aspx")
soup = BeautifulSoup(page)
postUrls = map(lambda x: x['href'], soup('a', href=re.compile('/post/.*')))
return [DownloadPost(url) for url in postUrls[:-10]]
I'm sure there is better way, but this was better than any way I've used previously. Anyway I've done a lot of work untangling the mess I created.
>>> posts = GetPost()
>>> for post in posts[:5]:
... print post.date, post.title
...
2010-03-17 22:16:00 OMG. It's a JavaScript Rhino
2010-03-12 13:53:00 Devevenings Presentation - IOC/Unit Testing/Mocking in ASP.NET MVC
2010-02-20 17:18:00 Revisiting Pygments in the browser with Silverlight, now with BackgroundWorker
2010-02-17 19:25:00 Revisiting Modal Binding an Interface, now with DictionaryAdapterFactory
2010-02-16 20:34:00 Modal Binding an Interface with DynamicProxy
I wanted to put the contents into the object database tonight, but I have pickled it to be revisited later.