Back in December 2008 I wrote a small perl script to enable you to enjoy podcasts from the Open University in MythStream, an add on for MythTV that enables you to watch streaming video content through MythTV. The OU's podcasts site has a number of RSS feeds that relate to the varioud subject areas that the podcasts covered and to a number of containing sections like OU Life and OU Research. At the time the script was written there was no easy overall way to autodiscover all of these feeds and tie them together, so I wrote a bit of code that would work this out from the menu rendered on the right hand side. This sort of screen scraping technique is great as a short term way to get the data we need, but the problem is that it is using output that was intended for a human to read rather than a machine to process. This sort of process can easily break if the layout of the page changes. To solve this problem I've been working with Chris Valentine of the Knowledge Media Institite at the OU who has kindly provided a better way to extract this information (many thanks Chris!).
To save having to do screen scraping, Chris has put together an OPML file that describes the content of those menu. As this is based on XML it can be parsed by our script and the information easily extracted to use in our output for MythStream. OPML is a content syndication format that is designed to represent an outline, or hierarchy of RSS feeds. This is a perfect way to represent the menu structure of the podcasts site and make sure that this same structure can be picked up by our script and show by MythStream so that the experience is kept consistent across both the podcasts site and on your TV with MythTV. It will be the same menu items you choose to get to the podcast you want. As this feed is specifically designed to be used by other programs, it can be encoded in a way to make it much easier and more reliable to parse, so a change in the layout or design of the podcasts site won't cause a problem in our application
OPML isn't as standardised as RSS, so it can can be difficult to extract all of the information from it without knowing the details of the site it relates to. The <outline> element can have attributes that might only relate to that site making data portability a little more difficult, however for this sort of application this is not too much of a problem as we are writing a script to extract data from a specific site. It allows us to extract information about the contents and structure of a podcasts site in a simple and efficient way which means we can bring these contents to new interfaces and new audiences. This has uses other than MythStream as well, the same technique could be adapted for streaming media capable clients like Boxee.
The new script is attached to this post, to use it follow the instructions in Getting Open University Podcasts on your TV with MythStream. Comments are very welcome, and even though this script relates to OU podcasts it could be adapted to other situations.