Back in December 2008 I wrote a small perl script to enable you to enjoy podcasts from the Open University in MythStream, an add on for MythTV that enables you to watch streaming video content through MythTV. The OU's podcasts site has a number of RSS feeds that relate to the varioud subject areas that the podcasts covered and to a number of containing sections like OU Life and OU Research. At the time the script was written there was no easy overall way to autodiscover all of these feeds and tie them together, so I wrote a bit of code that would work this out from the menu rendered on the right hand side. This sort of screen scraping technique is great as a short term way to get the data we need, but the problem is that it is using output that was intended for a human to read rather than a machine to process. This sort of process can easily break if the layout of the page changes. To solve this problem I've been working with Chris Valentine of the Knowledge Media Institite at the OU who has kindly provided a better way to extract this information (many thanks Chris!).
To save having to do screen scraping, Chris has put together an OPML file that describes the content of those menu. As this is based on XML it can be parsed by our script and the information easily extracted to use in our output for MythStream. OPML is a content syndication format that is designed to represent an outline, or hierarchy of RSS feeds. This is a perfect way to represent the menu structure of the podcasts site and make sure that this same structure can be picked up by our script and show by MythStream so that the experience is kept consistent across both the podcasts site and on your TV with MythTV. It will be the same menu items you choose to get to the podcast you want. As this feed is specifically designed to be used by other programs, it can be encoded in a way to make it much easier and more reliable to parse, so a change in the layout or design of the podcasts site won't cause a problem in our application
OPML isn't as standardised as RSS, so it can can be difficult to extract all of the information from it without knowing the details of the site it relates to. The <outline> element can have attributes that might only relate to that site making data portability a little more difficult, however for this sort of application this is not too much of a problem as we are writing a script to extract data from a specific site. It allows us to extract information about the contents and structure of a podcasts site in a simple and efficient way which means we can bring these contents to new interfaces and new audiences. This has uses other than MythStream as well, the same technique could be adapted for streaming media capable clients like Boxee.
The new script is attached to this post, to use it follow the instructions in Getting Open University Podcasts on your TV with MythStream. Comments are very welcome, and even though this script relates to OU podcasts it could be adapted to other situations.
Re: Linking a podcast site into MythStream using OPML (the ...
I think this would be a great feature for my Mythbuntu, but I can't get it to work. I copied the oupodcasts.pl into the right folder and added the stream as you wrote in the post before. But it tells me to test the parser on the command line. :(
When I run it in a terminal by "perl oupodcasts.pl", it says "Not an ARRAY reference at oupodcasts.pl line 77.". I'm no perl expert, so I don't know what's wrong.
Thanks.
Re: Linking a podcast site into MythStream using OPML (the ...
Hi! The structure of the OPML feed has been improved, which has sadly broken my script :( will fix it and post and updated version here in a few days. Sorry about that.
Re: Linking a podcast site into MythStream using OPML (the ...
Hi Liam,
Here's a patch for your perl script:
dug@spug:~/.mythtv/mythstream/parsers$ diff -u oupodcasts.pl.orig oupodcasts.pl
--- oupodcasts.pl.orig 2009-02-06 23:41:21.000000000 +1100
+++ oupodcasts.pl 2009-05-11 21:03:17.000000000 +1000
@@ -74,6 +74,9 @@
if ($outline_element->{outline}) {
# loop round children to find details for channel and its children
$channel{"subchannels"} = ();
+ if (ref($outline_element->{outline}) eq 'HASH') {
+ $outline_element->{outline} = [ $outline_element->{outline} ];
+ }
foreach my $subchannel (@{$outline_element->{outline}}) {
if ($subchannel->{text} eq $channel{"name"}) {
$channel{"url"} = $subchannel->{"htmlUrl"};
If $outline_element->{outline} only contained one element it was becoming a reference to a hash rather than an array... the above just detects this and sticks it in an array reference.
Thanks for the script... the only problem I have is I get about 2 seconds of each podcast's audio before it stops for unknown reasons... nothing in the log... but obviously your script is doing it's job since I can drill down the xml hierarchy.
P.S. The html editor seems to have taken out all the patch's indentation but the critical lines are prefaced with a "+". Hope that helps.
Cheers,
Doug
Re: Linking a podcast site into MythStream using OPML (the ...
Hi Doug, thanks for fixing this! I've patched the script as suggested and uploaded the new one to be attached with this post.