Scrape the archive of Radiolab podcasts from wnyc

RL_archive_bs4.ipynb

Last updated: 2020-08-29

I have been a long time fan of Radiolab ever since first hearing "Musical Language" for an undergraduate ethnomusicology class. Over time, the program has shifted away from the science/technology topics which originally drew me to it. Moreover, older episodes continue to be dropped from common podcast apps. As a result, it has become difficult to access older material in a convenient manner.

As a workaround, I decided to create a quick scraper of the wnycstudios website, which hosts Radiolab's entire archive of episodes. This scraper, demonstrated below, programmatically downloads this archive, which can then be stored either onto a mobile device or personal streaming service (such as Plex).

In [6]:
import pandas as pd
from bs4 import BeautifulSoup
from wget import download
from urllib.request import urlopen
import re
from time import sleep

First attempt

Let's get a working example using Beautiful Soup

In [134]:
# Where to store my downloads
library_path = '/Volumes/Elements/NVIDIA_SHIELD/PlexLibrary/Podcasts/radiolab/'

# Defining some structure for the WNYC site
stump_url = 'https://www.wnycstudios.org'
base_url = 'https://www.wnycstudios.org/shows/radiolab/podcasts'
In [11]:
html = urlopen(base_url).read()
In [13]:
bs = BeautifulSoup(html)
In [16]:
# Each podcast is represented as an <article> within each page.
articles = bs.find_all('article')
In [21]:
url = articles[0].find('a')['href']
In [30]:
podcast_url = stump_url + url
podcast_html = BeautifulSoup(urlopen(podcast_url).read())
In [37]:
podcast_download_link = podcast_html.find('a', attrs= {'class': 'download-link'})['href']
In [41]:
download(podcast_download_link, library_path)
Out[41]:
'/Volumes/Elements/NVIDIA_SHIELD/PlexLibrary/Podcasts/radiolab//radiolab_podcast20graham.mp3'

Seems to work. Time to generalize and apply to the entire archive

In [118]:
def download_podcast(podcast_url):
    '''
    Given a direct podcast url, download_podcast will find the download link (if it exists) and download to the pre-specified library destination
    '''
    podcast_html = BeautifulSoup(urlopen(podcast_url).read())
    podcast_download = podcast_html.find('a', attrs= {'class': 'download-link'})
    if podcast_download != None: # It's possible for an article to not be a downloadable episode
        if podcast_download.get('href') != None: # It is also possible for a page to erroneously omit a download link
            href = podcast_download['href']
            # In rare cases, the redirect url does not work, but the original url will.
            podcast_download_link = ['http' + x for x in re.split('http', href)][-1]            
            download(podcast_download_link, library_path)
    
def get_podcast_urls(page_no):
    ''' 
    Radiolab archive is paginated. Given a page number, get_podcast_urls will retrieve a list of all direct podcast links featured in the page.
    '''
    url = base_url + '/' + str(page_no)
    bs = BeautifulSoup(urlopen(url).read())
    articles = bs.find_all('article')
    urls = [stump_url + article.find('a')['href'] for article in articles]
    return(urls)
In [ ]:
pages = range(1,100) # the archive is paginated. 100 should be more than enough to cover the history
for page_no in pages:
    podcast_urls = get_podcast_urls(page_no)
    if len(podcast_urls) > 0:
         for p in podcast_urls:
            sleep(15)
            print(p)
            download_podcast(p)
        sleep(60*3)
    else:
        print('No more pages after ' + str(page_no))
        break