Scrape the archive of Radiolab podcasts from wnyc¶

RL_archive_bs4.ipynb¶

Last updated: 2020-08-29¶

I have been a long time fan of Radiolab ever since first hearing "Musical Language" for an undergraduate ethnomusicology class. Over time, the program has shifted away from the science/technology topics which originally drew me to it. Moreover, older episodes continue to be dropped from common podcast apps. As a result, it has become difficult to access older material in a convenient manner.

As a workaround, I decided to create a quick scraper of the wnycstudios website, which hosts Radiolab's entire archive of episodes. This scraper, demonstrated below, programmatically downloads this archive, which can then be stored either onto a mobile device or personal streaming service (such as Plex).

import pandas as pd
from bs4 import BeautifulSoup
from wget import download
from urllib.request import urlopen
import re
from time import sleep

First attempt¶

Let's get a working example using Beautiful Soup¶

# Where to store my downloads
library_path = '/Volumes/Elements/NVIDIA_SHIELD/PlexLibrary/Podcasts/radiolab/'

# Defining some structure for the WNYC site
stump_url = 'https://www.wnycstudios.org'
base_url = 'https://www.wnycstudios.org/shows/radiolab/podcasts'

html = urlopen(base_url).read()

bs = BeautifulSoup(html)

# Each podcast is represented as an <article> within each page.
articles = bs.find_all('article')

url = articles[0].find('a')['href']

podcast_url = stump_url + url
podcast_html = BeautifulSoup(urlopen(podcast_url).read())

podcast_download_link = podcast_html.find('a', attrs= {'class': 'download-link'})['href']

download(podcast_download_link, library_path)

'/Volumes/Elements/NVIDIA_SHIELD/PlexLibrary/Podcasts/radiolab//radiolab_podcast20graham.mp3'

Seems to work. Time to generalize and apply to the entire archive¶

def download_podcast(podcast_url):
    '''
    Given a direct podcast url, download_podcast will find the download link (if it exists) and download to the pre-specified library destination
    '''
    podcast_html = BeautifulSoup(urlopen(podcast_url).read())
    podcast_download = podcast_html.find('a', attrs= {'class': 'download-link'})
    if podcast_download != None: # It's possible for an article to not be a downloadable episode
        if podcast_download.get('href') != None: # It is also possible for a page to erroneously omit a download link
            href = podcast_download['href']
            # In rare cases, the redirect url does not work, but the original url will.
            podcast_download_link = ['http' + x for x in re.split('http', href)][-1]            
            download(podcast_download_link, library_path)
    
def get_podcast_urls(page_no):
    ''' 
    Radiolab archive is paginated. Given a page number, get_podcast_urls will retrieve a list of all direct podcast links featured in the page.
    '''
    url = base_url + '/' + str(page_no)
    bs = BeautifulSoup(urlopen(url).read())
    articles = bs.find_all('article')
    urls = [stump_url + article.find('a')['href'] for article in articles]
    return(urls)

pages = range(1,100) # the archive is paginated. 100 should be more than enough to cover the history
for page_no in pages:
    podcast_urls = get_podcast_urls(page_no)
    if len(podcast_urls) > 0:
         for p in podcast_urls:
            sleep(15)
            print(p)
            download_podcast(p)
        sleep(60*3)
    else:
        print('No more pages after ' + str(page_no))
        break